OUP user menu

Machine Learning and Rule-based Approaches to Assertion Classification

Özlem Uzuner PhD, Xiaoran Zhang, Tawanda Sibanda MEng
DOI: http://dx.doi.org/10.1197/jamia.M2950 109-115 First published online: 1 January 2009


Objectives: The authors study two approaches to assertion classification. One of these approaches, Extended NegEx (ENegEx), extends the rule-based NegEx algorithm to cover alter-association assertions; the other, Statistical Assertion Classifier (StAC), presents a machine learning solution to assertion classification.

Design: For each mention of each medical problem, both approaches determine whether the problem, as asserted by the context of that mention, is present, absent, or uncertain in the patient, or associated with someone other than the patient. The authors use these two systems to (1) extend negation and uncertainty extraction to recognition of alter-association assertions, (2) determine the contribution of lexical and syntactic context to assertion classification, and (3) test if a machine learning approach to assertion classification can be as generally applicable and useful as its rule-based counterparts.

Measurements: The authors evaluated assertion classification approaches with precision, recall, and F-measure.

Results: The ENegEx algorithm is a general algorithm that can be directly applied to new corpora. Despite being based on machine learning, StAC can also be applied out-of-the-box to new corpora and achieve similar generality.

Conclusion: The StAC models that are developed on discharge summaries can be successfully applied to radiology reports. These models benefit the most from words found in the ± 4 word window of the target and can outperform ENegEx.


The narrative in patient records contains information about the medical problems of patients. Given the medical problems mentioned in a record, for each mention of each medical problem, assertion classification aims to determine whether the problem is present (as stated by a positive assertion), absent (as stated by a negative assertion), or uncertain in the patient (as stated by an uncertain assertion), or is associated with someone other than the patient (as stated by an alter-association assertion).

Related Work

Extraction of key concepts from narrative medical records requires studies of medical language. Medical language processing systems enable studies of patient data repositories (e.g., for developing decision support systems1 and for automatic diagnosis)1,2 by extracting information from narrative patient records. Determining the nature of the assertion made on each mention of each medical problem is a step towards interpreting medical narratives.1 To this end, there have been some efforts in the literature.

Fiszman et al.1,2 developed the SymText system for encoding information from chest x-ray reports. SymText processes each sentence in a document independently, parses text syntactically, and fills semantic templates either with words extracted from the text or with broader concepts derived from these words. Bayesian networks applied to these templates can interpret one sentence at a time, e.g., can determine based on a single sentence the probably that a disease is present in the patient. Fiszman et al. used SymText's output with a rule-based system for determining whether a concept was present in a record. Identifying a single mention of the concept (or a term related to the concept) as present or possible was sufficient to qualify the concept as present in the record. They found that SymText's performance on this task “was similar to that of physicians”1; its performance was better than that of keyword search systems when these systems considered a concept to be present unless it was accompanied by an explicit negation.

Friedman et al.'s MedLEE3 uses domain-specific vocabulary and semantic grammar to process medical record narratives. It identifies the concepts in a report, maps the concepts to semantic categories, and maps the semantic categories to semantic structures. The resulting semantic representation captures information on status, location, and certainty of each mention of each concept. Hripcsak et al.4 processed these semantic representations through Medical Logic Modules that could determine whether each mention of a disease was indicated as present or absent. They studied mentions of six diseases and found that the performance in determining presence of these diseases was not significantly different from that of physicians, but was significantly better than that of a keyword-based system that used negation phrases to identify absence.4

For determining positive and negative assertions, Chapman's NegEx5 studies candidate diseases and findings identified by the Unified Medical Language System (UMLS), and employs dictionaries of pre- and post-UMLS phrases that are indicative of negation. NegEx uses heuristics to limit the scope of indicative phrases, and identifies negative assertions with 78% recall (sensitivity) and 84% precision (positive predictive value) on 1,235 findings and diseases found in 1,000 sentences taken from discharge summaries.5 Informal evaluations of NegEx report 78% recall and 86% precision on uncertain assertions.6 Application of NegEx to identify the experiencer of a medical problem by ConText achieves 50% recall and 100% precision on a corpus containing 8 instances (out of 1,620) of alter-association assertions.7

Mutalik et al.'s Negfinder8 employs techniques and tools used for creating programming language compilers, makes use of a lexical scanner that is based on regular expressions, and runs a parser based on a restricted context-free grammar. The NegFinder finds negated concepts in discharge summaries and surgical notes with 95.7% recall and 91.8% specificity when evaluated on 1,869 concepts found in 10 medical documents from a variety of specialties.

Aronow et al.'s NegExpander9 finds negation phrases through rules applied to part-of-speech tagged radiology reports, studies conjunctions that split negations, and expands negation phrases across conjunctions to make explicit the negation of individual concepts. The NegExpander gives 93% precision on radiology reports.

Elkin et al.10 employ a rule base to mark positive, negative, and uncertain assertions on text which is preprocessed into its tokens and parsed. They achieve 97.2% recall and 91.2% precision on the assignment of negations.

The above-mentioned systems employ contextual features of various complexity with algorithms and tools of various complexity. We extend their studies on negation and uncertainty extraction to recognition of alter-association. We expect that the context immediately surrounding a medical problem holds valuable information regarding the assertion made on that medical problem. Although language allows extensive variation in the expression of assertions, we hypothesize that a significant portion of assertions are marked with clear contextual characteristics. While testing this hypothesis, we explore the significance of one form of syntactic information in assertion classification.

In general, rule-based approaches to assertion classification can be applied out-of-the-box to new corpora. On the other hand, supervised learning approaches are usually retrained for use on new corpora. This can make rule-based approaches more desirable over supervised learning approaches, even if the choice of a rule-based over a supervised learning approach trades off some performance for convenience. Ideally, the choice of an assertion classifier for a task would not trade off performance for convenience; the assertion classifier used would be convenient to apply and would outperform the alternatives. We hypothesize that supervised learning approaches hold some potential for achieving these two goals simultaneously. We check the feasibility of building a statistical assertion classifier that can be used out-of-the-box and that can maintain a performance advantage over its rule-based counterparts.

Our end product is a statistical assertion classifier, StAC, that can automatically capture the contextual clues for negative, uncertain, and alter-association assertions. The StAC approach makes use of lexical and syntactic context in conjunction with Support Vector Machines11 (SVMs). We evaluate StAC on discharge summaries and on radiology reports. We compare StAC with Extended NegEx (ENegEx), our implementation of the NegEx algorithm extended to capture alter-association in addition to positive, negative, and uncertain assertions. We employ ENegEx as a representative rule-based assertion classifier. We show that ENegEx can give good results on our corpora. We also show that StAC need only use the words that appear in ± 4 word window of the target problem (i.e., the problem to be classified with an assertion type) to recognize most of the assertions in the same corpora. The models captured by StAC are most useful when they are specific to each corpus. However, the models built on one corpus can also identify assertions on a new corpus. As a result, StAC can be applied to new corpora out-of-the-box, in the same manner as ENegEx, and demonstrates potential for performance gain over this rule-based counterpart.


We studied assertion classification on two corpora of discharge summaries and one corpus of radiology reports. The studies of these corpora were approved by the relevant Institutional Review Boards.

Beth Israel Deaconess Medical Center (BIDMC) Corpus

The BIDMC corpus consisted of 48 deidentified discharge summaries, consisting of a total of 5,166 sentences and including 2,125 medical problem mentions, from various departments in the BIDMC.

Challenge Corpus

The Challenge corpus consisted of 142 deidentified discharge summaries, consisting of 15,042 sentences and including 8,279 medical problem mentions, from various departments of hospitals in Partners Health Care.

Computational Medicine Center (CMC) Corpus

The CMC corpus consisted of 1,954 deidentified radiology reports, consisting of 6,406 sentences and including 6,325 medical problem mentions, for the 2007 CMC challenge12 of the University of Cincinnati.

We used the BIDMC corpus for development; we used the Challenge and CMC corpora for evaluation.


Assertion classification, as tackled in this paper, assumes that mentions of medical problems in clinical records have already been identified, and aims to determine whether each mentioned medical problem is present, absent, or uncertain in the patient, or associated with someone other than the patient. Therefore, before studying assertions, we annotated our corpora in two ways: we identified the medical problems in them (the summary numbers that resulted from this annotation are in the descriptions of the corpora above) and we determined the assertion class of each identified medical problem.

Identifying Medical Problems

For our purposes, medical problems refer to the diseases and symptoms of the patient. Diseases include the UMLS semantic types pathological function, disease or syndrome, mental or behavioral dysfunction, cell or molecular dysfunction, virus, neoplastic process, anatomic abnormality, injury or poisoning, congenital abnormality, and acquired abnormality.13 Symptoms correspond to UMLS's signs or symptoms. Using this mapping, two undergraduate computer science students independently marked the medical problems in the BIDMC corpus.13,14 Two other undergraduate computer science students independently marked the medical problems in the challenge corpus. This required two months of full time effort from each annotator. Given time and resource constraints, the medical problems in the CMC corpus were tagged using MetaMap.15 Given the possible errors of MetaMap on this task,13 the output of MetaMap was manually corrected and finalized by a nurse librarian and by a graduate student. The use of MetaMap for marking medical problems cut the annotation time per annotator by approximately 75%.

Determining Assertion Classes

Given the patient medical problems, we defined four classes of assertions:

  • Positive assertions state that the problem, marked in square brackets, is/was present in the patient. e.g., “She had [airway stenosis].”

  • Negative assertions state that the problem is absent in the patient. e.g., “Patient denies [headache].”

  • Uncertain assertions state that the patient may have the problem. e.g., “… was thought possibly to be a [neoplasm].”

  • Alter-association assertions state that the problem is not associated with the patient. e.g., “Sick contact positive for family member with [cough].” We do not differentiate between present, absent, or uncertain alter-association assertions.

While positive, negative, and uncertain assertions are often studied in negation and uncertainty extraction, alter-association assertions are usually not studied as a part of this task. We believe that alter-association assertions make sense in the context of more general assertion classification as they indicate whether the medical problem directly or indirectly affects the patient. We therefore include this assertion class in our studies.

Using the above assertion class definitions, for each occurrence of each problem in each corpus, one nurse-librarian and one information studies graduate student marked its assertion class. Initial agreement between the annotators as measured by kappa (K)16 was 0.93 on the BIDMC corpus, 0.8 on the challenge corpus, and 0.92 on the CMC corpus. In general, K ≥ 0.8 is considered “almost perfect agreement”.16 The annotators discussed and resolved their disagreements, providing us with the gold standard (see Table 1).

View this table:
Table 1

Instances and Percentages of Medical Problems in Each Assertion Class

Assertion ClassNumber in BIDMCNumber in ChallengeNumber in CMC
Positive1,537 (72%)6,702 (81%)4,761 (75%)
Negative398 (19%)1,249 (15%)811 (13%)
Uncertain169 (8%)259 (3%)742 (12%)
Alter-Association21 (1%)69 (1%)11 (0%)
Total2,125 (100%)8,279 (100%)6,325 (100%)
  • BIDMC = Beth Israel Deaconess Medical Center; CMC = Computational Medicine Center.


Given the medical problems mentioned in a clinical record, both ENegEx and StAC classify the assertion made on each medical problem by processing the records one sentence at a time and one medical problem at a time. They treat each occurrence of each medical problem independently of all others.

Extended NegEx (ENegEx)

In the absence of direct access to NegEx in time for this study, we implemented our own version of this program using the algorithm and the pre- and post-UMLS indicative phrases of NegEx6. We extended NegEx to alter-association assertions by studying the BIDMC corpus. We added to NegEx dictionaries consisting of:

  1. Preceding alter-association phrases: that precede a problem and imply that it is associated with someone other than the patient, e.g., cousin, sister, and brother.

  2. Succeeding alter-association phrases: that succeed a problem and imply that it is associated with someone other than the patient.

The resulting number of alter-association indicative phrases was 14. These indicative phrases were a superset of ConText's7 dictionaries. We applied the NegEx algorithm6 with the extended set of indicative phrases to our data, and called this system ENegEx.

We ran ENegEx on the BIDMC corpus (see Table 2), manually checked its output, and reestablished that its algorithm complied with the specifications of NegEx. We double-checked that the low recall on uncertain assertions was due to a weakness in NegEx's dictionaries, which for uncertain assertions consisted solely of morphological and syntactic variants of the phrase “rule out.”6 We verified that our newly introduced alter-association indicative phrases were complete in their coverage of the alter-association assertions in the BIDMC corpus.

View this table:
Table 2

ENegEx on BIDMC Corpus

Assertion ClassPrecisionRecallF-Measure
  • BIDMC = Beth Israel Deaconess Medical Center.

Statistical Assertion Classifier (StAC)

To test the hypothesis that contextual features capture the information necessary for assertion classification, and to explore the contribution of one form of syntactic information to this task, we built StAC. The StAC applies SVMs to a binary feature vector. We define a feature as a characteristic that can have a multitude of values, e.g., for a person, “eye color” is a feature with several possible values, e.g., green. For each feature of StAC, the binary feature vector lists all possible values of that feature in the corpus as its columns, and for each target medical problem to be classified (row), it sets the columns observed to 1, leaving the rest at zero.14 If the target has no value for a feature, then all columns representing this feature will be set to zero.

We armed StAC with a variety of contextual features, which included some simple lexical information and some more complex syntactic information. For each target, StAC uses features extracted from the sentence containing the target. Upon request, the code for extracting these features will be made available for research purposes.

Lexical context features of StAC include:

  • ± 4 word window, i.e., words that appear within a ± 4 word window of the target. Given the target at the nth position in the sentence, the ± 4 word window captures the words found in the (n-1)th, (n-2)th, (n-3)th, (n-4)th, (n+1)th, (n+2)th, (n+3)th, and (n+4)th positions in the sentence. Our knowledge representation treats each of the above positions as an individual feature, lists all possible values for each feature, and identifies the value of the feature in the context of each target by setting only that value to one. For some targets, one or more of the features can have no values specified, e.g., the third word of the sentence will have all possible values of the (n-4)th position set to zero.

The ± 4 word window subsumes ± 1, ± 2, and ± 3 word windows so that any strings captured by these smaller windows are also captured by the larger window of ± 4. The focus on a ± 4 word window was determined by cross-validation on the BIDMC corpus. Figure 1 shows the F-measures of StAC when run only with various ± n word window features and indicates that windows greater than ± 4 can hurt performance on three of the assertion classes.

  • Section headings, i.e., whether the target appears in a section whose heading contains the word “Family”, e.g., family history. This feature is represented by a single column which is set to one only if the target appears in a section whose title contains the word “Family”.

Figure 1

Context window size (± n) vs. F-measure on each assertion class on BIDMC corpus.

Syntactic context features include:

  • Verbs preceding and succeeding the target, e.g., verb showed preceding a problem suggests that the problem is present, verb cured after a problem suggests that the problem is absent. We treat the verb preceding and the verb succeeding the target as two separate features, each with numerous possible values.

  • ± 2 link window, i.e., syntactic links within a ± 2 link window of the target (and of the verbs preceding and succeeding the target) and the words they link to the target (and to the verbs preceding and succeeding the target). We extract the links and the words they link to from the output of the Link Grammar Parser17 (LGP). We use a version of LGP whose lexicon has been extended to improve coverage on medical corpora.18 Even in the absence of a fully-correct parse for each sentence, this parser provides useful parses for phrases.13,14

The choice of ± 2 link window over windows of any other size was based on cross-validation on the BIDMC corpus. Given a target (or a verb) at the nth position in the sentence, its ± 2 link window is represented by the (n-1)th, (n-2)th, (n+1)th, and (n+2)th links and the words to which they link. e.g., for asthma in for asthma“His sister, last summer, was diagnosed with asthma”, the −2 link window is given by the set {(Jp, with),(MVp, diagnosed)} where MVp links verbs to their prepositional phrases and Jp connects prepositions to their objects (see Fig. 2). Our knowledge representation treats each of (n-1)th, (n-2)th, (n+1)th, and (n+2)th links and each of their words as an individual feature with its own set of possible values. For some targets, one or more of the link and word positions can have no values assigned, i.e., all possible values of that feature are set to zero. When absent among words within short range lexical window, the ± 2 link window features clarify the modifier–noun relationships and help eliminate false positives of lexical context that would result from mere lexical proximity. When present within long range lexical window, the ± 2 link window features can capture the long distance dependencies and help eliminate false negatives that would be missed by our lexical context. For example, the connection between sister and asthma would be missed by the lexical context but is captured by the {(Pv, was),(Ss, sister)} links of the verb diagnosed (see Sibanda14).

Figure 2

Sample link grammar parse.

The StAC employs SVMs with a linear kernel. The choice of SVMs over other classifiers is motivated by their ability to robustly handle large feature sets and by their ability to often find globally optimum solutions.19 In our case, the number of features in the set is on the order of thousands. We use the multiclass SVM implementation of LIBSVM.20

We evaluate StAC using single train–test cycles and using cross-validation. For both train–test cycles and cross-validation, we create the binary feature vector from only the development data used for each round. As a result, the feature values of the targets that appear only in the validation and test sets of that round do not appear in the feature vector, i.e., the feature vector is not overfit to the validation and test sets. The performance of StAC on the BIDMC corpus is given in Table 3.

View this table:
Table 3

StAC Cross-Validated on the BIDMC Corpus

Assertion ClassPrecisionRecallF-Measure
  • BIDMC = Beth Israel Deaconess Medical Center.

Evaluation Methods

We evaluate system performances in terms of precision (P), recall (R), and F-measure (F). Precision (positive predictive value) measures the proportion of predictions in a class that were correct. Recall (sensitivity) measures the proportion of true class instances that were correctly identified. F-measure is the harmonic mean of precision and recall. We test significance of the differences in performances of the two systems using the Z test on two proportions. This test considers the size of the sample in a class to make a judgment of significance on the difference of performances (proportions) in that class.21,22 We present precision, recall, and F-measure values for all of our experiments; however, we base our observations on the F-measure which provides a single convenient number for comparing systems.

Evaluation, Results, and Discussion

For evaluation, we ran ENegEx on the challenge and CMC corpora. Table 4 shows that ENegEx is strongest in recognizing positive and negative assertions, weakest in recognizing uncertain and alter-association assertions. Although the performance of ENegEx can be improved by tuning it to the corpora on which it is to be run, even in the absence of such tuning, ENegEx maintains itself as a simple algorithm that can recognize positive from negative assertions. Most of ENegEx's mistakes come from scope and from incomplete dictionaries. For example, in “She is an obese white female in no acute distress with a hoarse voice,” ENegEx finds both acute distress and hoarse voice are within the scope of the indicative phrase no. In “… frozen section analysis revealed this to be adenocarcinoma, metastatic disease from the colon most likely,” ENegEx misses the subtle uncertainty expressed by most likely.

View this table:
Table 4

ENegEx on the Challenge and CMC Corpora

  • CMC = Computational Medicine Center.

We evaluate StAC in two different ways:

  • Cross-validation experiments: We developed the assertion classification approach of StAC, with its specific methods and features, on the BIDMC corpus. Would the methods and features of StAC be as useful on other corpora? To answer this question, we cross-validated StAC on the challenge and CMC corpora. Cross-validation developed and validated models on each corpus separately.

  • Generality experiments: ENegEx can be applied to new data sets as is and would give reasonable results. Could StAC be similarly applicable to new corpora? To answer this question, we trained StAC on the BIDMC corpus and we ran it, without retraining or cross-validating, on the challenge and CMC corpora. While cross-validating StAC on a corpus tests the generality of StAC's approach on that corpus, running StAC on a corpus as trained on another corpus checks whether the model obtained from one corpus helps assertion classification in the other corpus.

In both of the above experiments, we use ENegEx (Table 4) as a benchmark. We use * to mark the performances of StAC that are significantly different from the corresponding performance of ENegEx at α = 0.05. Bold marks performances of StAC that are equal to or greater than the corresponding performance of ENegEx.

Cross-validation Experiments

Table 5 shows that StAC's approach to extracting key contextual clues for recognizing assertion classes generalize to all of our corpora. The StAC approach applies the methods and features identified on the BIDMC corpus to the challenge and CMC corpora. It extracts specific contextual clues from each corpus within the limits of these methods and features, and gives good results.

View this table:
Table 5

Cross-Validation of StAc on the Challenge and CMC Corpora

  • CMC = Computational Medicine Center.

Generality Experiments

In general, rule-based approaches like ENegEx can be applied out-of-the-box to new corpora. To test whether StAC, based on supervised learning, can be used out-of-the-box in a manner analogous to ENegEx, we trained StAC on the BIDMC corpus and tested the resulting model as is on the challenge and CMC corpora. We found that with the models trained on the BIDMC corpus, StAC could outperform ENegEx, when both systems are run (compare Table 4 and Table 6) on the challenge and CMC corpora. The performance gain of StAC is more pronounced on the F-measures from the CMC corpus.

View this table:
Table 6

StAC Trained on BIDMC and Run on Challenge and CMC Corpora

  • BIDMC = Beth Israel Deaconess Medical Center; CMC = Computational Medicine Center.

Naturally, StAC gives its best results when it is cross-validated because cross-validation allows it to tune its context to each corpus (compare Table 5 and Table 6). However, even in the absence of cross-validation, the information pertinent to classifying assertions on one corpus aids classification of assertions in another corpus.

Feature Evaluation

To understand the source of the strength of StAC, we cross-validated it with each of its features separately. Table 7 and Table 8, where italics mark the best F-measures, show that the words in the ± 4 word window are the most informative features of StAC on all of our corpora, indicating that the nature of the assertions made about a problem is mostly captured by the lexical context of the problem. Only for determining the alter-association assertions on the discharge summaries does lexical context drastically benefit from Section Headings. The ± 2 link window is the second most informative feature for StAC. The ± 2 link window features contribute to lexical features by correcting false positives that occur when a negation indicator such as no appears within the ± 4 word window but does not in fact modify a disease, e.g., “no intervention due to cardiovascular disease” where the ± 2 link window clarifies that no modifies intervention and not cardiovascular disease. Their value in our experiments is only limited by the number of such examples in our corpora.

View this table:
Table 7

F-Measures of StAC When Run on BIDMC Corpus with Subsets of Features Best F-Measures in Italics

BIDMCLexical Context±4 Word Window0.950.920.640.17
Section headings0.84000.92
±4 word window + section headings0.960.930.640.92
Syntactic Context±2 link window0.900.760.590.08
±2 link window + verbs0.910.790.510.08
  • BIDMC = Beth Israel Deaconess Medical Center.

View this table:
Table 8

F-Measure of StAC When Run on Challenge and CMC Corpora with Subsets of Features Best in Italics

ChallengeLexical Context±4 word window0.970.900.580.49
Section headings0.90000.82
±4 word window + section headings0.970.900.580.86
Syntactic Context±2 link window0.940.700.490.47
±2 link window + verbs0.940.720.480.52
CMCLexical Context±4 word window0.980.950.890.42
Section headings0.85000
±4 word window + section headings0.980.950.890.42
Syntactic Context±2 link window0.920.610.730.13
±2 link window + verbs0.920.620.750.25
  • CMC = Computational Medicine Center.


Despite correctly classifying most of the assertions, StAC makes several recurring mistakes. For example, it misinterprets the scope of some phrases: in “No JVP, 2+ swelling, no pain”, JVP and pain appear to be absent, while swelling is present. However, the lack of a consistent indicative context prevents StAC from recognizing this information.

The results in Table 6 show that StAC can obtain much of the contextual information necessary for assertion classification on all of our corpora just from the BIDMC corpus. Our choice of the BIDMC corpus for development was guided by its decent size and by its genre, which had previously been used for assertion classification.5 If trained on a corpus that was weaker in its representation of information pertinent to assertion classes, both in terms of the number of examples of each assertion class and in terms of capturing the variety of contexts indicating the various assertion classes, the results presented for StAC and its generalizability could change (as would the results and generality of ENegEx if developed under the same conditions). The results on the alter-association class support this claim: this class could benefit from further studies on corpora that may be richer in their examples for it.


We presented StAC and used it in exploring the contribution of various contextual features to assertion classification. Using ENegEx as a benchmark, we showed that StAC can capture assertion classes on discharge summaries and radiology reports by making use of the information contained in the immediate context of target problems. The information contained in the words found in the ± 4 word window of target goes a long way towards this goal. More importantly, information obtained from one corpus can help assertion classification on other corpora.


View Abstract