OUP user menu

★ Research Paper ★

Automated Detection of Adverse Events Using Natural Language Processing of Discharge Summaries

Genevieve B. Melton , George Hripcsak
DOI: http://dx.doi.org/10.1197/jamia.M1794 448-457 First published online: 1 July 2005


Objective: To determine whether natural language processing (NLP) can effectively detect adverse events defined in the New York Patient Occurrence Reporting and Tracking System (NYPORTS) using discharge summaries.

Design: An adverse event detection system for discharge summaries using the NLP system MedLEE was constructed to identify 45 NYPORTS event types. The system was first applied to a random sample of 1,000 manually reviewed charts. The system then processed all inpatient cases with electronic discharge summaries for two years. All system-identified events were reviewed, and performance was compared with traditional reporting.

Measurements: System sensitivity, specificity, and predictive value, with manual review serving as the gold standard.

Results: The system correctly identified 16 of 65 events in 1,000 charts. Of 57,452 total electronic discharge summaries, the system identified 1,590 events in 1,461 cases, and manual review verified 704 events in 652 cases, resulting in an overall sensitivity of 0.28 (95% confidence interval [CI]: 0.17–0.42), specificity of 0.985 (CI: 0.984–0.986), and positive predictive value of 0.45 (CI: 0.42–0.47) for detecting cases with events and an average specificity of 0.9996 (CI: 0.9996–0.9997) per event type. Traditional event reporting detected 322 events during the period (sensitivity 0.09), of which the system identified 110 as well as 594 additional events missed by traditional methods.

Conclusion: NLP is an effective technique for detecting a broad range of adverse events in text documents and outperformed traditional and previous automated adverse event detection methods.

Adverse event prevention and detection are national health care priorities.1 Detection of adverse events represents an opportunity to learn from events via a cognitive perspective so that inciting factors surrounding events can be identified and improved.2 Voluntary reporting of adverse events by most institutions, however, remains a largely unsuccessful practice.37 While chart review is effective,8 it is too costly for routine use.

Health care policy makers and practitioners will need information technology coupled with improved data collection to improve patient safety.9 Computerized systems for event detection rely on signals suggestive of adverse events both in the case of impending events (for prevention) and of events that have occurred (for management).10 For example, a discharge diagnosis of myocardial infarction in a patient with an unrelated surgical admission diagnosis might indicate an adverse event. Event detection systems reduce the cost of chart review by identifying those cases that are most appropriate for review.6 Successful systems require sufficient positive predictive value to avoid needless chart review and sufficient sensitivity to gather a meaningful number of events.

Most adverse event detection systems exploit numeric or coded data derived from patient registration, pharmacy orders, admission and discharge diagnoses, clinical laboratory results, and ancillary information systems.1116 Investigators have studied adverse event detection from the perspective of adverse drug events, dangerous laboratory values, failure to follow critical paths, and other events. Although these adverse detection systems often perform well, they are limited because they require clinical data that are in coded format.

Unfortunately, most institutions lack a detailed record of their patients' care in coded electronic format. Symptoms, physical findings, and clinical reasoning are recorded as narrative text in notes but are unavailable in coded form. The lack of coded information limits the performance of event detection systems and limits the breadth of events that they can detect.

Narrative clinical notes such as discharge summaries, operative reports, clinic notes, and nursing notes are increasingly available in electronic form either through transcription or direct data entry. Investigators have begun to exploit these documents for event detection by looking for notes with relevant words (“trigger words”) such as “iatrogenic,” “error,” or “perforation.”17,18 This technique helps, but its predictive value remains low, largely because it is difficult to distinguish whether a clinician is saying that a condition is present, is absent, or was present in the past. Natural language processing is an automated technique that converts narrative documents into a coded form that is appropriate for computer-based analysis.

Natural language processing has been used successfully for several specific domains of medicine1924 and for the detection of specific adverse events, such as falls and nosocomial infections.10,25 It is unclear, however, whether natural language processors can detect a wide range of complex adverse events accurately enough to assist health care institutions meaningfully. In this study, we built an event detection system for electronic discharge summaries using an existing, noncommercial natural language processor, MedLEE,26 in an effort to detect a broad range of adverse events.


The natural language processor MedLEE employs a vocabulary and a grammar to extract information from narrative text. MedLEE was initially developed to process radiographic reports21 but has been expanded to process a wide range of medical texts.26 MedLEE also handles negation (denial), uncertainty, timing, synonyms, and abbreviations. For example, the sentence “The patient may have a history of MI” is coded as follows:

  • problem: myocardial infarction

    • certainty: moderate

    • status: past history

The certainty and status fields indicate that the diagnosis is unsure (“moderate” certainty) and that if the myocardial infarction did occur, it occurred in the past. A detailed overview of MedLEE has been published.21,26

The New York Patient Occurrence Reporting and Tracking System (NYPORTS) is a mandatory adverse event reporting framework instituted in 1996 for all health care institutions in New York State.27 We used the criteria for each of the 45 patient-related hospital-based adverse event types defined in NYPORTS (Appendix 1); they represent a broad range of adverse events.

Many NYPORTS adverse event types are complex. For example, NYPORTS event type 751 includes falls in the hospital resulting in an x-ray–proven fracture, a subdural or epidural hematoma, cerebral contusion, traumatic subarachnoid hemorrhage, or internal organ trauma. The event type excludes falls that occur outside of the institution or that result in only soft tissue injuries. NYPORTS event type 604 includes perioperative myocardial infarction within 48 hours of an operative procedure. The procedure must not be cardiac related, birth related, an abdominal aortic aneurysm rupture, or a multiple trauma.


We developed and tested our adverse event system at NewYork-Presbyterian Hospital–Columbia University Medical Center, an urban, tertiary health care institution. There were 107,305 inpatients cases for the years 1996 and 2000. The target population of the study comprised all 57,452 inpatient cases at our institution with electronic discharge summaries during this period.

The adverse event detection system28 comprised the MedLEE natural language processor21,26 and a set of criteria that mapped each MedLEE-coded discharge summary to the adverse events that occurred during the admission. The inclusion and exclusion criteria for each event were implemented as a computer query, which is a short program that includes logic and terms from MedLEE's vocabulary. MedLEE converted each discharge summary to a coded form, and the 45 computer queries converted that coded form to a list of events that appeared to have occurred during each admission. The computer queries were developed iteratively; we tested them on discharge summaries from the years 1990 to 1995 (before implementation of NYPORTS), modified the queries to improve performance, and retested them on the cohort.

System Evaluation

Manual chart review served as the gold standard. We assessed the reliability of the reviewers on 100 cases as follows. Two reviewers, a physician coauthor (GBM) and an informatician independent of this study, identified NYPORTS events in 100 cases selected randomly but stratified so that about 40% had events. The reviewers' raw agreement was 0.97, and chance-corrected agreement (kappa) was 0.94. This high agreement justified the use of a single reviewer per case.

Reliability of the data sources was assessed on 1,000 randomly selected cases in which the physician identified NYPORTS events using (1) the discharge summaries alone, (2) the full electronic chart, and (3) for a subset of 100, the combined electronic and paper charts. Electronic charts included discharge summaries, operative reports, pathology reports, laboratory results, radiology results, registration data including coded diagnoses and procedures, residents' transfer of service notes, and other ancillary notes, but they contained few admission notes, progress notes, or nursing notes. The paper chart supplied the latter missing notes. We calculated the agreement among the three data sources.

Performance of the system was assessed with the same 1,000 random cases from 1996 and 2000 used for the full data reliability dataset. These cases were used to obtain an unbiased and direct estimate of sensitivity and specificity of the system for identifying cases that had NYPORTS events. The system identified apparent events based on discharge summaries. The physician manually reviewed the electronic chart for each case and determined which NYPORTS events had clearly occurred in the case.

System performance was then assessed using all electronic discharge summaries from 1996 and 2000 to get a more precise estimate of the positive predictive value and performance on individual event types. The physician reviewed those discharge summaries that the system identified as having events. An identification was considered correct only if the system selected the correct NYPORTS event type.

Finally, to assess how the system might work in practice, we compared the events that were detected by the system and confirmed by the physician reviewer with the events that were actually detected during those years using traditional event detection techniques. In 1996 and 2000, hospital personnel reported candidate NYPORTS events in one of three ways: (1) direct phone calls from practitioners, patients, and other hospital personnel; (2) incident reports from practitioners; and (3) report forms completed by case management personnel in conjunction with utilization review. Hospital personnel then determined the veracity of candidate NYPORTS events by manual screening of the electronic chart and, if needed, the paper chart.

The institutional review board approved the study and waived informed consent for this retrospective review.


Data Reliability

In the 100 cases with both electronic chart review and combined paper-electronic chart review, there was complete agreement on all 39 events. This high agreement justified the use of electronic charts as the gold standard for the 1,000 case set. Manual review of discharge summaries agreed with manual review of the electronic chart in all but five of 1,000 cases, resulting in a raw agreement of 0.995 and kappa of 0.96. This high agreement demonstrates that discharge summaries contain most of the information needed to detect NYPORTS adverse events, so a system based on discharge summaries has the potential for accurate identification.

System Performance on 1,000 Cases

Table 1 shows the performance of the system for detecting cases with at least one adverse event, based on the 1,000 case set. “True events” are those identified by manual review of the electronic chart, and “apparent events” are those identified by the system. The system correctly identified 15 of 53 cases with events. Table 2 shows the performance of the system for detecting individual events, based on the 1,000 case set. The system correctly identified 16 of 65 true events and incorrectly identified 49. Event specificity (0.9996 in Table 2) exceeds case specificity (0.982 in Table 1) because case specificity is subject to the sum of the false-positive rates of all the event types, whereas event specificity represents the average specificity expected for an investigator interested in a single NYPORTS event type.

View this table:
Table 1

Automated Adverse Event Detection System Versus Manual Review for 1,000 Charts, Aggregated by Case

Automated Detection System
Cases with Apparent EventsCases without Apparent EventsTotal
Manual review
    Cases with true events153853
    Cases without true events17930947
value (95% CI)
Sensitivity0.28 (0.17–0.42)
Specificity0.982 (0.971–0.990)
Positive predictive value0.47 (0.29–0.65)
Negative predictive value0.96 (0.95–0.97)
  • CI = confidence interval.

View this table:
Table 2

Automated Adverse Event Detection System Versus Manual Review for 1,000 Charts, Aggregated by Event

Value (95% CI)
Events identified by manual review65
Events identified by the system32
Events identified by the system and verified by manual review16
Sensitivity0.25 (0.15–0.37)
Specificity0.9996 (0.9994–0.9998)
Positive predictive value0.50 (0.32–0.68)
Negative predictive value0.9989 (0.9986–0.9992)
  • CI = confidence interval.

System Performance on Full Cohort of 57,452 Cases

Table 3 shows the number of events that the system identified in the full cohort of 57,452 cases, the number of those events that manual review verified, and the overall positive predictive value for the system calculated by case and by event. Appendix 1 contains the positive predictive value of the system by each event type. In sum, the system identified 1,590 events in 1,461 cases, and manual review verified 704 of the events in 652 cases.

View this table:
Table 3

System Performance with 57,452 Electronic Discharge Summaries Aggregated by Case and by Event

Value by case (95% CI)Value by event (95% CI)
Identified by the system1,4611,590
Identified by the system and verified by manual review652704
Positive predictive value0.45 (0.42–0.47)0.44 (0.42–0.47)
  • CI = confidence interval.

“Best Estimate” System Performance

Table 4 summarizes the event prevalence and “best estimates” of system performance using a combination of the data obtained from the 1,000 case set and the full cohort of 57,452 cases. The specificity for specific event types can also be estimated from Appendix 1. The range is 0.998 (95% confidence interval [CI]: 0.997–0.998) for event 803 to 1 (95% CI: 0.9999–1.0) for event 852.

View this table:
Table 4

“Best Estimate” Event Prevalence and System Performance

Metric*DerivationValue (95% CI)
    Case rate: proportion of cases with one or more true events53 ÷ 1,0000.053 (0.040–0.069)
    Event rate: true events per case65 ÷ 1,0000.065 (0.051–0.082)
System performance for detecting cases with events
    Sensitivity: proportion of cases with true events that had apparent events15 ÷ 530.28 (0.17–0.42)
    Specificity: proportion of cases with no true events that had no apparent eventsGraphic 0.985 (0.984–0.986)
    Positive predictive value: proportion of cases with apparent events that had true events652 ÷ 1,4610.45 (0.42–0.47)
    Negative predictive value: proportion of cases with no apparent events that had no true events930 ÷ 9680.96 (0.95–0.97)
System performance for detecting individual events
    Sensitivity: proportion of true events that were identified by the system16 ÷ 650.25 (0.15–0.37)
    Specificity: proportion of cases without true events of a given type that the system did not identifyGraphic 0.9996 (0.9996–0.9997)
    Positive predictive value: proportion of apparent events that were true704 ÷ 1,5900.44 (0.42–0.47)
    Negative predictive value: proportion of cases without true events of a given type that had no true eventGraphic 0.9989 (0.9986–0.9992)
  • CI = confidence interval.

  • * A true event was detected by manual review; an apparent event was identified by the system.

  • See the text for an explanation of the difference between case specificity and event specificity.

System Performance Compared with Traditional Reporting

The last two columns of Appendix 1 tally traditional NYPORTS detection and its overlap with the automated system. Table 5 compares traditional detection to the automated system followed by manual verification. The sensitivity of traditional detection can be approximated as 322 of 3,734 (0.065 × 57,452) or about 0.086. The system identified 110 of 322 traditionally detected events (0.34; 95% CI: 0.29–0.40). The system identified 594 events that were missed by traditional detection methods, increasing the total number of events detected from 322 for traditional detection alone to 916 using a combined approach.

View this table:
Table 5

Automated Detection with Manual Verification Versus Traditional Detection

Automated Detection System with Manual Verification
Event DetectedEvent Missed
Traditional detection
    Event detected110212322
    Event missed594*
  • * The number of events missed by both systems is unknown but can be estimated as 0.065 × 57,452 − (110 + 212 + 594) = 2,818.


In this study at a large, tertiary care medical center, our automated adverse event detection system with natural language processing achieved excellent performance and provided effective screening for NYPORTS adverse events contained within a large corpus of electronic discharge summaries over a two-year period. The sensitivity to detect events was only fair (0.25 by event and 0.28 by case) but far higher than that found for traditional reporting in this study (0.086) or in previous studies.37 The system achieved very high specificity.

The current system, when compared with other adverse event detection systems using text documents, is unique in its ability to both recognize a broad range of events and identify the specific event type in each case. Thus, it enables highly focused manual review to detect a significant fraction of events at minimal cost.

Most previous studies of automated adverse event detection from narrative documents used simple text search techniques and achieved limited success. In two studies of adverse drug event detection in the outpatient setting using automated text searching in clinic notes, the text search method performed well compared with other automated methods but achieved positive predictive values of only 7%13 and 12%.5 In a different study, text searching in discharge summaries, residents' transfer of service notes, and outpatient visit notes using the search terms “mistake,” “error,” “incorrect,” and “iatrogenic” to find medical errors identified a broad range of medical errors and had positive predictive values ranging from 3.4% to 24.4%.17 The system did not distinguish among the event types, however, and its sensitivity was less than 4%. In a study of text searching on discharge summaries to identify a broad range of events, the system returned 59% of discharge summaries with a predictive value of 52%.18 Because the prevalence of these nonspecific events in the underlying sample was 45%, however, the predictive value was only moderately higher than would be achieved by random sampling. Our system identified specific event types, with average prevalence per event type of less than 1%, and it still achieved a positive predictive value of 44% per event.

In addition, a recent report by Forster et al.29 described the validation of an adverse event detection instrument for discharge summaries using term searching. In contrast to the current study, which contains a direct reliability study, that report used an established instrument. The authors reported a positive predictive value of 0.41, a sensitivity of 0.23, and a specificity of 0.92. The predictive value of 0.41 must be interpreted in light of the high underlying prevalence of adverse events, which was 20% (48 of 245) in the reported case sample using a broad definition of adverse events. In addition to achieving a comparable predictive value with rare and specific events, our system achieved a better specificity and identified the exact event type.

Our reliability studies, which were conducted to verify the rater and data sources, revealed that NYPORTS events were straightforward for clinicians to identify with manual review and that discharge summaries contain most NYPORTS adverse events. Although the raters had little difficulty with manual review, query development for these events was a long and intricate task for system developers. Queries were developed in an iterative manner with many rounds often necessary to decrease both false negatives and false positives. Because of the large amount of complexity surrounding these adverse event definitions with respect to inclusion and exclusion criteria, however, mimicking the natural reasoning of a clinician within an automated query was difficult.

For example, an area being actively investigated by others,30 which was particularly difficult in this project, was reasoning with respect to time. While MedLEE does have some time representations for dates and other simple time structures, its current capabilities with respect to these issues are limited. Certain time reasoning could be inferred, such as an event occurring after another event using collocation information in the text. Many other time-reasoning issues, however, were not easily modeled in the queries. For instance, five postoperative NYPORTS events require that the event occurred within 48 hours of the procedure (events 601 to 605, see Appendix 1). Modeling a time difference of 48 hours with the coded data from MedLEE was difficult. The addition of other data sources, in addition to other text documents, to augment the system could potentially improve time reasoning as well as improve overall data modeling for the event detection system.

Although the system was successful in detecting NYPORTS events, there are important adverse event types that the NYPORTS structure does not include or sometimes explicitly excludes. For instance, the NYPORTS adverse event criteria for iatrogenic pneumothorax include solely those pneumothoraces due to an intravascular catheter and exclude other iatrogenic causes, including thoracocentesis or lung biopsy. For this reason, the system would need modification if the goal were to obtain all possible adverse events of potential interest.

While the overall performance of the system was excellent compared with that of other text-processing adverse event detection systems, system performance at the event or query level varied somewhat by event type. Many event types had a low event prevalence (Appendix 1), so the performance for individual event types could not be determined accurately. Nevertheless, certain queries were more difficult to implement in an automated fashion than others, resulting in variable system performance. Another central issue, in addition to issues with time reasoning, was handling event criteria not typically contained explicitly in the discharge summary. This required indirect modeling in the query (e.g., the use of conscious sedation was indirectly modeled by detecting procedures that typically use conscious sedation). The addition of other data sources could potentially enhance system performance by directly supplying this inferred information.

One potential source of bias in this study was that only patients with electronic discharge summaries were included. Patients who stayed less than 48 hours did not require a discharge summary, and sometimes summaries were simply missing from the record. This group may have had a different event rate than those included in the study.

An important aspect of this technology is its straightforward transferability to other institutions. Previous experience using the MedLEE natural language processor at other institutions suggests that performance should be comparable and that adjusting the computer queries should reduce any loss of performance.31 For patients with electronic discharge summaries, the overhead of using the system should be minimal. There are minor formatting requirements, and standardized section headings are helpful but not mandatory. Transferability is limited in two ways: (1) not all patients have discharge summaries, typically due to short hospital stays or lack of clinician compliance, and (2) some institutions do not currently have discharge summaries in electronic form. The MedLEE natural language processing component can process a broad range of documents, and extending the adverse event detection system to progress notes, operative reports, consult notes, and ancillary reports would likely result in the detection of additional adverse events.

Moreover, system specificity is high enough to make nationwide screening feasible. For example, if electronic discharge summaries were available for all inpatients, then an investigator interested in wound dehiscence (event 805) could run the system on the 30 million admissions expected per year32 and produce about 11,000 cases with about 11,000 false positives (from Appendix 1, event positive predictive value of 0.51 with approximately one case returned by the system for every 1,350 discharge summaries).

Natural language processing may revolutionize adverse event reporting and may play a significant role in adverse event prevention and other forms of intervention. The described system tripled the number of detected events without impeding or increasing the clinicians' workflow, as the operation of our system on discharge summaries was completely automated and transparent to clinicians. As health care moves from simple detection to actual intervention and prevention, the system may become even more important. Processing takes only about a second per document, and MedLEE processes documents at our institution as they are created. In contrast to retrospective manual detection and to voluntary reporting in which clinicians must know about and decide to report an event, natural language processing can provide immediate feedback to clinicians for issues of which they may be unaware. For example, MedLEE processing of chest radiograph reports reduced the rate of erroneously assigning patients with active tuberculosis to nonprivate rooms by almost one half.33


Natural language processing was an effective method for automated adverse event detection, with the reported system outperforming traditional and previous automated adverse event detection methods. In contrast to previously reported techniques, the system detected a broad range of complex adverse events and identified the specific event type with high specificity, although only fair sensitivity. Ultimately, this study demonstrates the potential of natural language processing to facilitate health care processes. Automated diagnosis coding, real-time clinical guidance, computer-assisted documentation, and improved clinical trial recruitment are some of the far-reaching applications of this important technique.

Appendix 1

Events Identified by the Automated Adverse Event Detection System and by Traditional Event Detection on 1,000 Cases and on 57,452 Cases

View this table:


  • Supported by grants from the Agency for Healthcare Research and Quality (R18 HS11806) “Mining Complex Clinical Data for Patient Safety Research” and National Library of Medicine (R01 LM06910) “Discovering and Applying Knowledge in Clinical Databases.” Dr. Melton was supported by the National Library of Medicine Training Grant (5T15LM007079-12).

  • The authors thank Carol Friedman for the use of the natural language processor MedLEE (National Library of Medicine grant support R01 LM06274 and R01 LM07659), Sue West for her assistance with institutional NYPORTS reporting, and Karina Tulipano for serving as a case reviewer.


View Abstract