OUP user menu

★ FOCUS on clinical research informatics ★

Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis

Stephen T Wu , Hongfang Liu , Dingcheng Li , Cui Tao , Mark A Musen , Christopher G Chute , Nigam H Shah
DOI: http://dx.doi.org/10.1136/amiajnl-2011-000744 e149-e156 First published online: 1 June 2012


Objective To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources.

Design Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data.

Results For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms.

Conclusion The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.

Background and significance

Natural language processing (NLP) is crucial to clinical informatics because the summative information that is stored in millions of clinical notes is too massive to be processed by a human. But automatic methods of processing clinical text have their own challenges, such as the extensive use of specialised medical terms. The Unified Medical Language System (UMLS) Metathesaurus1 has over 8 million strings that an NLP system might consider relevant in clinical text. It is thus common practice for NLP systems1 ,2 to filter the desired terms by criteria such as lexical redundancy and term ambiguity2 or semantic type.3 Such filters, while reasonable, are uninformed by how the terms behave in clinical text.

The long-term goal of this work is to produce an agile information extraction user interface that allows users to specify terms, concepts and logic relevant to their own problem settings, based on criteria such as frequency, source terminology, syntax and semantic type. To that end, our objective here is twofold: first, to analyse empirical instances of UMLS term strings in a large clinical corpus; and second, to illustrate what types of term characteristics are generalisable across data sources. The resulting statistics and principles may then be used in user-directed filtering of lexicons (eg, using Lexicon Builder4) for practical clinical NLP systems. This may also improve system efficiency—the full Metathesaurus (prohibitively, for some users) requires several gigabytes of memory to serve as a lexicon for many algorithms.

This paper therefore explores the characteristics of Metathesaurus term matches in clinical text along dimensions such as term length, term frequency, source terminology, syntactic category and semantic group. The data source used is a corpus of over 51 million patient notes gathered over a 10-year period at the Mayo Clinic. A variant of the standard Aho-Corasick string matching algorithm5 ,6 is run on the data to find term matches, and these data are paired against existing information from Mayo's enterprise NLP system, Clinical Notes Indexing (CNI),7 a precursor to Mayo's open-source NLP system, cTAKES.3 The paper also examines the transferability of corpus statistics by applying a set of Mayo-based filtering parameters to the i2b2/VA NLP Challenge corpus.8 This cross-institutional test provides some insight on which statistical metrics are mainly beneficial within one setting and which are broadly applicable.

After a brief discussion on related work, the remainder of this article introduces the data and methods for empirical term matching in clinical corpora, analyses the Mayo Clinic corpus of clinical notes, applies and analyses a practical set of filters and draws a few conclusions for NLP tasks.

Related work

The UMLS Metathesaurus1 is constantly growing as its source terminologies grow; its 2011AA release contains 155 sources with 8 335 125 different strings for terms in 21 languages, and 2 404 937 different concept unique identifiers. As a thesaurus, the Metathesaurus is designed to match identical concepts from different source terminologies, and it has thus been used frequently as a normalisation target for NLP methods.3 ,7 ,911 Our previous work has analysed the large-scale distribution of UMLS clinical concepts.12

The Metathesaurus has also been commonly used as a lexicon2 to supply term strings that might be identified in clinical text, which is slightly different than the concept-oriented focus for which it was designed. This incongruency has been addressed to some degree in MetaMap, an NLP system from the National Library of Medicine, which allows some configurable filtering of the lexicon.2 ,13 This filtering is helpful, but lacks the ability to provide a user with in-domain, empirically based recommendations. With the rise in computational power and the increasing availability of biomedical ontologies, we believe that a corpus-driven approach14 is feasible for principled lexicon filtering.

Constructing practical string-oriented lexicons through filtering has been attempted via statistical models and via rule-based systems. Statistical models typically identify a number of properties that allow prediction of the likelihood of a given string being found or not found in a corpus.15 An excellent recent rule-based study by Hettne et al16 recommends applying five rewrite rules (of nine studied) and seven suppression rules (of eight studied) to the UMLS before it is used for biomedical term identification in MEDLINE.16 Our work complements these attempts by highlighting the large-scale effects of the lexicon-building technique of term suppression.

In the biomedical literature domain, the efforts at lexicon creation are quite advanced; for example, the BioLexicon gathers terms from existing data resources into a single, unified repository, and augments them with new term variants extracted from biomedical literature.17 Efforts by Baral et al provide an online dictionary of diseases and drugs based on frequency analysis in Medline (http://bioai4core.fulton.asu.edu/snpshot/download.html). Our work in analysing a large-scale clinical corpus provides a principled foundation for creating such resources in the clinical domain.

Other corpus studies have been conducted which analyse variability in subdomains,18 sections of a document,19 large-scale semantic characteristics of biomedical literature abstracts,20 and longitudinal semantic shift.21 Our previous work also includes comparisons between concepts in the clinical and biomedical domains.12 Here, we undertake the first known enterprise-scale exploration of clinical text that centres on term strings actually present in the text.

Data and methods

Data sources

The data source for the corpus analysis of clinical text was Mayo Clinic clinical notes between 1 January 2001 and 31 December 2010, retrieved from the Mayo's Enterprise Data Trust (EDT).22 The EDT stores structured data, unstructured text and CNI-produced annotations7 from a comprehensive snapshot of Mayo Clinic's service areas, excluding only microbiology, radiology, ophthamology and surgical reports. Additionally, each possible note type at Mayo was represented: clinical note, hospital summary, post-procedure note, procedure note, progress note, tertiary trauma and transfer note.

For the evaluation of a sample filter, the i2b2/VA 2010 NLP Challenge data8 were used. This corpus contained a total of 871 manually annotated, de-identified reports from Partners Healthcare, Beth Israel Deaconess Medical Center and the University of Pittsburgh Medical Center. The majority of notes were discharge summaries, but the University of Pittsburgh Medical Center also contributed progress reports.

String matching algorithm

Our string matching procedure implemented a modified Aho-Corasick algorithm.5 This algorithm takes a dictionary and constructs a finite state machine with efficient transitions between alphabet string states for failed matches. Our modification uses normalised words as the alphabet, but we store the original strings for each match and report results on exact matches.

We used the UMLS Metathesaurus as a lexicon. Due to computational constraints we filtered out entries with 10 or more words and those that were not between 3 and 100 characters. Because the algorithm used the UMLS Metathesaurus there were concept unique identifiers available for each string match. We used this normalised representation to find type unique identifiers and characterise the semantic types of the strings.

Data collection and preparation

For corpus analysis, we retrieved text documents from the EDT repository, with 51 945 627 documents represented from 2000 to 2010. The dictionary lookup procedure described above found any UMLS terms in the text documents. For analysis by syntactic category, we retrieved CNI-produced syntactic chunks7 for the same set of documents, and the dictionary lookup procedure was applied to the text of these chunks. This yielded the syntactic category for the majority of term occurrences in the text.

For the last step of examining the cross-institutional transferability of statistics, we used the 2010 i2b2/VA NLP Challenge data without modification. As above, the dictionary lookup procedure mapped UMLS terms in the i2b2/VA data.

Results and analysis

Corpus analysis

Aggregate characteristics

In the corpus of 51 945 627 clinical documents, there are a total of 2 319 010 575 case-insensitive exact term matches, drawing from 296 167 unique terms. This amounts to 44.64 matches per document on average and only utilises 3.56% of the available case-insensitive terms in the UMLS. It is thus clear that we do not need to search the full Metathesaurus in the course of a concept mapping procedure.

However, we should not overestimate how much the terminologies may be filtered, as the dictionary lookup algorithm used was fairly unsophisticated. In fact, it is unlikely that there are so few terms per document in clinical text. Xu et al report 19 million Medline abstracts to have 530.45 matched terms per document using 13% of the unique strings in the UMLS.20 This difference is particularly stark in light of the fact that the clinical documents have, on average, three times as many characters (about 2500) as biomedical abstracts.

The larger number of biomedical matches is likely indicative of the fact that the biomedical text covers a broader range of topics than clinical text. It is also difficult for exact dictionary matches to fully capture the range of synonymous expressions, abbreviations and misspellings that are found in clinical text. For example, the strings ‘dispo’ (abbreviation for the disposition of a patient) and ‘00Cardiac implant’ (tokenisation problems) both occur in the Mayo corpus but are not identifiable.

All of these factors point to a large difference between the clinical and biomedical domains, and also to the need for a clinical data-specific study such as this one.

Word and character statistics

As previously mentioned, the UMLS Metathesaurus was designed as a controlled thesaurus not a lexicon. It therefore contains concepts that include an excessive number of words or characters and are not of use to NLP techniques. Figure 1 shows histograms for the number of words in the UMLS and in the subset that is empirically found in Mayo Clinic data.

Figure 1

The number of words in a term versus relative frequency of Unified Medical Language System (UMLS) terms with that number of words.

It should be clear that the mappable dictionary terms from the UMLS are shorter on average than the full set of UMLS terms. Subsetting to these 296 167 terms reduces the average characters per term from 37.27 to 17.83 and average words per term from 4.80 to 2.41, similar to the characteristics reported in the biomedical domain. The same is seen to be true when examining the number of characters in UMLS terms, as in figure 2.

Figure 2

The number of characters in a term versus how many Unified Medical Language System (UMLS) terms had that number of characters.

These findings suggest that filtering out high word counts or character counts may be a safe way to remove unnecessary terms from a lexicon.

Term frequency and TF−IDF

To understand what types of UMLS strings are found in clinical text, we now consider some traditional metrics for the importance of a term. Figure 3 shows the distribution of the top 5000 term frequencies in each domain.

Figure 3

Distribution of the most frequent terms in clinical versus biomedical data.

We have scaled the y-axis for biomedical term frequencies to be comparable with the clinical domain. The x-axis is ordered by term frequency (tf) ranking, where the top strings are seen in table 1A,B. We can see that few terms are used frequently (the left portion of figure 3) and many terms are used infrequently (the bottom/right portion), and this characteristic is consistent across both domains. This is reminiscent of Zipf's Law, which describes the empirical frequency distribution of words in general language as having a large peak and a heavy, one-sided tail. By the technical log–log plot definition of a Zipfian distribution, we would see that this is near-Zipfian but the tail is not as heavy.

View this table:
Table 1

Top terms in clinical text (Mayo corpus) and biomedical text (Medline 2011), by term frequency

(A) Clinical text(B) Biomedical text
Patient38 434 437Patients10 393 786
Not18 601 179Cells4 855 359
History16 650 248Treatment4 103 013
Pain15 125 464Study4 032 105
Time14 667 600Results3 498 940
Normal13 656 279Cell3 082 455
Right13 181 157Using2 840 963
Left13 170 124Effect2 754 055
Daily10 923 371Activity2 610 750
Well9 534 581Protein2 332 732

From table 1 it is evident that in both domains, the most frequent terms are general rather than specific, and reflect the domains from which they arise. In 51 million documents, 7.7% of terms only occurred once; the 0%, 25%, 50%, 75% and 100% quantiles are at 1, 3, 18, 85 and 38 434 437 occurrences, respectively.

We additionally obtained the tf−idf weight of each term for the clinical corpus as in table 2. Tf−idf weights are defined by tfdf=tflog(N/df) , where n is the number of documents in the corpus and df is the number of documents a term occurs in. They are commonly used in information retrieval to measure the importance of terms, with the intuition that terms that occur often in every document are less distinctive than those that occur often in a few documents. Note that the top terms are very similar to the term frequency-ranked versions.

View this table:
Table 2

Top terms in clinical text by tf–idf weight

TermFrequencyDocument frequencytf−idf
Patient38 434 43712 163 1865.5E+07
Not18 601 1796 921 3383.7E+07
Pain15 125 4644 883 1783.5E+07
History16 650 2487 375 3923.2E+07
Normal13 656 2795 265 3353.1E+07
Daily10 923 3712 984 2353.1E+07
Right13 181 1575 351 1403.0E+07
Left13 170 1245 388 3043.0E+07
Time14 667 6007 177 8142.9E+07
Day8 288 4723 358 8342.3E+07

Figure 4 visualises this comparison by showing the tf rank (x-axis) with the tf−idf values (y-axis)—they are still highly consistent. From here, we see that traditional information retrieval metrics such as tf−idf may be somewhat limited in their ability to discover truly valuable, discriminative words in the clinical domain.

Figure 4

Tf−idf values of the most frequent terms in clinical data.

This ineffectiveness of inverse document frequency is likely due to the fact that the clinical domain is highly specialised by note type and subdomain. The term ‘patient’ is discriminative in some respects: it can be easily found in progress notes and discharge summaries, but is much less likely to be found in notes like pathology or radiology reports.

Source terminology

Here, we compare the number of strings per terminology in the raw UMLS (table 3A) with the most commonly used terminologies (by number of terms represented) in the clinical and biomedical domains (table 3B,C).

View this table:
Table 3

Top source vocabularies and their degree of utilisation, by number of unique term strings in clinical notes

(A) UMLS(B) Clinical text—Mayo(C) Biomedical text—Medline
SourceUniqueSourceUnique% UseFrequencySourceUnique% Use
SNOMED-CT988 733CHV106 42674.41 866 925 442MSH242 46232.6
MSH743 332SNOMED-CT94 7889.61 538 745 839SNOMED-CT215 21721.8
MEDCIN726 724MSH51 5846.9753 847 562NCI101 80758.0
NCBI662 674NCI50 53628.8981 062 417CHV85 47359.7
RXNORM455 466RCD42 66812.31 683 517 327NCBI84 12912.7
RCD346 922MEDCIN32 3354.4298 650 586RCD69 51920.0
LNC313 431SNMI30 28018.5629 881 044SNMI57 17734.8
ICD10249 863MDR28 71439.8310 815 333SCTSPA56 7353.8
NCI175 679MTH21 64215.3866 386 287OMIM46 33934.5
SNMI164 069SCTSPA17 6611.2369 476 316MTH43 02930.5
  • Frequency of terms from each source in clinical text is also shown.

  • CHV, Consumer Health Vocabulary; ICD10, International Classification of Diseases, 10th revision; LNC, Logical Observation Identifier Names and Codes (LOINC); MDR, Medical Dictionary for Regulatory Activities Terminology (MedDRA); MSH, Medical Subject Headings; MTH, UMLS Metathesaurus; NCBI, National Center for Biotechnology Information; NCI, NCI Thesaurus; OMIM, Online Mendelian Inheritance in Man; RCD, Clinical Terms Version 3 (Read Codes); SCTSPA, SNOMED Terminos Clinicos; SNMI, SNOMED International v3.5; SNOMED-CT, Systematized Nomenclature of Medicined - Clinical Terms.

These tables show which terminologies are best for each domain, ranked by the number of unique case-insensitive terms used. Tables 3B and 3C also include what percentage of the terms in the full terminology are used. Interestingly, the new Consumer Health Vocabulary contains only 148 383 terms but accomplishes excellent coverage of terms in both domains because it was designed for natural language contexts. The Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) is the largest source ontology in the UMLS and was developed specifically as a clinical resource. As such, it is one of the most important terminologies in the clinical domain. Similarly, Medical Subject Headings (MSH) was developed specifically for indexing biomedical literature and therefore captures the most terms from biomedical abstracts.

The percentage usage of each of these ontologies is lower in the clinical domain than in the biomedical domain, again likely due to applying an exact case-insensitive string match to highly varied clinical notes. Low usage rates in the clinical domain, for example, SNOMED-CT, also indicate that the resource may best contribute to a lexicon after some filtering along other dimensions.

Semantic groups

As mentioned above, the frequent words in the clinical domain differ from those in the biomedical domain. This is most easily seen in figure 5A,B.

Figure 5

(A) Frequencies of terms discovered in clinical versus biomedical text, by semantic group; (B) number of unique terms, by semantic group.

The percentages of matched strings are compared by semantic group and they differ greatly. Here, we follow Bodenreider and McCray's 15 semantic groups23 of semantic types (UMLS Type Unique Identifiers) figure 6.

Figure 6

Percentage of unique terms that are noun phrase (NP) dominated, by semantic group.

These plots display predictable domain differences in semantic type distribution of terms. Clinical data focus on disorders, anatomy, medications and procedures. cTAKES and CNI are examples of intentional semantic type-based filtering for clinically relevant types, in which five semantic groups are kept, accomplishing 59.60% coverage of occurrences and 82.74% coverage of unique strings.

Note that the difference between the clinical and biomedical domains is very significant. Type filters designed for one domain should not be applied to another, though some semantic groups are relatively infrequent to both domains.

Syntactic categories

Across the Mayo clinical notes in this study, we found that Across the Mayo clinical notes in this study, 90.18% of clinical term mentions were found in noun phrase (NP) chunks; Xu et al found similar NP-dominance characteristics in biomedical data. Figure 6 stratifies the clinical NP-dominance characteristics by semantic group. While filtering out non-NP constructions is commonplace in many clinical NLP systems, it should be done with caution in for semantic groups like “Procedures” or “Activities & Behaviors”.

It should be noted that this depends on a sound chunking procedure, and there were some limitations to the accuracy of the IBM shallow parser in CNI: there were terms that resided in incorrect chunks and those that were not in any chunk. However, as string-matched terms occur across the whole distribution of the text, this noise is overcome on average.

Cross-institutional analysis

Based on the corpus analysis on Mayo data above, we defined an example configuration of filters for use-case agnostic information extraction in clinical notes, and applied these candidate filters to string-matched i2b2/VA data to examine their trans-institutional applicability.

A Mayo-based filtering configuration

We implemented eight lexicon filters:

  1. Special characters. The UMLS contains fine-grained semantic distinctions that are indicated with punctuation, for example, ‘[D] Respiratory insufficiency (finding)’ versus ‘Respiratory insufficiency, NOS.’ This UMLS-intrinsic filter removes a term from the lexicon if and only if it begins with ‘[’ ends with‘)’ or contains a comma.20

  2. Maximum number of words. Given the histogram in figure 1, fewer than 1000 terms have seven words. Thus, we eliminate terms with seven or more words, removing over a quarter of UMLS terms.

  3. Maximum number of characters. Given the histogram in figure 2, only 39 terms have 56 or more characters. We thus eliminate terms with fewer than 2 characters or more than 55 characters, removing over a fifth of UMLS terms.

  4. Language. Fifteen languages are represented in the UMLS. Filtering to English terms reduces the set of UMLS terms by almost a third.

  5. Source terminology. Many UMLS source terminologies are not designed to be lexicons (eg, International Classification of Diseases, ninth revision billing codes). We keep only the top 14 English sources out of the possible 155: SNOMED-CT, Consumer Health Vocabulary, National Cancer Institute (NCI) Thesaurus, Medical Subject Headings (MSH), Read Codes, Medical Dictionary for Regulatory Activities Terminology (MedDRA), SNOMED International, MEDCIN, UMLS Metathesaurus, National Drug File—Reference Terminology (NDF-RT), the original SNOMED, Online Mendelian Inheritance in Man (OMIM), Logical Observation Identifiers Names and Codes (LOINC) and Computer Retrieval of Information on Scientific Projects (CRISP) Thesaurus.

  6. Semantic group. Of the 15 semantic groups, over 92% of Mayo Clinic terms come from only 7: anatomy, chemicals & drugs, concepts & ideas, disorders, living beings, physiology, and procedures.

  7. Empirical occurrence filter. We filter out those terms that never appeared in the Mayo corpus. This leaves the full set of Mayo Clinic term occurrences and tests the transferability of a specific lexicon across institutions.

  8. Term frequency. A total of 99.99% of mentions can be retained if we eliminate terms that occurred only once or twice in the Mayo corpus. This is a subset of the empirical occurrences filter, since zero occurrences are also eliminated.

Cross-institutional filtering evaluation

Table 4 reports the impact of this filtering. First, we begin with a baseline of the full UMLS. The top left cell indicates the number of unique UMLS terms. Rows show the lexicon size reduction effect of individual filters against this baseline. The final rows apply multiple filters at once.

View this table:
Table 4

Transferability of corpus-based filtering of the Unified Medical Language System (UMLS)

UMLSMayo Clinic term occurrencesi2b2/VA term occurrences
Unique% rdnUnique% excMatches (n)% excUnique% excMatches (n)% exc
Full UMLS8 335 125296 7982.376×10917 570376 350
1. Sp. Char.5 146 09638.26296 7980.002.376×1090.0017 5700.00376 3500.00
2. MaxWord6 157 28326.13295 3850.482.376×1090.0017 5640.03376 3430.00
3. MaxChar6 477 25022.29296 5160.102.376×1090.0017 5690.01376 3490.00
4. Language5 610 57632.69296 1670.212.375×1090.0517 5520.10376 2340.03
5. Sources3 409 18359.10251 36115.312.327×1092.0816 4916.14368 6822.04
6. SemGroup7 798 9376.43273 3007.922.289×1093.6816 3436.98361 0184.07
7. EmpFilt296 79896.44296 7983.562.376×1090.0017 3711.13319 25815.17
8. TermFreq230 01197.24226 69723.622.376×1090.0017 3261.39319 03915.23
Filters 1–8181 52397.82181 52338.842.244×1095.5715 13913.84301 47319.90
Filters 1–61 448 81182.62230 86022.222.244×1095.5615 34312.68354 2745.87
Filters 1–51 594 67480.87250 19215.702.327×1092.0916 4866.17368 6762.04
  • The UMLS column shows % rdn (reduction) of lexicon size (larger % rdn is more efficient). The Mayo and i2b2/VA columns compare this to % exc (exclusion) rate, wherein UMLS terms are no longer mapped due to the filtering. Incongruencies in % exclusion indicate corpus differences.

The left ‘UMLs' columns analyze how much of the UMLS Metathesaurus remains after each of the filters, and larger percent reduction values correspond to more memory-efficient systems. The middle ‘Mayo’ columns evaluate the reasoning for choosing these filter definitions. For example, our semantic groups filter (filter 6 in table 4) uses only seven semantic groups. Reading the row from left to right, it reduces the size of the lexicon to 7 798 937 (a 6.43% reduction), keeps 273 300 of the 296 798 unique terms (ie, excludes 7.92%), and keeps 2.289×109 of the 2.376×109 term occurrences (ie, excludes 3.68%) for the Mayo corpus. As a whole, the filters defined in this example might be reasonable for some information extraction applications, excluding only 5.57% of all mentions. The right ‘i2b2/VA’ columns are defined by using Mayo-based filters on term matches from the i2b2/VA corpus.

Our cross-institutional evaluation lies in comparing the ‘Mayo’ columns with the ‘i2b2/VA’ columns. Filters 1–4 seem to apply similarly and accurately across the two corpora. This is to be expected because they largely deal with systematic intrinsic properties of the term strings in the UMLS and should not depend on corpora. The remaining filters differ between Mayo and i2b2/VA data, indicating that statistical analysis along those lines should only be transferred across data sources with caution.

The source terminology filter removed far less a proportion of unique terms in the i2b2/VA corpus (6.14%) than in the Mayo corpus (15.31%). This is probably due to the vast size difference between the two corpora: recalling figure 3, a heavy tail distribution within large corpora means that many uncommon terms are mapped in the Mayo Corpus, but not in the i2b2/VA corpus. We may conclude that filtering by source reduces the diversity of available terms, but the most frequent terms are captured in a small number of sources.

In i2b2/VA data, the semantic group filter excludes a higher proportion of unique terms but a smaller proportion of term mentions than in Mayo data. The variability is not great, however, compared with the differences with the biomedical literature domain in figure 5A,B. We conclude that though different clinical corpora may have slightly different distributions, their utilised terms are still relatively similar to each other in semantic groups.

Perhaps most instructive are filters 7–8, the empirical occurrences and the term frequency filters. Both of these filters exclude a smaller proportion of unique terms in the i2b2/VA data (1.39% for filter 8) than in the Mayo data (23.62% for filter 8), again likely due to the corpus size differential. However, a larger proportion of i2b2/VA mentions are excluded (15.23% for filter 8) than Mayo mentions (0% for filter 8). Despite the fact that these are both clinical corpora, term frequencies have vastly different characteristics in the different corpora. Although these statistics are standard NLP techniques, they would appear to be more helpful within an institution than across sources of data.


The foregoing cross-institutional test aligns with envisioned applications because a user of a practical application like Lexicon Builder4 will need guidance on what filters to choose. It would be safe in such a situation to apply filters that have been validated across institutions, but other filters should be applied with caution. A final recommendation for use-case agnostic information extraction in the i2b2/VA corpus, as presented in table 4, might be to utilise filters 1–5. These simple filters achieve fivefold reduction in lexicon size (efficiency) while preserving almost 94% of unique terms and almost 98% of mentions in the corpus.

The semantic group filter has limited utility because it does not greatly reduce the dictionary size. However, other factors, such as the limitations of a corresponding human annotation effort, may be reasons for narrowing the scope of general information extraction to specific semantic groups.

Most importantly, the results show that the empirical occurrences and term frequency filters are highly institution specific. Any methodologies developed off of these statistics should take care to complete a preliminary corpus analysis rather than directly using the Mayo Clinic statistics.

Unlike our previous work,12 the preceding analysis does not attempt to calculate or analyse concept-level semantics. Although mixing the two analyses is an interesting problem, our term-level analysis is natural for the envisioned problem setting, where a user is building a lexicon of strings for concept indexing—concept normalisation would presumably be a downstream task. Additionally, we do not calculate the ‘usefulness' of filters in real-world applications because such measures typically require a concept-centric focus.

Conclusion and future work

Based on the occurrences of terms in a 51 million document corpus of Mayo Clinic clinical notes, this paper has presented a suite of statistics on UMLS term occurrences in the clinical domain, and has evaluated the cross-institutional applicability of these statistics. We have shown several measures that are intrinsic to their Metathesaurus entries (term well-formedness, length and language) that generalise easily across clinical institutions. Term frequencies are highly variable across institutions and should be adapted across domains or institutions with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but the distance between institutions is much smaller than that between the clinical and biomedical literature domains.

We believe this analysis makes it possible for end users to build customised, empirically informed lexicons from the UMLS. Implementationally, this team plans on enhancing Lexicon Builder4 with the statistics presented above. Other future work includes the further characterisation of clinical note sections (eg, terms may differ in history of present illness vs discharge diagnosis sections), types of notes (eg, discharge summaries vs operative reports), co-occurrence information (ie, utilising latent semantic information), and ontological structure (eg, which branches in an ontology are more useful).

As mentioned, a concept-centric analysis and its relationship to our term-centric analysis are also areas of future work. A concept-centric filtering evaluation, for example, may actually show that precision could be improved by filtering, since it could remove ‘distracting’ terms.

While the coverage of lexicons derived out of biomedical ontologies is impressive, clinical writing contains many more variants. We plan to generate accurate variants by analysing lexical variants, synonyms and related terms at a large scale.


SW carried out the experiments, led the study design and analysis and drafted the manuscript. HL and DL helped with coding the experiments and with manuscript drafting. NS and CT enabled the comparisons with biomedical data, and NS also helped draft the manuscript. MM and CC provided institutional support and manuscript editing.


This work was supported in part by the NIH Roadmap Grant U54 HG004028. This study was also supported by National Science Foundation ABI:0845523, National Institute of Health R01LM009959A1, and the SHARPn (Strategic Health IT Advanced Research Projects) Area 4: Secondary Use of EHR Data Cooperative Agreement from the HHS Office of the National Coordinator, Washington, DC, DHHS 90TR000201.

Competing interests


Ethics approval

Ethics approval was provided by Mayo Clinic Institutional Review Board.

Provenance and peer review

Not commissioned; externally peer reviewed.

Data sharing statement

The aggregate corpus statistics in this paper will be released open-source at a future date as a part of the Lexicon Builder web service.


The authors would like to acknowledge Rong Xu, Vinod Kaggal and Yipei Liu for their help on the experiments and statistics, and the anonymous reviewers for their thorough feedback.

This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.


View Abstract