OUP user menu

Knowledge-based Methods to Help Clinicians Find Answers in MEDLINE

Charles A. Sneiderman MD, PhD, Dina Demner-Fushman MD, PhD, Marcelo Fiszman MD, PhD, Nicholas C. Ide MS, Thomas C. Rindflesch PhD
DOI: http://dx.doi.org/10.1197/jamia.M2407 772-780 First published online: 1 November 2007


Objectives: Large databases of published medical research can support clinical decision making by providing physicians with the best available evidence. The time required to obtain optimal results from these databases using traditional systems often makes accessing the databases impractical for clinicians. This article explores whether a hybrid approach of augmenting traditional information retrieval with knowledge-based methods facilitates finding practical clinical advice in the research literature.

Design: Three experimental systems were evaluated for their ability to find MEDLINE citations providing answers to clinical questions of different complexity. The systems (SemRep, Essie, and CQA-1.0), which rely on domain knowledge and semantic processing to varying extents, were evaluated separately and in combination. Fifteen therapy and prevention questions in three categories (general, intermediate, and specific questions) were searched. The first 10 citations retrieved by each system were randomized, anonymized, and evaluated on a three-point scale. The reasons for ratings were documented.

Measurements: Metrics evaluating the overall performance of a system (mean average precision, binary preference) and metrics evaluating the number of relevant documents in the first several presented to a physician were used.

Results: Scores (mean average precision = 0.57, binary preference = 0.71) for fusion of the retrieval results of the three systems are significantly (p < 0.01) better than those for any individual system. All three systems present three to four relevant citations in the first five for any question type.

Conclusion: The improvements in finding relevant MEDLINE citations due to knowledge-based processing show promise in assisting physicians to answer questions in clinical practice.


Surveys of physicians consistently show clinical queries unanswered at the time of diagnostic and therapeutic decision-making.1,2 Physicians report that they did not pursue answers because of the time required to find information (75%) and because of resource inconvenience (8.3%).2 Ely et al.1 identified five major obstacles in addition to time: (1) difficulty formulating a question, (2) difficulty selecting an optimal search strategy, (3) failure of a resource to cover the topic, (4) uncertainty whether all of the relevant evidence had been found and the search could stop, and (5) difficulty synthesizing multiple bits of evidence.

Several possible approaches to addressing these problems have been investigated. Studies of interfaces for structured query formulation report improvements in precision without negative impact on recall, but not necessarily higher user satisfaction or acceptance of the interface.3 Moreover, training in the searching and appraisal of medical literature are essential for finding satisfactory answers to clinical questions.1,4 Merely using electronic information resources of choice, physicians were not always successful in answering clinical questions.5 The paradigm of evidence-based medicine (EBM)6 is an important resource in devising solutions to these problems.

In an environment where time and effort are at a premium, the value of MEDLINE for assisting in therapeutic decision making depends not only on well-formed questions but also on algorithms that can improve precision and recall in finding clinically relevant information.7 Research indicates that given enough time and skill, clinicians can find answers to their questions in MEDLINE.8 A recommended strategy is to reduce search results by focusing the question, often by adding more terms to a query. This requires a clinician to invest time in analyzing the information needed to identify search terms describing a clinical situation. An alternative approach is often observed in practice9: A clinician underspecifies a search by submitting two or three terms and then selects relevant documents while browsing the results, often using clinical practice guidelines for evaluation. Some of these strategies could be implemented automatically by incorporating domain knowledge in the search and then postprocessing the search results in order to rerank results for relevance to the query.

The research presented here explores the effectiveness of such automatic methods. We evaluate three knowledge-based automatic methods being developed at the National Library of Medicine to assist physicians find clinically relevant information in MEDLINE. The three systems use medical domain knowledge encoded in the Unified Medical Language System (UMLS)10 (alone or in combination with corpus-based methods) to find information for clinical queries of varying complexity.

The first system, SemRep Summarization,11 uses natural language processing and automatic summarization of MEDLINE citations to find the most relevant information about a clinical query within PubMed retrieval results. The second system, CQA-1.0,12 uses EBM recommendations for finding best answers for questions about treatment and prevention.13 Both SemRep and CQA-1.0 rerank a set of MEDLINE citations retrieved using PubMed and clinical query filters.14 The third system, Essie,15 is a probabilistic search engine that uses fine-grained tokenization, concept searching utilizing UMLS-derived synonymy, and phrase searching based on the user's query to find the best MEDLINE citations for answering a clinical question. All three systems use structured domain knowledge from the UMLS to a varying extent and rely directly or indirectly on the medical subject headings (MeSH) controlled vocabulary used to manually index MEDLINE citations.

After providing an overview of the three systems, we concentrate on evaluating them with respect to finding information about treatment and prevention of 15 disorders. A test collection was constructed using the Text Retrieval Conference (TREC) pooling strategy.16 A rating scale developed to evaluate the utility of MEDLINE citations in clinical decision making17 was used to compare the performance of the three methods in answering clinical queries. We provide results in several evaluation metrics as a way of predicting the effectiveness of the systems under consideration in different clinical situations. The goal of this work is to explore approaches to increasing the utility of the primary literature with respect to answering clinical questions of varying complexity. Specifically, we investigate whether automatic understanding of MEDLINE citations based on medical domain knowledge can provide practical support for clinicians in therapeutic decision making.


Prior Work

Previously, several domain knowledge-based automatic approaches to indexing and retrieving scientific publications that contain answers to clinical questions have been explored. The approaches range from developing a set of generic queries for clinical information retrieval18 to matching semantic representation of a patient's clinical record with that of a MEDLINE citation19 and generating personalized summaries in response to physicians' questions.20 The most thoroughly researched approaches have been automatic query expansion, concept indexing, and retrieval-based feedback. The reported evaluations of these approaches have not shown consistent improvements in satisfying clinicians' information needs. For example, automatic query expansion using controlled vocabulary terms was shown to improve overall average precision,21 but only one third of the queries showed improvement in a study of synonym- or hierarchical thesaurus-based query expansion.22 Concept indexing implemented in the SAPHIRE system helped physicians,23 whereas it degraded performance for 30 medical questions in a study of effectiveness of conceptual indexing.24 Query expansion based on retrieval feedback (terms from the top few documents retrieved in an initial run are added to the original query) improved average precision,25 but is also known to degrade performance.26

In addition to using domain knowledge to assist clinicians as they actively seek information, promising results have been obtained through passive query generation using patient records for query formulation.27 The effect of unobtrusively providing context-specific links between clinical data and information resources in the form of “infobuttons”28 has been examined in several recent studies. Rosenbloom et al.29 found a significant increase in the use of educational materials in the Care Provider Order Entry system when the materials could be accessed through visible hyperlinks (as opposed to menus). Cimino et al.30 observed positive results and increased use of infobuttons over 5.7 years. The success of context-specific access to knowledge access in this study varies with context and user type. Similar to the study by Cimino et al., Del Fiol et al.31 observed increased infobutton use with preference for secondary sources (such as Micromedex and UpToDate) that summarize the results of clinical studies and underutilization of resources providing access to the primary literature (such as MDConsult and PubMed). Although these resources provide valuable general information to the clinician, there remains a need for methods that help find answers to particular questions.


PubMed automatically recognizes controlled vocabulary terms matching a user's query with the entries in several translation tables. If a match is found in the MeSH translation table, the term is searched as MeSH (including the MeSH term and any specific terms indented under that term in the MeSH hierarchy) and as a text word. One of the advanced PubMed search options, clinical queries, is a set of filters designed to find clinically relevant and scientifically sound studies.32 These filters automatically expand queries using predefined sets of terms designed to limit search results to articles addressing one of the four major clinical tasks (etiology, diagnosis, therapy, and prognosis). For each task, clinical queries provide two search choices: specific (narrow) or sensitive (broad). For example, a “narrow therapy” clinical query augments a user's query with the following search terms: randomized controlled trial [Publication Type] OR (randomized [Title/Abstract] AND controlled [Title/Abstract] AND trial [Title/Abstract]).

SemRep Summarization

The first reranking system considered in this study is based on SemRep Summarization,11,33 which depends on the semantic natural language processing system SemRep.34,35 SemRep identifies semantic predications (relationships) in biomedical text using underspecified syntactic analysis and structured domain knowledge from the UMLS. SemRep predications consist of UMLS Metathesaurus concepts as arguments and UMLS Semantic Network relations as predicates (relations between the concepts). Analysis begins with an underspecified syntactic parse that relies on the SPECIALIST lexicon36 and a part-of-speech tagger.37 MetaMap38 then matches noun phrases to concepts in the UMLS Metathesaurus and determines the semantic type for each concept. Concepts are identified as arguments in a predication using syntactic constraints based on dependency grammar rules and semantic constraints imposed by the Semantic Network. Predications representing core aspects of the clinical scenario were central to this study. These predications have predicates such as treats, co-occurs_with, and occurs_in and arguments belonging to the UMLS semantic groups39 Chemicals and Drugs, Disorders, and Population Groups.

SemRep Summarization is an automatic summarization system in the semantic abstraction paradigm.40 The system takes as input a list of predications extracted by SemRep from biomedical text (MEDLINE citations to be reranked in this study). Output is a condensed set of predications that serves as a summary of salient information on a specified topic in the citations processed. The core of the system is a transformation stage that identifies the most important information with respect to the specified topic. The transformation stage relies on four principles: (1) relevance, which keeps predications on the topic of the summary; (2) connectivity, which keeps related predications that share an argument with the summary topic; (3) novelty, which eliminates uninformative predications; and (4) saliency, which keeps high frequency predications.41 Predications in the summary are linked to the citations from which they were extracted and play an important role in exploiting SemRep Summarization for reranking retrieved citations in this study.


Another reranking method is implemented in the prototype clinical question-answering system CQA-1.0. In this system, questions and MEDLINE citations are represented using frames that capture the fundamental elements of EBM: (1) clinical scenario, (2) clinical task, and (3) strength of evidence. A question frame submitted to the system is used to generate a query and search MEDLINE using PubMed. Retrieved citations are processed with several knowledge extractors and classifiers that rely on a combination of UMLS concept recognition using MetaMap,38 manually derived patterns and rules, and supervised machine learning techniques12 to identify the fundamental EBM components listed. The PICO framework (Problem/Patient, Intervention, Comparison, and Outcome) designed to help clinicians formulate clinical questions42 is used to capture the first fundamental component (clinical scenario) in a MEDLINE citation. The elements of a clinical scenario are identified and extracted by four knowledge extractors. The problem extractor identifies a UMLS concept in the semantic group39 Disorders, which is the focus of a given study. The population extractor identifies phrases containing numerical expressions and concepts with the semantic type Group and its subcategories. The intervention and comparison extractor is based on finding concepts with nine semantic types (for example, Therapeutic or Preventive Procedure and Diagnostic Procedure). Identification of the second fundamental component (clinical task) is based on rules derived from: (1) search strategies encoded in PubMed clinical queries, (2) the JAMA EBM tutorial series on critical appraisal of medical literature,43 and (3) MeSH scope notes. The third fundamental component (strength of evidence) is based on the type of clinical study presented in the publication, authority of the journal that published it, and date of publication. Citation scoring and reranking with respect to a question are based on: (1) matching the question and citation with PICO frames, (2) matching the clinical task that generated the question with the task identified in the clinical study (treatment and prevention, for this study), and (3) the strength of the evidence presented in the study.


A different approach to finding citations answering clinical questions is implemented in Essie, a probabilistic search engine developed at the National Library of Medicine for the ClinicalTrials.gov database. Essie incorporates a number of strategies aimed at alleviating the need for sophisticated user queries.15 These strategies include a fine-grained tokenization algorithm that preserves punctuation information, concept searching utilizing UMLS-derived synonymy, and phrase searching based on the user's query. Citations containing phrases identified in a user's query are ranked higher than citations containing individual words comprising the phrase. Position of a matching phrase or term in a citation also influences the rank of a citation with respect to a query. For example, if a phrase is found in the title, the citation is ranked higher than one that contains this phrase in the abstract. Essie provides several possibilities for query expansion: exact match, SPECIALIST lexicon-based36 morphological expansion of terms, and UMLS-based expansion of concepts. Essie was the best-performing search engine in the 2003 TREC Genomics track44 and one of the best-performing systems in the 2006 TREC Genomics track.45

Evaluation Strategy

Our evaluation is based on techniques developed over the past 15 years in the framework of TREC—a yearly large-scale evaluation of information retrieval and question answering systems.16 Traditionally, systems are evaluated using test collections consisting of: (1) a corpus of documents (for example, MEDLINE citations), (2) a set of queries or questions (called topics in TREC), and (3) relevance judgments—human assessment of the relevance of each document in the collection to a given topic. Ideally, each document in the corpus would be judged with respect to each topic. Due to the size of modern document collections, such evaluation is not feasible even in the framework of TREC, which leads to an alternative strategy of first selecting a subset of documents to be judged, then assessing the relevance of these documents to the topics, and finally using these relevance judgments to assess the relative performance of the systems. A practical solution to the question of selection of an appropriate small subset of documents is the TREC pooling strategy. Documents to be judged for each topic are contributed to the pool by each information retrieval system participating in the evaluation. In TREC, the top 75 to 100 documents returned by each system are combined into a set given to the judge. The judged documents are subsequently used to evaluate the relative performance of the contributing systems.


In exploring the effectiveness of SemRep Summarization, CQA-1.0, and Essie in pinpointing answers to clinical questions in MEDLINE citations, we first created a test collection based on published clinical questions deemed to be of interest to the majority of American family physicians46 and a set of MEDLINE citations created using the retrieval results of the three experimental systems and the TREC pooling strategy. The citations were judged for relevance to the questions on a three-point scale. Results from the three systems being evaluated were compared against the test collection individually and fused. Several evaluation metrics were computed to predict how the systems would perform in a clinical setting.

Creating a Test Collection

To evaluate the performance of the three systems under scrutiny, we constructed a test collection consisting of 15 clinical questions along with relevant MEDLINE citations and judgments of their relevance to the questions. The top 10 documents returned by each system were added to the pool of documents evaluated by the first author, who did not participate in the development of any of the experimental systems.

Question Selection

For the questions, the first author, a practicing family physician, selected 15 queries (Table 1) from the Family Practice Information Network (FPIN) clinical queries collection, which is published monthly in the Journal of Family Practice and American Family Physician and contains queries typically generated in the daily practice of general medicine.46 Even if the query did not adhere to the syntactic form of a question (for example, specific queries 4 and 5), the original queries were not modified. The queries selected pertain to therapeutic or preventive interventions for clinical problems and can be regarded as instances of generic clinical questions.47 We identified two types of clinicians' information needs: general (an overview of a topic) and specific (an exact answer to a focused question). When inspecting the FPIN clinical queries collection, we determined that some questions are intermediate; they do not call for an overview but are not focused enough for an exact answer. The nature of the questions in the FPIN collection warrants exploration of all three question types. Five queries were selected as general in that the only element of a clinical scenario in the question was the problem. Five were intermediate, with clinical scenario elements of population group, intervention, or outcome included with the request for therapy or prevention of a problem. Finally, five were specific or complex, including at least two elements of a clinical scenario selected from population group, intervention, or outcome (in addition to the problem). Our focus on therapy and prevention questions and the intent to evaluate the systems' performance for all levels of difficulty precluded random selection of the questions. Instead, the first author selected five questions of interest to his practice from each level.

View this table:
Table 1

Clinical Questions Used to Retrieve MEDLINE Citations

General Questions
  1. What is the most effective treatment for external genital warts?

  2. What are the most effective interventions to reduce childhood obesity?

  3. What is the most effective treatment for acute low back pain?

  4. What is the best approach to treatment of osteoporosis?

  5. What are effective treatments for panic disorder?

Intermediate Questions
  1. What is the most effective treatment for ADHD in children?

  2. Can type 2 diabetes be prevented through diet and exercise?

  3. What are the best therapies for acute migraine in pregnancy?

  4. Do steroid injections help with osteoarthritis of the knee?

  5. What is the best antiviral agent for influenza infection?

Specific Questions
  1. Are antibiotics effective in preventing pneumonia for nursing home patients?

  2. Is methylphenidate useful for treating adolescents with ADHD?

  3. What is the best treatment for gastroesophageal reflux and vomiting in infants?

  4. Antiviral agents for pregnant women with genital herpes.

  5. Intravenous fluids for children with gastroenteritis.

In the FPIN collection, each query is accompanied by a published answer derived from a careful process involving a search of the published literature by a medical librarian, review of that literature and any other sources of evidence by two clinicians trained in evidence-based medicine, and editorial review by FPIN academic family physicians. The MEDLINE citations published in the evidence summaries48 were used for reference while judging the abstracts selected for evaluation as described below.

Assigning Relevance Judgments

Relevance judgments were generated using the pooling strategy developed for TREC.16 The top 10 documents from each system were collected, and duplicates were removed. The titles and abstracts of MEDLINE citations were printed in random order and given to the first author. The nontopical characteristics of key articles identified in Sievert et al.49 (authors and their institutional affiliations, or document types) were removed so that the judgments could be based only on the content of the abstract text. The abstracts were rated on a three-point scale: A, leads to an answer (definitely useful in clinical decision-making for the question); B, might lead to an answer (relevant, but not sufficient to make a decision); C, not relevant (not useful for clinical decision making). In addition to randomized blinded evaluation of the citations, the first author documented the reasons for rating. Analysis of these reasons for rating provides information about features that make a citation particularly useful in decision support.

Retrieving and Reranking MEDLINE Citations

Each FPIN question was used to search MEDLINE with PubMed and Essie, limited to no later than the date of the FPIN answers for each question. Essie returns relevance-ranked output directly. The chronologically ordered citations from PubMed were subsequently reranked for each query using SemRep Summarization and the CQA-1.0 system.

For the strategies based on SemRep Summarization and CQA-1.0, an initial PubMed search strategy was to use the narrow therapy clinical queries filter and the clinical terms identified in a given question. For example, the clinical term in the FPIN question, “What is the best approach to treatment of osteoporosis?” is osteoporosis. The addition of the PubMed clinical queries filter to this term yields the following query: (osteoporosis[MeSH Terms] OR osteoporosis[Text Word]) AND (randomized controlled trial[Publication Type] OR (randomized[Title/Abstract] AND controlled[Title/Abstract] AND trial[Title/Abstract])). If the initial search yielded no results, the search was repeated with the clinical queries filter replaced with the following limits: citations with abstracts, restricted to human studies written in English. Two of the intermediate questions and all specific questions required this substitution. A total of 1,305 documents for the first set, 925 for the second set, and 959 for the third set were retrieved from MEDLINE using PubMed. Unranked PubMed results were used as a baseline against which experimental results were compared.

In exploiting SemRep Summarization for reranking retrieved citations, predications were extracted from the MEDLINE citations retrieved for each FPIN query. After summarizing the predications, the citations from which the predications were extracted were promoted as being more highly relevant to the query based on how closely and how frequently arguments in those predications matched Metathesaurus concepts extracted from the query.

The CQA-1.0 reranking algorithm promotes citations in which the automatically identified problems and interventions match those in the question; patient-oriented outcomes are identified with strong confidence, the task matches that of the questions, the study population is large, and the strength of evidence is high.50

In searching with Essie, a strategy similar to PubMed clinical queries, using EBM-related and therapy-related terms (such as therapeutic use, clinical trial, etc.) was applied. Unlike the clinical queries filters, this strategy promotes EBM-oriented citations without reducing the number of retrieved citations. Essie core document ranking promotes citations that contain query phrases in the fields observed to be most informative, for example in the title.51 To take advantage of UMLS synonymy, UMLS-based expansion of concepts was used in the search. Essie returned 2,500 citations in the first set, 896 in the second, and 673 in the third.

Fusion of Results

In addition to evaluation of individual systems, the ranked results generated by each system were merged using fusion. Fusion was based on the rank order assigned to a document by each system, rather than on scores. This is because the systems either do not score documents or generate scores for ranking purposes only (that is, scores represent neither the similarity of a citation and the query nor the system's confidence in the relevance of a citation to the query). This approach relies on document overlap, which for SemRep, CQA-1.0, and the baseline PubMed retrieval constitutes the whole result set. The results were merged using the fusion approach proposed by Fox et al.52 The contribution of each system to the final ranking was weighted equally.


Five sets of output were evaluated as part of this study: the ranked output from each of the systems under consideration, the fused output from all three, and unprocessed PubMed output (baseline). The trec_eval-8.0 package53 was used to evaluate the results. The systems were evaluated under two conditions: strict, considering only citations graded A in the three-point scale evaluation to be relevant, and soft, considering both three-point scale A-grade and B-grade citations relevant to the question. Because the relative ranking of the systems with respect to the baseline is identical under both conditions, we present and discuss the results of the soft evaluation. The differences in retrieval results between systems were compared using a Wilcoxon signed ranks test for all metrics. p values <0.05 were considered significant. The Wilcoxon signed ranks test is used when the values in the two results being compared are naturally paired (for example, the same set of documents is ranked by two systems) and the relative magnitude as well as the direction of the differences is considered.54

Two classes of evaluation metrics were used to account for two different information needs experienced by clinicians, one general and the other focused. The first type of information need is reflected in our general questions and corresponds to a situation in which a clinician might need an overview of a topic. In this scenario, a clinician would be interested in both precision (the percentage of the retrieved citations that are relevant) and recall (the percentage of the relevant documents that are retrieved). Evaluation metrics that reflect this need are:

  1. Mean Average Precision (MAP): For multiple topics, it is the mean of the average precision scores for each of the topics. The average precision score for a single topic is computed by averaging the precision after each relevant document is retrieved.16 This metric has recall and precision components and is widely accepted in information retrieval as reflecting the level of performance a user should expect for a new topic retrieved using a system that achieves a given MAP value.

  2. Binary Preference (Bpref): A preference-based measure that depends on the number of documents that were judged as nonrelevant that were retrieved with higher rank than relevant documents. This distinguishes Bpref from MAP, which is determined by the ranks of the relevant documents in the result set and makes no distinction between documents explicitly judged as not relevant and documents that are not judged.55 This measure is reported to be more stable than MAP with incomplete judgments, which is probably the case for the pilot studies presented below.

  3. R precision: This measures precision after R documents have been retrieved, where R is the total number of relevant documents for a query.

The second type of information need experienced by clinicians corresponds to a situation in which an exact answer to a well-focused question is required (reflected in our specific questions). Because clinicians are willing to spend no more than 4 to 5 minutes evaluating search results,56 it is important that the answer to the question be found in the first few citations retrieved. Metrics that evaluate how soon a user will see the answer and how many relevant citations are at the top of the retrieval results list are:

  1. Precision at five retrieved documents (P@5): Measures the fraction of relevant documents in the top five documents retrieved.

  2. Precision at 10 retrieved documents (P@10): Measures the fraction of relevant documents in the top 10 documents retrieved.

  3. Mean Reciprocal Rank (MRR): The metric used in TREC question answering evaluation.57 It quantifies the expected search length and is computed as the mean of the individual questions reciprocal ranks. The reciprocal rank of the top relevant document is the reciprocal of the rank at which the first relevant document was found.


Table 2 summarizes the results of the exploration of the differences between the three experimental approaches to document ranking for clinical question answering. PubMed results are used as a reference point to provide a comparison of the experimental retrieval approaches with the state-of-the-art baseline (which includes the clinical queries filters). The table also presents the fusion results for the experimental systems. The best results for individual systems and the best fused results are shown in bold.

View this table:
Table 2

Results for All Questions


Overall (Table 2), CQA-1.0 performs best with respect to the baseline. Fusion of the three systems also performs well overall and outperforms CQA-1.0 for general questions.

In Tables 3 through 5, results are presented categorized by the complexity of the question and from the point of view of how well evaluated systems perform in response to general versus focused information needs. For general questions (Table 3), there is no single trend discernible. As noted, MAP, Bpref, and R-prec are likely to be most valuable for evaluating general questions as expressing a general information need. Essie and CQA-1.0 significantly outperformed PubMed according to MAP, but not Bpref. Fusion does well for Bpref.

View this table:
Table 3

Results for General Questions


The baseline PubMed performance for intermediate questions (Table 4) was significantly better than for the general questions. The experimental approaches did not significantly improve on the baseline for these questions, according to the measures reflecting a general information need (MAP, Bpref, and R-prec). SemRep Summarization and CQA-1.0 did better on the measures reflecting a more focused information need (SemRep Summarization for MRR and P@5, CQA-1.0 for P@10).

View this table:
Table 4

Results for Intermediate Questions


The baseline is higher for specific questions; however, the experimental approaches apparently benefited from additional details provided in the complex questions (Table 5). The CQA-1.0 system, specifically designed to handle questions in the EBM-recommended form, benefited most among individual systems, scoring particularly well on MRR, P@5, and P@10. Fusion also does well on these measures in response to a focused information need. CQA-1.0 also did well according to MAP for the complex questions, 0.6286. However, the difference between CQA-1.0 and Essie is not statistically significant. Fusion of the results for the three systems (MAP = 0.7839) is particularly successful for this class of questions.

View this table:
Table 5

Results for Specific (Complex) Questions

      Fusion all0.66770.65720.57750.90000.88000.8200
      Fusion 30.78390.80010.68661.00000.92000.8400

In terms of finding answers to specific questions, all experimental methods were successful in promoting relevant documents to the higher ranks, achieving MRR from 0.86 to 0.96 (Figure 1), 79% to 85% precision at five retrieved documents, and 69% to 87% precision at 10 documents, meaning that three to four of the first five documents retrieved by the evaluated systems (and six to eight of the first 10) provide information that potentially or definitely leads to answers to a clinical question.

Figure 1

Mean reciprocal rank of the first relevant document retrieved by each method.


Because of the size of the pool and the number of questions, the results of this exploration are promising but not definitive. For 15 questions, the CQA-1.0 improvement over PubMed is statistically significant (p < 0.01), and so is the improvement of the fused results of SemRep, Essie, and CQA-1.0 over the individual systems and the baseline. Mean average precision is not always improved by semantic reranking; that is, well-formed PubMed queries provide respectable recall and precision, and thus a good overview of the information landscape for a given topic. Semantic reranking, however, improves the rate of finding answers to specific questions.

External Knowledge

The three systems evaluated rely on UMLS domain knowledge to manipulate semantic content in MEDLINE citations. Such content includes: (1) the number of subjects, (2) comparison of multiple therapies, (3) placebo control, and (4) comparative cost of interventions. Previous research49 has identified nonsemantic characteristics of articles as being important in identifying key articles. These include methodological rigor, authors and their institutional affiliations, document types, and population studied. Our research suggests that such cues, which are used in Essie and CQA-1.0 but not in SemRep, contribute to performance. Judging by the reciprocal rank of the top retrieved document, and precision at five and 10 documents, semantic reranking is necessary when a clinician is interested in (or has time for) only the first few citations. However, using Essie might preclude the need for reranking for general and intermediate questions.

Yet another type of key element identified in this study requires external knowledge in addition to semantic processing. These characteristics include: (1) availability of a therapy for the local practitioner community (e.g., approval by the U.S. Food and Drug Administration or availability in a community environment) and (2) applicability of the study results more generally, for example, extending the results of a clinical trial conducted in a subpopulation to the population of interest.

Notes taken during evaluation identified additional nonsemantic criteria used to assess usefulness of the citation to a clinician. The rater (who considers himself a typical primary care physician1) evaluated the utility of a citation and used nontopical cues present in the citation, as well as “world knowledge.” For example, for the query, “what is the most effective treatment for ADHD in children?” a citation entitled “Attention-deficit hyperactivity disorder in children and youth: a quantitative systematic review of the efficacy of different management strategies” was judged as A grade (leads to an answer; definitely useful in clinical decision making for the question) with the assumption that a systematic review was exhaustive of the published literature for efficacy. In contrast, for the query, “What is the best antiviral agent for influenza infection?” a citation “Efficacy and safety of oseltamivir in treatment of acute influenza: a randomized control trial” was judged as A grade even though the comparisons were to placebo only. The rater believes that a citation with comparisons to various treatment methods is unlikely to appear in the primary literature.

Citations that were judged to be B grade (not sufficient to answer the query but helpful in medical decision making) also need to be qualified by an understanding that the rater assumed the knowledge level of the typical primary care physician. Thus for the query, “What is the best treatment for gastroesophageal reflux and vomiting in infants?” a citation not related to therapy entitled “The infant with chronic vomiting: the value of the upper GI series” was retrieved by the probabilistic search method. It was rated B because the rater thought that most primary care physicians might not know that “in a study of 344 otherwise healthy infants referred to pediatric gastroenterologists for chronic vomiting findings other than gastroesophageal reflux were seen in only 2 patients ;lf (0.6%)” and that knowledge might influence a best-therapy decision.

Citations that were judged to be C grade (not helpful in answering the clinical query) also involved some assumptions regarding utility to the decision maker. For the query, “In children with acute vomiting and diarrhea (gastroenteritis), does treatment with intravenous fluids improve recovery compared with oral rehydration therapy (ORT)?” a citation entitled “Ondansetron decreases vomiting associated with acute gastroenteritis: a randomized control trial” was rated C because the study population reported included only children who had been assigned to intravenous fluid therapy. The information may have been new and helpful in treatment of the disorder, but was not helpful in the decision called for in the query.


As previously mentioned, the results of our exploration should be interpreted taking into account several limitations, including the modest size of the pool of judged documents and the number of questions. In addition, our findings pertain to answering therapy questions only. Answering questions about other clinical tasks could provide additional insights. Our evaluation is based on one expert opinion. Although our evaluator is a residency-trained and board-certified practicing family physician, his opinions most probably differ somewhat from the opinions of other family doctors. Because of the exploratory nature of this investigation, we did not evaluate the spectrum of opinions of family practitioners regarding the relevancy of the citations in our test collection.

Implications and Future Work

This study presents some evidence showing that the burden of overcoming several of the major obstacles1 in practicing evidence-based medicine could be alleviated by integrating into information retrieval systems the domain knowledge in the UMLS and the EBM principles. Unless connected to an electronic patient record, automatic methods cannot be used for the initial step of formulating an information need nor (under any circumstances) for the final steps of appraising the evidence and making a clinical decision. However, automatic methods could address the challenging task of determining an optimal search strategy. A system might first provide the clinician with a pick list for selecting question type, for example, an overview of best available treatments for a given condition. The system could then use a predetermined optimal search strategy for the question type chosen.

Our study suggests several areas for further exploration. We are currently developing question templates for submitting therapy questions to our systems. We plan to expand these to accommodate other types of clinical questions, including those involving diagnosis, prognosis, and cost effectiveness. Uncertainty about finding all relevant evidence could be mitigated by using optimal recall-oriented strategies. Subsequently, the difficulty of synthesizing and appraising all evidence found could be addressed by presenting aggregated search results to the clinician (using SemRep summaries and patient-oriented outcomes extracted by CQA-1.0, for example).


We investigated three knowledge-based systems for assisting clinicians in finding answers to questions in MEDLINE. Although the number and range of clinical queries and citations retrieved are too small for any definitive conclusion, it seems that the semantic processing alone may be less helpful in finding relevant citations than the hybrid approach of combining topical semantically identified factors and nonsubject features associated with MEDLINE citations. The Essie search engine performed significantly better than the baseline overall for general searches of therapy for disease. The CQA-1.0 clinical question answering system performed significantly better than the baseline for complex queries involving population groups, outcomes, and comparison of intervention. A fusion of the three approaches (SemRep, Essie, and CQA-1.0) outperformed the baseline and each approach taken separately for all types of questions. The significance of any of these methods to point-of-care decision support remains unknown, but the increasing ability to postprocess MEDLINE citations to enable increasingly sophisticated methods for ranking retrieval for the clinician is promising.


  • Supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine, Bethesda, Mary-land.

  • 1 See American Academy of Family Physicians Policy and Advocacy58 for one definition of a typical family doctor.


View Abstract