OUP user menu

★ Model Formulation ★

A Document Clustering and Ranking System for Exploring MEDLINE Citations

Yongjing Lin , Wenyuan Li , Keke Chen , Ying Liu
DOI: http://dx.doi.org/10.1197/jamia.M2215 651-661 First published online: 1 September 2007


Objective: A major problem faced in biomedical informatics involves how best to present information retrieval results. When a single query retrieves many results, simply showing them as a long list often provides poor overview. With a goal of presenting users with reduced sets of relevant citations, this study developed an approach that retrieved and organized MEDLINE citations into different topical groups and prioritized important citations in each group.

Design: A text mining system framework for automatic document clustering and ranking organized MEDLINE citations following simple PubMed queries. The system grouped the retrieved citations, ranked the citations in each cluster, and generated a set of keywords and MeSH terms to describe the common theme of each cluster.

Measurements: Several possible ranking functions were compared, including citation count per year (CCPY), citation count (CC), and journal impact factor (JIF). We evaluated this framework by identifying as “important” those articles selected by the Surgical Oncology Society.

Results: Our results showed that CCPY outperforms CC and JIF, i.e., CCPY better ranked important articles than did the others. Furthermore, our text clustering and knowledge extraction strategy grouped the retrieval results into informative clusters as revealed by the keywords and MeSH terms extracted from the documents in each cluster.

Conclusions: The text mining system studied effectively integrated text clustering, text summarization, and text ranking and organized MEDLINE retrieval results into different topical groups.


MEDLINE is a major biomedical literature database repository that is supported by the U.S. National Library of Medicine (NLM). It has now generated and maintained more than 15 million citations in the field of biology and medicine, and incrementally adds thousands of new citations every day.1 Researchers can no longer keep up-to-date with all the relevant literature manually, even for specialized topics. As a result, information retrieval tools play essential roles in enabling researchers to find and access relevant papers.2 Frequently, biomedical researchers query the MEDLINE database and retrieve lists of citations based on given keywords. PubMed, an information retrieval tool, is one of the most widely-used interfaces to access the MEDLINE database. It allows Boolean queries based on combinations of keywords and returns all citations matching the queries. Many advanced retrieval methods, such as GoPubMed3 and Textpresso,4 also use natural language processing methods (i.e., entity recognition and part-of-speech tagging) to better identify documents relevant to a query.2 Even with these improvements, significant challenges remain to efficient and effective utilization of ad hoc information retrieval systems such as PubMed.2

Information Retrieval

Information retrieval methods attempt to identify, within large text collections, the specific text segments (such as full text articles, their abstracts, or individual paragraphs or sentences) whose content pertains to specified certain topics or to users' expressed needs.2, 5 Such topics or needs are often stated in user-defined queries. Information retrieval systems typically employ one of two popular methodologies—the Boolean model and the vector model. The Boolean model, used by virtually all commercial information retrieval systems, relies on Boolean logical operators and classical set theory. Documents searched and user queries both comprise sets of terms, and retrieval occurs when documents contain the query terms. The vector model, on the other hand, represents each document as a vector of index terms (such as keywords). The set of terms is predefined, for example, as the set of all unique words occurring across all documents in the overall corpus. A weighting scheme, such as term frequency inverse document frequency (TFIDF), assigns a value to each term occurring in each document.6 A similarity metric determines how well a document matches a query, calculated, for example, by comparing the deviation of angles between each document vector and the original query vector, where the query is represented as the same kind of vector as the documents.7

Challenges for PubMed Information Retrieval

The goal of PubMed, like all other search engines, is to retrieve citations considered relevant to a user query. Modern search engine developers have devoted great effort in optimizing retrieval result rankings, hoping to place the most relevant ones at the top of the ranking list. Nevertheless, no ranking solution is perfect, due to the inherent complexity of ranking search results.8 One aspect of this complexity derives from widely different possible query types. Narrow topic queries, or specific topic queries, retrieve relatively small numbers of citations from MEDLINE. For example, for the query “BRCA”, PubMed returns 722 citations.9 On the other hand, broad topic queries, or general topic queries, return large numbers of citations (thousands or more) from MEDLINE. For example, for the query “Breast cancer”, PubMed returns 96,292 citations.9 Manual reading, summarizing, or organizing such large numbers of articles is overwhelming. The vast majority of MEDLINE users show poor patience with large retrieval sets from broad topic queries. Such users commonly browse through the first screen or even the first ten results hoping to find the right answers for their queries.10 PubMed provides some sorting mechanisms to rank citations, such as sorting by publication date, author name, or journal. Furthermore, a pre-calculated set of PubMed citations that are closely related to a user-selected article can also be retrieved. The related articles will be displayed in ranked order from most to least relevant, with the “linked from” citation displayed first.11 However, the “Related Articles” are citations which are related to a selected MEDLINE citation, and may or may not be relevant to a user's original query. Therefore, to help users identify papers of interest more easily and quickly, will require more advanced rankings and additional information about the relevance of citations to a submitted query.

Another difficulty in ranking search results is that the relevance of a citation is a subjective concept. In fact, the same set of keywords may abstract different user needs according to the context in which the user formulates the query.8, 10 For example, both a researcher interested in finding genetic-study related papers and another researcher interested in finding the latest cancer treatments might issue a MEDLINE query #x201C;Breast cancer.#x201D; Despite the differences in their initial interests, both researchers might seek out papers on breast cancer recurrences. PubMed provides some filtering functions to reduce the number of citations retrieved. However, even advanced PubMed queries with Boolean logic cannot always properly structure the search results.3

In a recent review paper, Jensen et al. (2006)2 indicated that presenting information retrieval results to users constitutes a sentinel problem in biomedical literature retrieval. When a single query retrieves a large number of citations, simply showing them as a long list often provides a poor overview. Furthermore, personal efforts and experience are necessary to extract the desirable biological knowledge from the retrieved literature. Finding elegant and accurate ways to extract the desired information could help users, particularly novice users, select and analyze a focused, reduced, relevant set of citations.12, 13

Using Clustering and Ranking to Boost Information Retrieval

Biologists urgently require efficient systems to help them find the most relevant and important articles from the expanding biological literature.12, 14, 15 One approach to this dilemma is to apply two post-processing techniques, clustering and ranking, to organize the retrieved documents into different topical groups based on semantic information. In this way, users can select, analyze, and focus on only a reduced set of citations in one or more topical groups of interest. Several projects have developed effective and efficient clustering technologies for Web search result organization.1619 Vivisimo20 and Eigencluster21 are working and available demonstrations of search-and-cluster engines. However, how to use both clustering and ranking technologies to improve the search result presentation has not been well studied.10

Next-generation information retrieval tools should take advantage of clustering and ranking technologies.10 Given a set of documents as input, clustering techniques group them into subsets based on similarity. Not only is clustering useful when applied in the presence of broad queries, but it also can improve the search experience by labeling the clusters with meaningful keywords or sentences8—a very useful alternative to a long, flat list of search results. Therefore, clustering can boost user queries by extracting and displaying hidden knowledge from retrieved texts. Ranking is a process which estimates the quality of a set of results retrieved by a search engine. Traditional information retrieval has developed Boolean, probabilistic, and vector-space models for ranking retrieved documents based on their contents.

Clustering and ranking are closely related, but few studies have deeply explored this relationship for biomedical literature retrieval. Some claim that clustering and ranking form a mutually reinforcing relationship. A good ranking strategy can provide a valuable information basis for clustering, and conversely, a good clustering strategy can help to rank the retrieved results by emphasizing hidden knowledge content not captured by traditional text-based analyses. In addition, clustering algorithms can be used to extract, on the user's behalf, knowledge which goes beyond the traditional flat list.10

Text Mining Systems to Improve PubMed Retrieval

Clustering of MEDLINE abstracts has been studied for gene function analysis2227 and concept discovery.28, 29 A few systems have been proposed to present PubMed retrieval results in a user-friendly way other than a long list, most of which are based on pre-defined categories. GoPubMed3 categorizes PubMed query results using to Gene Ontology (GO) terms. Textpresso4 is an information retrieval system that operates on a collection of full text papers on Caenorhabditis elegans. It classifies the papers into about 30 high-level categories, some of which are derived from GO. Another system, XplorMed,30 maps PubMed query results to eight main Medical Subject Headings (MeSH) categories and extracts keywords and their co-occurrences. All such past systems have focused on grouping PubMed results into predefined categories, using classification techniques in which an a priori taxonomy of categories is available, rather than clustering techniques. Clustering differs from classification in that the categories are part of the discovered output, rather than predefined at input. When no pre-imposed classification scheme is available, automatic clustering may critically benefit users by organizing large retrieval sets into browsable groups.10 By comparison, although current systems can classify search results into pre-defined categories, within each category, PubMed results still consist of long lists without importance-related ranking. The need exists for a system to help biomedical researchers in quickly finding relevant, important articles related to their research fields.

This paper describes a text mining system that automatically clusters PubMed query results into various groups where each group contains relevant articles, extracts the common topic for each group, and ranks the articles in each group. To the authors' knowledge, it is one of the first systems that integrates several text mining techniques, namely, text clustering, text summarization, and text ranking. The conceptual clustering component of the system takes an inductive machine learning approach.31, 32 There are two steps in conceptual clustering: the first is an aggregation phase, which clusters documents into different groups, while the second is a characterization phase, which obtains the description of each cluster. The proposed system applies text clustering for the first phase, with text summarization and text ranking for the second phase.

System Framework

Figure 1 provides a high-level overview of the system, which proceeds through five phases: a) query submission and document retrieval; b) preprocessing the retrieved documents; c) determining the number of clusters and partitioning the document set into clusters; d) extracting the common topic for each cluster; e) identifying the most important and relevant articles in each cluster.

Figure 1

Overview of the system.

Query Submission and Document Retrieval

The system starts with submission of a query to the PubMed website. The PubMed search queries used in the current study are the 10 PubMed queries provided by Bernstam et al. (2006)12 without field restrictions (see Online Supplemental Materials at www.jamia.org). The retrieved documents (abstracts) from each query are stored as single XML-format files. Since each retrieved PubMed document comprises one abstract, the authors use the words document, abstract, and article interchangeably in this manuscript. Each XML file is then parsed, with the title, abstract, and MeSH term fields retained for further analysis. If a MEDLINE record does not have an abstract, but has an #x201C;otherabstract,#x201D;33 then the #x201C;otherabstract#x201D; was used as the study's version of the abstract. Lacking both, the title of the record was analyzed.


The preprocessing phase plays a critical role in the subsequent clustering and concept extraction steps. Abstracts were broken during preprocessing into tokens which, in this paper, mean single words (or terms). Word stemming truncated suffixes so that words having the same root (e.g., activate, activates, and activating) collapse to the same single word for frequency counting. Our work applied the Porter stemmer for this task.34 Stop lists were used to filter out non-scientific English words. We developed a stop list based on an online dictionary of 22,205 words.25

The standard term frequency-inverse document frequency (TFIDF) function was used6 to assign weights to each word in each document. Then each document was modeled as an N-dimensional TFIDF vector, where N is the number of distinct words in all of the abstracts. Formally, a document was a vector (tfidf1, tfidf2, …, tfidfN), where tfidfi is the tfidf value of word i. Then a document-by-word matrix was built, in which each row represented a word, and each column represented a document. The values in the matrix are the TFIDF values. If a word did not appear in a document, then zero appeared in the corresponding cell in the matrix (Figure 2a).26

Figure 2

The representation of the documents. (A) Document-by-word matrix. The values in the matrix are the term frequency-inverse document frequency (TFIDF) values. (B) Normalized document-by-word matrix. The document-by-word matrix is normalized using cosine normalization. (C) Bipartite graph representation of the normalized document-by-word matrix.

Text Clustering

The document-by-word matrix was normalized using cosine normalization26 (Figure 2b) and then used as input for the clustering step. Document clustering has been widely studied in the text mining research area. Common methods take one of two approaches: document partitioning and hierarchical clustering. Hierarchical clustering methods organize the document set into a hierarchical tree structure, with clusters in each layer.35 However, the clusters do not necessarily correspond to a meaningful grouping of the document set.36 By contrast, partitioning methods can produce clusters of documents that are better than those produced by hierarchical clustering methods. Comparative studies have shown that partitioning algorithms outperform hierarchical clustering algorithms, and suggested that partitioning algorithms should be well-suited for clustering large document datasets—due not only to their relatively low computational requirements, but also to comparable or even better clustering performance.37 Therefore, the current study employed partitioning algorithms for clustering PubMed query results. One major drawback of partitioning algorithms is that they require prior knowledge of the number of clusters in a given data set. In our system, we applied the authors' recently proposed new algorithm, Spectroscopy,38, 39 to estimate the number of clusters in a document set.

Spectroscopy: Spectroscopy is a novel algorithm which can effectively predict the clustering characteristics of a text collection before the actual clustering algorithm is performed.38, 39 It applies the techniques of spectral graph theory to data sets by investigating only a small portion of the eigenvalues of the data. Since spectral techniques have been well-studied and constitute a mature field in computation, there are a number of applicable efficient computational methods. Particularly in the case where we are only interested in a small number of eigenvalues and the term-document text data is rather sparse, numerical computation software such as LANSO40 and ARPACK41 can obtain results efficiently. (Please see JAMIA online data supplement at www.jamia.org for listing of pseudo-code of the spectroscopy algorithm.)

Document Clustering: Once the number of clusters is estimated, we apply the CLUTO42 software to cluster a set of documents. We use the bisecting K-means technique because it performs better than the standard K-means approach.43 CLUTO is a software package for clustering low and high dimensional data sets and for analyzing the characteristics of various clusters.36 CLUTO has been shown to produce high quality clustering solutions in high dimensional data sets, especially those arising in document clustering. It has been successfully used to cluster data sets in many diverse application areas including information retrieval, commercial data, scientific data, and biological applications.35

Topic Extraction

For a given cluster of documents, our system generates summary sentences, a set of informative keywords, and a set of key MeSH terms, which can be used to describe the topic of that cluster.

To extract a summary sentence, the system uses a multi-document summary software, MEAD, which generates summaries using cluster centroids produced by a topic detection and tracking system.44 Although MEAD can select sentences that are most likely to be relevant to the cluster topic, the summary sentences may not include all informative terms, therefore, they may not be able to precisely describe the topic of a cluster containing a large number of articles. In order to help users understand the topic of a cluster easily, the system also provides a set of keywords and a set of key MeSH terms that are specific and highly descriptive for a given cluster of documents.

Our system adopted a method that represents the relation between term set and document set as a weighted graph, and uses link analysis techniques like HITS (Hyperlink-Induced Topic Search)45 to identify important terms. HITS, first proposed by Kleinberg,45 was originally used to rate Web pages for their #x201C;authority#x201D; and #x201C;hub#x201D; values. #x201C;Authority value#x201D; estimates the value of the content of a web page. #x201C;Hub value#x201D; estimates the value of a web page's links to other pages. The higher the authority value, the more important the web page is. The higher the hub value, the more connected the Web page is. These values can be used to rank Web pages. Authority and hub values are defined in terms of one another in a mutually recursive way. A page's authority value is computed as the sum of the scaled hub values of pages that point to it. A page's hub value is the sum of the scaled authority values of the pages it points to.46 Therefore, HITS detects high scoring hub and authority Web pages using a reinforcement principle. This principle states that a Web page is a good authority if it is pointed to by many good hubs and that a good hub page points to many good authorities. The algorithm constructs a graph of nodes representing Web pages and the edges between them (representing hyperlinks) and each node receives an authority score and a hub score.47

In order to extract informative keywords, a bipartite graph can be built between terms and documents as shown in Figure 2c. All keywords are represented as term nodes on the left-hand side of the bipartite graph, which have edges connecting to document nodes on the right-hand side of the bipartite graph (Figure 2c). #x201C;Authority#x201D; terms and #x201C;hub#x201D; documents can be discovered by the HITS algorithm. Then the reinforcement principle can be stated as #x201C;A term should have a high authority if it appears in many hub documents, while a document should have a high hub value if it contains many authority terms.#x201D;48 Therefore, it is reasonable to infer that documents containing many #x201C;authority#x201D; terms must be #x201C;hubs#x201D; and core documents, while those terms occurring in many #x201C;hub#x201D; and core documents must be #x201C;authority#x201D; and keywords. For a given cluster where documents are homogeneous and central to a topic, the HITS algorithm is effective in discovering keywords and core documents. It has been shown that the HITS algorithm is efficient enough for a Web search engine and therefore it is fast enough for the current setting.

Similarly, our approach also represents the relation between a MeSH term set and a document set as a weighted graph and applies the HITS algorithm to identify the important MeSH terms. (Please see the online supplemental materials at www.jamia.org for the pseudo-code of the project's HITS implementation.)

Document Ranking

In the document ranking step, the goal is to identify articles that are important as well as relevant to the topic of a cluster. Our approach focuses on the citation count per year of a given article. A highly cited article has affected the field more than an article that has never been cited. Therefore, it is reasonable to consider the citation count as an important factor in ranking the articles. Bernstam et al.12 compared eight ranking algorithms, simple PubMed queries, clinical queries (sensitive and specific versions), vector cosine comparison, citation count, journal impact factor, PageRank, and machine learning algorithms based on polynomial support machines. They concluded that citation-based algorithms are more effective than non-citation-based algorithms in identifying important articles. Our approach uses citation count per year instead of simple citation count because an article that was relatively unimportant and published several decades ago can, over time, accumulate more citations than would an important article that was published very recently. We compared our ranking algorithm based on citation count per year with simple citation count and journal impact factor. Article citation count and journal impact factor were obtained from the Science Citation Index (SCI>) and the Journal Citation Report (JCR).49 If an article does not have a citation count in SCI, then its citation count and citation count per year are taken as zero. If a journal does not have a journal impact factor in JCR, then the journal impact factor of the articles published in that journal is taken as zero. In general, the system's ranking strategy is: the higher the citation counts an article has, the more important it is; the larger the citation count per year an article has, the more important it is; the larger the journal impact factor a journal has in which an article was published, the more important this article is.


Gold Standard Test Set

We used the Society of Surgical Oncology (http://www.surgonc.org) Annotated Bibliography (SSO_AB) as a gold standard. The SSO_AB is maintained by the Society of Surgical Oncology (SSO) and is grouped into 10 categories, each regarding a kind of cancer. Each category was compiled by a single expert and reviewed by a panel of experts on that particular topic.12,14 The articles in SSO_AB are chosen by experts as important. The latest edition of SSO_AB is dated October 2001 because maintaining the annotated bibliography requires a great amount of human effort. It contains 458 unique articles cited by MEDLINE. Publication dates range from March 1969 to September 2001. Therefore, in this study, we restricted the PubMed query to this date range. A perfect ranking algorithm should return the SSO_AB articles at the top of the result set.

System Performance Evaluation

To evaluate the effectiveness of our system, we applied the Hit curve algorithm.12 The Hit curve function, h(n), measures the number of important articles among the top n ranked results. If there are k important articles, then the ideal Hit curve will be a straight line with a slope of 1, for 1 < n < k−1, which becomes horizontal for n > k, after all k important articles have been retrieved.12 For this paper, we chose to measure the number of important articles among the top 10, 20, 40, 60… ranked articles. The Hit curve provides an intuitive representation of an algorithm's performance for a given query, and can be averaged over a number of different queries.12

Results and Discussion

As mentioned before, our aim was to implement a text mining and ranking system that allows the user to analyze the documents in a conceptually homogeneous way, as well as choose the most important and relevant documents. We tested all the 10 categories (10 types of cancers) defined by SSO_AB. In this paper, we only present the #x201C;breast cancer#x201D; results. The results of the other nine cancers are included as Online Supplementary Materials at www.jamia.org. The size of the breast cancer result set was 77,784 with 65 unique important articles in SSO_AB.

Preprocessing Results

After the preprocessing stage for the breast cancer results, we obtained a term set with size of 55,712, and a sparse matrix with size of 1,379,417. Note that the document set is of size 77,784, which implies that each document has, on average, about 18 unique terms left after the preprocessing.

Clustering and Knowledge Extraction Results for Breast Cancer

The clustering procedure returned 6 clusters for the breast cancer set. Each cluster refers to a set of abstracts that are related by terms that co-occur among the different abstracts. The sentence summaries are very long, and not informative. Therefore, in this paper, only the top-ranked keywords and MeSH terms are presented (Table 1). Cluster E has the largest number of important articles selected by SSO.

View this table:
Table 1

#x201C;Breast Cancer#x201D; Document Clustering and Topic Extraction Results

Clusters# of Important articlesTop KeywordsTop MeSH Terms
A5BRCA1, BRCA2, p53, ERBB, cell, BCL, DNA, tumor, cyclin, apoptosis, neu, mutant, p21, carcinoma, tumour, receptors, ovarian, suppressor, oncogene, chromosome, antibody, exon, ras, kinase#x201D;mutation#x201D; #x201D;neoplasm proteins#x201D; #x201D;transcription factors#x201D; #x201D;dna, neoplasm#x201D; #x201D;molecular sequence data#x201D; #x201D;receptor, erbb-2#x201D; #x201D;base sequence#x201D; #x201D;tumor suppressor protein p53#x201D; #x201D;proto-oncogene proteins#x201D; #x201D;brca2 protein#x201D; #x201D;genes, brca1#x201D; #x201D;genes, p53#x201D; #x201D;ovarian neoplasms#x201D; #x201D;genes, tumor suppressor#x201D; #x201D;brca1 protein#x201D; #x201D;gene expression regulation, neoplastic#x201D; #x201D;immunohistochemistry#x201D; #x201D;dna mutational analysis#x201D;
B1IGF, IGFBP, insulin, cell, MCF, receptors, mitogen, plasma, serum, paclitaxel, affinity, estrogen, hs578t, tamoxifen, autocrine, mda, kinase, cancer, ligand, tumor, phosphorylation, apoptosis, circulating, TGF#x201D;antineoplastic combined chemotherapy protocols#x201D; #x201D;fluorouracil#x201D; #x201D;cyclophosphamide#x201D; #x201D;doxorubicin#x201D; #x201D;methotrexate#x201D; #x201D;paclitaxel#x201D; #x201D;antineoplastic agents, phytogenic#x201D; #x201D;adult#x201D; #x201D;neoplasm metastasis#x201D; #x201D;middle aged#x201D; #x201D;aged#x201D; #x201D;antineoplastic agents#x201D; #x201D;epirubicin#x201D; #x201D;cisplatin#x201D; #x201D;antibiotics, antineoplastic#x201D; #x201D;infusions, intravenous#x201D; #x201D;taxoids#x201D; #x201D;vincristine#x201D; #x201D;treatment outcome#x201D; #x201D;dose-response relationship, drug#x201D; #x201D;drug therapy, combination#x201D; #x201D;clinical trials#x201D; #x201D;chemotherapy, adjuvant#x201D;
C8estrogen, receptors, tamoxifen, cell, estradiol, MCF, hormone, antiestrogens, progesterone, TGF, steroid, EGF, aromatase, tumor, endometrial, oestrogen, women, postmenopausal, progestin, androgen, PgR, mammary, cytosolic, pS2#x201D;receptors, estrogen#x201D; #x201D;estradiol#x201D; #x201D;tamoxifen#x201D; #x201D;receptors, progesterone#x201D; #x201D;tumor cells, cultured#x201D; #x201D;estrogens#x201D; #x201D;estrogen antagonists#x201D; #x201D;cell division#x201D; #x201D;rats#x201D; #x201D;neoplasms, hormone-dependent#x201D; #x201D;menopause#x201D; #x201D;cell line#x201D; #x201D;antineoplastic agents, hormonal#x201D; #x201D;mice#x201D; #x201D;mammary neoplasms, experimental#x201D; #x201D;rna, messenger#x201D; #x201D;uterus#x201D; #x201D;progesterone#x201D; #x201D;kinetics#x201D; #x201D;cytosol#x201D; #x201D;aged#x201D;
D8women, mammography, mammogram, mammographic, cancer, fat, American, aged, BSE, pregnancy, interventions, lesion, OCs, disease, oral, African, abortion, dietary, diagnosed, birth, deaths, Hispanic, cervical, younger#x201D;mammography#x201D; #x201D;mass screening#x201D; #x201D;adult#x201D; #x201D;aged#x201D; #x201D;risk factors#x201D; #x201D;middle aged#x201D; #x201D;age factors#x201D; #x201D;united states#x201D; #x201D;comparative study#x201D; #x201D;adolescent#x201D; #x201D;incidence#x201D; #x201D;questionnaires#x201D; #x201D;neoplasms#x201D; #x201D;male#x201D; #x201D;aged, 80 and over#x201D; #x201D;case-control studies#x201D; #x201D;risk#x201D; #x201D;breast#x201D; #x201D;breast diseases#x201D; #x201D;socioeconomic factors#x201D; #x201D;cohort studies#x201D; #x201D;sensitivity and specificity#x201D;
E41metastases, axillary, recurrence, lymph, bone, carcinoma, survival, SLN, DCIS, lesion, tumor, mastectomy, disease, metastasis, metastatic, ductal, cytology, excision, cell, cancer, women, lobular, nodal#x201D;lymphatic metastasis#x201D; #x201D;neoplasm recurrence, local#x201D; #x201D;aged#x201D; #x201D;adult#x201D; #x201D;combined modality therapy#x201D; #x201D;middle aged#x201D; #x201D;mastectomy#x201D; #x201D;follow-up studies#x201D; #x201D;neoplasm staging#x201D; #x201D;retrospective studies#x201D; #x201D;prognosis#x201D; #x201D;axilla#x201D; #x201D;lymph node excision#x201D; #x201D;aged, 80 and over#x201D; #x201D;male#x201D; #x201D;lymph nodes#x201D; #x201D;mastectomy, segmental#x201D; #x201D;survival rate#x201D; #x201D;time factors#x201D; #x201D;carcinoma, intraductal, noninfiltrating#x201D; #x201D;radiotherapy dosage#x201D; #x201D;carcinoma, ductal, breast#x201D; #x201D;carcinoma#x201D;
F2cell, antigens, tumor, antibody, carcinoma, MMP, epithelial, MAb, CSF, MCF, CD34, MUC1, mammary, receptors, membrane, cadherin, lymphocytes, marrow, UPA, TNF, HLA, VEGF, kinase#x201D;mice#x201D; #x201D;tumor cells, cultured#x201D; #x201D;antibodies, monoclonal#x201D; #x201D;antigens, neoplasm#x201D; #x201D;molecular sequence data#x201D; #x201D;cell line#x201D; #x201D;amino acid sequence#x201D; #x201D;mice, nude#x201D; #x201D;rna, messenger#x201D; #x201D;cell division#x201D; #x201D;immunohistochemistry#x201D; #x201D;base sequence#x201D;

We extend the idea of the HITS (Hyperlink-Induced Topic Search) algorithm45 in extracting keywords and MeSH terms, in which the ranking is based on the relationships among terms and documents. One of the limits of the HITS algorithm is that it relies on #x201C;global#x201D; information derived from all the vectors in the dataset, which is more effective for datasets consisting of homogeneously distributed vectors. However, the retrieved documents returned from PubMed consist of multiple distinguishable topics. We integrated the HITS algorithm with a clustering technique. Articles of the same topic are grouped into clusters such that the top ranked terms and documents can be identified efficiently. Table 1 illustrates that the clustering and topic extraction strategy performs well. The six clusters our system derived represent six distinct topical groups, which are revealed by the top-ranked keywords and MeSH terms:

  1. Cluster A shows one common topic: molecular genetic studies of breast cancer, especially the genes BRCA1, BRCA2, p53, and p21.

  2. As revealed by the top MeSH terms, such as #x201C;antineoplastic combined chemotherapy protocols,#x201D; #x201C;drug therapy,#x201D; and the chemotherapy drugs, paclitaxel, epirubicin, and cisplatin, we determined that Cluster B contains the articles which are related to chemotherapy.

  3. The articles in Cluster C report research focusing on the role of hormones and growth factors, such as epidermal growth factor (EGF) and transforming growth factor (TGF) in breast cancer development, and on tamoxifen, an anti-estrogen (anti-hormonal) drug as a treatment for breast cancer.50,51 Estrogen promotes the growth of breast cancer cells, and tamoxifen blocks the effects of estrogen on these cells, slowing the growth of the patient's cancer cells that have estrogen receptors. As adjuvant therapy, tamoxifen helps prevent the original breast cancer from returning and also helps prevent the development of new cancers in the other breast.

  4. The common topic of the articles in cluster D is population studies, including mammographic screening for breast cancer among different ethnic group. Cluster-related keywords include #x201C;American,#x201D; #x201C;African,#x201D; and #x201C;Hispanic.#x201D;

  5. The articles in cluster E focus on recurrences of breast cancer and follow-up studies.

  6. Cluster F represents the set of articles which report on treatment of breast cancer using monoclonal antibodies. MAb is one kind of monoclonal antibody which can target epidermal growth-factor receptors.52 The anti-vascular endothelial growth factor (VEGF) antibody, which has been approved by Food and Drug Administration (FDA) for the treatment of colon cancer, is also able to achieve similar progress in the treatment of locally advanced breast cancer.53

The clustering and topic extraction results for the other nine types of cancers appear in the Online Supplemental Materials at www.jamia.org. For all nine cancer types, the algorithm grouped articles into distinct clusters with specific topics.

Document Ranking Results

Figure 3 presents, using Hit curves, the document ranking results for the breast cancer document set (six clusters), In Figure 3, the x-axis represents the number of ranked documents, while the y-axis represents the numbers of SSO_AB articles found in the top ranked document list. All the ranking algorithms, CC, CCPY, and JIF, carried to the end would finish with all important articles found. To provide an upper bound, we also plotted the hit curve of an ideal ranking strategy, which would rank all the important articles at the top of the result list. Therefore, the ranking algorithm corresponding to the hit curve that is #x201C;closest#x201D; to the ideal hit curve was the best algorithm.12

Figure 3

The Hit curves of the six clusters (A, B, C, D, E and F) derived from #x201C;breast cancer#x201D; data set. The x-axis represents the number of ranked documents, while the y-axis represents the SSO_AB articles found in the top ranked document list. (e.g., in cluster E, using CCPY as the ranking function, among the top 40 ranked document list, 7 of the documents are important articles.) CC—Citation Count; CCPY—Citation Count Per Year; JIF—Journal Impact Factor. As a ranking function, CCPY outperforms CC and JIF. To provide an upper bound, we also plot the hit curve of an ideal ranking strategy, which ranks all the important articles at the top of the result list. Therefore, the ranking algorithm corresponding to the hit curve that is #x201C;closest#x201D; to the ideal hit curve is the best algorithm.

Figure 3 shows that, as a ranking function, CCPY (Citation Count Per Year) outperforms CC (Citation Count) and JIF (Journal Impact Factor) as the CCPY hit curve is the #x201C;closest#x201D; to the ideal hit curve. In cluster E, which has the largest number of important articles (41 important articles), out of the 40 top ranked articles, 7 were important if ranked by CCPY, while only 3 were important if ranked by CC, and no important article appeared in the top ranked documents if ranked by JIF. Similar results occurred in the other clusters. The document ranking results for the other nine types of cancers (see Online Supplementary Materials at www.jamia.org) also show that, as a ranking function, CCPY outperforms CC and JIF.

In order to show how much CCPY outperforms CC, for each cancer type, we calculated the average ranking of the important articles as: Embedded Image (1) where R¯ is the average ranking, n is the number of important articles identified by SSO, and ri is the ranking of article i ranked by either CC or CCPY. For example, there were 65 important articles in the #x201C;breast cancer#x201D; retrieval result (n = 65). After all the #x201C;breast cancer#x201D; documents were ranked either by CC or CCPY, we found the ranks of all the 65 articles and calculated the average ranking R¯ . Then, an improvement rate (IR) was calculated as: Embedded Image (2) where R¯CC is the average ranking of the important articles ranked by CC and R¯CCPY is the average ranking of the important articles ranked by CCPY. Table 2 lists these results. For all ten different types of cancers, CCPY improved the ranking compared with CC (from ∼23% to more than 46%). For each cancer type, two important articles (their PMIDs are listed) with the largest IRs were identified. The CC rank, CCPY rank, and the year when the article was published are also listed. We note that these articles were published relatively recently (closer to 2001 within the study's PubMed query date range from March 1969 to September 2001). The ranking information of 65 important #x201C;Breast cancer#x201D; articles is shown in Table 3. The ranking information of the other 9 cancer types appears in the Online Supplementary Materials at www.jamia.org.

View this table:
Table 2

The Comparison of CCPY and CC for Important Article Ranking

Category# of Important Article1 (Total # of articles retrieved2)Average Ranking3Improvement Rate (%)6Example7
CC4CCPY5PMIDCC RankCCPY RankYear Published
Breast Cancer65 (77784)4976.53069.84438.31321115704216553812001
Colorectal Cancer39 (53686)6938.8384857.37829.997231100636649102000
Endocrine Cancer72 (46981)4402.0143389.43523.002641097338311822612000
Esophageal Cancer34 (16359)995.4688641.218835.5862510080844301999
Gastric Cancer47 (33938)2178.2981463.48932.81511547741101222001
Hepatobiliary Cancer68 (76616)7338.6574123.26943.814410636102308492000
Lung Cancer42 (74189)3413.5852595.48823.965931069460019795872000
Melanoma46 (33074)2968.8911761.1340.68054115047456382001
Pancreas Cancer61 (25241)2061.6331131.93345.095311129727110992382001
Soft Tissue Sarcomas22 (3193)205.761911046.5401511230464117162001
  • 1 The number of important articles in each type of cancers defined by SSO_AB;

  • 2 The number of articles retrieved from MEDLINE;

  • 3 Please refer to equation (1) for Average Ranking calculation;

  • 4 CC: Citation Count;

  • 5 CCPY: Citation Count per Year;

  • 6 Please refer to equation (2) for Improvement Rate calculation;

  • 7 For each type of cancer, two articles (PMIDs were listed) with largest improvement rate are identified.

View this table:
Table 3

#x201C;Breast cancer#x201D; Important Article Ranking Results. There are 65 Important Articles

PMIDCitation CountYear PublishedCC RankCCPY RankImprovement Rate (%)

Bernstam et al. (2006)12 reported that citation-based algorithms are more effective than non-citation-based algorithms in identifying important articles. The current study showed that for our purposes, citation count per year worked better than simple citation count. We hypothesized that an article which was relatively unimportant and published several decades ago may accumulate more absolute citations than a more important article published just recently.

The importance of a paper to a field varies over time. A citation decay pattern has been discovered in bibliometric studies of published scientific literature.54, 55 Burton and Kebler coined the term #x201C;Citation half-life,#x201D; with respect to scientific and technical literature.56 Citation half-life can be defined as the number of years required to encompass the most recent 50% of all references made.54 A paper with a longer half-life might have more enduring value than a paper with a shorter half-life.54, 57 Therefore, it seems that citation half-life may be a better measure for identifying important articles than simple citation count or citation count per year. However, citation half-life is highly related to the journal publication frequency, journal age, language, country of publication, length of the paper, and subject category.54,57 Papers that are longer and in sciences or subfields that are growing fast are more likely to be cited over a longer period, and thus have longer citation half-lives.57 How to use the citation half-life information for document ranking requires further investigation.

Algorithm Computational Efficiency

Retrieval and mining of large document sets is computationally intensive. However, by choosing efficient clustering, summarization and ranking algorithms, the system studied performed acceptably fast. The study utilized a machine with the Windows XP operating system, a 3.0GHz CPU, and a 2.0GB RAM. Table 4 shows the computing time for each of the five phases of system operation. Only the #x201C;breast cancer#x201D; data set computing time results are shown because that analysis involved the largest number of documents retrieved from MEDLINE (77,784 citations). The first system analytic phase was query submission and document retrieval. Since this phase relies on PubMed to retrieve the documents, the time is not listed in Table 4. Table 4 shows the times to conduct the other four phases, text preprocessing, document clustering, document ranking and topic extraction. The time to extract the topic for each cluster was only the time to generate the top-ranked keywords and MeSH terms, because summarization of the document set using MEAD took very long (about 40 minutes per cluster, not shown in the table). Furthermore, the sentence summary was not as informative as the top-ranked keywords or MeSH terms, so we used the top-ranked keywords and MeSH terms to represent the common topic of the documents in each cluster. From Table 4, we can see that the pre-processing phase took most of the time (45 seconds out of a total 70.06 seconds). In the future work, we plan to have a local copy of MEDLINE and index each abstract with its keywords. Then, the pre-processing time will be significantly reduced (to about 1–2 seconds). As a result, the system will cluster and rank MEDLINE abstracts in a more efficient and faster manner.

View this table:
Table 4

#x201C;Breast cancer#x201D; Data set Computing Time for Each Phase in Our System

# of documentsComputing Time (in seconds)
Text pre-processing77,784∼45
Text clustering (using CLUTO)77,78414.28
Keyword Extraction
Cluster A8,2120.77
Cluster B6,1590.66
Cluster C13,1221.057
Cluster D16,2920.97
Cluster E21,0051.25
Cluster F12,9941.09
MeSH term Extraction
Cluster A8,2120.41
Cluster B6,1590.33
Cluster C13,1220.58
Cluster D16,2920.66
Cluster E21,0050.81
Cluster F12,9940.59
Document ranking
Cluster A8,2120.20
Cluster B6,1590.11
Cluster C13,1220.27
Cluster D16,2920.30
Cluster E21,0050.44
Cluster F12,9940.28

Conclusions and Future Work

The text mining system we presented, which integrates several text mining techniques, namely, text clustering, text summarization, and text ranking, can effectively organize PubMed retrieval results into different topical groups. It offers users the potential to focus on reduced sets of articles for which they have greater interest, instead of reading through the long list of citations returned by a query. An additional finding of the study involved demonstrating that as a ranking function, citation count per year outperforms simple citation count and journal impact factor.

In this study, authors developed a system framework to explore MEDLINE citations to assist biomedical researchers in identifying important articles according to different topics. There are several areas in which future efforts might improve our system. One such area is the text summarization part of the system. In this study, the sentences derived from MEAD, a multi-document summarizer, were not informative compared to the keywords generated by HITS. We plan to include the Gene Ontology (GO) information in the next iteration of the text summarization process. For each group of articles, besides the informative keywords, the appropriate GO terms would also be listed. New algorithms will be designed and tested to find appropriate GO terms to represent the topic of a given group of articles. A second potential area for improvement would be to develop a new ranking function based on a combination of the CCPY, JIF and other factors. Third, an active learning system might be employed to utilize user-provided feedback to refine clustering and ranking based on users' suggestions. Fourth, a parallel and distributed algorithm might improve system performance by carrying out document clustering, ranking, and topic extraction in a parallel and distributed way. Some open-source distributed computing infrastructure, such as hadoop,58 will be explored. Last, a Web-based software system can be developed and deployed for remote researchers to use as shown in Figure 4.

Figure 4

An example of PubMed search result clustering and ranking.


  • The authors are grateful to the editor and anonymous reviewers for a number of suggestions for improvement. We thank Phil Bachman and Lavin Urbano for helpful discussions.


View Abstract