OUP user menu

Drug repurposing: mining protozoan proteomes for targets of known bioactive compounds

Adam Sateriale , Kovi Bessoff , Indra Neil Sarkar , Christopher D Huston
DOI: http://dx.doi.org/10.1136/amiajnl-2013-001700 238-244 First published online: 1 March 2014


Objective To identify potential opportunities for drug repurposing by developing an automated approach to pre-screen the predicted proteomes of any organism against databases of known drug targets using only freely available resources.

Materials and methods We employed a combination of Ruby scripts that leverage data from the DrugBank and ChEMBL databases, MySQL, and BLAST to predict potential drugs and their targets from 13 published genomes. Results from a previous cell-based screen to identify inhibitors of Cryptosporidium parvum growth were used to validate our in-silico prediction method.

Results In-vitro validation of these results, using a cell-based C parvum growth assay, showed that the predicted inhibitors were significantly more likely than expected by chance to have confirmed activity, with 8.9–15.6% of predicted inhibitors confirmed depending on the drug target database used. This method was then used to predict inhibitors for the following 13 disease-causing protozoan parasites, including: C parvum, Entamoeba histolytica, Giardia intestinalis, Leishmania braziliensis, Leishmania donovani, Leishmania major, Naegleria gruberi (in proxy of Naegleria fowleri), Plasmodium falciparum, Plasmodium vivax, Toxoplasma gondii, Trichomonas vaginalis, Trypanosoma brucei and Trypanosoma cruzi.

Conclusions Although proteome-wide screens for drug targets have disadvantages, in-silico methods can be developed that are fast, broad, inexpensive, and effective. In-vitro validation of our results for C parvum indicate that the method presented here can be used to construct a library for more directed small molecule screening, or pipelined into structural modeling and docking programs to facilitate target-based drug development.

  • computational biology
  • proteomics
  • parasitology
  • drug repositioning
  • drug therapy


The prohibitive cost of bringing a new drug to market for the treatment of rare and neglected diseases has spurred a movement to repurpose or reposition existing therapeutics. Small molecule screens to identify new bioactivities can be costly, so the choice of which libraries to screen can be crucial to the success of such an undertaking. Bioactivity databases such as ChEMBL and DrugBank hold a wealth of accumulated knowledge concerning drug–protein interactions. Here, we present an approach to pre-screen the entire proteome of any organism with available genomic data against known drug targets, which utilizes a combination of Ruby scripts and freely available resources. Results from an in-vitro growth assay of Cryptosporidium parvum used for validation showed that compounds predicted to have activity with our approach were significantly more likely than expected by chance to have confirmed activity, with 15.6% of those predicted using the DrugBank database and 8.9% of those predicted using ChEMBL being active. For the purpose of demonstration, we have screened 13 unicellular protozoan parasites against known drug targets of clinically approved therapeutics. The screening process described here offers an in-silico approach to pre-screen and build enriched small molecule libraries in order to identify drugs that may be repurposed for the treatment of protozoan and other neglected diseases.

Background and significance

Parameters that outline what makes a disease rare or orphaned were first codified by the USA Orphan Drug Act in 1983. According to this act, an orphaned or rare disease is one with a prevalence of less than 200 000 individuals in the USA. Of the estimated more than 6000 orphaned/rare diseases, only 325 have established treatments.1 Neglected tropical diseases are generally far from rare globally, yet have a small market in the developed world, making them orphaned as a consequence of their geography. With such a small market in the developed world, pharmaceutical companies are often hesitant to invest in costly de-novo campaigns to develop new therapeutics. What may be considered a weakness in orphaned drug development also poses an opportunity for industrial and academic collaboration. Pharmaceutical companies such as Novartis and GlaxoSmith-Kline have made high throughput screen data publicly available for researchers, opening the door for data mining and developing in-silico screens to identify new treatments and repurpose existing treatments.2

Bringing a new drug to market can cost up to US$800 million and 17 years in development time.3 ,4 Despite a steady annual increase in expenditures for pharmaceutical research and development, de-novo drug creation has stagnated as measured by the number of original investigational new drug applications received by the US Food and Drug Administration.4 In response to a decreasing return on investment, there has been a steady movement to ‘repurpose’ or ‘reposition’ (the terms are used interchangeably) existing therapeutics for off-label applications. Patients and the pharmaceutical industry alike reap the benefits of drug repurposing. Patients may receive a novel therapeutic for a previously unmet clinical need, and extensive post-marketing surveillance data mean repurposing candidates will have a well-defined safety/side effect profile. Pharmaceutical companies that bring repurposed drugs to market can expect a decreased time of development, an extension of patent life, or the salvage of a previously failed therapeutic.5 In terms of the bottom line, repurposed drugs cost around 60% less to bring to market than drugs developed de novo.6 This methodology of drug repurposing has been largely considered a success; of the 51 new drugs to reach their first markets in 2009, approximately 30% were repurposed drugs.7

Because repurposing screens can be costly and time consuming, an in-silico drug screen with the ability to identify drugs with a high likelihood of activity could improve the chances of success by enabling the pre-selection of compounds to test in vitro. In-silico screens generally fall into one of three categories: pharmacophore based; network based; or target based.8 Early successes leveraging in-silico drug discovery were largely pharmacophore based. By mapping the three-dimensional (3D) molecular features of drugs that are required for bioactivity, drugs with similar pharmacophore signatures can be predicted to act in a similar manner. Using this method, novel antagonists for protein kinase C, HIV integrase, and CCR5 were identified.912 Network-based approaches rely on the integration of large-scale datasets to characterize diseases and potential therapeutics. Applications of network-based approaches to drug discovery range from drug prediction based on known side effects to drug prediction based on data-mining of ontology networks.1315 Finally, target-based approaches to drug discovery focus on the prediction of drug activity based on the similarity of microbial proteins to known drug targets and can range from one-dimensional to 3D in nature; from protein sequence analysis to structure prediction and docking of potential ligands. Because target-based approaches rely on a priori knowledge of drug mechanisms, databases that hold vast amounts of protein–drug information such as PubChem, DrugBank and ChEMBL are invaluable.1618

Here, we developed a target-based approach that utilizes simple sequence alignment techniques to discover potential drugs. The method can be used for any organism for which genomic data exist. Our hypothesis was that this approach would enable a low cost pre-screen to improve efficiency and yield in small molecule screens by allowing users to create enriched libraries based on in-silico drug predictions. To evaluate the potential of this approach, the efficacy of our screen was assessed using our previously published in-vitro small molecule screen against the apicomplexan parasite C parvum.19 We have also screened 13 protozoan parasites in silico, and identified many excellent potential candidates for repurposing.

Materials and methods

Protozoan parasites used for this study met two criteria: rare and/or orphaned disease status, and the availability of a comprehensive predicted proteome based on full genome data in GenBank. Thirteen parasites that met these criteria were chosen: C parvum, Entamoeba histolytica, Giardia intestinalis, Leishmania braziliensis, Leishmania donovani, Leishmania major, Naegleria gruberi (in proxy of Naegleria fowleri), Plasmodium falciparum, Plasmodium vivax, Toxoplasma gondii, Trichomonas vaginalis, Trypanosoma brucei and Trypanosoma cruzi.

The overall workflow (shown in figure 1) was implemented using a series of Ruby scripts (making use of BioRuby and CSV gems). Predicted organism protein sequences were obtained from GenBank via taxonomic identifiers; unique records and species complex sequences were used when available. Drug target sequences were obtained through ChEMBL and DrugBank, and identifying information was loaded into a MySQL database. ChEMBL employs a 1–10 ranking system to quantify the specificity of the protein–drug interaction for relevant bioassays (10 being most specific). For this investigation, eight was used as a stringent cut-off value. Protein sequences were then used to search the drug target sequences using BLAST (blastp) using default parameters (BLOSUM62 matrix and composition-based score adjustment conditioned on sequence properties)20 and an expectation value cut-off of 10e-100. For each hit above the specified threshold, identifying information was retrieved. For DrugBank hits this information included the DrugBank ID, drug approval status, and generic name of the drug(s) for each known protein target. For ChEMBL hits this included the ChEMBL ID, drug approval status, published activity and units (including published IC50 values), and the target confidence score mentioned above. The gene identifier (GI) numbers for each protein that hit a DrugBank or ChEMBL entry were then subjected to pathway analysis and functional clustering using the National Institutes of Health (NIH) DAVID's API services.21 Raw output used to construct all graphs and tables and Ruby scripts are provided as supplementary materials (available online only). For statistical analysis, GraphPad Prism software (V.6.01) was used to run a binomial test using the results of either DrugBank or ChEMBL program predictions as observed data, and the results of our previous NIH clinical collection drug screen as expected data.

Figure 1

Overview of workflow.


Parasitic protists chosen for analysis were separated into three taxonomically significant groups: (1) the closely related Trypanosomatidae order (represented by L braziliensis, L donovani, L major, T brucei and T cruzi); (2) the Apicomplexa phylum (represented by C parvum, P falciparum, P vivax and T gondii); and (3) the highly divergent group of Amitochondriates (represented by E histolytica, G intestinalis, N gruberi and T vaginalis). Table 1 summarizes the number of predicted targets and predicted drugs identified for each organism for both approved and non-approved drug classes (see supplementary materials, available online only, for raw data including all predicted drug targets and drugs). Genome size had no correlation (linear, exponential, or logarithmic) with the number of predicted targets or the corresponding number of predicted therapeutics.

View this table:
Table 1

Number of predicted targets and predicted drugs for each organism via ChEMBL and DrugBank alignments (approved drugs only)

Organism nameRecords in GenBankChEMBL predicted targetsDrugBank predicted targetsTargets predicted in bothChEMBL predicted drugsDrugBank predicted drugsDrugs predicted in both
  Entamoeba histolytica81632825111807625
  Giardia intestinalis650213158446618
  Naegleria gruberi1575959622223815451
  Trichomonas vaginalis59679272612916819
  Cryptosporidium parvum38052924121837427
  Plasmodium falciparum5337243110597421
  Plasmodium vivax5392242911587221
  Toxoplasma gondii801337371610710327
  Leishmania braziliensis78964140172199036
  Leishmania donovani79924340161419733
  Leishmania major826540401621810337
  Trypanosoma brucei87124028121039233
  Trypanosoma cruzi1960241351410010031

Apicomplexa and in-vitro validation of results

C parvum is one of several Cryptosporidium species that cause cryptosporidosis, a diarrheal disease with a fecal–oral route of infection. In most cases these infections are self-limiting, yet in immunocompromised individuals cryptosporidiosis can be fatal. Figure 2 shows an example of the program output for nitazoxanide, which is the current standard of care for cryptosporidiosis and was identified as a potential therapeutic by the method. The putative target is C parvum’s pyruvate:ferredoxin oxidoreductase (GI: 66356990), which is aligned with that of Clostridium perfringens (GI: 110803645) (figure 2A). Additional sequences encoding potentially druggable Cryptosporidium proteins that were identified and the relevant metabolic pathways based on a clustering analysis performed using NIH DAVID are also shown (figure 2B,C).21 Unfortunately, nitazoxanide has shown limited efficacy in the treatment of immunocompromised patients; hence there is a great need for new treatments, de novo or repurposed.2224

Figure 2

Example of program output for Cryptosporidium parvum's pyruvate:ferredoxin oxidoreductase (GI: 66356990), including: (A) the raw alignment results, (B) sequences corresponding to potentially drugable organism proteins that were identified for downstream modeling and docking purposes, and (C) pathway analysis and functional clustering courtesy of NIH DAVID.19

Our group has previously used a high throughput cell-based assay to screen the NIH clinical collections chemical libraries to identify candidate drugs for the treatment of cryptosporidiosis.19 In brief, C parvum growth in HCT-8 ileocecal adenocarcinoma cell monolayers treated with small molecule libraries identified 21 confirmed inhibitors. To validate the bioinformatic approach described here, we compared the results of this screen with the in-silico predictions. The DrugBank database recognized 417 of the 727 compounds in the NIH clinical collections as approved therapeutics. Based on our analysis, 32 of these were predicted to have protein targets in C parvum. Of these 32, five compounds exhibited in-vitro efficacy against C parvum growth and became candidates for repurposing. The ChEMBL database recognized 427 of our tested compounds as approved therapeutics, 45 of which were predicted to have protein targets in C parvum. Of these 45, four compounds exhibited efficacy in our screen; 15.6% of predicted compounds exhibited confirmed activity using the DrugBank database, as compared to 8.9% using ChEMBL (table 2). For reference, 0.1–4% of compounds tested in high throughput cell-based small molecule screens are typically found to be active, and the compounds identified by in-silico pre-screening of either database were significantly more likely to be active than expected by chance when compared to the results from screening the NIH clinical collections (DrugBank p=0.002; ChEMBL p=0.04).2527 Of the 27 drugs identified by both databases as potentially having anticryptosporidial activity (see table 1), four (26.7%) were confirmed inhibitors (p=0.0007 vs expected).

View this table:
Table 2

Validation of bioinformatic approach using ChEMBL and DrugBank alignments with in-vitro small molecule screening data

ChEMBL (%)DrugBank (%)
Confirmed hit rate8.9*15.6**
Sensitivity (TP/(TP+FN))44.445.5
Accuracy ((TP+TN)/ALL)89.292.1
Specificity (TN/(TN+FP))90.293.4
False discovery rate (FP/(FP+TP)91.184.4
  • *Indicates p=0.04 and ** indicates p=0.002 versus expected confirmation rate based on results from the NIH clinical collections screen (binomial test).

  • ALL, all predictions; FN, false negative; FP, false positive; TN, true negative; TP, true positive.


The closely related Trypanosomatidae show large overlaps in predicted bioactivities (or predicted drug spectrum) as compared to the more divergent Apicomplexa and Amitochondriate groups (figure 3A,B). Phylogenetic analyses have predicted L donovani to be more closely related to L major than L braziliensis.28 ,29 However, in our analysis, cutaneous L major and L braziliensis have different predicted drug susceptibilites from the visceral L donovani for ChEMBL hits (figure 3A), suggesting that predicted drug spectrums reflect pathogenesis more than taxonomy. On closer inspection we discovered that most of the differences in the predicted drug spectrum between Leishmania species arose from the presence or absence of ATP-binding cassette transporters implicated in multidrug resistance in their respective proteomes. From research in cancer to bacteria, many scientific fields study multidrug-resistant channels and the known bioactivities cataloged in ChEMBL reflect this effort. When multidrug-resistant channels are omitted from the alignments, the Trypanosomatidae predicted drug spectrum for ChEMBL alignments (figure 3A) is much more similar to those of DrugBank (figure 3B and data not shown).

Figure 3

Venn diagrams depicting the overlap in predicted drug susceptabilities for each organism via (A) ChEMBL alignments and (B) DrugBank alignments, drawn to scale.


In the Amitochondriate group there are a large number of predicted targets for the amoeboflagellate, N gruberi, from both ChEMBL and DrugBank alignments. N gruberi is closely related to N fowleri, an extracellular parasite that causes primary amebic meningoencephalitis (PAM), an infection that is almost invariably fatal. Although the mortality rate is very high, the case rate for N fowleri is not. Consequently, the market for a treatment is small and there are few novel therapeutics for PAM in development. BLAST hits of known drug targets against N gruberi identified 59 putative targets and 238 corresponding therapeutics, more than all other organisms tested. Granted, any treatment for PAM would have to achieve an effective concentration in a patient's cerebrospinal fluid, but the sheer number of predicted inhibitors is encouraging that one or more may show efficacy in vivo.


In this study we have presented a computational approach for identifying potential candidate drugs for repurposing. In-vitro validation of the method suggests that identified candidate drugs will have a high probability of activity in cell-based follow-up assays compared to historical rates of activity (0.1–4%) based on screening libraries of compounds with known bioactivity.2527 Our hope is that this method can be either pipelined into more intensive in-silico screens or used to build enriched compound libraries for in-vitro drug testing, which could enable a more efficient use of laboratory resources. The method augments currently available resources, because it can be used with any genomic data rather than being limited to analysis of supported organisms. All programming scripts and screening results for the 13 protozoan parasites are available as supplementary data (online only).

Advantages and disadvantages of alignment-based approaches

A proteome-wide alignment has advantages and disadvantages in drug repurposing, both of which became evident in the results of our small molecule growth assay of C parvum. One disadvantage is the inability to identify drugs that interact with non-protein targets. Drugs that act on DNA, for example, either through a toxic metabolite (eg, furazolidone) or intercalation (eg, quinacrine) cannot be identified.30 ,31 Another disadvantage of this method is that some infections, especially those of intracellular parasites such as C parvum, can be strongly affected by drugs targeting the host cell. Mevastatin, the original member of the cholesterol-lowering family of statins, was identified as active against C parvum growth by cell-based screening, yet was not identified through our in-silico analysis. We believe this is because our approach is limited to parasite targets while mevastatin is predicted to block parasite growth through modulation of the host cell metabolism.19

It is also important to note that the success of the approach described here rests largely in the quality of drug target annotations in a chosen database. With regard to our in-vitro assay, DrugBank alignments showed a higher hit rate and accuracy than ChEMBL alignments; however, the differences were not statistically significant (data not shown). Further, this may change for different organisms and the sheer amount of information in ChEMBL, especially in experimental drug targets, makes it an essential database for the identification of putative hits. Possibly the largest disadvantage to our method is that many drugs simply do not have well-defined protein targets. Despite all these weaknesses, 15.6% of the drug predictions made with our in-silico alignments using DrugBank had confirmed activity, compared to 2.9% of the compounds from the overall screen of the NIH clinical collections.19 It should be noted that our results were obtained using default alignment parameters for the sake of simplicity while working with multiple organisms. A higher confirmation rate for predictions may be possible by adjusting alignment cut-off parameters or by conducting the BLAST searches using an organism-specific BLOSUM matrix.

The advantages offered by the approach described here are speed, broadness, and cost. Local alignments take a nominal amount of time, and when pipelined into more extensive modeling/docking programs allow for less computationally expensive in-silico pipelines. Broadness, in this context, refers to the potential uses of the alignments and the scope of each alignment. With regard to the former, potential uses can range from repurposing drugs for use against neglected diseases to identifying experimental compounds for use as probes to characterize microbial pathogenesis. With regard to the latter or broad scope of each alignment, ChEMBL contains close to 30 000 bioactivities corresponding to more than 1100 unique protein targets for approved drugs. In contrast, the Protein Data Bank currently contains less than 300 unique protein targets of approved drugs available for 3D homology comparisons, greatly reducing the potential for significant hits. Finally, the cost of this in-silico alignment method of identifying bioactivities is minimal, and the user is free to work with any published proteome available through GenBank.

Using functional annotation and pathway analysis it becomes apparent that there is a number of potential synergistic targets in C parvum's metabolic pathways, specifically in the KEGG defined pathways of propanoate and pyruvate metabolism. Identification of synergistic activity may enable effective therapy with a microbicidal drug combination, in which either drug alone has static microbial activity (figure 2C). Synergistic drug combinations were recently shown to increase the activity of fluconazole from fungistatic to fungicidal in pathogenic Candida and Cryptococcus strains.32 Should the lack of efficacy of nitazoxanide in immunocompromised individuals be from an in-vivo static activity against C parvum, then it is possible that a synergetic drug combination could be cidal.


An in-silico approach can be used as a quick, broad, and inexpensive approach to pre-screen entire organism proteomes against known drug targets. To validate this approach, results from a previously published cell-based C parvum growth assay were used; 15.6% of inhibitors predicted using DrugBank and 8.9% of inhibitors predicted using ChEMBL had confirmed activity. Both percentages were significantly higher than the number expected based on data from screening the NIH clinical collections compound library and exceed the normal percentage of active compounds identified using high throughput small molecule screens, which ranges between 0.1% and 4% depending on the library and the assay method employed.2527


This work was jointly conceived by AS, KB and CDH. INS provided expertise in data acquistion and management. AS wrote the code and analyzed the data. KB provided wet-bench screening data used for confirmation of in-silico results. All authors contributed to writing the manuscript.


This project was supported by NIAID R21AI101381 to CDH; AS is supported by NIAID R01 AI072021, KB is supported by T32 AI0055402-06.

Competing interests


Provenance and peer review

Not commissioned; externally peer reviewed.

Data sharing statement

All data generated from this work have been included in the supplementary information (available online only). All of the programming scripts required for use of this method have also been included in the supplementary information (available online only).


The authors would like to thank Jose Teixeira and Peter Miller for their helpful conversations.


View Abstract