OUP user menu

R Engine Cell: integrating R into the i2b2 software infrastructure

Daniele Segagni, Fulvia Ferrazzi, Cristiana Larizza, Valentina Tibollo, Carlo Napolitano, Silvia G Priori, Riccardo Bellazzi
DOI: http://dx.doi.org/10.1136/jamia.2010.007914 314-317 First published online: 1 May 2011

Abstract

Informatics for Integrating Biology and the Bedside (i2b2) is an initiative funded by the NIH that aims at building an informatics infrastructure to support biomedical research. The University of Pavia has recently integrated i2b2 infrastructure with a registry of inherited arrhythmogenic diseases. Within this project, the authors created a novel i2b2 cell, named R Engine Cell, which allows the communication between i2b2 and the R statistical software. As survival analyses are routinely performed by cardiology researchers, the authors have first concentrated on making Kaplan–Meier analyses available within the i2b2 web interface. To this aim, the authors developed a web-client plug-in to select the patient set on which to perform the analysis and to display the results in a graphical, intuitive way. R Engine Cell has been designed to easily support the integration of other R-based statistical analyses into i2b2.

Introduction

Informatics for Integrating Biology and the Bedside (i2b2) is an initiative funded by the NIH Roadmap National Centers for Biomedical Computing and headed by Partners HealthCare Center in Boston. The i2b2 project aims at building an informatics infrastructure to support biomedical research by storing and integrating data from clinical practice, and making them accessible in anonymized form to biomedical researchers.13 For this purpose, i2b2 has developed a software architecture structured as a ‘hive’ with different software ‘cells’ devoted to data-extraction, data-manipulation, or data-analysis tasks. The software has been developed in Java and exploits web services for the communication between the cells.

The University of Pavia, as an academic partner of the i2b2 project, has been implementing the i2b2 architecture in the research hospitals of Pavia, Italy. In one of these, the IRCCS Fondazione S Maugeri (FSM), the i2b2 infrastructure has been integrated with the Transatlantic Registry of Inherited Arrhythmogenic Diseases (TRIAD), a joint project of the Molecular Cardiology Department at FSM and the Langone Medical Center at New York University.

In order to augment the statistical analyses that can be performed within i2b2-TRIAD, we decided to rely on the R software.4 R is currently the most powerful free statistical environment, available in source code form under the terms of the Free Software Foundation's GNU General Public License. R is constantly expanded via packages, and it contains a very wide set of statistical tools particularly interesting for biomedical research. We created a novel cell named the R Engine Cell (RE Cell) that allows the communication between the i2b2 architecture and the R software. As survival analyses are routinely performed by cardiology researchers at FSM, we have first concentrated on making Kaplan–Meier analyses available within the i2b2 web interface. To this aim, we developed a web-client plug-in that allows users to easily select the patient set on which to perform the analysis and displays the results in a graphical, intuitive way. The Methods section describes in more detail our twofold implementation strategy, namely the implementation of the RE Cell and of the Kaplan–Meier plug-in, while the Observations section includes an example of use. The source code for RE Cell and the Kaplan–Meier plugin is available at http://www.orbitproject.org/resource/r-engine-cell-integrating-r-i2b2-software-infrastructure under LGPL license.

Methods

R Engine Cell architecture and implementation

The i2b2 hive is a set of server-side software modules, called cells, which contain applications for accessing and manipulating data.5

Cells communicate with the i2b2 web client through RESTful web services,6 which exploit XML messages compliant with the i2b2 XML specifications. Once the cell has received an XML request from the web client and processed it, the cell service sends back an XML response message containing information about the final status of the request and the service results.

Our novel R Engine Cell is inserted into the i2b2 hive and allows users to dynamically run predefined R statements, in our case Kaplan–Meier survival analyses, on the data coming from the i2b2 data warehouse. Figure 1 shows the system components and their inter-relationships. The web client first requests the analysis data from clinical research chart (CRC) cell (data are split into patients, visits, and observations, where observations for the Kaplan–Meier survival analysis consist of concepts related to events and therapies). The data and analysis request are then sent to the RE Cell by the client through dynamically created XML messages, structured into the following tags:

  • <operation_type> defines the type of operation RE Cell has to perform. The possible values are: ‘data’ if the standard i2b2 <message_body> contains information about patients, visits or observations and ‘analysis’ if the request is about an analysis service of previously sent data.

  • <operation_name> defines the name of the analysis the cell has to perform. Currently, RE Cell embeds the Kaplan–Meier survival analysis procedure, which is invoked by setting the tag value to ‘KM_analysis.’

  • <cell_status> defines the operation-dependent cell status. The allowed values are ‘new’ for new cell calls, ‘open’ if another operation has been previously performed on the same data, and ‘closed’ if it is the last cell request. If the value is ‘closed,’ the <operation_type> value has to be set to ‘analysis’; if the cell status value is ‘new,’ the cell uses the information about the user, request date, and time to create a temporary folder in which the incoming patients data are saved.

  • <data_type> defines the type of data the request is passing as parameter. The possible values are: ‘patients,’ ‘visits’ or ‘observations.’

  • <data_tag> contains the XML data taken from the i2b2 CRC.

Figure 1

System components and their inter-relationships. (1) The web plug-in needs to receive patients, visits, and observation data from the clinical research chart (CRC) cell. To this aim, an XML request for data is sent from the plug-in to CRC using the i2b2 javascript function i2b2.CRC.ajax.getPDO_fromInputList. (2) CRC Cell sends back to the plug-in an XML response containing the requested data (extracted from the i2b2 datawarehouse). (3) The web client plug-in sends the data to the RE Cell through dynamically created XML messages, structured as detailed in the subsection ‘R Engine Cell architecture and implementation’. To this aim, the javascript function i2b2.RECell.ajax.setREngineOperation is called, also specifying the jar application to be used for the data analysis. (4) The RE Cell creates the dataset for the analysis by parsing the XML messages relying on predefined methods, as outlined in the subsection ‘R Engine Cell architecture and implementation’. The RE Cell then runs the Kaplan–Meier jar application. This application, through the JRI libraries, uses the R statistical software installed on the same server machine. RE Cell public method writeHTMLReport() is employed to save results in HTML format (under i2b2 web server installation directory). (5) The RE Cell returns to the web client plug-in the URL where the results have been saved. The web client plug-in shows the survival analysis HTML report and related graphics.

In order to parse the XML message and create the response XML message, a set of Java objects have been created inside RE Cell. These objects have been created by tailoring the ANT script present in the i2b2 cell installation package. To this aim, we introduced our cell specification and relied on the i2b2 XML schemes made up by the three i2b2 default XSD files related to the different data types.7 A set of predefined methods to parse the XML message have also been created for RE Cell. These methods are used to unmarshal the XML request message originated by the client, instantiating a tree of Java content objects that represent the content and organization of the XML message. Different methods to extract patient-related information have been defined. For example, methods are available to extract patients' birth date and gender, the concept associated with an observation, observation's start and end date, the numeric/textual value associated with an observation, and the start and end date of the visit.

RE Cell is then able to marshal the created Java objects in order to generate the XML response message to be sent back to the client. The RE Cell exposes only one web service developed in a RESTful way, like all other i2b2 cells, so that it can be called directly through an Ajax function from the web-client plug-in.8

To embed the R software for statistical computing into our cell, we used the JRI Java library.9 JRI is a Java/R Interface that allows R to be run inside Java applications as a single thread, and it is released only as part of another project called rJava.10 Thus, in order to set up the server machine, first the R software (including R survival library11) and then the rJava binaries have to be installed.

Using the JRI library, we developed a Java application to execute the Kaplan–Meier survival analysis. This application, installed on the i2b2 server as a runnable JAR application, parses the i2b2 XML information about patients, visits, and observations in order to create the correct input source to run the analysis using the embedded R software. The RE Cell runs the specific JAR application using the information stored in the XML <operation_name> tag only if <operation_type> is set to ‘analysis.’

Plug-in implementation and usage

Using tailored plug-ins developed specifically for the i2b2 web client, the research user can directly access the service exposed by each single hive cell, using patient sets previously extracted with ad-hoc queries.

To perform a Kaplan–Meier survival analysis, the user can run the dedicated i2b2 web plug-in, named the Kaplan–Meier Statistical module, we developed and integrated into the i2b2 Analysis Tool web page, inside the Plug-ins section. After dragging and dropping a patient set into the specific box, the user has to select the ‘cardiological event’ to be analyzed among a given set of possible choices. This set can be modified by editing the proper XML file saved in the plug-in folder and loaded through an Ajax GET function. Afterward, by clicking the ‘Run Kaplan–Meier Analysis’ button, the survival analysis starts. During data analysis, communication between the client and RE Cell takes place through XML messages structured as described in the previous section. Once the R Engine Cell has finished the XML parsing action, the Kaplan–Meier Java application can use the information retrived in order to prepare the data in a format that the R software can use. When the survival analysis has been completed, RE Cell uses information about the user and project to store the result files in the web plug-in dedicated folder: the survival curves are saved in jpeg format, while the statistical results are saved as html files. Using Ajax methods, the web plug-in loads and shows these files.

Observations

The current implementation of i2b2 at FSM contains all data from the cardiology TRIAD database, which have been used to populate every i2b2 dimension table of its warehouse architecture comprising patients, visits, and observations, each associated with a concept.12 At present, a total of 5369 patients, 12 060 visits, 342 concepts, and 198 494 observations are contained in the i2b2-TRIAD data warehouse. The majority of concepts belong to one of the following categories: genetic information (such as mutations about genes and their location), demographics information (such as age and gender), clinical episodes (such as cardiac arrest, electric shock device appropriate discharge, syncope), and inherited arrhythmogenic disease diagnosis (such as Brugada syndrome, long QT interval syndrome, catecholaminergic ventricular tachycardia). In order to allow the export of all relevant clinical information for a patient, we allowed the presence of observations associated with concepts that are not part of i2b2-TRIAD ontology. A significant example of observations of this type are results of tests such as ‘Holter monitor’ and ‘effort stress test.’

The novel i2b2 plug-in for Kaplan–Meier survival analysis has been tailored to work on patients inserted in the i2b2-TRIAD. In the current implementation, the survival analysis can be performed on one of three different clinical episodes, each associated with a concept, namely ‘syncope,’ ‘cardiac arrest,’ or ‘electric shock device appropriate discharge.’ An ‘event’ is the occurrence of the chosen episode in a patient who is not undergoing therapy at the time of observation, and the associated time is the age of the patient (in years) at the same time.

The analysis by default groups patients by gender, building separate survival curves for males and females, and performing the logrank hypothesis test to compare the two survival curves.13 Figure 2 shows a screenshot of the plug-in with results of a survival analysis for the event ‘cardiac arrest’ performed on patients with Brugada syndrome aged <70.

Figure 2

Screenshot of the plug-in results of a survival analysis for the event ‘cardiac arrest.’ The analysis was performed on 1548 patients with Brugada syndrome aged less than 70 years (354 females, 1189 males). The event occurred for three female and 54 male patients, giving rise to significantly different survival curves. The survival plot produced by the plug-in has been superimposed onto the screenshot.

Discussion

The aim of the implementation of our novel RE Cell was to allow users to exploit R statistical capabilities within the i2b2 web interface. We have decided first to develop an application for survival analysis, as this type of analysis is routinely performed by clinical researchers. However, RE Cell has been designed to easily support the integration of other R-based applications. This integration requires two key steps: (1) a novel R routine must be embedded into RE Cell by implementing it through JRI library and creating the corresponding jar file; (2) a novel, dedicated web-client plug-in must be implemented. Alternatively, the existing plug-in for survival analyses can be extended in order to accommodate the novel functionalities. The web-client plug-in queries i2b2 CRC (Clinical Research Chart) Cell in order to obtain data for analysis and then sends data to the RE Cell through XML messages (figure 1). The novel R routine is responsible for creating the dataset in the appropriate format for analysis, calling the proper R functions, and saving the results. Dataset creation implies extracting the data from the XML messages. However, for this task, it is possible to exploit the set of dedicated methods provided by RE Cell to parse the XML messages. The availability of these methods substantially reduces the burden on the R-routine writer.

In an effort to continuously improve i2b2 ease of use for hospital researchers, in addition to the Kaplan–Meier plug-in we have already added two other novel web plug-ins to i2b2-TRIAD for data export and for phenotype exploration. In all our work, we paid particular attention to being totally compliant with i2b2 architecture, and thus we always relied on i2b2 technical framework and exploited XML messages compliant with i2b2 XML schemes.7

We plan to continue extending the capabilities of i2b2-TRIAD implementation at FSM to suit the needs of biomedical researchers accessing the system. Moreover, the i2b2-Pavia group is involved in a project aimed at supporting oncology translational research, which has recently been financed by the Lombardia region. This project is based on the idea to integrate the FSM oncology IT system with the i2b2 infrastructure. To this aim, we are importing data from the hospital database into the i2b2 CRC, adding observations extracted from medical reports using natural-language-processing applications, and we are linking i2b2 with the FSM biobank.

Funding

This work has been supported by IRCCS Fondazione S Maugeri and by FIRB project ITALBIONET ‘Rete Nazionale di Bioinformatica’ funded by the Italian Ministry of Education and Research.

Competing interests

None.

Provenance and peer review

Not commissioned; externally peer reviewed.

Acknowledgments

We thank C Prithiani for his technical support on the TRIAD registry.

References

View Abstract