OUP user menu

★ Editor's Choice ★

Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium

Jyotishman Pathak, Kent R Bailey, Calvin E Beebe, Steven Bethard, David S Carrell, Pei J Chen, Dmitriy Dligach, Cory M Endle, Lacey A Hart, Peter J Haug, Stanley M Huff, Vinod C Kaggal, Dingcheng Li, Hongfang Liu, Kyle Marchant, James Masanz, Timothy Miller, Thomas A Oniki, Martha Palmer, Kevin J Peterson, Susan Rea, Guergana K Savova, Craig R Stancl, Sunghwan Sohn, Harold R Solbrig, Dale B Suesse, Cui Tao, David P Taylor, Les Westberg, Stephen Wu, Ning Zhuo, Christopher G Chute
DOI: http://dx.doi.org/10.1136/amiajnl-2013-001939 e341-e348 First published online: 1 December 2013

Abstract

Research objective To develop scalable informatics infrastructure for normalization of both structured and unstructured electronic health record (EHR) data into a unified, concept-based model for high-throughput phenotype extraction.

Materials and methods Software tools and applications were developed to extract information from EHRs. Representative and convenience samples of both structured and unstructured data from two EHR systems—Mayo Clinic and Intermountain Healthcare—were used for development and validation. Extracted information was standardized and normalized to meaningful use (MU) conformant terminology and value set standards using Clinical Element Models (CEMs). These resources were used to demonstrate semi-automatic execution of MU clinical-quality measures modeled using the Quality Data Model (QDM) and an open-source rules engine.

Results Using CEMs and open-source natural language processing and terminology services engines—namely, Apache clinical Text Analysis and Knowledge Extraction System (cTAKES) and Common Terminology Services (CTS2)—we developed a data-normalization platform that ensures data security, end-to-end connectivity, and reliable data flow within and across institutions. We demonstrated the applicability of this platform by executing a QDM-based MU quality measure that determines the percentage of patients between 18 and 75 years with diabetes whose most recent low-density lipoprotein cholesterol test result during the measurement year was <100 mg/dL on a randomly selected cohort of 273 Mayo Clinic patients. The platform identified 21 and 18 patients for the denominator and numerator of the quality measure, respectively. Validation results indicate that all identified patients meet the QDM-based criteria.

Conclusions End-to-end automated systems for extracting clinical information from diverse EHR systems require extensive use of standardized vocabularies and terminologies, as well as robust information models for storing, discovering, and processing that information. This study demonstrates the application of modular and open-source resources for enabling secondary use of EHR data through normalization into standards-based, comparable, and consistent format for high-throughput phenotyping to identify patient cohorts.

Keywords
  • Electronic health record
  • Meaningful Use
  • Normalization
  • Natural Language Processing
  • Phenotype Extraction

Introduction

The Office of the National Coordinator for Health Information Technology (HIT) in 2010 established the Strategic Health IT Research Program (SHARP) to address research and development challenges in wide-scale adoption of HIT tools and technologies for improved patient care and a cost-effective healthcare ecosystem. SHARP has four areas of focus1: security in HIT (SHARPs2; led by University of Illinois, Urbana-Champaign), patient-centered cognitive support (SHARPc3; led by University of Texas, Houston), healthcare applications and network design (SMART4; led by Harvard University), and secondary use of electronic health records (EHRs) (SHARPn5; led by Mayo Clinic).

The Mayo Clinic-led SHARPn program aims to enhance patient safety and improve patient medical outcomes by enabling the use of standards-based EHR data in research and clinical practice. A core component for achieving this goal is the ability to transform heterogeneous patient health information, typically stored in multiple clinical and health IT systems, into standardized, comparable, consistent, and queryable data.6 ,7 To this end, over the last 3 years, SHARPn has developed a suite of open-source and publicly available applications and resources that support ubiquitous exchange, sharing and reuse of EHR data collected during routine clinical processes in the delivery of patient care. In particular, the research and development efforts have focused on four main areas: (1) clinical information modeling and terminology standards; (2) a framework for normalization and standardization of clinical data—both structured and unstructured data extracted from clinical narratives; (3) a platform for representing and executing patient cohort identification and phenotyping logic; and (4) evaluation of data quality and utility of the SHARPn resources.

In this article, we describe current research progress and future plans for these foci of the SHARPn program. We illustrate our work via use cases in high-throughput phenotype extraction from EHR data, specifically for meaningful use (MU) clinical quality measures (CQMs), and discuss the challenges to shared, secondary use of EHR data relative to the work presented.

Methods

SHARPn organizational and functional framework

The SHARPn vision is to develop and foster robust, scalable and pragmatic open-source informatics resources and applications that can facilitate large-scale use of heterogeneous EHR data for secondary purposes. In the past 3 years, via this highly collaborative project with team members from seven different organizations, we have developed an organizational and functional framework for SHARPn that is structured around informatics infrastructure for (1) information modeling and terminology standards, (2) clinical natural language processing (NLP), (3) clinical phenotyping, and (4) data quality. We describe these infrastructural components briefly in the following sections.

Clinical information modeling and terminology standards

Clinical concepts must be normalized if decision support and analytic applications are to operate reliably on heterogeneous EHR data. To achieve this goal, we adopted Clinical Element Models (CEMs)8 as target information models which have been designed to provide a consistent architecture for representing clinical information in EHR systems. The goal of CEMs has been to specify granular, computable models of all elements that might be stored in an EHR system. The intent is to use these models in validating data during persistence and in developing the services and applications that exchange data, thus promoting interoperability between applications and systems. For the secondary use needs of SHARPn, a set of generic CEMs for capturing core clinical information, such as ‘Medication’, ‘Labs’, and ‘Diagnosis’, have been defined. CEMs are distributed in the following formats: Tree, Constraint Definition Language (CDL, the language GE Healthcare developed for authoring CEMs), and Extensible Markup Language (XML) Schema Definitions (XSDs). Figure 1 shows the demographics model, SecondaryUsePatient, in Tree format where the bracket shows the cardinality of the corresponding feature. For example, a SecondaryUsePatient instance has at least one entry for PersonName (unbounded indicates that a demographics instance can have multiple entries for PersonName) while all other entries are optional. The CEM is authored in CDL and processed by a compiler, which validates the CDL and can generate various outputs, one of which is a SHARPn-specific XSD structure used to convey patient data instances conforming to CEMs.

Figure 1

SecondaryUsePatient Clinical Element Model.

A key component in defining and instantiating the CEMs are structured codes and terminologies. In particular, the CEMs used by SHARPn adopt MU standards and value sets (use-case-specific subsets of standard vocabularies) for specifying diagnosis, medications, laboratory results, and other classes of data. Terminology services—especially mapping services—are essential, and we have adopted Common Terminology Services 2 (CTS2)9 as our core terminology infrastructure.

Clinical data normalization and standardization

Syntactic and semantic data normalization

Figure 2 shows the data normalization pipeline architecture. The SHARPn data normalization pipeline adopts Mirth Connect, an open-source healthcare integration engine, as the interface engine, and Apache Unstructured Information Management Architecture (UIMA) as the software platform.10 It takes advantage of Mirth's ability to support the creation of interfaces between disparate systems, and UIMA's resource configuration ability to enable the transformation of heterogeneous EHR data sources (including clinical narratives) to common clinical models and standard value sets.

Figure 2

SHARPn clinical data normalization pipeline. (1) Data to be normalized are read from the file system. These data can also be transmitted on NwHIN via TCP/IP from an external entity. (2) Mirth Connect invokes the normalization pipeline using one of its predefined channels and passes the data (eg, HL7, CCD, tabulardata) to be normalized. (3) The normalization pipeline goes through initialization of the components (including loading resources from the file system or other predefined resource such as the Common Terminology Services 2 (CTS2) and then performs syntactic parsing and semantic normalization to generate normalized data in the form of a Clinical Element Model (CEM). (4) Normalized data are handed back to Mirth Connect. (5) Mirth Connect uses one of the predefined channels to serialize the normalized CEM data to CouchDB or MySQL based on the configuration. CTAKES, clinical Text Analysis and Knowledge Extraction System; DB, database; NLP, natural language processing; UIMA, Unstructured Information Management Architecture.

The behavior of the normalization pipeline is driven through UIMA's resource configuration—that is, source and target information models (syntactic) and their associated value sets (semantic) as well as the mapping information from source to targets. As discussed in the previous section, we adopt the CEMs and CTS2 infrastructure for defining the information models and access to terminologies and value sets, respectively, for enabling syntactic and semantic normalizations. In particular, syntactic normalization specifies where (in the source data or other location) to obtain the values that will fill the target model. They are ‘structural’ mappings—mapping the input structure (eg, an HL7 message) to the output structure (a CEM). For semantic normalization, we draw value sets for problems, diagnoses, laboratory observations, medications and other classes of data from MU terminologies. The entire normalization process relies on the creation or identification of syntactic and semantic mappings, and we adopt the CTS2 Mapping Service (http://www.omg.org/spec/CTS2/1.0/) to specify these mappings. For brevity, additional details about the normalization process, along with examples, can be reviewed in the article by Kaggal et al.11

Normalization of the clinical narrative using NLP

The clinical narrative within the EHR consists primarily of care providers' notes describing the patient's status, disease or tissue/image. The normalization of textual notes requires NLP methods that deal with the variety and complexity of human language. We have developed sophisticated information extraction methods to discover a relevant set of normalized summary information for a given patient in a disease- and use-case agnostic way. We have defined six ‘templates’—abstractions of CEMs—which are populated by processing the textual information and then map it to the models. Table 1 represents the six templates along with their attributes. The anchors for each template are a Medication, a Sign/symptom, a Disease/disorder, a Procedure, a Laboratory result and an Anatomic site. Some attributes are relevant to all templates—for example, ‘negation_indicator’—others are specific to a particular template—for example, ‘dosage’ is specific to Medications.

View this table:
Table 1

Template generalizations of the Clinical Element Model used for the normalization of clinical narrative

Common attributesTemplate-specific attributes
TemplateAttributeTemplateAttributeTemplateAttribute
AssociatedCodeMedicationChange_statusSign/symptomAlleviating_factorDisease/disorderAlleviating_factor
ConditionalDosageBody_lateralityAssociated_sign_symptom
GenericDurationBody_locationBody_laterality
Negation_indicatorEnd_dateBody_sideBody_location
SubjectFormCourseBody_side
Uncertainty_indicatorRelative_temporal_contextDurationCourse
 FrequencyEnd_dateDuration
RouteExacerbating_factorEnd_date
Start_dateRelative_temporal_contextExacerbating_factor
StrengthSeverityRelative_temporal_context
ProcedureBody_lateralityStart_dateStart_date
Body_locationLaboratory resultsAbnormal_interpretationAnatomic siteBody_laterality
Body_sideDelta_flagBody_side
DeviceEnd_date 
DurationLab_value
End_dateOrdinal_interpretation
MethodReference_range_narrative
Relative_temporal_contextStart_date
Start_date 

The main methods used for the normalization of the clinical narrative are rules and supervised machine learning. As with any supervised machine learning techniques, the algorithms require labeled data points from which to learn the patterns and evaluate the output. Therefore, a corpus representative of the EHR from two SHARPn institutions (Mayo Clinic and Seattle Group Health) was sampled and deidentified following Health Insurance Portability and Accountability Act guidelines. The deidentification process was a combination of automatic output from MITRE Identification Scrubber Tool (MIST)12 and manual review. The corpus size is 500 K words that we intend to share with the research community under data use agreements with the originating institutions. The corpus was annotated by experts for several layers following carefully extended or newly developed annotation guidelines conformant with established standards and conventions in the clinical NLP community to allow interoperability. The annotated layers (syntactic and semantic) allow learning structures over increasingly complex language representations, thus enabling state-of-the-art information extraction from the clinical narrative. A detailed description of the syntactic and select semantic layers is provided in Albright et al.13

The best performing NLP methods are implemented as part of the Apache clinical Text Analysis and Knowledge Extraction System (cTAKES)14 which is built using UIMA. It comprises a variety of modules including sentence boundary detection, syntactic parsers, named entity recognition, negation/uncertainty discovery, and coreference resolver, to name a few. For method details and formal evaluations, see Wu et al,15 Albright et al,13 Choi and Palmer,16 ,17 Clark et al,18 Savova et al,19 Sohn et al,20 and Zheng et al.21 Additional details on cTAKES are available from http://ctakes.apache.org.

Data transfer, processing, storage and retrieval infrastructure

Data transfer/processing

We have primarily leveraged two industry standard, open-source tools for data exchange and transfer: Mirth Connect and Aurion (NwHIN). Aurion provides gateway-to-gateway data exchange using the NwHIN XDR (Document Submission) protocol, enabling participating partners to push clinical documents in a variety of forms (HL7 2.x, clinical document architecture (CDA), and CEM). Mirth Connect software and channels have been developed to provide Sender, Receiver, Transformation, and Persistence channels in support of the variety of use cases as well as enabling the interconnectivity of the various SHARPn systems.

Data storage/retrieval

The SHARPn teams have developed two different technology solutions for the storage and retrieval of normalized data. The first is an open-source SQL-based solution. This solution enables each individual CEM instance record to be stored in a standardized SQL data model. Data can be queried directly from the SQL database tables and extracted as individual fields or as complete XML records. The second is a document storage solution that leverages the open-source CouchDB22 repository to store the CEM XML data as individual JavaScript Object Notation (JSON) documents, providing a more document-centric view into the data. Both solutions leverage toolsets built around the CEM data models for a variety of clinical data including ‘NotedDrugs’, ‘Labs’, ‘Administrative Diagnosis’, and other clinically relevant data.

High-throughput cohort identification and phenotype extraction

SHARPn overloads the term ‘phenotyping’ to imply the algorithmic recognition of any cohort within an EHR for a defined purpose, including case–control cohorts for genome-wide association studies, clinical trials, quality metrics, and clinical decision support. In the recent past, several projects, including eMERGE,23 PGRN,24 and i2b2,25 have developed tools and technologies for identifying patient cohorts using EHRs. A key aspect of this process is to define inclusion and exclusion criteria involving EHR data fields (eg, diagnoses, procedures, laboratory results, and medications) and logical operators. We refer to them as ‘phenotyping algorithms’, and typically represent them as pseudocodes with varying degrees of formality and structure. CQMs, for instance in the MU program, are examples of phenotyping algorithms maintained by the National Quality Forum (NQF), based on the Quality Data Model (QDM)26 and represented in the HL7 Health Quality Measures Format (HQMF27 or eMeasure). The QDM is an information model and grammar intended to represent data collected during routine clinical care in EHRs as well as the basic logic required to articulate the algorithmic criteria for phenotype definitions.

While the NQF, individual measure developers, the National Library of Medicine, and others have made improvements in the clarity of eMeasure logic and coded value sets for the 2014 set of CQMs28 for MU stage 2, at least in MU stage 1, these measures often required human interpretation and translation into local queries in order to extract necessary data and produce calculations. There have been two challenges in particular: (1) local data elements in an EHR may not be natively represented in a format consistent with the QDM including the required code systems and value sets; (2) An EHR typically does not natively have the capability to automatically consume and execute eMeasure logic. In other words, work has been needed locally to translate the logic (eg, using an approach such as SQL) and to map local data elements and codes (eg, mapping an element using a proprietary code to SNOMED). This greatly erodes the advantages of the eMeasure approach, but has been the reality for many vendors and institutions in the first stage of MU.29

To address these challenges, the SHARPn project has investigated aligning the representation of data with the QDM along with automatic interpretation and execution of eMeasures using the open-source JBoss Drools30 rules management system. Specifically, using the Apache UIMA platform, we have developed a translator tool that converts QDM-defined phenotyping algorithm criteria into executable Drools rules scripts. In essence, the tool takes as input the HQMF XML file along with the relevant value sets as inputs, maps the QDM categories and criteria to CEMs, and, at run-time, generates JavaScript queries to the JSON-based CouchDB CEM datastore containing patient information. Figure 3 shows the overall architecture of the conversion tool, with additional details presented in Li et al.31

Figure 3

Architecture for Quality Data Model (QDM) to Drools translator system. AE, annotation engine; CAS, common analysis system; CEM, Clinical Element Model; DB, database; UIMA, Unstructured Information Management Architecture; XSLT, extensible stylesheet transformation language.

Data quality and consistency

Accurate and representative EHR data are required for effective secondary use in health improvement measures and research.32 There are often multiple sources with the same information in the EHR, such as medications from an application that supports the printing or messaging of prescriptions versus medication data that were processed from progress notes using NLP. Identical data elements from hospital, ambulatory, and nursing home care, for example, have variations in their original use context. We take a two-pronged view toward the requirement of providing accurate and correct processing of source data. The first is to validate correct functionality of the SHARPn components and services. The second is to develop methods and components to monitor and report end-to-end data validity for intended users of the SHARPn deliverables. Our approach to validating the normalization pipeline was to construct an environment in which representative samples of various types of clinical data are transmitted across the internet, the data are persisted as normalized CEM-based data instances, and the CEM-based instances are reconciled with the originally transmitted data. Figure 4 illustrates this approach.

Figure 4

Conceptual diagram of validation processes. (1) Use-case data are translated to HL7 and submitted to the MIRTH interface engine, and then (2) processed and stored as normalized data objects. (3) A Java application pulls data from the source database and the Clinical Element Model (CEM) database, compares them, then (4) prints inconsistencies for manual review.

In particular, we have validated a stream of 1000 HL7 V.2.5 medication messages end-to-end: from the source data submitted through an NwHIN gateway at Intermountain Healthcare to normalized, Secondary Use, Noted Drug CEM instances retrieved from the platform datastores at Mayo Clinic. Once a validation procedure had been successfully tested in the SQL datastore environment, this evaluation process was implemented on output from the CouchDB. The content of all messages were matched correctly between input and output data, with the exception of one medication RxNorm code that was not populated in the CTS2 terminology server (more details in the Results section). Other SHARPn CEMs are in various stages of validation as the pipeline components continue to evolve. Our continuing objective is to develop standard procedures to identify transmission/translation errors as data from new sites are added to a centralized repository. The errors expose the modifications of underlying system resources within the normalization pipeline necessary to produce an accurate conversion of the new data into the canonical CEM standard.

Results

Medication CEMs and terminology mappings

As discussed above, for testing our data normalization pipeline, 1000 ambulatory medication order instances were evaluated for the correctness of structural and semantic mapping from HL7 V.2.3 Medication Order messages to CEMs. These medication data compared included ingredient, strength, frequency, route, dose and dose unit. Specifically, 266 distinct First Data Bank GCN sequence_codes (GCN-SEQNO) were used in the test messages. These generic clinical medication terms were a random selection from Intermountain's EHR data. All but one (GCN-SEQNO, 031417, Serevent Diskus (Salmeterol) 50 μg/Dose Inhalation Disk with Device) failed to map to an RxNorm ingredient code in the pipeline. It has been noted in previous evaluations of RxNorm coverage of actual medication data that metered dose inhalers are, in general, challenging to map.33 ,34

MU quality measures

To demonstrate the applicability of our phenotyping infrastructure, we experimented with one of the MU phase 1 eMeasures: NQF 0064 (Diabetes: Low Density Lipoprotein (LDL) Management and Control35; figure 5). NQF 0064 determines the percentage of patients between 18 and 75 years with diabetes whose most recent LDL-cholesterol (LDL-C) test result during the measurement year was <100 mg/dL. Thus, the denominator criterion identifies patients between 18 and 75 years of age who had a diagnosis of diabetes (type 1 or type 2), and the numerator criterion identifies patients with an LDL-C measurement of <100 mg/dL. The patient is not numerator-compliant if the result for the most recent LDL-C test during the measurement period is ≥100 mg/dL, or is missing, or if an LDL-C test was not performed during the measurement time period. NQF 0064 also excludes patients with diagnoses of polycystic ovaries, gestational diabetes or steroid-induced diabetes.

Figure 5

Denominator, numerator and exclusion criteria for NQF 0064: Diabetes: Low Density Lipoprotein (LDL) Management and Control.

As the goal is to execute such an algorithm in a semi-automatic fashion, we first invoked the SHARPn data normalization and NLP pipelines in UIMA to populate the CEM database. Specifically, we extracted demographics, diagnoses, medications, clinical procedures and laboratory measurements for 273 patients from Mayo's EHR systems (table 2) and normalized the data to HL7 value sets, International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM), RxNorm, Current Procedural Terminology (CPT)-4 and Logical Observation Identifiers Names and Codes (LOINC) codes, respectively. The NLP pipeline was particularly invoked in processing the medication data as demonstrated in prior work.36 For translating the QDM criteria into Drools rules, a key component comprised mapping from QDM data elements to CEMs. This enables the mapping and extraction from disparate EHRs to this normalized representation of the source data. Additionally, CEMs contain attributes of provenance, such as date/time, originating clinician, clinic, and entry application, and other model-specific attributes, such as the order status of a medication. These care process-related attributes are required in QDM data specifications. CEMs are bound to a standard value set. The ‘key’ code and the ‘data’ code are generally the CEM ‘qualifiers’ that enable a mapping from a QDM data element specification to one or more CEMs. Table 3 shows a sample of mappings developed for NQF 0064. Two QDM data categories, ‘Diagnosis’ and ‘Encounter’, may be satisfied by one or more CEMs. Both of these map to an ‘AdministrativeDiagnosis’ CEM: the QDM data element will be instantiated based on matching the ‘data’ qualifier to the QDM value set. All QDM data specifications for NQF 0064 were successfully mapped to CEMs.

View this table:
Table 2

Basic demographics for patients (N=273) evaluated for NQF 0064

CategoryNumber of patients
Gender
   Male132 (48%)
   Female141 (52%)
Race
   White236 (85%)
   Black8 (3%)
   Alaskan Indian10 (4%)
   Asian5 (2%)
   Pacific Islander4 (2%)
   Other10 (4%)
Ethnicity
   Hispanic, Latino or Spanish origin7 (4%)
   Non-Hispanic, Latino or Spanish origin235 (85%)
   Unknown31 (11%)
Age (years)
   ≤1831 (11%)
   19–3032 (11%)
   31–5070 (26%)
   51–75105 (38%)
   >7535 (14%)
View this table:
Table 3

Sample QDM category to CEM mapping for NQF 0064

QDM data elementQDM code systemCEMCEM qualifier(s)
Diagnosis, active: diabetesICD-9-CMAdministrative DiagnosisData, Encounter Date
SNOMED-CTSecondary Use AssertionData, Date Of Onset or Observed Start Time
Encounter: non-acute inpatient and outpatientCPTAdministrative ProcedureData, Encounter Date
ICD-9-CMAdministrative DiagnosisData, Encounter Date
Laboratory test, result: LDL testLOINCSecondary Use Standard

LabObs Quantitative
Key, data
Medication, active: Meds indicative of diabetesRx NormSecondary Use Noted Drugdata, Start Time
Patient characteristic: birth dateLOINCSecondary Use PatientBirth Date
RaceCDCRECSecondary Use PatientAdministrative Race
ONC administrative SexAdministrative SexSecondary Use PatientAdministrative Gender
  • CDREC, US Centers for Disease Control and Prevention Race and Ethnicity Code Set; CEM, Clinical Element Model; CPT, Current Procedural Terminology; ICD-9-CM, International Classification of Diseases, Ninth Revision, Clinical Modification; LDL, low-density lipoprotein; LOINC, Logical Observation Identifiers Names and Codes; ONC, Office of the National Coordinator; QDM, Quality Data Model.

Executing NQF 0064 on this population of 273 patients identified 21 and 18 patients for the denominator and numerator, respectively (table 4). Nineteen patients were excluded because they did not meet either the denominator or numerator criterion for NQF 0064. Furthermore, we have implemented an open-source platform (http://phenotypeportal.org) providing a library of such cohort identification algorithms, as well as for visualization, reporting and analysis of algorithm results.

View this table:
Table 4

Criteria results for patients (N=273) evaluated for NQF 0064

NQF 0064 populationNumber of patients
Initial patient population273
Denominator21
Numerator18
Exclusion19

Discussion

We have briefly presented the SHARPn project whose core mission is to develop methods and modular open-source resources for enabling secondary use of EHR data through normalization into standards-based, comparable, and consistent formats. A key aspect to achieving this is normalization of both structured and unstructured clinical data into a unified, concept-based formalism. Using CEMs and Apache UIMA, we have developed a data-normalization platform that ensures data security, end-to-end connectivity, and reliable data flow within and across institutions. This platform leverages and builds on our existing work on cTAKES and CTS2 open-source tools which have been widely adopted in the health informatics community. With phenotyping as SHARPn's major use case, we have collaborated, and continue to collaborate, with NQF on further development and adoption of QDM and HQMF for standardized representation of cohort definition criteria. As illustrated in the previous section, our goal is to create a national library of standardized phenotyping algorithms, and provide publicly available infrastructure that facilitates their execution in a semi-automatic fashion. Finally, this article describes a systematic implementation for an informatics infrastructure. While there are alternative commercial (eg, Oracle Cohort Explorer37) and open-source (eg, PopHealth38) tools that achieve similar functionality, to our knowledge, ours is the first system that leverages an emerging national standard—namely QDM—for representing phenotypic criteria, as well as using a rules-based system (Drools) for execution of such criteria.

Several limitations and challenges remain to full achievement of the SHARPn vision. First, to achieve any form of practical semantic normalization, we are faced with the unavoidable requirement for well-curated semantic mapping tables among native data representations and their target standards. While several semi-automated methods for mappings are being investigated,3942 terminology mappings remain a largely labor-intensive process. Second, within the context of secondary use, metadata and other means of understanding the provenance and meaning of source EHR data are highly pertinent. While CEMs do accommodate provenance details, such information is typically either not readily available and, in a few cases, not adequately processed by the data normalization framework, suggesting the requirement for future enhancement. Furthermore, at present, for structured clinical data, we use HL7 2.x and CCD messages as the payload of an NwHIN Document Submission message, which are then transformed and normalized using the UIMA pipeline. However, the emergence of more comprehensive standards, such as Consolidated Clinical Document Architecture,43 will warrant further evaluation, including investigation of the use of Mirth Connect and Aurion NwHIN for transferring the message payload. Further, our existing evaluations with eMeasures have been limited to MU phase 1 as well as to criteria with core logical operators (eg, OR, AND). However, QDM defines a comprehensive list of operators including temporal logic (eg, NOW, WEEK, CURRTIME), mathematical functions (eg, MEAN, MEDIAN), and qualifiers (eg, FIRST, SECOND, RELATIVE FIRST) that require a deeper understanding of the underlying data semantics. Our plan is to incorporate these additional capabilities as well as 2014 MU phase 2 CQMs in future releases of the QDM-to-Drools translator.

Conclusion

End-to-end automated systems for extracting clinical information from diverse EHR systems require extensive use of standardized vocabularies and ontologies, as well as robust information models for storing, discovering, and processing that information. The main objective of SHARPn is to develop methods and modular open-source resources for enabling secondary use of EHR data through normalization to standards-based, comparable, and consistent formats for high-throughput phenotyping. In this article, we present our current research progress and plans for future work.

Correction notice

This paper has been corrected since it was published Online First. The middle initial of the fifth author has been corrected.

Contributors

JP and CGC designed the study and wrote the manuscript. All other authors checked the individual sections in the manuscript and discussion. All authors contributed to and approved the final manuscript.

Funding

This research and the manuscript was made possible by funding from the Strategic Health IT Advanced Research Projects (SHARP) Program (90TR002) administered by the Office of the National Coordinator for Health Information Technology.

Competing interests

None.

Patient consent

Obtained.

Ethics approval

This study was approved by the Mayo Clinic Institutional Review Board (IRB#: 12-003424) as a minimal risk protocol.

Provenance and peer review

Not commissioned; externally peer reviewed.

References

View Abstract