OUP user menu

Privacy preserving interactive record linkage (PPIRL)

Hye-Chung Kum, Ashok Krishnamurthy, Ashwin Machanavajjhala, Michael K Reiter, Stanley Ahalt
DOI: http://dx.doi.org/10.1136/amiajnl-2013-002165 212-220 First published online: 1 March 2014


Objective Record linkage to integrate uncoordinated databases is critical in biomedical research using Big Data. Balancing privacy protection against the need for high quality record linkage requires a human–machine hybrid system to safely manage uncertainty in the ever changing streams of chaotic Big Data.

Methods In the computer science literature, private record linkage is the most published area. It investigates how to apply a known linkage function safely when linking two tables. However, in practice, the linkage function is rarely known. Thus, there are many data linkage centers whose main role is to be the trusted third party to determine the linkage function manually and link data for research via a master population list for a designated region. Recently, a more flexible computerized third-party linkage platform, Secure Decoupled Linkage (SDLink), has been proposed based on: (1) decoupling data via encryption, (2) obfuscation via chaffing (adding fake data) and universe manipulation; and (3) minimum information disclosure via recoding.

Results We synthesize this literature to formalize a new framework for privacy preserving interactive record linkage (PPIRL) with tractable privacy and utility properties and then analyze the literature using this framework.

Conclusions Human-based third-party linkage centers for privacy preserving record linkage are the accepted norm internationally. We find that a computer-based third-party platform that can precisely control the information disclosed at the micro level and allow frequent human interaction during the linkage process, is an effective human–machine hybrid system that significantly improves on the linkage center model both in terms of privacy and utility.

  • privacy preserving interactive record linkage (PPIRL)
  • decoupled data
  • entity resolution
  • medical record linkage
  • privacy
  • Electronic Health Records (EHR)


Information systems in the health sector have undergone significant infrastructure changes making it possible to collect, store, and process huge amounts of data. However, information derived from these heterogeneous systems is often redundant, fragmented over multiple databases, incomplete, and erroneous.17 In fact, the 4V's of Big Data, Volume, Velocity, Variety, and Veracity,8 describe succinctly the nature of Big Data in healthcare as seen in the continuously generated medical records from diverse service providers which always contain some level of error. Thus, a task critical to finding the useful information among such chaotic Big Data is record linkage—the process of identifying record pairs from different information systems which belong to the same real-world entity.

The record linkage process is complicated by the inherent factors observed in Big Data, such as missing data (eg, missing social security number (SSN)), erroneous data (eg, transpose of date of birth (DOB)), non-standardized forms of data (eg, Dr Smith), and change in the data over time (eg, changed last name). The absence of common, error-free, and unique identifiers makes exact matching solutions inadequate, leading to methods for approximate linkage to address these issues.115 In a study linking cancer registries, 10% more matches were found using a deterministic approximate match compared to the exact match methods due to typos in names or missing SSNs.2 A more sophisticated approximate method, six pass probabilistic record linkage, linking a cancer registry with Medicaid data, reported only 83% of records were matched using exact match.3 In another study, 36.3% of health records were missing SSNs.5 Yet another study reported that there were between 0.16% and 16% potential duplicate medical record numbers in five different electronic health record systems.6 Due to the large number of patients served, even 0.16% equals 1583 records, quite a considerable number to clean up manually.

In this paper, we provide a tutorial on record linkage and a systematic review of the literature on privacy preserving record linkage (PPRL) for biomedical research. We also synthesize the literature to propose a new framework, privacy preserving interactive record linkage (PPIRL), for data integration with tractable privacy and utility properties. We evaluate the current literature using the framework.


Record linkage

The main difficulty in record linkage is that data are often expressed differently, change over time, lack unique attributes, have missing attributes, or have erroneous data entry. Let us consider an example where SSN, first name, last name, and DOB are available for linkage. If we link only on SSN, issues arise from missing and erroneous SSN. If linked using all four attributes on exact match, many true matches are missed. The goal of the different approximate approaches is to capture as many true matches as possible while minimizing the false matches. Typically, all approaches will use approximate matching and result in three categories: match, uncertain, non-match. The objective in all automatic approximate algorithms is to minimize the uncertain region which requires manual resolution by an individual. There are several good surveys915 and recent advances in new learning methods for automatic matching.1621

Uncertainty in record linkage

In all approximate matching methods, real-world entities which share similar identifying information (eg, twins and family) result in a certain number of false matches.3 ,5 ,22 In addition, in health informatics application, there are substantial ethical and liability issues involved in the potential corruption of the integrated patient data system that can result from these false matches.67 ,23 ,24 At the same time, conservative matching methods which can miss many true matches, may result in selection biases.45 ,25 Not properly accounting for linkage error, both false matches and missed true matches, can cause serious harm as the erroneous links propagate to subsequent steps in the workflow.6 ,7 ,2230 Bronstein et al 4 found that when matching Medicaid claims data to vital records, the resulting matched analytic datasets tend to under-represent the outcomes of high-risk pregnancies. Baldi et al found that the covariates in the Cox regression models can be biased due to not capturing all true links when analyzing survival rate in a cohort of patients with breast cancer.25 Lahiri and Larsen propose a method for taking into account the measurement error in the linkage process when building a linear regression model between linked variables.26 Tancredi and Liseo present a more general model for propagating the uncertainty between the parameter estimation step and the matching procedure using a hierarchical Bayesian approach.27

Currently, most research treats linked data as if there are no errors. This convention is perpetuated because most scientists using the linked data are not involved with the linkage process7 and do not fully appreciate the complex process or the uncertainties in the linked data. Researchers who use linked data need a better understanding of the nature of uncertainty in the linkage process and more research is needed on methods of propagating the uncertainty in record linkage to subsequent analysis.2529

Interactive record linkage

Linkage errors propagate into the linked data and its analysis results leading to potential problems with incorrect results, and eventually incorrect knowledge and action. Thus, interactive record linkage, defined as people fine tuning the false matches and managing the uncertainty and its propagation to subsequent analyses, is the first step in the data workflow to turn Big Data into useful biomedical information.35 ,31 We define the properly tuned output from such a hybrid human–machine data integration system as high quality record linkage.

Recently, there has been more research on interactive record linkage that takes advantage of human interaction either through active learning systems or crowdsourced systems3238 after a study described the limitations of the techniques in automatic record linkage for real applications.39 More research is needed on interactive record linkage systems that allow the scientist to tune the linkage results and manage the uncertainty in the subsequent analysis. The importance of human interaction in record linkage to resolve the many uncertainties in the process is demonstrated well in Bronstein et al.4 Their paper describes a method for matching pregnancies from Medicaid data to birth records using probabilistic record linkage that involved 11 manual steps. There were multiple uncertainties that needed human decisions during the process. For example in step 4, of 46 364 pregnancies the authors were trying to match, 4369 linked to more than one vital record and 9400 had no match to any vital record. Eventually after multiple iterative data cleaning and matching steps, the authors identified 43 500 completed pregnancies that should be documented in vital records, 5278 of which were not found (87.9% match rate). This is similar to the 90% match rates found in linking medcaid and vital statistics records in other states in the USA. With no human interaction, the match rate would be much lower. Such a high level of human interaction and iteration is common in medical record linkage studies.35

Privacy in record linkage

Given the sensitivity of biomedical data, privacy is a major concern in interactive record linkage where data cannot be de-identified. In particular, in secondary data analysis the research question is not known at the time of data collection, making informed consent, the most common form of protection in biomedical research, difficult. In most cases, general blanket consent for research along with IRB review of the risks and benefits of research, is the only option available. In 2001, the US Government Accountability Office (GAO) published a report on technologies for privacy protection in record linkage in federally funded projects.40 Much is still the same with only two modes of access for research, de-identify mode and trust mode. De-identified data cannot be linked and the trust mode provides little protection from trusted users requiring high level clearance for those doing the linkage. In addition, with trusted third-party linkages, scientists are unaware of the uncertainty in the linkage process and how to propagate this uncertainty in the analysis downstream.22

Privacy protection in record linkage is fundamentally different from all other privacy preserving data operations41 ,42 because the goal is to exactly identify the entity represented by the data being linked, so that the tables can be accurately merged.22 Thus, there is a direct conflict with the conventional understanding of privacy as anonymity. More precisely, anonymity is preventing identity disclosure. In comparison, attribute disclosure refers to the disclosure of one or more sensitive attributes (eg, cancer status). Although related, identity disclosure is only a necessary condition for attribute disclosure, not a sufficient condition. Identity disclosure without attribute disclosure has a low risk of harm.22 ,4347 For safe interactive record linkage, we need to find the exact level of information disclosure that protects sensitive data but reveals enough identifying data for high quality linkage.

Thus, a model focusing on guaranteeing no attribute disclosure while also minimizing identity disclosure has the potential to significantly reduce the risk of privacy violation while still allowing for high quality data integration.46 ,47 It is important to note that the current norm for data integration in the USA is full disclosure of all information to a fully trusted human entity, often called the honest broker; for example, full disclosure of both attribute and identity to certain trusted parties for certain purposes is HIPAA (Health Insurance Portability and Accountability Act) compliant.27 Often, the trusted party is a government or hospital employee, or business associates who must access identifying information for operations.7 Typically in biomedical research, the trusted party is responsible for maintaining a master patient index (MPI), and this index is used to integrate data. The quality of MPI varies widely and most MPIs have duplicates that must be cleaned during the linkage process.6

In this current trust model, there is no protection from insider attack. The main threat model in the interactive linkage process is an honest-but-curious (HBC) user who follows protocol48 but carries out an insider attack, which accounts for close to half the breaches in the USA.49 In a survey of over 600 people, 46% of the respondents answered that the damage caused by insider attacks is greater than that caused by outsider attacks, with the most common insider e-crimes being unauthorized access to and use of information (63%) and unintentional exposure of private and sensitive data (57%).49 By focusing on giving trusted parties access to only the minimum information required, unintentional exposure of sensitive data can be significantly reduced.

Privacy preserving record linkage

For our systematic review of the topic, we modified the guidelines of the Center for Reviews and Dissemination.50 Figure 1 details the workflow with specifics provided in the online supplementary appendix. Here we present the three themes that emerged from the 71 articles reviewed in two separate sections, ‘Privacy preserving record linkage’ and ‘Privacy preserving interactive record linkage.’

Figure 1

Systematic review process workflow.

Private record linkage

On the theoretical front, there have been ongoing efforts to develop PPRL algorithms since 2003.51 Private record linkage is defined as computing the set of linked records given as input a matching function and then outputting them to the two private parties without revealing anything about the non-linked records. The first generation of private record linkage algorithms relied on hash-based algorithms.51 ,52 The use of hashes resulted in strong privacy guarantees but was limited to exact matching algorithms. This led to the second generation of algorithms that developed private string comparison methods (eg, Bloom filters) for private approximate matching.5360 Secure multi-party computation (SMC) is also a common approach to protect against cryptographic attack. Several surveys of private record linkage6164 and privacy-preserving string comparators65 have been carried out. Recently, Kuzu et al proposed a practical private record linkage system demonstrating the effectiveness of controlled information disclosure via obfuscated data and SMC.66 ,67 However, they still formulate the problem as private record linkage with a known matching function and ambiguous links.

In summary, private record linkage involves two private parties who are trying to share minimum information with each other and assumes that the matching function between the tables is known. The goal is to apply the known matching function in a secure manner. There are two problems in this formulation. First, if the matching function is not known, as in most applications, the algorithms cannot be used. Second, there is no possibility of clerical review of the ambiguous links or human interaction during linkage because one of the assumptions is that the private data must not be revealed to the other party. Consequently, the major challenges for real applications are that, without human interaction, there is no method for finding the matching function and resolving the ambiguous links. All reviews of private record linkage identify these as open issues.6164

Trusted third-party linkages

In practice, published research using linked data uses a trusted third-party model.27 In the USA, the National Center for Health Statistics or state centers for health statistics often play the role of a trusted third party. Internationally, several countries have linkage centers whose main role is to determine the matching function manually and link data for research as the trusted third party. Many linkage centers have succeeded in building systems for integrating population health records with good protocols for privacy protection,5 ,6876 sometimes called the pseudonym approach. Such centers rely on separation of the identifying information from the sensitive information for privacy protection.5 ,7578 Dedicated record linkage experts have access to only the identifying data with no access to sensitive data, and furthermore are not involved with subsequent research using the linked data. In these linkage centers, there is significant reliance on the human expert for high quality record linkage and maintenance of a master population list to which all data are linked. Hertzman et al75 describe this proactive linkage as ‘linking each data set when it arrives from a data provider, rather than project by project.’ Most linkage centers cover a designated region, easing the burden of maintaining a master population list, and operate in countries with uniformity in health records and national identification numbers.

Privacy preserving interactive record linkage

In a heterogeneous health system, like that in the USA, the validity and reliability of integrated health data is a significant problem.57 ,2224 In these settings, given the velocity and veracity of Big Data, good incremental record linkage methods are required for proactive linkage to work well since multiple data continue to flow into the system with no shared unique identifiers. However, incremental record linkage to maintain a coordinated master list and its links to multiple data sources that change over time is still largely an open research area. The literature confirms that high quality data integration as well as managing uncertainty in Big Data require human interaction throughout the entire workflow.17 ,2230 Human interaction means that data must be revealed to someone in some form under some condition. In this section, we synthesize the literature to propose PPIRL, a novel framework with tractable privacy and utility properties. We then review a system called Secure Decoupled Linkage (SDLink) as a possible implementation of the principles of PPIRL.

The main use case for the PPIRL framework is for those with approval for full access to multiple data under the trust mode for data integration. PPIRL is a framework that can protect against HBC users in such situations. If successful, PPIRL can greatly increase the throughput of record linkage by allowing many more people, who have not obtained the highest trusted party status (eg, graduate students), to be involved in the time consuming steps of record linkage, creating the matching function, and carrying out the clerical review. Furthermore, under certain conditions, crowdsourcing parts of the process is possible.

The goals of PPIRL are to allow direct control of the matching function and the matching uncertainty by the user while still providing privacy protection, defined as no sensitive attribute disclosure, during this interaction. In box 1, we present the PPIRL framework.

Box 1

Privacy preserving interactive record linkage (PPIRL)

Problem statement (interactive record linkage, IRL) Let R1 and R2 be two private datasets, which cover data on subsets of a population Ω, with non-sensitive attributes Q1 and Q2, and sensitive attributes S1 and S2, respectively. The goal of IRL is to construct an algorithm A that takes as input R1 and R2, and outputs a function M: R1×R2→{match, non-match} AND a function CM:R1×R2→[0,1]. The function CM is an automatic function which outputs a probability score of match between 0 and 1 reflecting the confidence level of the match. The function M is also automatically computed, but for selected mappings, typically from uncertain regions, the output assignment can be interactively changed by an informed human H. In IRL, the human H has access to the full datasets R1 and R2, as well as the output from CM to tune the final matching function M.

Problem statement (privacy preserving interactive record linkage, PPIRL) The goal of PPIRL is to construct an algorithm, A′ that outputs function M′ and CM′, which serves the same purpose as algorithm A from IRL except that the sensitive attributes S1 and S2, from the datasets R1 and R2, respectively, are not disclosed to the human H. The human in PPIRL is thus typically working with less data about the records being linked but trying to still achieve the same level of quality in the matching function M′.

Privacy objective To protect against sensitive attribute disclosure, S1 and S2 are never revealed to H.

Utility objective (1) To generate the best matching function M′ possible by allowing a person H to fine tune the results; and (2) to generate and communicate the confidence level CM′ to H, so that uncertainty can be managed and propagated through the full analysis workflow flexibly.

The cost of privacy in PPIRL is the difference in quality of the matching functions M′ and M. The key to solving the PPIRL problem is to understand the minimum amount of information required for the human user, H, to make accurate linkage decisions and then to devise methods to disclose that information to H without disclosing any of the sensitive attributes S1 and S2 from the databases R1 and R2. If we can disclose all of the information required for generating the matching function M safely, then the quality of the matching function M′ can be as good as M and privacy can be guaranteed at no cost to utility.

Secure decoupled linkage

SDLink is a flexible, secure linkage system that implements the key ideas behind PPIRL. Below we review the key privacy design principles of the system.22 ,46 ,47

Privacy design 1: decoupling data

Decoupling refers to separating out, via encryption, the identifying information (eg, personally identifiable information (PII)) from the sensitive data (eg, cancer status) that need protection (figure 2).46 ,47 Decoupled data provide the same level of protection as de-identified data, but with more protection than is provided in the trust mode of access and more utility.22 ,46 ,47 Decoupling data follows the minimum necessary standard for privacy protection and, during the record linkage process, removes unnecessary information, that is, the information connecting the PII to sensitive data. The innovation in decoupling data is to take a privacy-by-design approach and focus on selectively revealing information rather than hiding it. The key is to understand the minimum information required for acceptable linkage and then to design protocols to reveal, in a secure manner, only that information.

Figure 2

Secure decoupled data. Internally, the data is stored in a decoupled data system (bottom), which has the same level of privacy protection as de-identified data (top right), but is much more powerful because researchers can link multiple decoupled datasets safely. Decoupled data allows for accurate record linkage with no attribute disclosure.

Privacy design 2: computerized third-party linkage

As discussed above, the trusted third-party mechanism to protect privacy is well understood. In the decoupled approach, a researcher has access to computerized third-party software that can access the PII in order to link the data. The researcher requests that two decoupled tables be merged, after which the computerized third-party software takes control and carries out the linkage. In this process, the software actively interacts with the researcher as needed for guidance on parameter settings (determining the matching function) and resolving ambiguities (clerical review of ambiguous links). Essentially, a decoupled data access system is a computerized third-party equivalent to the human-run data linkage centers that strictly controls information.

The main benefit of a computerized third-party model of privacy protection is that it allows each project to have maximum flexibility in its linkage to control the uncertainty in real data. With properly designed third-party software acting as an oracle, a person can interact frequently and inexpensively with information held by the computer third party at the smallest level (eg, asking how similar two encrypted SSNs are) in order to manage uncertainty in the linkage. The decoupled database software functions like a bank vault with security deposit boxes that have well-developed security protocols for importing and accessing datasets in the system. The system only allows access to particular tables to those who have the appropriate decryption keys which are managed by the different data custodians.

Privacy design 3: chaffing and universe manipulation

With decoupling, researchers cannot associate a particular row of data with any PII disclosed during record linkage. But researchers can combine what they know with the PII data shown during interaction to make certain inferences and learn sensitive information about people they know.46 ,47 The privacy literature has shown that background knowledge can be used to infer more information than is originally disclosed.44 ,45 For example, in homogeneous data, attribute disclosure can occur via group membership43 (eg, someone you know is on the list of cancer patients). Thus, strict decoupling via encryption is not sufficient to protect against attribute disclosure when identities could be revealed during human interaction. The probability of attribute disclosure through group membership is dependent on a variety of factors including any pre-existing information that is known by the observer, the knowledge of the nature of the list, and the uniqueness of the PII in the universe of the data.46 ,47 To overcome this, Kum et al 46 ,47 evaluated three methods of modification (figure 3): (1) chaffing: literally changing the nature of the universe by adding fake data; (2) fabrication: changing the label/name of the universe presented to mislead the user about the nature of the list; and (3) non-disclosure: hiding the identity of the universe to reduce confidence by making the list less tractable (eg, a list from the USA compared to a list from Austin). The study showed that when the universe around the data was not disclosed, 56% of the participants were uncertain about the identity given a common name. Even for rare names, if the list is chaffed and the universe is not disclosed, 66% of the participants were uncertain about the identity.22 These results show that through chaffing and universe manipulation, identity disclosure can be minimized for both common and rare names.

Figure 3

Chaffing and Universe Manipulation. Triangles: cancer patients; cross-hatched circles: not cancer patients. DA: Universe of all cancer patients (eg, USA); LA: list of subset of cancer patients being reviewed for linkage which is more tractable (eg, Austin); IanPII represents the PII of someone that the reviewer knows (eg, Ian who lives in Austin). Since Ian is not a unique name, it is unclear whether the PII represents the same real world Ian that the reviewer knows personally. (1) chaffing: literally changing the nature of the universe by adding fake data (eg, add blue circles to red triangles); (2) fabrication: changing the label/name of the universe presented to mislead the user on the nature of the list (eg, label DA as DB and/or LA as LB, thus IanPII now is presented as someone who lives in Beijing, China, who could not be the same Ian that the reviewer knows to live in Austin); and (3) nondisclosure: hiding the identity of the universe to reduce confidence by making the list less tractable. That is, by not disclosing the label LA, the user must assume the list represents a much larger universe DA (eg, a list from USA compared to list from Austin) The reviewer, who knows an Ian living in Austin, loses confidence in inferring the real identity of IanPII when it is presented as an Ian living in the USA compared to being presented as an Ian living in Austin.

Privacy design 4: minimum information sharing via recoding

What information is disclosed during the interaction with the decoupled system determines the risk of disclosure.4347 Figure 4 depicts a sample screen during the linkage process with only the name being fully disclosed. The differences between the attributes that are meaningful for record linkage are displayed instead of the raw data.46 ,47 For example, the gender field only indicates same, different, or missing in one or both fields. Identity (ID) numbers which are PII with a risk of harm (eg, SSN), are displayed as the number of different digits and transposes. DOB comparisons are made on an element to element basis for month, day, and year. In addition, transpose of month and day is accounted for as well as transposes within one element. Determining meaningful differences in names is the most difficult. Table 1 depicts different levels for data recoding of names from left (high disclosure) to right (low disclosure). More research is required to understand what level of information will result in acceptable levels of high quality linkages from interactive record linkage.

View this table:
Table 1

Different data recoding techniques for names

Record no.Full disclosureRemove identical stringsEdit distance if smallEdit distanceLength:edit distance and frequencyBinary
112Parker II––– II––– II––– II7:1CommonDIFF
114Richards––––––––––––0Very commonSAME
Richards––––––––––––0Very commonSAME
116SmithSmithSmith-mith5:4Very commonDIFF
Figure 4

Data recoding techniques.47 The SDLink GUI applies data recoding techniques which display the difference between the attributes that are meaningful for record linkage instead of the raw data. For example, the gender field only indicates, same[−], different[D], or missing[M] in one or both fields. DOB, date of birth; SSN, social security number.

Comparing PPIRL with existing data linkage methods

Although many private record linkage systems have strong privacy guarantees, it is assumed that the matching function is known and thus has a different objective than PPIRL. The use case for PPIRL is similar to that in the linkage centers where there is one trusted party with access to all the required data for linkage. The trusted third-party model can, to some extent, meet the privacy objective of PPIRL if the sensitive data are isolated from the identifying data. However, better documentation is required on the detailed protocols to handle threats by the HBC trusted users who can disclose information unintentionally and/or access unauthorized data. As discussed above, separation alone will not guarantee that no attribute is disclosed due to homogenous group membership. On the other hand, the SDLink platform can guarantee no sensitive attribute disclosure by: (1) decoupling sensitive data from identifying data via encryption; (2) using chaffing to block against attribute disclosure via group membership along with universe manipulation; and (3) recoding to minimize identity disclosure.

The first utility objective of generating the best possible matching function by a human expert is met in the linkage centers. However, the objective at the linkage centers is to maintain a global optimal matching function for all data at all times, which can be difficult in many circumstances involving continuously changing heterogeneous data. In comparison, the SDLink platform has been built to allow for finding a local optimal matching function on a per project basis depending on the data required to be merged for a given study. In reality, developing totally new matching functions for every project will be too expensive. But, the ability to optimize existing matching functions per project will allow for better tuning of linkage results and less uncertainty because the scope of the problem per project is significantly smaller than the global problem. Although PPIRL only discusses one matching function M′ for simplicity, in reality there are multiple matching functions that meet the diverse needs of different projects.79 A flexible system that can efficiently support multiple matching functions is required to give the scientists the control they need over the data to propagate and manage the uncertainty of Big Data. The SDLink platform is a safe infrastructure that can be utilized by many scientists to carry out all aspects of record linkage research including a model for uncertainty propagation while protecting privacy.

Discussion of PPIRL and SDLink

To the best of our knowledge, the research on SDLink platform is based on good privacy designs, reviewed in detail in this paper, and best meets both the privacy and utility objectives of PPIRL. SDLink proposes a platform to improve on the existing record linkage centers in terms of both privacy and utility. Nonetheless, it is unclear how effective the privacy designs proposed will fare against more vicious adversaries with background information. More research is required on precisely what information a person needs for tuning the linkage results and the harm that can result from release of just that information given the readily available background information in the digital age. Any wide table with many attributes required for biomedical research when combined with publically available background information may release more information than is safe even if it is de-identified. Thus, along with privacy research that guarantees the minimum release of information required for biomedical research, better research is needed on how to make sure that the information released is properly protected. The strongest confidentiality protection is provided by secure data centers that strictly control physical access to the data by not allowing remote access. However, the various costs associated with such data centers are prohibitive. Thus, a data infrastructure based on a more holistic coordinated approach that combines methods from technology, statistics, policy, and ethics is required so that Big Data can be used for biomedical research.5 ,7 ,22 ,31 ,80 An extensible platform for building a comprehensive knowledge base that meets the needs of biomedical research is quite complex and managing digital entities is at the core of the problem. Bellare et al present a good starting point for continuously maintaining huge numbers of digital entities for a continuous knowledge base in the context of search engines79 that should be extended with privacy guarantees for biomedical research.

Conclusions and future work

Privacy preserving data integration is key to any data intensive biomedical research using Big Data. Given the volume, variety, velocity, and veracity of Big Data, tuning the results of automatic record linkage algorithms via human interaction is the only way to achieve high quality record linkage as well as manage and propagate the uncertainty in the linked data. A properly designed computerized third-party platform, such as SDLink, that can precisely control the information disclosed at the micro level and allows frequent human interaction during the linkage process, is an effective human–machine hybrid system that can accurately and safely integrate Big Data for biomedical research.

Sometimes the quality of linkage can be improved when sensitive data are available during linkage. For example, sorting through twin records is easier done with sensitive data. The right trade-off between the quality of linkage and protection must be case dependent and should be determined by an IRB based on the risk and benefit, considering issues such as who are doing the linkage, on what computer system and with what software, and for what purpose. Most importantly, for population level research, as long as there are means to propagate and bound errors from linkage the optimal linkage may not be required. More research is needed on: (1) precisely what information is required for good linkage decisions; (2) how to disclose only that information in an effective and safe manner; (3) possible threats from and countermeasures against more aggressive adversaries; and (4) how to propagate the uncertainty in record linkage to subsequent analysis steps.


All co-authors worked together on this research and wrote the manuscript.



Competing interests


Provenance and peer review

Not commissioned; externally peer reviewed.


We thank Darshana Pathak and Jan Werner for their helpful comments.


View Abstract