OUP user menu

A secure protocol for protecting the identity of providers when disclosing data for disease surveillance

(CC)
Khaled El Emam, Jun Hu, Jay Mercer, Liam Peyton, Murat Kantarcioglu, Bradley Malin, David Buckeridge, Saeed Samet, Craig Earle
DOI: http://dx.doi.org/10.1136/amiajnl-2011-000100 212-217 First published online: 1 May 2011

Abstract

Background Providers have been reluctant to disclose patient data for public-health purposes. Even if patient privacy is ensured, the desire to protect provider confidentiality has been an important driver of this reluctance.

Methods Six requirements for a surveillance protocol were defined that satisfy the confidentiality needs of providers and ensure utility to public health. The authors developed a secure multi-party computation protocol using the Paillier cryptosystem to allow the disclosure of stratified case counts and denominators to meet these requirements. The authors evaluated the protocol in a simulated environment on its computation performance and ability to detect disease outbreak clusters.

Results Theoretical and empirical assessments demonstrate that all requirements are met by the protocol. A system implementing the protocol scales linearly in terms of computation time as the number of providers is increased. The absolute time to perform the computations was 12.5 s for data from 3000 practices. This is acceptable performance, given that the reporting would normally be done at 24 h intervals. The accuracy of detection disease outbreak cluster was unchanged compared with a non-secure distributed surveillance protocol, with an F-score higher than 0.92 for outbreaks involving 500 or more cases.

Conclusion The protocol and associated software provide a practical method for providers to disclose patient data for sentinel, syndromic or other indicator-based surveillance while protecting patient privacy and the identity of individual providers.

  • Privacy
  • reidentification
  • de-identification
  • consent
  • ethics
  • integration across care settings (inter- and intraenterprise)
  • languages
  • computational methods
  • advanced algorithms
  • personal health records and self-care systems
  • assuring information system security and personal privacy
  • anonymization
  • other methods for security and policy enforcement
  • cryptography
  • data exchange
  • communication
  • ethical study methods
  • statistical analysis of large datasets
  • methods for integration of information from disparate sources
  • distributed systems
  • agents
  • software engineering: architecture
  • detecting disease outbreaks and biological threats
  • policy
  • legal
  • historical
  • simulation of complex systems (at all levels: molecules to work groups to organizations)
  • monitoring the health of populations
  • privacy and security
  • machine learning
  • health data standards
  • vocabulary
  • ontology
  • scientific information and health data policy
  • consumer health/patient education information
  • information retrieval
  • NLP
  • public-health informatics
  • clinical trials
  • syndromic surveillance
  • secure computation

Introduction

Provider reporting of diseases to public-health authorities is common.1 ,2 However, often there is under-reporting by physicians and hospitals, including for notifiable diseases and frequently by wide margins.324 One causal factor for this under-reporting has been provider concerns about patient privacy.8 ,9 ,11 ,13 ,15 ,19 ,2123 ,2528 Such a reluctance to disclose information has been noted in the past,29 ,30 and exists despite the US Health Insurance Portability and Accountability Act Privacy Rule permitting disclosures of personal health information for public-health purposes without patient authorization.27 ,29 ,3134 Canadian privacy legislation in multiple jurisdictions also permits health-information custodians to disclose personal health information without consent for a broad array of public-health purposes, including chronic disease and syndromic surveillance.35 Concerns about disclosing data are somewhat justified; however, as there have been documented breaches of patient data from public-health information custodians.3642

One way to address patient privacy concerns is to de-identify the individual-level data before disclosure to public health, with the possibility of re-identification if an investigation or contact tracing is required.31 ,42 ,43 However, even if patient privacy concerns are addressed, there have been other concerns about risks to physicians when patient information is disclosed,13 and specifically disclosures for public-health purposes (unpublished data). At least five types of risks have been noted:

  1. Legal exposure. Disclosures without individual patient consent have resulted in tortious or contractual claims of invasion of privacy, breach of confidentiality or implied statutory violations under state law,44 and the increasing collection and disclosure of electronic information raises physicians' malpractice liability exposure.45

  2. Compliance exposure. Physicians have concerns about information being used to evaluate compliance with clinical practice guidelines and compliance with pay for performance programs.46 This concern increases with the amount of detail in the information that is collected.

  3. Intrusive marketing. Providers do not want to be targeted by marketers who gain access to their patient information.46 ,47

  4. Inference or disclosure of income data. Physicians and their professional associations consider the disclosure of income information a serious privacy breach.46 ,48

  5. Inference or disclosure of performance or competitive data. It has been noted that ‘[h]ealthcare providers compete fiercely,’ making it difficult to establish adequate trust for the exchange of health information among health information custodians.49 Furthermore, some data sources for disease surveillance are proprietary, such that they may have reservations about data sharing. For example, schools may not want their absenteeism levels known to avoid political repercussions, and commercial pharmacies would be concerned about their sales data becoming known to potential investors and competitors.27 ,50 ,51 Such custodians may not be willing to disclose information without their identity being masked.50

A distributed architecture for syndromic or other indicator-based surveillance can mask the identity of providers.29 ,30 ,5254 The sources provide count data to independent hubs; these data are aggregated and possibly analyzed by the hubs, and then forwarded to the public-health unit. However, the hubs need to be fully trusted to protect the identity of data sources. This means that if a hub is compromised, corrupted, or compelled to disclose information, then the raw data would reveal the identity of the data sources. Therefore, stronger protections than currently afforded by distributed architectures are needed to alleviate the data-sharing concerns noted above.

In this paper, we present a practical surveillance protocol for the secure multi-party computation of counts. The protocol also follows a distributed model, but it only requires semi-trusted parties. This protects the identity of data sources under different plausible threats. By addressing such concerns, we remove another barrier to the collection of data for disease surveillance.

Methods

In the following narrative, we assume that the data sources are physician practices. This is for the purposes of illustration and ease of presentation. The descriptions, and our proposed protocol itself, would be applicable if the providers were, say, hospitals, pharmacies, or schools.

Trusted versus semi-trusted parties

With distributed surveillance protocols, individual practices send count data to hubs.29 The hubs then aggregate the counts, perform additional analyses, and forward summaries or alerts to the public-health units. The hubs are considered trusted third parties because they will know the identity and counts of each practice. There are three challenges with having a trusted third party: (a) disclosures if a hub is compromised or corrupted, (b) compelled disclosures, and (c) all providers must trust the hub(s).

The first challenge is that if a hub's security is compromised, the adversary will have access to the identity of practices and their corresponding counts. A compromise can be due to either insiders or outsiders. A compromise can be as simple as a ‘change your password’ phishing attack to obtain the credentials of an employee of the hub. Many social-engineering techniques exist,5557 and have been used to obtain passwords and very personal information from individuals and organizations (as well as to commit more dramatic crimes such as bank robberies).58 ,59 A recent review of data breaches indicated that 12% of data breach incidents involved deceit and social-engineering techniques.60 Corruption can occur if an individual with access to the raw data within the hub is bribed or blackmailed to reveal information.

Second, a hub could be compelled to disclose personal health information, for example, in the context of litigation. For research, the National Institutes of Health can issue certificates of confidentiality to protect identifiable participant information from compelled disclosure, and allow researchers to refuse to disclose identifying information in any civil, criminal, administrative, legislative or other proceeding, whether at the federal, state or local level.61 However, these would not be applicable to non-research projects or to projects that are not approved by an IRB, and most public-health surveillance programs would be in that excluded category. Furthermore, such certificates do not exist outside the USA.

Third, the hub must be trusted by all of the practices supplying data. This creates potential obstacles to the exchange of data across municipal, provincial/state, and international jurisdictions. To avoid sending data across jurisdictional boundaries, many regional hubs would need to be created. However, this will result in a proliferation of hubs and the replication of the exact infrastructure multiple times.

To address these challenges, we propose a distributed protocol with the weaker requirement of having only semi-trusted third parties. A semi-trusted third party would not be able to access any of the raw data, even if it wanted to. This means that if there is a security compromise, staff corruption, or a compelled disclosure, there is no additional risk of identifying practices. A protocol with semi-trusted third parties also overcomes the requirement of practices having to completely trust the hub. This allows us to set up a single infrastructure for a large number of practices across multiple jurisdictions. The only requirement on a semi-trusted third party is that it follow the protocol faithfully.

Context

The basic scenario we will use consists of the physician practices providing count data to a public-health unit. There are two types of counts disclosed over the reporting period: cases and all patients seen (denominators). We assume a 24 h reporting period, although our protocol would work with any interval, and that the counts are stratified by syndrome and age. The syndromes are influenza-like-illness (ILI) and gastrointestinal (GI). Ages are grouped similar to the CDC syndromic surveillance system62 as <2, 2–4, 5–17, 18–27, 28–44, 45–64, 65+. Therefore, from each practice, we have a report containing 14 case counts for each age by syndrome stratum, and a total patient count for each age stratum for the previous 24 h. This makes a total of 21 counts per practice per reporting period.

Requirements for secure disease surveillance

The following are the requirements for a protocol that will allow meaningful reporting to a public-health unit while masking the identity of the reporting practices:

  • R1. It should not be possible for any single adversarial party to know the true counts for any practice. This should hold even if a third party involved in the protocol is compromised, compelled to disclose its data, or corrupted.

  • R2. The protocol should allow for technology failures. In a real-world setting, any distributed reporting system will have failures due to machine or connectivity breakdowns. The protocol should have inherent redundancy.

  • R3. It must be possible to verify if a practice did submit data or did not submit data. This ensures data integrity and provides the basis for potentially compensating practices.

  • R4. The computational requirements for the protocol should make it feasible to report at 24 h intervals.

  • R5. It should be possible to identify practices with unusual spikes so that the public-health unit can obtain patient identities and initiate contact with them when necessary.

  • R6. The ability to effectively detect disease outbreak clusters must not deteriorate with the secure protocol.

These requirements were constructed based on the authors' experiences and discussions with computer science and public-health professionals. They represent what are considered necessary conditions to protect the identity of patients and to allow public health to perform their surveillance and investigation functions effectively. The first four requirements address the trustworthiness of the protocol from the perspective of the patients, practices, and public-health units. The latter two requirements address the practical utility of the protocol to public health. Trustworthiness and practical utility are both important, as they will increase the likelihood of initial adoption of the protocol and ongoing use.

Homomorphic cryptosystems

An important technique used in our protocol is homomorphic encryption. Utilizing a homomorphic cryptosystem, mathematical operations can be performed on the encrypted values (ciphertext) that produce the correct result when decrypted (plaintext). An example is additive homomorphic encryption introduced by Paillier,63 in which, conceptually, the summation of two messages is equal to the decryption of the product of their corresponding ciphertexts:

D(E(m1,e)E(m2,e),d)=m1+m2 (1)

In this equation m1 and m2 are the two plaintext messages, E is the encryption function, D is the decryption function, e is the public encryption key, and d is the private decryption key. More details of the exact computation are provided in the appendix.

It is also possible to compute the product of a ciphertext with a constant q:

D(E(m1,e)q,d)=m1×q (2)

For example, if we want to convert the sign of a number, we would raise the power of the ciphertext to q=−1.

Another property of Paillier encryption is that it is probabilistic. This means that it uses randomness in its encryption algorithm so that when encrypting the same message several times, it will, in general, yield different ciphertexts. This property is important to ensure that an adversary would not be able to compare an encrypted message with all possible counts from zero onwards and ascertain the encrypted value.

A threshold version of the Paillier cryptosystem requires t out of l parties to decrypt a message.64 For example, if we have a (2,3) threshold cryptosystem, we would need any two parties out of three to decrypt the message. No single party can decrypt the message.

Secure protocol for disease surveillance

The two phases of our secure protocol are illustrated in figures 1 and 2. Data would be aggregated into groups of at least k practices, where we set k=5 for illustrative purposes below. This means that it will not be possible for anyone but the practice itself to know the actual counts for any practice. The public-health unit will only be able to know the total count for groups of five practices or more.

Figure 1

Set-up phase of the secure computation protocol assuming only two practices are submitting data.

Figure 2

Actual operation of the protocol to securely compute counts.

Roles in our protocol

There are six roles in the protocol: (a) the practices—in the illustration we have only two practices, but this can be a much larger number; (b) the key generator (KG) issues the public and private keys for use by the various parties; (c) the aggregators are semi-trusted third-parties who perform the group- and stratum-specific sums of counts, (d) the key holders (KH) are semi-trusted third parties who decrypt the sums; (e) the mixer is a semitrusted third party who combines the results from at least two out of three key holders; and (f) the public-health unit (PHU) itself. A single physical entity can play multiple roles. An instance of a role will be referred to as a node.

The two aggregators are fully redundant in that the protocol can be implemented with a single aggregator. The primary purpose for redundancy is to ensure that the aggregation operations are performed even if a single aggregator fails or is not accessible. There are three KHs to ensure that there is redundancy built into the system. This means that any single KH can fail, but the overall results can still be computed. For additional robustness, it is also possible to extend the protocol to have t>2 and/or l>3 key holders (eg, 2 out of 4) to be able to decrypt the counts. A minimalist implementation of the protocol with no redundancy would have only one aggregator and two KHs.

The exact configuration in figures 1 and 2 is the one we have used in our demonstration system. The protocol has two main phases described below.

Set-up phase of our protocol

At the outset, the KG generates a public key and the corresponding partial private keys. The public key is given to all of the participating practices when they register. The partial private keys are sent to each of the three KHs. The KG then destroys its copies of the partial private keys after their successful transmission.

During set-up, each practice registers with the KG to indicate that it wishes to participate in the protocol. Registration means downloading the client software and a configuration file. When a practice registers, they also provide a physical address which can be used to identify the other geographically closest registered practices. The configuration file contains the public key as well as the regional grouping of the closest registered practices. The provider installs the client software and is ready to submit count data. The KG informs the aggregators about each practice that has registered and its regional grouping.

For the sake of example, we will assume that we have two regional groupings of practices: Ottawa and Montreal. More formally:

  • Assume there are P strata, M practices, N aggregators, T KHs, and R regional groups. In our example, we have 21 strata, two aggregators, three KHs, and two groups.

  • KG generates the public key PK and the T partial private keys SKt where t∈{1…T}, and sends PK to each new practice when it joins the protocol, and sends the SKt to each of the KHs.

  • Each new practice has a unique ID. All aggregators are informed of each practice's unique ID and group when they join. The PHU is informed of all practices within a group.

Operation phase of our protocol

At the end of the 24 h period, each practice computes the counts for the 21 strata, each of which is encrypted using PK. The encrypted counts are sent by each practice to all of the aggregators. Within each group, an aggregator will sum the encrypted counts only if they come from at least k practices. For example, if the Ottawa region only submitted the counts from four practices, the aggregator would produce a ‘NO DATA’ result for Ottawa.

If at least k encrypted values for a group have been received by an aggregator, the aggregator computes the sums within each stratum across all of the practices within each group. The aggregator does not know what the original values from each practice are, and does not know what the sums are because all of these values are encrypted. The aggregator then sends the encrypted P group sums for the R groups to each KH (except for the groups with ‘NO DATA’ status).

Each KH uses its partial private key to decrypt the sums it receives, which are subsequently sent to the mixer. The KHs ignore regions with no data.

The Mixer selects any two KH values and computes the decrypted values of the P group sums for the R groups with data, which is forwarded to the PHU. More formally:

  1. Each practice computes Eij where i∈{1,…,P} and j∈{1,…,M} as: Eij=E(Cij, PK) where Cij is the count for stratum i for practice j.

  2. The P×Eij values for each practice are then sent to all of the aggregators.

  3. Each aggregator sums (which is equivalent to a multiplication of encrypted values as in equation 1) the values within each stratum within each group: Sir=j in group rEij where r∈{1,…,R}.

  4. The sums are sent by the aggregator to each of the KHs. The KHs decrypt the sums using their partial decryption key: sirt=D(Sir, SKt), which are sent along with their validity proofs to the mixer.

  5. The mixer verifies the partial KH decryptions using their proofs (see the appendix). It then selects any two valid decryption results and combines their results to obtain the final count: sir.

At the end of these steps, the PHU has the plaintext counts for each group for each stratum.

Meeting the requirements

Security analysis (R1)

The security analysis for this protocol is provided in the appendix. This demonstrates that no party will know the practice identities and their counts under plausible compromises or corruption of individual nodes and collusions among nodes.

Node failures (R2)

Our protocol has multiple points of redundancy, making allowances for real-world failures of nodes. This meets requirement number 2. A simulation of node failures is presented in the appendix. This demonstrates that with two aggregators, if any aggregator has a failure rate as high as 20%, there will still be at least one aggregator operating around 98% of the time. With a KH node failure rate of 15%, at least two nodes will be operating around 95% of the time.

Detecting practices providing and not providing data (R3)

An important element of a real-world deployment is the use of digital signatures. Digital signatures will ensure that the senders of messages are who they say they are (authenticity), that the messages cannot be modified in transit without the tampering being discovered (integrity), and that the senders cannot claim that they did not send the messages (non-repudiation). Digital signatures will make it possible to ensure that data are indeed coming from the practices and that practices cannot deny that they submitted counts. Digital signatures and their application in our protocol are described in the appendix.

There will be situations when the PHU needs to detect if any practices are consistently not providing data but claiming that they are. In a sense, the PHU needs to detect ‘free riding’ practices whose lack of contribution of counts is hidden within the practice group total. Such free riding may be deliberate or accidental. For example, a practice may insist that it is providing data and that the aggregator is ‘losing it.’ A proof of data submission would be particularly important if practices are compensated financially for providing data. In such a case, the PHU would need to verify which practices have been contributing counts. For example, if there were eight practices in a group and only five provided data, and the remaining three insist that their systems are working and sending data, the PHU can verify whether the three missing practices did indeed provide their counts. In the appendix, we provide an extension to the protocol that can be used by the PHU to verify which practices have provided data in their group. The approach checks membership using a commutative hash function,65 and makes it impossible for a practice to misrepresent that it provided a count in a total.

Computation performance (R4)

A critical concern with protocols utilizing secure multi-party computation is their performance under realistic situations. We conducted a performance evaluation of the surveillance protocol to determine how it scales as the number of practices and groups increases. Performance is defined in terms of the amount of communication and time taken to perform the computations. The assessment in the appendix shows that only a handful of messages need to be communicated among nodes, and the absolute time to perform the computations was 12.5 s for data from 3000 practices.

Contacting patients (R5)

If full identifying information about cases is sent to the PHU, it can contact the patients directly if an investigation needs to be initiated. However, under our protocol, only count data about patient encounters are sent to the PHU. For sentinel surveillance programs, this is generally not problematic because contacting patients is not usually done. However, for other types of indicator-based surveillance, the PHU may want to contact patients under certain circumstances.

The PHU would first need to find out which practices have unusually high counts that require investigation. Subsequently, these practices are contacted and asked to identify the cases. Each practice has access to a line list of the individual level records that make up each stratum count, and therefore can respond with more detailed information about specific patients. The PHU only needs to determine which practice(s) have unusual spikes.

We present a protocol extension in the appendix for identifying the N practices with the largest counts within a regional group. This protocol does not reveal the actual counts from any of these practices, only that they have the largest counts in their group. The PHU can then identify the practice(s) with the highest counts and contact them for additional details. These details could include detailed line listings of the patients who made up specific strata.

Detecting disease outbreaks (R6)

The ability to detect spatial clusters is important for disease surveillance. We consider two scenarios.

For the first scenario, the counts are stratified by some geographic area, such as the census tract. This would indicate the counts of patients in each area aggregated across practices. It would be necessary to ensure that the areas are large enough to protect patient identifiability, however.66 ,67 Since our protocol would not affect these counts, the ability to detect clusters will be the same as current distributed surveillance protocols.

In the second scenario, the strata sent to the PHU do not contain patient-specific location information. Therefore, the PHU could perform clustering on the practices themselves to detect geographically adjacent practices with unusually high cases. The question is whether the grouping of practices masks the ability to detect such practice clusters. In the appendix, we present the results of a simulation demonstrating that, for practice groups of size 5 and 10, the accuracy of cluster detection is quite high (F-scores greater than 0.95 and 0.92 respectively) and similar to when the practices are not grouped.

Deployment considerations

For deployment, two aggregators and KH pairs can coexist in the same physical node/site, since collusion between them would not reveal any new information. In addition, the mixer and the public-health unit can exist on the same physical node/site. This is illustrated in figure 3, which shows that only four nodes/sites would be needed to implement the protocol as described. The same nodes can support multiple local and national surveillance initiatives. Additional practical deployment considerations are provided in the appendix.

Figure 3

Minimalist deployment of the protocol. Each box represents a single node. Note that there may be multiple practice nodes; we have shown only two here.

It would be important to convince providers to participate in such a surveillance protocol. A recent study examining family doctor attitudes toward the disclosure of patient data for public-health purposes determined that an endorsement by their professional college would be a key factor in their willingness to participate in disclosures of data to public health (unpublished data). The reasoning would be that the college would be an independent and trusted party that would provide an objective opinion regarding the trustworthiness of the protocol. Therefore, as an initial step for implementation, it will be important to engage with the professional colleges and work with them to transition such a protocol into practice.

Funding

This work was partially funded by the Canadian Institutes of Health Research, the GeoConnections program of Natural Resources Canada, the Ontario Institute for Cancer Research, the Natural Sciences and Engineering Research Council and grant number R01-LM009989 from the National Library of Medicine, National Institutes of Health.

Competing interests

None.

Provenance and peer review

Not commissioned; externally peer reviewed.

Acknowledgments

We wish to thank P AbdelMalik, S Rose and G Middleton for their help in the data preparation and spatial analysis for the practice aggregation problem.

This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs licence (http://creativecommons.org/licenses/by-nc-nd/3.0/) which permits non-commercial reproduction and distribution of the work, in any medium, provided the original work is not altered or transformed in any way, and that the work is properly cited. For commercial re-use, please contact journals.permissions@oup.com

References

View Abstract