Evaluating Predictors of Geographic Area Population Size Cutoffs to Manage Reidentification Risk
Abstract
Objective: In public health and health services research, the inclusion of geographic information in data sets is critical. Because of concerns over the reidentification of patients, data from small geographic areas are either suppressed or the geographic areas are aggregated into larger ones. Our objective is to estimate the population size cutoff at which a geographic area is sufficiently large so that no data suppression or further aggregation is necessary.
Design: The 2001 Canadian census data were used to conduct a simulation to model the relationship between geographic area population size and uniqueness for some common demographic variables. Cutoffs were computed for geographic area population size, and prediction models were developed to estimate the appropriate cutoffs.
Measurements: Reidentification risk was measured using uniqueness. Geographic area population size cutoffs were estimated using the maximum number of possible values in the data set and a traditional entropy measure.
Results: The model that predicted population cutoffs using the maximum number of possible values in the data set had R^{2} values around 0.9, and relative error of prediction less than 0.02 across all regions of Canada. The models were then applied to assess the appropriate geographic area size for the prescription records provided by retail and hospital pharmacies to commercial research and analysis firms.
Conclusions: To manage reidentification risk, the prediction models can be used by public health professionals, health researchers, and research ethics boards to decide when the geographic area population size is sufficiently large.
Introduction
Privacy legislation in Canada applies to identifiable information. This means that if health information is deemed sufficiently deidentified, then there is no legislative requirement to obtain consent from patients to collect it and use it.1 In addition, Research Ethics Boards (REBs) are more likely to waive the consent requirement if the information collected for research is deemed deidentified.2 The option to waive consent is important as there is evidence that currently used methods for obtaining optin consent can result in low recruitment and selection bias in health research.3–10 The ability to make precise claims about identifiability therefore is needed to inform this consent waiver decision.
It is obvious that variables such as name and address would have to be removed, or not collected to start off with, to deidentify a data set. However, beyond the elimination of such variables, the definition of identifiability is often vague and remains an active area of research.11
The inclusion of geographic information (geocoding) in health data sets is critical for public health investigations and health services research.12–17 However, the inclusion of geographic details in a data set also makes it much easier to reidentify patients.18,19 The more specific the geographic detail included, the easier it is to use the other variables/information in the data to uniquely identify an individual. In fact, recently the federal court accepted evidence that the inclusion of the “Province” field in Health Canada's adverse drug events database can potentially reidentify individuals.20 Therefore, the province where the adverse event occurred cannot be disclosed by Health Canada in response to an access to information request. It has also been shown that patient addresses can be reidentified from published maps.21–23 Consequently, there is a risk that geographic detail in health data sets makes Canadians identifiable.
To protect privacy one can mask geocodes,24,25 or control geographic area population size (GAPS) to minimize the risk of reidentification. Due to its relative simplicity, controlling GAPS has been adopted widely in practice. Controlling GAPS means either that data about individuals living in areas with small populations are suppressed, or that areas with small populations are aggregated into larger ones. Suppression results in the direct loss of data, and aggregation reduces the utility of a data set.26–28 This is justified because of the demonstrated empirical relationship between GAPS and reidentification risk29–31: reidentification risk tends to be higher in areas with smaller populations.
Examples of GAPS cutoff use include the United States Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. The HIPAA Privacy Rule defines 18 variables in the Safe Harbor List that need to be removed or generalized to ensure that a data set is deidentified. One of these 18 items stipulates that the first three numbers of the ZIP code can be collected/disclosed if the population living within that geographic area is greater than 20,000 people. The US Bureau of the Census has a 100,000 GAPS cutoff for releasing public use microdata files.32–34 That same cutoff is used for making disclosure control decisions with public health data sets.35,36 Only data from areas with a population of 120,000 or more are released as microdata from the British census.37 Similarly, Statistics Canada uses a 70,000 population size cutoff for health regions to control the risk of disclosure when releasing data from the Canadian Community Health Survey (CCHS).38 It has been suggested that different GAPS cutoffs should be applied depending on the user, with a 25,000 cutoff for data disclosed to researchers, and a 100,000 cutoff for data disclosed to the public.39
The dearth of evidence supporting the specific cutoffs that are used in practice, and the “real research need to develop empirical evidence to justify recommendations regarding geographic specificity”19 make the continued search for GAPS cutoffs important. Furthermore, existing GAPS cutoffs do not account for the fact that a cutoff is inherently dependent on the number and nature of the variables under consideration.31,40 For example, the cutoff to apply when one has two variables will be smaller than a cutoff to apply when there are 15 variables. When the variables have few response categories, the cutoff will be smaller than when they have many response categories. Therefore, many GAPS cutoffs in current use (summarized above), may be overprotecting data sets or not protecting them enough depending on the specific variables in question.
The purpose of our study is to provide an empirically grounded basis for using GAPS cutoffs. The primary contributions of this work are to (a) provide models for predicting the GAPS cutoffs that explicitly account for reidentification risk and the variable characteristics based on two simple metrics: the number of possible combinations of data fields and entropy, (b) validating these models using Canadian census data, and (c) demonstrating their applicability with two examples of pharmacy prescription data.
Methods
Definitions and Preliminaries
Quasiidentifiers
When considering reidentification risk, we are only interested in a subset of variables in a data set.41 These are called the quasiidentifiers.42 They are variables that make individuals unique in the population and are possibly publicly known. Therefore, they do not directly identify an individual, but can be used for indirect reidentification. While there is no universal definition of what constitutes a quasiidentifier, there are some quasiidentifiers that have been studied more extensively than others such as gender, date of birth, ethnicity, income, years of education, and geocodes. In addition, quasiidentifiers may differ across data sets. For example, gender will not be a meaningful quasiidentifier if all of the individuals in a data set are female. Lastly, in this study, the quasiidentifiers that are assessed have a finite set of possible discrete values.
Uniqueness as a Measure of Reidentification Risk
We define a unique individual as the one individual with specific values on the quasiidentifiers in a particular geographic area. For example, if there is only one 95yearold male in a postal code, then that individual is unique within that postal code. The uniqueness of individuals is often used as a surrogate measure for reidentification risk: unique records in a data set are more likely to be reidentified by an intruder than nonunique records.43 We therefore use uniqueness as our measure of reidentification risk.
Nested Geographic Areas
Geographic area aggregation implies a nesting relationship among those areas. For example, if we decide that reidentification risk is too high when we geocode using full postal codes, then we can aggregate the geographic area to Forward Sortation Areas (FSA), which are the first three characters of the postal code. Postal codes are nested within FSAs.
Determining the GAPS Cutoffs
Geographic areas can be measured in terms of the physical area or population size. In this paper we refer only to the population size of the geographic area.
Previous research has identified two characteristics of the relationship between uniqueness and GAPS:29–31
Uniqueness in a data set is inversely proportional to the population size of the geographic area. This means that the proportion of unique individuals in a large area will be smaller than in a nested smaller area. As smaller areas are aggregated into larger areas, the proportion of uniques goes down (see Fig 1).
Once GAPS reaches a certain point, uniqueness tends to plateau. This trend applies irrespective of the quasiidentifiers in question.
A case has been made that the 100,000 GAPS cutoff used by the Census Bureau is justified by computing the uniqueness plateau noted above (i.e., the point at which uniqueness no longer changes).29 The rationale is that increasing the size of the geographic area any further has little impact on uniqueness, and hence little impact on reidentification risk.29–31 For example, if the uniqueness plateau is reached at 100,000 then this means the reidentification risk changes insignificantly between 100,000 and 110,000. Therefore, there is no disclosure control benefit in increasing the size of the geographic region or of aggregation beyond 100,000, and a reasonable cutoff would be 100,000.
In our analysis we build on a methodology used in a previous study at the Census Bureau29,31 and proceed as follows:
Define a quasiidentifier model as a specific quasiidentifier or combination of quasiidentifiers and evaluate its uniqueness.
Plot uniqueness against GAPS and compute the cutoff point as the point where the derivative approaches zero (illustrated in Fig 1).
Let the geographic areas under consideration be indexed by 1..K, and their population size denoted by S_{i} where i:1..K. The area indexed by i is nested within the area indexed by i+1. Consequently, we also have S_{i} < S_{i+1} for all i. We denote the percentage of individuals on a particular quasiidentifier model that are unique in an area i by U(S_{i}). Because of the monotonically decreasing relationship between GAPS and uniqueness, we expect the following relationship to hold: U(S_{i}) > U(S_{i+1}). The GAPS cutoff was then defined as the value of S_{i} where the approximate derivative, the change in the percentage of uniques, is close to zero31:
This approach, however, may identify local plateaus where the uniqueness remains temporarily steady, followed by a more substantial decrease to reach the asymptotic value. To address this we adopted a model building approach where the uniqueness function is defined as U(S_{i}) = β_{0} × S_{i}^{β1}, where the β_{0} and β_{1} are estimated using ordinary least squares regression. We then take the derivative of this function and compute the cutoff as the size value where the derivative approaches zero:
The cutoff values were computed separately for central Canada (which includes Ontario and Quebec), western Canada (which includes all territories and provinces west of Ontario), and eastern Canada (which includes all provinces east of Quebec).
Data Source
The data set used for our study is the 2001 Canadian census Public Use Microdata File (PUMF) made available by Statistics Canada.44 The PUMF represents approximately 2.7% of the Canadian population. The variable subset that is analyzed is shown in Table 1. These are common demographics that are often available in health data sets. There are 10 quasiidentifiers. These variables were selected because they can be used to link with other databases, because they describe attributes which are visible on individuals, or because they describe attributes which would make individuals easily identifiable.41
Number Response Categories^{*}  
Variable Name in the Census File  Definition  Western and Central Canada  Eastern Canada 
SEXP  sex  2  2 
AGEP  single years of age from 0 to 84, 85+  86  86 
HLNPA  language: the language spoken most often at home by the individual  14  4 
ETHNICRA  ethnic or cultural group to which respondent's ancestors belong  41  26 
ABSRP  aboriginal identity  4  4 
TOTSCHP  total years of schooling  9  9 
MARST  marital status (legal)  5  5 
RELIGRPA  religious denomination  11  3 
TOTINCP  total income: we defined categories of total income in $ 15K intervals  11  11 
VISMINP  visible minority  4  4 

↵* The Number of response categories excludes nonspecific responses such as missing value, not available, or “other”.
Disclosure control was already applied to the PUMF by Statistics Canada. The specifics that are relevant to this study consist of: (a) suppression for some variables for the Eastern region of Canada, and (b) the age variable was top coded at 85 years. As a result, there were three variables in the Eastern region, as seen in Table 1, which corresponded to variables in the West and Central regions but with a smaller number of response categories, where these response categories were coarsened.
Quasiidentifier Models
A quasiidentifier model consists of one or more quasiidentifiers (qids). To manage the scope, we only consider combinations of up to five quasiidentifiers.
There are some similarities among the ethnicity related variables, and therefore they were treated as a group: variables ETHNICRA, HLNPA, RELIGRPA, VISMINP. Whenever the ethnicity variable appears in a model it was replaced by one of the above individual variables. Each substitution represented a different model. This gives 7 distinct qids: sex, age, ethnicity, schooling, marital status, total income, and aboriginal identity.
Categorizing the 7 distinct qids by their sensitivity and availability to an intruder gives the following two types:
Easily used and available for reidentification: sex and age
Possibly usable for reidentification/sensitive: ethnicity, schooling, marital status, total income, and aboriginal identity
The value for
5 qids: have age and gender and 10 combinations of 3 of the 5 sensitive qids.
4 qids: have age and gender and 10 combinations of 2 of the 5 sensitive qids.
3 qids: have age and gender and each of the 5 sensitive qids.
2 qids: have age and gender only—there is only one model.
This gives 26 models for the 7 distinct qids. Substituting each of home language, religion and visible minority for ethnicity then gives us 18 (3 × 6) models for 5 qids (ethnicity appears in 6 of the 10 models), 12 (3 × 4) models for 4 qids (ethnicity appears in 4 of the 10 models), and 3 (1 × 3) models for 3 qids. The subtotal for this group is 59 models.
We repeated the above process by using each one of age or gender in combination with the sensitive qids. That is, models containing:
5 qids: have age and 5 combinations of 4 of the 5 sensitive qids.
4 qids: have age and 10 combinations of 3 of the 5 sensitive qids.
3 qids: have age and 10 combinations of 2 of the 5 sensitive qids.
2 qids: have age and each of the 5 sensitive qids only.
This gives 30 models. Similarly to the previous group, by taking into account the ethnicity related variables gives a subtotal for this group of 75 models. For the last group, age is replaced with gender for an additional 75 models.
Therefore, in total we tested 209 different quasiidentifier models.
Varying Region Size
We performed a simulation following the nested sampling method described by Greenberg and Voshell.30,31 We took a simple random sample of 200,000 individuals from western Canada, 200,000 from central Canada, and 60,000 from eastern Canada. For each of these three regions of Canada, we varied the size of the region by randomly removing individuals in 5,000 decrements. For example, for central Canada, we started with a random sample of 200,000 individuals, then a subsample of 195,000 was randomly selected, and then another subsample with 190,000 individuals, and so on. For each subsample we computed the proportion of unique records on each of the 209 quasiidentifier models described above. The cutoff was selected when the derivative was less than 0.001 using Eq (2).
This simulation approach has been shown to produce results that are quite similar to using actual contiguous areas (e.g., Census Tracts).30,31 Furthermore, it has been argued that this simulation approach ensures that the results are controlled, replicable, and generalizable.31
When computing the cutoff using the derivative (Eq 2), the potential cutoffs were evaluated only within the GAPS range in our data set (i.e., 5–200 k for western and central Canada, and 5–60 k for eastern Canada) to ensure that we did not extrapolate beyond the original data used to build the models.
Predicting the GAPS Cutoff
We developed a prediction model to have the results of the simulation be more practical for an enduser, such as a privacy analyst or epidemiologist, to calculate the GAPS cutoff for their particular study or data set. As noted earlier, we expected that a cutoff is related to the quasiidentifiers that are being considered. The following are two traditional ways used to characterize the quasiidentifiers:
Entropy. A previous study formulated an entropy measure that captures the dispersion in the quasiidentifiers.31 This was found to be strongly related to uniqueness within a region. We computed the standard information theoretical entropy measure from the full samples using
MaxCombs. The maximum number of possible different values for the quasiidentifiers. For example, if we have two quasiidentifiers, say, age and gender, and assume that age has 86 possible values and gender has 2 values, 86 × 2 = 172 is the maximum number of different possible combinations of values for these two quasiidentifiers. It is expected that the greater the maximum number of combinations the more uniques will be in a data set.31
We constructed two prediction models, each with a single independent variable: Entropy, or MaxCombs. An examination of the data indicated an obvious logarithmic relationship between each of these variables and the GAPS cutoff, giving us the following two linear models: log (GAPS_CUTOFF) ∼ β_{0} + β_{1}log (Entropy) and log (GAPS_CUTOFF) ∼ β_{0} + β_{1}log (MaxCombs). For each of the two prediction models we had 209 observations representing the quasiidentifier models.
The GAPS cutoff value is truncated from below at 5,000 because that is the smallest subsample that was selected. It is also truncated at the top at 200,000 for central and western Canada, and 60,000 for eastern Canada because that was the size of the total sample that we used. Neither Entropy nor MaxCombs is truncated. A suitable modeling technique for such a censored data set is Tobit regression.45–47
Let y denote the actual value of the GAPS cutoff, the point at which the approximate derivative is close to zero, produced during our simulations. We have y ≥ c_{1} and y ≤ c_{2}, where c_{1} and c_{2} are the bottom and top truncation threshold values respectively. Also, let there be an underlying latent variable
The Tobit model takes the form:
To determine the goodness of fit of the models, we used the pseudoR^{2} of McKelvey and Zavoina,48 which was shown to be valid for the Tobit model.49 A Monte Carlo simulation compared different pseudoR^{2} measures for the Tobit model and found this one to be the best,50 with the main criterion being equivalence to the R^{2} measure that would be obtained using ordinary least squares regression if there was no censoring in the data.
Validation of GAPS Cutoff Predictions Models
To validate the GAPS cutoff values that we used, the delta score was computed for each of the three regions of Canada. This score indicates how far the uniqueness at the GAPS cutoff was from the asymptotic value. Small values of the delta score indicate that uniqueness is close to zero, and that any additional geographic area aggregation would have an insignificant impact on uniqueness.
An enduser can enter either the Entropy or MaxCombs values in the Tobit models to predict the GAPS cutoff value for their study. To validate the accuracy of the prediction models, we used the Tobit models to predict the GAPS cutoff using 10fold crossvalidation.51,52 That is, we divided the data sets into deciles and used one decile in turn for validation, and the remaining nine deciles to build the model.
The predicted cutoff used for validation was the unconditional value of the realized variable y̑—the full equation for this estimate is provided in the literature.45–47 Using y̑ in the validation ensured that the predicted value was also censored. The quality of the prediction was evaluated by considering the median and trimmed mean of the error (y−y̑) and the relative error, defined as (y−y̑)/y.
Applying the Prediction Models
Since an enduser does not need to worry about censoring (which is an artifact of our simulation), the predicted value of the latent variable would be used instead, y̑#x002A;. This is given by y̑* = e^{β0}Entropy^{β1} or yб* = e^{β0}MaxCombs^{β1} where β_{0} and β_{1} are the model parameter estimates.
After presenting the results in the next section, the application of the prediction models in several real examples pertaining to the disclosure of retail and hospital pharmacy data to commercial data aggregators is illustrated in the discussion.
Results
An example of the relationship between GAPS and proportion uniqueness is shown in Fig 2. A similar pattern was observed for all regions and variable combinations. As illustrated in Fig 1, the cutoff was calculated from such a graph by fitting a model and taking its derivative. The cutoff values were then used to develop the prediction models, as described in the previous section.
Table 2 shows the delta scores, which indicate how far uniqueness was from the asymptotic value at the various GAPS cutoffs that were calculated. As can be seen, there is very little difference in uniqueness across the regions, suggesting that there is little disclosure control benefit in increasing area sizes beyond the cutoffs that were calculated.
West  Central  East  
Trimmed mean  0.007  0.0068  0.0061 
Median  0.0036  0.0033  0.0037 
In Tables 3 and 4 we show the model parameters and diagnostics to predict the GAPS cutoff as a function of Entropy and MaxCombs, respectively. As is clear, all of the parameters are statistically significant, and the goodness of fit is high.
Entropy Prediction Model (Western)  
PseudoR^{2}  0.89  
Intercept  6.3; p < 0.0001  
Log (entropy) parameter est.  2.8; p < 0.0001  
Prediction error (10fold)  Relative prediction error (10fold)  
Trimmed mean  −4,433  Trimmed mean  0.012 
Median  −1,500  Median  −0.02 
Entropy Prediction Model (Central)  
PseudoR^{2}  0.8  
Intercept  6.5; p < 0.0001  
Log (entropy) parameter est.  2.6; p < 0.0001  
Prediction error (10fold)  Relative prediction error (10fold)  
Trimmed mean  −1,218  Trimmed mean  −0.015 
Median  −7,405  Median  0.019 
Entropy Prediction Model (Eastern)  
PseudoR^{2}  0.9  
Intercept  7.0; p < 0.0001  
Log (entropy) parameter est.  1.8; p < 0.0001  
Prediction error (10fold)  Relative prediction error (10fold)  
Trimmed mean  −1,284  Trimmed mean  0.0024 
Median  −524  Median  −0.019 
MaxCombs Prediction Model (Western)  
PseudoR^{2}  0.9  
Intercept  7.4; p < 0.0001  
Log (MaxCombs) parameter est.  0.4; p < 0.0001  
Prediction error (10fold)  Relative prediction error (10fold)  
Trimmed mean  −2,175  Trimmed mean  −0.012 
Median  −1,325  Median  −0.016 
MaxCombs Prediction Model (Central)  
PseudoR^{2}  0.9  
Intercept  7.3; p < 0.0001  
Log (MaxCombs) parameter est.  0.4; p < 0.0001  
Prediction error (10fold)  Relative prediction error (10fold)  
Trimmed mean  −2,472  Trimmed mean  −0.0002 
Median  −1,156  Median  −0.013 
MaxCombs Prediction Model (Eastern)  
PseudoR^{2}  0.9  
Intercept  7.6; p < 0.0001  
Log (MaxCombs) parameter est.  0.3; p < 0.0001  
Prediction error (10fold)  Relative prediction error (10fold)  
Trimmed mean  −920  Trimmed mean  −0.007 
Median  −445  Median  −0.015 
For both the Entropy and MaxCombs prediction models, the prediction errors are quite small. While the MaxCombs models have a slightly higher goodnessoffit than the entropy models, the accuracy of the prediction for both are very similar.
Discussion
The results suggest that the three regional models we have constructed for predicting the GAPS cutoff from both the Entropy and MaxCombs values can be quite accurate. They also make clear that having a single GAPS cutoff would be a serious oversimplification and that the appropriate cutoff is a function of the quasiidentifiers that will be collected and the region of Canada.
Geographic areas that are larger than the GAPS cutoff represent low reidentification risk since they are close to the asymptotic risk value of zero, and there is also no disclosure control benefit in aggregating areas beyond the cutoff.
The prediction accuracy results were similar for MaxCombs and Entropy. One would expect Entropy to perform better given that it represents more information about the data distribution. However, there may be a ceiling effect in that the accuracy for either variable is sufficiently high that it is difficult for Entropy to outperform MaxCombs.
In practice, the MaxCombs value is easier to compute than the Entropy value. It is also possible to compute MaxCombs at the outset of a study during the design phase before any data are collected. We therefore recommend using the MaxCombs results in practice since in terms of accuracy they are very comparable to the Entropy results.
To apply these results an analyst first needs to compute the maximum number of combinations for the quasiidentifiers in the data set. Once this MaxCombs value is determined, the prediction models in Table 5 can be used to compute the GAPS cutoff. If the cutoff is deemed too large then the analyst can look at ways to reduce the value of MaxCombs by collapsing or coarsening the response categories. This process can be repeated until the cutoff is sensible for the particular study.
Region of Canada  GAPS Cutoff 
Western  1588×MaxCombs^{0.42} 
Central  1436×MaxCombs^{0.43} 
Eastern  1978×MaxCombs^{0.304} 

GAPS = geographical area population size.
Applying the Results
The following disclosure control example is about the reidentification of patients from their prescription records—it illustrates the application of our results. Many retail and hospital pharmacies across Canada provide prescription data to commercial data aggregators (we will refer to these data as “prescription records”). Prescription records are used to produce reports on physician prescription patterns and drug use53 These reports are then sold primarily to the pharmaceutical industry and government agencies.
In practice, the prescription records provided to commercial data aggregators do not contain directly identifying information about the patients (e.g., patient names and telephone numbers). However, it has been argued that the patient information that is disclosed in such records can still reidentify patients,54,55 and that this possible reidentification jeopardizes the confidentiality of Canadians' health information.54
The relevant quasiidentifiers in the prescription record are summarized in Table 6.We relied on five sources to construct this table: (1) the Canadian Pharmacists Association (CPhA) Pharmacy Claim Standard which defines all fields in the pharmacy electronic record used for claims adjudication,56 (2) a report provided to us on the variables collected by the data management group at IMS Health Canada Inc, one of the largest commercial data aggregators in Canada,57 (3) the investigation report by the Office of the Information and Privacy Commissioner of Alberta (OIPCA) which listed the 37 fields that are collected by commercial data aggregators,58 (4) the results of a survey of provincial pharmacy regulatory authorities,54 and (5) a specification of the data collected by Brogan Inc from Canadian hospital pharmacies (Brogan is another large commercial data aggregator in Canada).59
CPhA Standard  
Variable  Defined in CPhA Std?56  CPhA Mandatory?56  IMS57  Field in OIPCA Report?58  Disclosed According to Survey?54  Brogan59  Additional Explanations 
Patient gender  Y  O  R  Y  Y^{**}  Y  All sources indicate that patient gender is collected. 
Patient year of birth  Y  O  R  Y  Y^{**}  Y  The survey suggests that some provinces collected the full date of birth.54 But both the OIPCA report58 as well as the IMS Health Reports57 indicate that only the year of birth is collected. 
Patient postal code  Y  O  —  —  n^{***}  Y^{†}  The survey indicated that only PEI allowed the collection of postal codes.54 When we contacted the pharmacy registrar in PEI it was made clear that if geographic information was disclosed by pharmacies, only the FSA was being disclosed rather than the full postal code. The IMS health report indicated that neither the full postal code nor FSA are collected from any province.57 The Brogan document indicated that the FSA was being collected.59 
Pharmacy postal code  Y  M  Y  Y  —  Y  Brogan's data are from hospital pharmacies, therefore the pharmacist is known. 
Prescriber postal code  Y  O  Y^{*}  Y  Y^{¶}  Y  Prescriber group is in the record layout for the Brogan data. 

M = Mandatory field in the CPhA claims standard; O = optional field. These fields will not necessarily be available for every pharmacy submitting data; CPhA = Canadian Pharmacists Association; SD = standard deviation; OIPCA = Office of the Information and Privacy Commissioner of Alberta; R = The field is required by IMS health Canada from all pharmacies submitting data, but if it is missing that would not invalidate the record.
The field is not defined or collected at all.

↵* whether this field is collected depends on the arrangement with a particular pharmacy and on the province (not collected in BC, MN, QC).

↵** except MN, QC, NS.

↵*** except PEI.

↵¶ except BC, SK, MN, Nfld.

↵† Brogan collects the patient FSA as part of its record layout.
Key variables that are disclosed pertaining directly to patients are gender and year of birth.
Brogan also collects the patient FSA, but IMS Health does not do so directly. However, it is often possible to infer new information about individuals from variables that already exist in a record:11 it may be possible to infer the patient (residence) postal code from the postal code of their pharmacy or the prescriber if one assumes that there is some regularity in the distances that patients travel to see their general practitioner, specialist, or pharmacist. A simulation concluded that a patient would have to live at most within a 100m radius from the pharmacy or prescriber to be able to accurately predict the full postal code in urban areas.11 For rural areas, the distance varies from 1 km in Nova Scotia, 5 km in Ontario, to 10 km in Alberta.11 We conducted a similar simulation to determine the accuracy of inferring the FSA and concluded that this can be accurately predicted if the patient lives within 10 km of the pharmacist/prescriber for rural areas, and within 1 km for urban areas in Nova Scotia and Alberta, and 0.5 km in Ontario.
In our analysis, we therefore made the assumption that the FSA was being collected or that it was reasonable to accurately infer the FSA for some of the patients if it is not collected.
Example 1
In this example, the prediction models were applied to assess patient reidentification risk for pharmacy prescription records in ten Canadian provinces, for the two quasiidentifiers of age and gender. The MaxCombs value is 172; the number of all possible values of age (86) × gender categories (2). For each of the three regions of Canada the GAPS cutoff was computed using the values in Table 5. The percentage of FSAs whose population size is above the predicted cutoff for each province along with the percentage of the population that resides in these FSAs was then calculated.
The results are summarized in Table 7, and compared to the three other cutoffs that were being used: the 20,000 cutoff used in HIPAA (in practice the HIPAA Privacy Rule is sometimes used in Canada60), the Statistics Canada 70,000 cutoff for the CCHS, and the Census Bureau 100,000 cutoff. These data show that, except for New Brunswick, the vast majority of the provincial populations live in FSAs that are larger than the GAPS cutoff and therefore there is no disclosure control benefit in aggregating the geography any further.
Our GAPS Models  20,000 Cutoff  70,000 Cutoff  100,000 Cutoff  
Province  FSA  Pop  FSA  Pop  FSA  Pop  FSA  Pop 
Alberta  55%  84%  38%  71%  1.4%  5%  0.00  0 
British Columbia  68%  87%  46%  70%  1.1%  4.%  0.00  0 
Manitoba  59%  88%  39%  68%  0  0  0.00  0 
New Brunswick  20%  51%  4.5%  19%  0  0  0.00  0 
New found land  55%  83%  30%  62%  0  0  0.00  0 
Nova Scotia  47%  82%  16%  43%  0  0  0.00  0 
Ontario  69%  91%  49%  76%  1.4%  5%  0.20%  1% 
PEI  57%  90%  43%  79%  0  0  0.00  0 
Quebec  59%  84%  36%  63%  1%  5%  0.25%  0 
Saskatchewan  60%  93%  49%  84%  2%  7%  0.00  2% 

FSA = forward station area; GAPS = geographical area population size; PEI = Prince Edward Island.
For commercial data aggregators, there are three possible options:

Suppress the FSAs that are smaller than the cutoff. For example, in Ontario data from 31% (100–69%) of FSAs would need to be suppressed. These 31% of FSAs represent 9% of the Ontario population.

Given that sex and gender are collected, determine what level of geographic aggregation would be suitable to avoid suppression of any data.

The analyst coarsens or collapses the response categories of the quasiidentifiers given that the level of geography is fixed at the FSA.
Suppression of data from small FSAs means that pharmacists would not be permitted to provide that data to the commercial data aggregators. Nevertheless, there would be far less FSA suppression using our models compared to the other cutoffs in current use: our models take into account the characteristics of the variables and calibrate the cutoff. For some provinces, no data would be released at all if some of the other GAPS cutoffs are applied.
For the second option described above, one can aggregate FSAs to the postal region, the first character of the postal code. We found that all postal regions in the ten provinces are above the GAPS cutoff. Therefore, inclusion of the sex and gender variables in the prescription record is possible as long as the geographic detail is at the postal region level, since this level of geography is always higher than the cutoff. The advantage of this option is that no data needs to be suppressed at all; however the disadvantage is that the aggregated geographic unit is quite large.
For the third option described above, it is assumed that the FSA geographic detail needs to be retained—the question then is which one of sex and gender is to be coarsened and the interval for grouping the coarsened age categories. For example, instead of disclosing the age in years, age can be disclosed as part of a 2year interval, a 5year interval, or a 10year interval. The results for such coarsened categories are shown in Table 8. As expected the percentage of FSAs that can be disclosed increases as the amount of coarsening increases. However, for smaller provinces, such as New Brunswick, the proportion of the population in large FSAs remains low even with 10years age intervals. Table 8 also shows the effect of coarsening the categories for age in terms of the percentage of the population. With 5years age intervals, 98% of the Ontario population would be living in regions that are larger than the cutoff.
Original Variables  2yrs Age Intervals  5yrs Age Intervals  10yrs Age Intervals  
Province  FSA  Pop  FSA  Pop  FSA  Pop  FSA  Pop 
Alberta  55%  84%  68%  92%  79%  96%  84%  98% 
British Columbia  68%  87%  78%  93%  90%  99%  93%  99% 
Manitoba  59%  88%  66%  92%  72%  95%  78%  98% 
New Brunswick  20%  51%  26%  59%  37%  70%  45%  75% 
Newfoundland  55%  83%  70%  91%  79%  95%  88%  98% 
Nova Scotia  47%  82%  54%  86%  66%  93%  72%  95% 
Ontario  69%  91%  78%  96%  84%  98%  87%  99% 
PEI  57%  90%  71%  97%  71%  97%  71%  97% 
Quebec  59%  84%  70%  91%  82%  96%  88%  99% 
Saskatchewan  60%  93%  69%  97%  69%  97%  71%  98% 

FSA = forward station area; GAPS = geographical area population size; PEI = Prince Edward Island.
Example 2
In this example we consider a specific data set from a hospital pharmacy. Records for all prescriptions dispensed from the Children's Hospital of Eastern Ontario pharmacy during the period beginning January 2007 to the end of June 2008 were obtained following institutional ethics approval. In total there were 94,100 records. These represent 10,259 patient visits and 6,902 unique patients.
The MaxCombs value for these data are 54 since the patient ages in years range from 0 to 26. Also, almost all of the patients of the hospital come from Ontario and Quebec. Therefore, we used the Central Canada model from Table 5.
The results were that 95% of the patients in the prescription record database reside in FSAs that are larger than the cutoff. However, for pediatric hospital patients it is important to know the age in weeks for patients younger than 1 year. Here, the MaxCombs value is 156, and the result is that 89% of the patients live in FSAs that are larger than the Central Canada cutoff.
Summary
These examples show that using the MaxCombs prediction models given in Table 5 provide a straightforward technique to determine the GAPS cutoffs for datasets so the reidentification risk is managed while allowing for an increased amount of data to be available to the health researcher.
Relationship to Other Work
There have been previous descriptive studies of uniqueness in the United States population on basic demographic variables, such as age and gender.61,62 However, these studies did not explicitly consider the impact of nested geographic areas and their population size on uniqueness.
We used uniqueness as the measure for reidentification risk. Another common criterion for evaluating reidentification risk is kanonymity.63,64 This criterion considers that nonunique records are also risky. However, even under kanonymity, unique records are still those with the highest probability of reidentification. Therefore, managing the risk of reidentification from uniques remains an important objective in disclosure control.
Earlier work at the United States Census Bureau evaluated nested geographic areas, and provided the basic methodology for our study.29,31 This work did not document a general model that can be applied by individuals outside the bureau and that takes into account the characteristics of the quasiidentifiers, which is what we did in this study.
Limitations
The prediction models we present here should be considered as one element in a toolbox of heuristics that can be used for disclosure control. Some other heuristics have been described in previous work.65,66
Although we contend that the ten quasiidentifiers we considered represent basic demographics that are quite common in health research, they will not cover all possible quasiidentifiers that may be used in practice. Thus, our results are limited to the specific variables that we have considered in our analysis.
Conclusions
Data custodians often use general population size cutoffs to determine the level of geographic information to disclose in a data set. For example, the HIPAA Privacy Rule's Safe Harbor list allows the release of the first three digits of the ZIP code only if that area has 20,000 or more individuals living in it. National statistical agencies in the United States, UK, and Canada also use such cutoffs as part of their disclosure control practices. The primary rationale for such cutoffs is that there is no disclosure control benefit for aggregating geographic areas beyond that size.
In this paper we performed an empirical evaluation of such cutoffs using Canadian census data. Our results indicate that the appropriate cutoff depends on characteristics of the variables included in the data set; therefore there is not a single cutoff. We developed and validated models to predict such population size cutoffs for Canada. The model which predicted population cutoffs using the maximum number of possible values in the data set had R^{2} values approaching 0.9, and relative error of prediction less than 0.02 across all regions of Canada. Our prediction models were then applied in a risk assessment of the prescription records that are provided by Canadian pharmacies to commercial data aggregators. This assessment indicated that for most of the Canadian population, that there is no disclosure control benefit to aggregating geography beyond the FSA when releasing patient age and gender.
Footnotes

This work was funded by the Public Health Agency of Canada, the Ontario Centers of Excellence, GeoConnections (Natural Resources Canada), and the Natural Sciences and Engineering Research Council of Canada. The authors would like to thank Anita Fineberg from IMS Health Canada, Inc for providing us with information about the record layout for the prescription data. The authors also would like to thank David Paton (Canadian Institute for Health Information), Bradley Malin (Vanderbilt University), JeanLouis Tambay (Statistics Canada), and Don Willison (McMaster University) for their detailed feedback on an earlier version of this paper. Comments from the anonymous review were also of considerable help in improving and clarifying the paper.

This work was approved by the research ethics board of The Children's Hospital of Eastern Ontario Research Institute.
 American Medical Informatics Association