Rankbased spatial clustering: an algorithm for rapid outbreak detection
Abstract
Objective Public health surveillance requires outbreak detection algorithms with computational efficiency sufficient to handle the increasing volume of disease surveillance data. In response to this need, the authors propose a spatial clustering algorithm, rankbased spatial clustering (RSC), that detects rapidly infectious but noncontagious disease outbreaks.
Design The authors compared the outbreakdetection performance of RSC with that of three well established algorithms—the wavelet anomaly detector (WAD), the spatial scan statistic (KSS), and the Bayesian spatial scan statistic (BSS)—using real disease surveillance data on to which they superimposed simulated disease outbreaks.
Measurements The following outbreakdetection performance metrics were measured: receiver operating characteristic curve, activity monitoring operating curve curve, cluster positive predictive value, cluster sensitivity, and algorithm run time.
Results RSC was computationally efficient. It outperformed the other two spatial algorithms in terms of detection timeliness, and outbreak localization. RSC also had overall better timeliness than the timeseries algorithm WAD at low false alarm rates.
Conclusion RSC is an ideal algorithm for analyzing large datasets when the application of other spatial algorithms is not practical. It also allows timely investigation for public health practitioners by providing early detection and welllocalized outbreak clusters.
 Biosurveillance
 disease outbreak
 spatial clustering
 time series
 birthday
 editorial Office
 Health data standards
 vocabulary, ontology
 Scientific information and Health data policy
 Consumer health/patient education information
 Information retrieval
 NLP, Public health informatics
 clinical trials
Introduction
As recent experience with SARS, H1N1, and cholera demonstrate, large outbreaks of infectious disease continue to threaten the world's population. It thus has become a priority of public health practitioners to identify and localize an outbreak in its early stages to allow public health response as early as possible.
Over the past decade, a large number of approaches to early detection and localization based on more rapid collection of disease surveillance data and their automatic algorithmic analysis have been developed.1–4 The algorithms, in addition to timeseries algorithms that analyze univariate data in the temporal dimension, include many spatial and tempospatial algorithms, which allow better detection and localization of outbreaks caused by infectious but noncontagious disease agents (eg, an aerosol release of Bacillus anthracis, waterborne outbreaks caused by pathogens such as Cryptosporidium parvum and Giardia lamblia), which typically affect an aggregated group of geographic areas. These spatial algorithms have also been applied (either retrospectively or prospectively) to noninfectious public health problems such as infant death5 and prostate cancer survival data.6
Current spatial algorithms divide into frequentist and the Bayesian approaches. Given a hypothesis, a Bayesian approach assigns a probability to it, whereas a frequentist approach typically tests it. The spatial scan statistic by Kulldorff et al (KSS)5 is a frequentist approach that exhaustively searches for areas of maximum disease activity (eg, incidence) within a region of interest using circular or elliptic subregions of different sizes centered on various locations. Other frequentist methods include the flexible spatial scan statistic (FSS), which relaxes the constraint on cluster shape used in KSS,7 and the upper level set scan statistic (ULS), which searches tessellated clusters from some subsets of the study areas (each subset includes the areas with higher elevated rates than a predefined threshold).8 All of the frequentist algorithms compute a score of a likelihood ratio of having an outbreak in each considered cluster versus no outbreak and then perform a randomization test to decide its significance.
On the other hand, the Bayesian approaches do not require the randomization test. Neill's Bayesian spatial scan statistic (BSS) employs an m×m grid covering the area of interest to search for clusters. Each cell in the grid is a geographic unit. BSS identifies a rectangular subregion, which is composed of the cells with the highest posterior probability of having an outbreak. Other Bayesian methods include the Bayesianbased multilevel spatial clustering algorithm and the zscorebased multilevel spatial clustering algorithm.9 ,10 These algorithms identify clusters from a subdataset to achieve computational efficiency; here the challenge of deciding on the proper criteria for data selection is introduced.
The current frequentist and the Bayesian scan statistics face some common limitations. First, they are computationally intensive because of exhaustive searching, randomization testing or both. Therefore, in timesensitive applications, an algorithm may take too long to complete, rendering its results outdated or delayed for decision makers. Second, certain artificial cluster shapes used by those algorithms may not conform to true outbreak shapes, thus reducing their sensitivity to small outbreaks and the timeliness of detection of other outbreaks.
In an effort to overcome these limitations, we developed a nonparametric methodology for early outbreak detection and localization, a ‘rankbased clustering’ (RSC) algorithm. RSC uses heuristic search, based on statistical models that assess the risk rates of having an outbreak occur in each geographic unit (eg, a ZIP code area) to improve the time efficiency of cluster searching. In the following sections, we describe the RSC algorithm in detail and then evaluate the performance of RSC by comparing it with three wellestablished detection algorithms.
Methods
The RSC algorithm
The key steps in the RSC algorithm are: (1) computing the risk rate of each geographic unit having an ongoing outbreak, and ranking each unit by its estimated risk rate; (2) searching for clusters based on the rankings and on geographic adjacency; (3) computing the posterior probabilities for each cluster.
Computing the risk rates
We studied two measures for estimating the risk rate for each geographic unit: standard score and posterior probability.
Standard score (zscore)
The model computes a standard score (also known as zscore),
Bayesian posterior probability
The standard score results in a large number when most of the counts in a time series are close to the mean value (ie, when s_{i}→0 and c_{i}_{,T}≠b_{i}_{,T}, R(z_{i},T)→∝). In such circumstances, one can use a Bayesian approach to estimate R(z_{i}) as the posterior probability, P(H_{1}(z_{i},T)D_{i}), where T is the most current day.9
We assume that the counts for z_{i} within each period t, 1≤t≤T, follow a Poisson distribution, which is commonly used to model a certain variable that counts a number of discrete occurrences during a time interval of a given length.11 A gamma distribution, the conjugate priori of a Poisson distribution, is then used to model the ratio q (ie, the ratio of the observed counts to the expected). Expert knowledge can also be introduced by setting different prior probabilities (P(H_{1}(z_{i},t))) to different geographic units z_{i} at different times t. (We applied uniform priors in this study.) The posterior probability of H_{1}(z_{i},T) is computed using Bayes theorem (eqn (1)), where the likelihood of the null hypothesis H_{0} (not having an outbreak) and alternative hypothesis H_{1} (having an outbreak in some area) are integrals over the ratio q (eqns (2 and 3)) with different shape parameters α and α′, respectively. The marginal probability is computed as the sum over the two hypotheses as denoted in eqn (4).
In the above equations, c_{i}_{,T} and b_{i}_{,T} are the observed and the expected values for geographic unit z_{i} on day T, respectively, and Γ(·) represents the γ function. The shape parameter (α_{i}) and the rate parameter (β_{i}) of the gamma distribution are learned from the historical data using moment matching methods (eqns (5–8)). As with the alternative hypothesis, we assume that the outbreak will increase α_{i} by a multiplicative factor k>1, while leaving β_{i} unchanged. In this study, k is assumed to follow a discrete uniform distribution in the range of [1.2, 3] with 10 values in between with steps 0.2.
Searching for clusters
RSC computes the shortest Euclidean distance d_{ij} between every two geographic units, z_{i} and z_{j}. We define an adjacency threshold η if d_{ij}≤η, z_{i} and z_{j} are considered to be adjacent. The geographical units are considered to fall into the same cluster when each of them is adjacent to at least one of the other members of the cluster.
The search for the highest probability cluster is a greedy search. The search is seeded with the highest ranked geographic unit and this area itself becomes the first cluster. The second ranked geographic unit is the next to consider. If this area is adjacent to the first, they merge to form a new cluster; otherwise, the second area becomes a cluster. The algorithm processes the remaining geographic units in a riskratedescending order by adding them to one cluster if a geographic unit is adjacent to one or more previously constructed clusters or creating a separate singlearea cluster. Figure 1 illustrates the clusters produced by the algorithm from a region where some areas are adjacent (connected in this example) and some are not. To constrain the growth of large clusters, one can set an upper bound on cluster size. For example, in this study, a cluster stops growing if it includes more than half of the geographic units. Each created cluster is scored using posterior probability, which we describe in the following section. Figure 2 is the pseudocode for the algorithm.
Computing cluster posterior probabilities
We compute the posterior probability of a cluster S using a gammaPoisson model as used in the BSS algorithm.12 eqns (9 and 10) compute the likelihood that there is no outbreak in the entire region (H_{0}) and the likelihood that there is an outbreak in cluster S (H_{1}), respectively.
In the null hypothesis, H_{0}, the number of observed cases summed over all the areas C_{all} follow a Poisson distribution—that is, C_{all}∼Poisson(q_{all}B_{all}), where B_{all} is the number of expected cases summed over all these areas and the disease rate q_{all} follows a gamma distribution, denoted as q_{all}∼gamma(α_{all},β_{all}). Similarly, in the alternative hypothesis H_{1}(S), the number of observed cases in the areas within cluster S, C_{in}, follows a Poisson distribution C_{in}∼Poisson(q_{in}B_{in}), where B_{in} is the number of expected cases summed over all the areas inside the cluster S and the disease rate has a gamma distribution q_{in}∼gamma(α_{in},β_{in}). For the study areas outside cluster S, C_{out}, B_{out}, α_{out} and β_{out} represent the corresponding parameters. The prior variables in gamma distributions can be estimated using the moment matching approach (eqns (5–8)). The posterior probability is computed using Bayes Theorem (eqn (11)), and the marginal probability can be computed (eqn (12)). The prior of each cluster S is assumed to follow a uniform distribution,
Algorithm evaluation
We compared the performance of RSC with that of KSS, the most widely known frequentist approach, and BSS, a wellestablished Bayesian approach.12 We also compared RSC with the wavelet anomaly detector (WAD), in order to benchmark the detectionperformance characteristics of RSC against a member of the timeseries algorithms by considering WAD as a representative.13 ,14 We applied each algorithm to semisynthetic datasets, which were generated by superimposing outbreak cases into real overthecounter healthcare product sales data assumed to have no outbreaks.
Overthecounter healthcare product sales data
We created a semisynthetic dataset for evaluation by injecting simulated outbreak data into real daily sales counts for cough and cold products in Pennsylvania between January 1, 2008 and December 31, 2008.13 ,15–18
We excluded stores with unreliable reporting from the dataset. We defined unreliable reporting as failure to send any records in any of the 23 overthecounter categories for more than 27 days during the period of the dataset (27 days was 5% of the time period, plus 3 federal holidays per year). This filtering of stores removed 498 of 1502 (33%) stores.
We aggregated the daily sales by ZIP code (471 ZIP code areas).
We used the data from January 1, 2007 through December 31, 2007 as a training dataset from which to compute the threshold values for given false alarm rates by assuming there were no known outbreak signals in this period. Each day within the training period was analyzed using an algorithm, and the greatest score (ie, the posterior probability) was recorded. A set of scores were then used as threshold values to control for different false alarm rates. It is worth noting that the threshold values computed in this way are likely to be overestimated because of possibly existing but unveiled outbreak signals.
Semisynthetic outbreaks
We used a simple outbreak model to inject outbreak cases into the dataset of cough and cold products. The size and shape of the injected signal is adjustable by parameters that set outbreak size (K), slope (δ), and duration (D) (eqn (13))
where ω is the number of days after outbreak onset and
The geographical shapes of the simulated outbreaks created in this study were designed to be flexible and independent of any detection algorithm except that the simulator assumes that an outbreak cluster has contiguous areas. To generate an outbreak, we randomly selected a geographic unit as an outbreak area. Second, we randomly chose the rest of the outbreak areas, with each new one required to be adjacent to at least one of the previously selected. In this way, the outbreak simulator was mostly simplified and generalized. It conformed to the exhibiting characteristics of most outbreaks generated by many wellknown simulation models, which also distribute infected cases into clustered (oftentimes connected) regions.2 ,20–22
We injected the simulated outbreak counts into the real cough and cold sales data for the period January 1, 2008 to December 31, 2008. We injected each simulated outbreak into a randomly selected interval of length D days during this period. We used a timeseries algorithm to compute each day's expected value during the evaluation period based on the last 12 months of data previous to that day.
We generated six groups of datasets with K and δ chosen from {4, 8, 12} and {0.2, 0.3}, respectively. We used a fixed duration D of 10 days. Each group included 100 different (differing by start data or seed area) outbreaks. More elevated outbreaks (ie, δ>0.3) are not discussed here because of the indistinguishable detection performances among the different algorithms applied.
Evaluation metrics
The evaluation metrics included (1) the receiver operating characteristic (ROC) curve,23 (2) the activity monitoring operating characteristic (AMOC) curve,24 (3) the areas under the ROC and AMOC curves, (4) computation time, (5) cluster sensitivity (the proportion of the number of outbreak areas correctly detected), and (6) cluster positive predictive value (PPV, the proportion of the number of true outbreak areas in the detected cluster).6 ,7 ,25
Given a threshold value that is computed from the training dataset controlling for a particular false alarm rate, we define a true positive output cluster for all the spatial algorithms as a cluster that satisfies the three conditions: (1) having a score that is greater than the threshold, (2) having one or more outbreak areas identified, and (3) identified within the upward phase of the outbreak (ie, within the first 5 days). For the temporal algorithm, WAD, a true positive output is a geographic unit that (1) has a score that is greater than the threshold and (2) is one of the outbreak areas.
Algorithm settings
We compared the performance of RSC with WAD,26 KSS,5 ,27 ,28 and BSS.12 Note that WAD was not only compared as an individual detection algorithm, but also used to compute expected values required by the three spatial algorithms.
We applied the discrete Poisson model proposed in KSS for a purely spatial analysis performed by SaTScan V.8.0. As suggested by the author of SaTScan, the size of a searched for cluster was limited to <3% of the population in favor of relatively small and focused outbreaks. The input parameter file for SaTScan used in this study is shown in appendix 1. For BSS, we used a 24by24 grid structure to cover the entire study region (ie, state of Pennsylvania), with the area of each grid cell being ∼80 square miles. It would make a more compelling case to compare RSC with BSS with a finer grid size. However, to have a finer grid so that each cell can cover a single ZIP code area (assuming each ZIP code area covers about 10 square miles on average), we would need a 64by64 grid structure. One of our pilot studies showed that a BSS with a grid size of 32 required 70+ min. Theoretically, considering the time complexity of BSS, which is O(m^{4}), a 64by64 grid size algorithm might take about 2^{4} × 70 min which is about 19 h.
The experiments were executed on a Linux server with a 2 GHz Intel CPU and 4 GB memory. All the algorithms except KSS were implemented in Java 1.5.
Results
Detection power and timeliness
The ROC curves of the RSC algorithms using the standard score model and the Bayesian model are shown in figure 4, as well as the curves for WAD and BSS. Figure 5 provides the corresponding AMOC curves. Because the lowest false alarm rate that KSS could achieve is 0.76,ii it is not shown in the figures, with false alarm rates ranging between 0 and 0.2. We assume that any false alarm rates greater than 0.2 (1 false alarm per 5 days) have no practical advantage to public health practitioners.
Table 1 compares partial areas under the ROC curve for false positive rates within the range [0, 0.2] using trapezoidal approximation. DeLong testing29 showed no significant difference between any pair of the algorithms. RSC, BSS and WAD all demonstrated similar detection sensitivities.
Table 1 also compares the areas under partial AMOC curves, demonstrating the detection timeliness of the algorithms. The two RSC methods exhibited better timeliness than the other algorithms in five out of the six groups of experiments. The average number of days it took the algorithms to detect the outbreaks are shown in table 2. Paired student t tests performed on the variable ‘daystodetect’ showed that both RSC methods were able to detect outbreaks significantly earlier than BSS in all six groups at a false alarm rate of 0.1 (ie, 1 false alarm per 10 days). RSC also outperformed WAD when those datasets injected with low intensity outbreaks were analyzed (ie, δ=0.2).
Outbreak localization
The results for average cluster sensitivity for spatial algorithms at a false alarm rate of 0.1 are provided in table 3. In all six groups of experiments, either the RSC_{zscore} or RSC_{Bayesian} showed significantly higher cluster sensitivity than the other algorithms. This result indicates that the RSC is capable of identifying more outbreak areas than other algorithms. Correspondingly, table 3 shows the average measures of cluster PPVs. Both RSC algorithms had significantly better cluster PPV, which suggests that the RSC algorithms made fewer type II (false negative) errors than the other algorithms.
Run time
The average run times for RSC_{zscore}, RSC_{Bayesian}, WAD, BSS, and KSS were 26 s, 24 s, 0.22 s, 44 min, and 2.58 s, respectively. WAD had a shorter run time than any of the spatial algorithms. Among the spatial algorithms, both the RSC_{zscore} and RSC_{Bayesian} were more than 100 times faster than BSS. Please note that the measured average run time for KSS is not strictly comparable, since the software SaTScan was implemented in a different programming language, C.
Discussion
We have presented an RSC algorithm and demonstrated several ways in which this approach is preferable to other temporal or spatial scan algorithms. The experiments demonstrated that RSC using both risk estimation models consistently outperforms the other algorithms in terms of detection timeliness while having comparable detection powers.
The results for the run times of RSC and three compared algorithms can be explained by their time complexity theoretically. The time complexity of RSC is O(n^{2}) plus the time complexity of computing the adjacency relations (ie, the shortest distances) between any two geographic units. The time complexity of computing the shortest distance between two vector polygons defining the borderlines of two geographic units is O(uv), where u and v are the numbers of vertices of the two polygons.30 In fact, most spatial database systems (eg, postGIS) offer querying capabilities on topological relations. The adjacency relationship can also be computed in advance by storing them in a lookup (hash) table for a quick search. Both ways make the run time of adjacency computation less significant. BSS has O(m^{4}) time complexity, where m is the length of the grid.12 KSS has O(Rn^{2}) complexity for the purely spatial model, where R represents the number of replicated analyses required by randomization tests. O(n) is the time complexity of the pyramid algorithm implemented in WAD.31
The RSC algorithm can be extended to detect outbreaks in the areas not restricted to connected ones. One approach is to adjust the threshold distance inversely proportional to some density measure of a baseline factor (eg, population). Another is to attach nonresidential landforms (eg, lakes, valleys, etc) to their nearest census tract and consider them in the analysis as well. In addition, one can use a grid cell as a unit study area that comprises one or more geographic units (eg, ZIP code areas) and search for clusters with adjacent grid cells, where the nonconnected geographical areas (eg, ZIP code areas) inside two or more connected grid cells can be clustered together (J Que, unpublished data).
Furthermore, RSC has better detection accuracy and precision than other tested algorithms. A cluster that is inaccurately shaped or much larger than the actual outbreak area may result in barriers for epidemiologists to conduct investigations given the very limited resources they have.
Detecting flexible shaped clusters is another advantage of RSC. The output of RSC is a combination of a set of adjacent areas (eg, ZIP code areas). A cluster identified by RSC may be any irregular and flexible shape, and the identified shape can be informative and indicative for detection of specific types of outbreaks. For example, an elongated cluster may suggest an aerosol release of a disease agent that follows a downwind direction, or a fishbonelike cluster may imply a waterborne outbreak.
Since RSC is computationally efficient, it may be applied to finer spatial units other than postal code areas while still having an acceptable run time. For example, individual cases can be aggregated into street or block groups for a drilldown analysis, given that a more finely grained dataset is available (eg, streetlevel address is available for analysis). However, when no predefined demographic unit is appropriate, such as individual cases that are indicative of an environmental hazard (eg, a radiation leak) or a nosocomial infection (ie, hospitalacquired infection), artificial units have to be considered, such as circles centering on a nuclear plant or a grid structure displaying ward locations in a hospital.
Similar to RSC, the ULS algorithm8 searches for clusters in a reduced search space. However, the ULS and RSC algorithms differ in three respects: the risk estimation models used, the cluster search space, and cluster significance testing. More specifically, ULS stratifies the dataset into several subsets using a set of predefined threshold rates looking for tessellated areas from each subset, whereas RSC creates a new cluster each time when taking the nextranked area into consideration, which makes its search space a super set of ULS. When ULS sets a set of thresholds and each value in the set is equal to the risk rate of each geographic unit in the study, ULS finds sets of clusters identical with those RSC would find if both applied the same risk estimation model. In terms of significance testing, ULS uses frequentist randomization testing, whereas RSC applies Bayesian inference. Our future work may include performance comparison between these two algorithms.
A limitation of this study is that it used a geometrically shaped outbreak curve and therefore may not represent the complexity of real outbreaks. Future work should include evaluations of these algorithms using the most realistic outbreak simulators possible and validation with real outbreak data.
Conclusions
One of the essential functions of biosurveillance systems is to detect outbreaks at an early stage at local, regional or even national levels. Our results suggest that the RSC algorithm is computationally efficient for outbreak detection and localization. When given limited computation resources or there is a timesensitive requirement, its lower order of computational complexity makes it preferable for analyzing large datasets, either covering broad geographic regions or including multiple data types. The significantly better detection timeliness suggests that RSC may be able to identify some outbreaks at an earlier stage. Moreover, the compact clusters localized by RSC allow field epidemiologists to focus on investigating a manageable number of high priority regions given limited resources.
Funding
This work is supported by grants CDC1U38HK00006301, CDC1 R01 PH0002601, PADOHME01737, and NSFIIS0325581.
Competing interests
None.
Provenance and peer review
Not commissioned; externally peer reviewed.
Footnotes

↵i In Bayesian probability theory, a conjugate prior is the posterior probability distribution which is in the same family as the prior probability distribution, which provides a closedform expression for the posterior probability.

↵ii In order to compute false alarm rates, we ran the algorithms on each day between January 1, 2007 and December 31, 2007 without injecting any outbreak cases. For each day, only the cluster with the lowest p value was considered. The results of KSS showed that 278 out of 365 clusters had the lowest p value, 0.001, which means the lowest false alarm rate that the algorithm was able to achieve was 278/365=0.76.
 © 2011, Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rightslicensing/permissions.