Finding Leading Indicators for Disease Outbreaks: Filtering, Crosscorrelation, and Caveats
Abstract
Bioterrorism and emerging infectious diseases such as influenza have spurred research into rapid outbreak detection. One primary thrust of this research has been to identify data sources that provide early indication of a disease outbreak by being leading indicators relative to other established data sources. Researchers tend to rely on the sample crosscorrelation function (CCF) to quantify the association between two data sources. There has been, however, little consideration by medical informatics researchers of the influence of methodological choices on the ability of the CCF to identify a lead–lag relationship between time series. We draw on experience from the econometric and environmental health communities, and we use simulation to demonstrate that the sample CCF is highly prone to bias. Specifically, longscale phenomena tend to overwhelm the CCF, obscuring phenomena at shorter wave lengths. Researchers seeking lead–lag relationships in surveillance data must therefore stipulate the scale length of the features of interest (e.g., shortscale spikes versus longscale seasonal fluctuations) and then filter the data appropriately—to diminish the influence of other features, which may mask the features of interest. Otherwise, conclusions drawn from the sample CCF of bivariate timeseries data will inevitably be ambiguous and often altogether misleading.
Introduction
In the last decade, leaders in public health have expressed increasing concern over the threats posed by bioterrorism and emerging infectious diseases, such as influenza. To help identify these threats, the medical informatics and public health research communities have sought to develop techniques for rapid detection of disease outbreaks. Broadly speaking, research on rapid outbreak detection has fallen into two related areas: identifying data sources that may provide an early indication of an outbreak and developing algorithms for analyzing surveillance data gleaned from these sources.1
In the search for promising data sources, considerable attention has been focused on routinely collected electronic data, since they have a short reporting delay and a low cost for acquisition. Researchers have studied many types of routinely collected electronic data in an attempt to identify variables that will provide an early, or leading, indication of an outbreak relative to a more established variable. If a variable is shown to provide an early indication of an outbreak consistently, then it is called a leading indicator. Examples of candidate leading indicators in recent studies include sales of overthecounter pharmaceuticals,2–4 outpatient or emergency department visits,5–8 ambulance dispatches,9 nurse hotline calls,10 and dead bird counts.11 Some researchers have also examined strata within datasets to determine if certain age groups, for example, provide a leading indicator of an outbreak.12,13 Established variables used for comparison include reportable disease cases,7,9–11 hospital admissions,2,3,5 aggregate volume of clinical visits in particular syndromic categories, and deaths.6,8 Leading indicator studies in public health surveillance use different methods to compare the candidate leading indicator and established variables, but the crosscorrelation function is employed commonly.2–5,7,8,13 In some studies, authors used the crosscorrelation function as part of an initial, exploratory, analysis,5,7,13 while in other studies, the result of the crosscorrelation analysis was a main outcome.2–4,8
A surveillance analyst can use knowledge of leading indicators to select variables for inclusion in a surveillance system and to develop prospective decisionmaking schemes that incorporate demonstrated lead–lag relationships between variables. The selection of leading indicators and their variables can have a direct impact on surveillance. Despite the important role of leadingindicator research in directing surveillance practices, few researchers have critically examined the methodology for defining leading indicators, or the influence of methodological choices on identifying leadlag relationships.
In this paper, we demonstrate that the empirical verification of a lead–lag relationship must be relative to a stipulated scale length. The healthservices utilization data or other healthrelated data time series under consideration may be conceptualized as aggregations of potentially many concurrent and independent components. For example, the influence of clinic closings may introduce a shortscale daily or weekly component into the time series. Similarly, seasonal changes in exposure may introduce a longscale monthly or yearly component. The scale length effectively characterizes those components of the data that are relevant to the epidemiological relationship between the variables under consideration. The scale length of the feature of interest should, therefore, be identified a priori to reflect the epidemiological understanding of the feature of interest. If researchers are interested in the ability of one time series to provide a leading indication of a rapidly appearing outbreak, for example, then the research should focus on relationships of shorter scale length features.
We borrow from the rich literature in econometrics and environmental health and use simulation to demonstrate that, if a researcher does not stipulate the relevant scale length when comparing two surveillance data sources, then conclusions drawn from crosscorrelation analyses of the timeseries data will be ambiguous, and often altogether misleading.
Background
The term leading indicator appears prominently in econometrics, where theoreticians construct distributedlag models as explanations of the joint dynamics of regularly recorded macroeconomic aggregates (e.g., gross national product, consumption, employment, etc.). A leading indicator for a given objective variable (e.g., aggregate consumption) is another variable whose significant fluctuations anticipate significant fluctuations in the target variable. For example, aggregate GNP might be postulated as a leading indicator for aggregate consumption and if this relation persists, then sudden upturns in GNP can be demonstrated empirically to anticipate, on the average, a similar upturn in consumption, by some typical time interval.14,15
Claims alleging lead–lag relations between particular crosssectional macroeconomic indices have generated considerable controversy in the field of econometrics.15–17 Such claims ultimately must be put to an empirical test using sample data and crosscorrelation techniques. In the 1960s and 1970s Granger and others demonstrated that care must be taken when trying to infer “intrinsic” relationships from sample crosscorrelations.14,15,17 An important conclusion from this work was that the “discovery” of spurious relationships is likely unless the relative emphasis carried by longer and shorter wavelength phenomena on the crosscorrelation function is taken into account.
The environmental health literature, and in particular research into the health effects of air pollution, offers another relevant perspective on the search for leading indicators.18–21 Distributedlag models, which include terms for multiple lagged measures of a pollutant, are used to predict future morbidity and mortality from current and historical measurements of pollution. The leading indicator in this context is air pollution and the outcomes of interest are illness and death due to respiratory and cardiovascular disease. In these studies, the goal is explanation, not prediction, but the principle remains the same. Estimation of shortterm effects of air pollution on human health, on the scale of days, is possible only if longterm patterns of season and month are taken into account.18
The search for leading indicators in public health surveillance data is an emerging topic in public health informatics research. Leading indicators may be exploited in a public health setting in different ways. First, it may be possible to use a univariate timeseries forecast scheme which relies exclusively on a putative leading indicator for, say, clinic or hospital visits. Thus, overthecounter pharmaceutical sales may enable detection of some types of disease outbreaks before they would otherwise be detected using another data source, such as hospital admissions. Another possible application of a leading indicator in public health is to use a leading indicator in conjunction with a primary data source to improve disease outbreak detection. This application would entail development of a bivariate (or multivariate) timeseries forecast scheme which uses the leading indicator in concert with the primary variable, or in other words, a distributedlag model relating two variables. Another application of leading indicator research is to enhance the understanding of the epidemiology of diseases, such as influenza.12 Although in this application, particular care should be taken to avoid drawing individual level inferences from associations observed in aggregated data. Whether a leading indicator is used alone or in combination with other data sources, an analyst requires a quantitative understanding of the lead–lag relationship between time series to identify promising data sources, to examine the epidemiology of disease, and to combine data sources into a prospective decisionmaking scheme.
Methodological Considerations in Searching for Leading Indicators
The crosscorrelation function (CCF) is a natural metric for measuring the similarity between segments of time series and identifying lead times between time series by lagging the sample series so as to maximize the CCF.22–24 This intuitively simple prospect is fraught with ambiguity in practice. Since the seminal 1926 article by G. Udny Yule25 it has been recognized that the sampling properties of the CCF are exceedingly sensitive to biases. The primary problem with this approach is that long wavelength phenomena tend to overwhelm the sample CCF, obscuring phenomena at shorter wavelengths.
Econometricians and environmental health researchers have shown that to avoid this problem, the distinct effects on the sample crosscorrelation function of phenomena at different scale lengths (e.g., years, seasons, months, days) must be resolved separately by filtering the data series into “bands” corresponding to the scale length of the feature of interest, prior to estimating the lagged correlation between series. This is equivalent to performing a frequencydomain crossspectral analysis and examining the phase of the smoothed crossspectrum in discrete frequency bands.23,26–29
The ultimate goal of the analysis may be to fit a distributedlag regression model to unfiltered (broadband) data which will then be used as a predictive model relying on both the lagging and leading indicators. If this is the case, then one must estimate the linear relation between values of a primary variable (say clinic visit count) and past values of itself and of a predictor variable (say drug sales). It is well known that optimal leastsquares estimation of such linear relations requires “prewhitening” of the predictor variable time series. Whitening the predictor series (say drug sales) is accomplished by applying a particular linear filtering operation: We subtract from the data the onestepahead predictions given by a univariate timeseries model which explains the serial dependence structure (autocorrelation) exhibited in that data. The data are replaced by the sequence of modelbased prediction errors, which, if the fit is good, will constitute an approximate “whitenoise” sequence exhibiting no serial correlation. The distributedlag regression is then performed using the derived whitened series in place of the serially correlated predictor data. This procedure is equivalent to fitting a “transfer function model” to a pair of stationary time series.22,24
Thus, to identify potential lead/lag relations between features of a scale length of particular interest, we would carry out bandpass filtering, followed by crosscorrelation; or equivalently, a frequencydomain crossspectral analysis. To fit a timedomain–distributedlag model, insofar as we have already developed evidence that one series does in fact lead the other, we would apply a “whitening” filter to the predictor variable as part of the model identification and fitting process.
In a public health setting, it is clear that potentially many diverse and independent effects may be expressed as features that have different characteristic time scales. For example, endemic or seasonally epidemic diseases (e.g., influenza) may manifest as slowly changing long wavelength effects in public health surveillance data. Sporadic epidemics, such as those resulting from a large pointsource exposure, may give rise to more rapidly changing short wavelength effects. In reality, the class of objective features that can be observed in timeseries data is virtually unlimited. The choice of features relevant to a particular lead–lag study in surveillance data must be determined from practical epidemiological principles and not merely statistical practice.
To be more precise, the sudden appearance in timeseries data of one or another among several distinct classes of quantitative or qualitative features may give evidential support to one or another epidemiological causal hypothesis. These features of interest may be broadly classified as “change points” in the timeseries dynamics; or, more broadly, “exceptional points” in respect to one or another numerical or statistical attributes. From the standpoint of the classical theory of timeseries modeling, we seek to identify certain types of “disturbance” in the ordinary statistics of a time series.
The crosscorrelations of interest are those between the disturbances of interest—to the extent that all others may be excluded from consideration. Such features may include: significant turning points (sudden changes in slope corresponding to the onset of an outbreak), significant peaks, significantly large values of the slope, significant changes in the local mean or variance, changes in the frequency content (or crossing rate of a particular level), and so on. In each instance, the significance must be quantified both by a statistical measure of “surprise value” and by an empirical standard of “practical significance.” Isolating features of interest, and suppressing other potentially confounding features, is implemented by applying appropriate filters to timeseries data.
The specific shape and scale of an epidemic curve, as an example feature, may depend upon many factors, including: the distribution of primary cases in space and time; the contagiousness (or lack thereof) of the disease agent; and the characteristic distribution of incubation periods, relative to the demographics under consideration. To a first approximation, the characteristic rise time and peak value of an epidemic curve are the key quantitative features of interest. It follows that we ought first to hone our search for evidence of lead/lag relationships between those (and only those) components of the data series whose time scales are in accord with the characteristic rise time of the epidemic or outbreak of interest. It is by considerations of the foregoing sort that we are naturally led to set up empirical investigations in which specific filters are applied to time series; prior to computing crosscorrelations.
To ask only, “Is data series ‘A’ a leading indicator for data series ‘B’” does not delineate the path of quantitative investigation sufficiently. Rather, the question that must be posed is, “Do signals of interest embedded in ‘A’ tend to precede those same signals embedded in ‘B’?” If the class of signals of interest is left unspecified, and global crosscorrelation analyses are performed nonetheless, the results may or may not reflect the features and dynamics of practical interest and are likely to be misleading.
Stipulating the features and circumstances of interest requires first defining the length of scale of the “signals of interest.” In the public health surveillance context, researchers have tended to model outbreaks resulting from bioterrorism as shorterterm wavelength (or more rapidly varying) features of the data. For example, epidemic curves from real30 and simulated31,32 inhalational anthrax outbreaks peak within days, not weeks, following exposure. Studies that evaluate leading indicators for use in the bioterrorism surveillance using long scalelength features, estimating crosscorrelations dominated by seasonal regularities in clinic visits and drug sales3,33 do not necessarily inform us of how shortterm features (of several days rise time) in pairs of such data series do or do not anticipate one another. On the other hand, if the interest is in the long scalelength features of influenza, for example, then estimating crosscorrelations or equivalent measures on a longer scale length is both appropriate and informative.12,13
In summary, any statement about lead–lag relations is dependent on what constitutes a signal of interest. Moreover, the means by which we filter the data to remove the other components of lesser interest is accompanied by a commitment (explicit or implicit) to treat fluctuations of certain scale lengths as such signals of interest. In the next section, we illustrate through simulation studies how stipulating the signal of interest, and preparing the data series for analysis, can significantly influence the discovery of lead–lag relationships in public health surveillance data.
Illustrations of the Influence of Methodological Decisions on the Search for Leading Indicators
To illustrate the influence of methodological choices on the evaluation of lead–lag relationships between data series, we developed two simulation studies. Authentic health care utilization data are the basis for both simulations. Clinical visit data are aggregated to form a baseline onto which simulated “signals of interest” are superimposed. We extracted the clinical data from anonymized electronic medical records from a network of ambulatorycare clinics in Norfolk, VA, where the raw data consisted of timeordered records containing dates, times, clinic location, chief complaint, and diagnostic code. We obtained these data from commercial sources in aggregate form and the data do not contain individually identifiable health information. The illustrations in this paper were developed from three years of daily counts, aggregated over all the clinics, of those outpatient physician visits whose chief complaint or diagnosis matched a simple set of criteria for acute respiratory illness.
The data are composed of mixtures of regular effects and a residual random component. The regular effects include a 7day weekly cycle with a weekday to weekend differential of 100% or more. Superimposed upon the weekly cycle is a longterm annual cycle that peaks sometime during the winter and is smallest in the summer. The longterm annual cycle takes about 120–240 days to rise from summer minimum to winter maximum. The annual cycle does not have a true yearly period since the influenza season, which is responsible for the maximum in the number of clinical cases of respiratory illness, may advance or retreat from one year to the next.
As the foundation for our simulation, we used a sample of the clinical respiratory case counts and aggregated sales of OTC cold and flu medicine from a network of pharmacies in the Norfolk, VA, area (Figure 1). Both data sources exhibit a similar 7day periodicity and a comparable seasonal transition from the summer minimum to the winter maximum.
The Importance of Stipulating a Wavelength of Interest
In the first example, the importance of identifying the scale length of interest in a lead/lag relationship analysis is underscored. We constructed a hypothetical situation in which both the longscale and shortscale features exhibit a lead/lag relationship; but the two relationships are, in fact, independent of one another and are different in both magnitude and direction. The salient point is that longterm fluctuations will obscure evidence in the crosscorrelation function of relationships between shortterm fluctuations.
The construction begins with the 3year smoothed outpatient respiratory illness visit count series shown in Figure 1. Smoothing by either centered or trailing 7day moving averages removes the weekly cycle, and the resulting data show a more or less regular annual cycle peaking sometime in the winter. A pair of three annual cycles worth of these smoothed data were offset by a fixed lag of +4 days, and Gaussian random components were added to the underlying longterm cycle. One random component was constructed of repeated short bursts of Gaussian noise that are correlated at a lag of −6 days. The pair of mixed signals therefore exhibits a long timescale structure that naturally dominates a raw crosscorrelation calculation and tends to show up as an optimal alignment at +4 days. However, if the composites are highpass filtered, using simple first differences, to admit only the short scale length random structure, we find that the crosscorrelation calculation very clearly shows the relationship at alignments of −6 days. The construction is given explicitly in Appendix A.
Series X^{t} and Y^{t} and their crosscorrelation are shown in Figure 2. Note that the embedded regular effect at short scales is completely absent from the crosscorrelation, which is dominated by the offset at +4 days between the longscale annual cycles.
Taking simple first backward differences:
Another round of differencing, setting:
The global CCF is overwhelmed by longscale length relationships, possibly obscuring the shortscale length relationships. Any analysis of these two series for a lead–lag relationship must therefore begin by stipulating the characteristics of the signal of interest, and then proceed to filter the data appropriately to remove other signals that may overwhelm the effect of the actual signal of interest. A simple differencing of the series was sufficient to remove the “nonstationary” long wavelength phenomena, and reveal the relationship at a short wavelength.
The Importance of Removing or Filtering Features Appropriately
In the second example, we show that the 7day cyclic component typical of many clinical data must be removed (filtered out) if evidence of a common “point event” affecting a pair of series is at all manifest in the sample crosscorrelation plot. We also demonstrate that the choice of filtering method influences the appearance of the CCF; to the extent that one method obscures evidence of the crossrelation of interest, while enhancing evidence of an entirely irrelevant crossrelation.27,29
We constructed a pair of series analogous to the authentic respiratory visit data but in which both the 7day cycle and the seasonal pattern are equally prominent. In addition we added a random component into each series. The cyclic, seasonal, and random components were added on a logarithmic scale and the result exponentiated. This process corresponds to using a multiplicative composition model. The random component was drawn from independent identically distributed lognormally distributed draws with a standard deviation roughly 5% of the peaktopeak transit of the 7day cycle. This gives us a “noisy” 7day periodic signal with a signaltonoise ratio comparable to many of the realworld aggregates we have worked with.
In both series, shortterm signals, suggestive of the presence of rapidly rising “outbreaks,” were superimposed additively. These pulses were of short duration, with rise times of 5 days and pulse height of 50 counts. The outbreak signals stand relative to a background with weekly maxima on the order of 200 to 400 counts. The signal “pulses” in the two background series were offset by 10 days. We have considerable flexibility in generating examples to illustrate our points, this is only one example. The details of the construction are given in Appendix B.
Two modes of pretreatment were applied to eliminate the influence of the 7day cycle on the crosscorrelation: (1) moving central averages of length 7 days; and (2) lagged 7day differences. Crosscorrelation plots were made for (a) the raw data series, (b) the averaged data series, and (c) the lagdifferenced data series. The results are shown in Figure 5.
Note that the crosscorrelation of the unfiltered data (Figure 5, panels A and D) show no evidence of the presence of the pair of “outbreak” signals separated by a lag of 10 days. This example is dominated by the 7day cycle. The movingaverage smoothed data (Figure 5, panels B and E) show no evidence of signal pair; it is dominated by the zerolag alignment between the longterm seasonal trends. The 7day differenced data exhibits evidence of the lagged pair of “outbreak” signals at lag 10 days (Figure 5, panels C and F).
The situation shown in Figure 5 is typical of healthrelated aggregated time series in its strong periodicity on a scale of 7 days. This peculiarity makes the study of lead/lag relations between shortscale (several days to a week) components superimposed on top of the cycle pattern rather delicate. The crosscorrelation function will be dominated by the 7day cycle unless it is first removed. But there are different ways to estimate and remove the cycle. For one thing, the cyclic component is itself subject to short, medium, and longterm variability in its shape and amplitude. The subtraction of a fixed 7day “skeleton” will leave behind considerable residual of 7day cycling in any given week. The problem is analogous to that faced by econometricians and demographers who must produce “seasonally adjusted” data for various standardized comparisons across diverse aggregate time series. Cycle adjustment is done by linear filtering; but the appropriate choice of filter depends on precisely what feature of the residual data (in the absence of the cycle) one wishes to study. In the example at hand, the “incorrect” cycle adjustment was the 7day moving average filter, since, as a byproduct of removing the undesired cycle, it suppresses evidence of precisely the shortscale features that were the intended object of study. The correct adjustment, in this case, was the 7day lagged difference, which actually amplifies the shortscale fluctuations; and suppresses the longscale (seasonal) fluctuations. In the broadest possible sense, when one seeks evidence of correlations between shortscale features, the appropriate filter is a “highpass” filter, as opposed to smoothing or averaging filters. More sophisticated methods of cycleadjustment might well have been applied in our example, but the two simplest techniques shown here are sufficient for the purpose of contrast.
This example illustrates that the search for lead–lag relationships between data series is influenced by the method used to remove features at wavelengths different from the signal of interest. In searching for a lead–lag relationship, one must stipulate the characteristics of the signal of interest and then filter the data in a manner that ensures that the signal of interest is not obscured. Cycle removal by smoothing (e.g., 7day average) obscures relationships between the shortscale features; and cycle removal by highpass filtering (e.g., 7day differencing) obscures relationships between the longscale features. The lesson, however, remains the same: the data must be treated to accentuate the influence of features of interest in a CCF analysis.
Discussion
Through reference to relevant literature and simulation we have shown that, when seeking a leading indicator, researchers should stipulate a feature scale length of interest, and then filter the data appropriately to eliminate other features that may bias the sample crosscorrelation. In selecting a filtering method, it is clear that the same pretreatment, especially of cycling data, cannot be used for the discovery of lead–lag relationships at both long and shortscale lengths. The longterm features in the data tend to overwhelm the sample crosscorrelation function (CCF), obscuring evidence of any lagcorrelated shortterm features that may be present. Thus, the results obtained from a lead–lag analysis are implicitly linked to the methods used to uncover them.
Identification of the scale length of interest should be determined directly from a clearly stipulated epidemiological or causal model that postulates the relationship between the data series under consideration. Take, for example, evaluation of the lead–lag relationship between sales of overthecounter (OTC) pharmaceuticals and utilization of clinical services. Researchers have shown that on long time scales, changes in sales of some OTC products tend to lead changes in clinical visits for some conditions.2–4 In other words, the seasonal patterns in the two series are similar, with the longterm trend in sales of OTC products tending to lead the longterm trend in clinical visits.
One causal model that might result in the observed lead–lag relationship is that ill individuals tend to first purchase OTC products and then at some later time utilize clinical services. Reaching this conclusion from the analysis of timeseries data is an example of ecological inference, or drawing individuallevel inference from aggregate data, and the epidemiological literature has shown that this type of inference is problematic.34 Other causal models may also explain the observed longterm lead–lag relationship. Seasonal marketing campaigns or awareness in the community of the increasing risk of infection may also cause increased purchasing of OTC products prior to increased utilization of clinical services. When one examines aggregate data only, one cannot identify with certainty the true explanation for the lead–lag relationship.
Nevertheless, OTC sales may be useful for predicting seasonal changes in the utilization of clinical services, as demonstrated by the CCF applied to unfiltered data, regardless of whether the relationship holds at the individual level and shorter time scales. If, however, one is interested in detecting a rapid increase in illness in a sporadic outbreak, then the CCF results on a longer time scale provide little evidence that OTC sales will predict changes in clinical visits on a shorter time scale. This observation is supported by a study where the authors filtered out the longerterm features prior to determining the CCF, and identified a weaker lead–lag relationship than was observed when using unfiltered data.33 Similarly, demonstrated lead–lag relationships on a seasonal scale between data sources that reflect influenza may be useful in predicting the onset of the influenza season,13 but these longterm relationships may not be useful in predicting shorterterm fluctuations.
Although we have based our examples on counts of outpatient clinic visits aggregated into daily totals for respiratory conditions, the specific characteristics of other aggregated data time series used in surveillance may be different. Nonetheless, the same caveats apply to the search for lead–lag relationships in general. The key requirement is to stipulate carefully the class of features relevant to the investigation and their characteristic scale lengths.
Isolating features of interest, and suppressing other potentially confounding features, is implemented by applying appropriate filters to timeseries data. A particular epidemiologic scenario of interest will be associated more or less strongly with some of these markers or features of interest. The repeated timelagged appearance in two data series of these preferred features (irrespective of any relationships between other features) would lend support to the claim that the two series in question enjoy a lead/lag relationship which is of special anticipatory value in respect to the scenario of interest.
Research to identify lead/lag relationships in health time series is essential for guiding surveillance efforts and for investigating the epidemiology of disease. Care must be taken, however, when using the crosscorrelation function in this setting. The scale length of the feature of interest must be specified and the data must be filtered appropriately prior to application of the crosscorrelation function or the results may be ambiguous and misleading.
Appendix
Two Gaussian random series, X^{t} and Y^{t} were constructed as:
Appendix
Data with regular cycles may be modeled as a multiplicative superposition of independent weekday cycle, W^{t}, seasonal trend, S^{t}, and random components, U^{t}.29 The weekly component has a period of 7 days, the seasonal trend effectively represents the slowly varying annual cycle but is estimated locally and so assumes a random character; and the residual random component is thought of as a random variable embodying all the variation remaining after “regular” effects have been “explained.” A multiplicative model appropriate for this type of cyclical data is:
Footnotes

David Buckeridge's research is supported by a Canada Research Chair in Public Heath Informatics. Part of this work was performed while Dr. Buckeridge was a Department of Veterans Affairs Postdoctoral Informatics Fellow.
 American Medical Informatics Association