Weekly Paper Review: Can Clinicians’ Search Query Data help Monitor Influenza Epidemics?

Arimoro Olayinka
6 min readJun 13, 2020

This week I read the paper titled: “Using Clinicians’ Search Query Data to Monitor Influenza Epidemics” by Mauricio, S. & Elaine, O. N et al. (2014).

Introduction

According to researches, search activity on diseases such as influenza and dengue have been shown to correlate with traditional surveillance data in many instances. In fact, searches from Yahoo and Baidu have been used to track influenza pandemics.

Isn’t that interesting to know?

Let’s explore the approach used by the authors to test their claims and the results produced.

Are you ready? Let’s explore!

Preamble

The findings from the study demonstrates that a combination of a “robust dynamic” methodology and subject-matter experts’ search activity, more accurately predicts influenza activity than the well-established Internet-based tool Google Flu Trends (GFT).

Interesting.. Right?

The authors also claimed that public search activity can be influenced by anxiety, fears and rumors. This claim corroborated different findings of (Cook, et al., 2011; Butler, 2013; Olson, et al., 2013 and Lazer, et al., 2014). Therefore, the authors opined that it is more reliable to use Internet-based surveillance from subject-matter experts.

In addition, the authors used UpToData (www.uptodate.com) search query activity related to influenza Like Illness (ILI) to design a timely sentinel (like a guard) for influenza incidence in the United States.

UpToDate, is a physician-authored clinical decision support Internet resource that is used by 700,000 clinicians in 158 countries and almost 90% of academic medical centers in the United States.

Methods

Data

As stated above, UpToDate is a professional database utilized by healthcare practitioners for point-of-care decisions. Therefore, in collaboration with UpToDate, the authors obtained search volume of 23 search terms related to ILI, as well as overall search activity from November 2011 to November 2013 for US accounts only.

Some of the search terms included: influenza, Haemophilus influenzae, flu, parainfluenza, H1N1, gripe, rhinovirus, respiratory syncytial virus, metapneumovirus, coronavirus, Mycoplasma pneumoniae, pneumonia, bronchitis, and so on.

In order to set the stage for analysis, the authors obtained a weekly search fraction for each search term, at any given point in time, by dividing the number of searches for a given phrase by the total number of searches in the UpToDate database, thus minimizing the effects of variation in the overall use of the UpToDate database through time.

This provides a robust (standardized) data to be used in analysis. In addition, they obtained the national ILI weekly index from the CDC for the same time period to used for comparison.

I think at this point, the stage is set for analysis. Don’t you think so?

Let’s delve into the dynamic approach employed by the authors to give reliable estimates of the results.

Analysis

The authors built a collection of multivariate linear models using the z-scores (standardized) of the aforementioned 23 search terms’ weekly search fraction as explanatory variables and the CDC ILI index as dependent variable.

Note: The multiplicative coefficients associated with each search term in each multivariate linear model were updated weekly as the CDC ILI index was updated.

The multivariate models can be expressed as:

I hope that makes sense?

Remember there were 23 search terms, hence the sum goes to 23.

What else did the authors do? Let us see.

Model selection was performed using a least absolute shrinkage and selection operator (LASSO) technique at every single week incorporating new CDC ILI information as it became available.

The LASSO technique uses an optimization algorithm that favors models that minimize the mean squared error between the observations and predictions, while penalizing models containing many variables by simultaneously minimizing the sum of the absolute size of the regression coefficients.

I remember that in the last paper review I did, the authors used elastic net regression.

Question for thought: What is the difference between elastic-net regression and LASSO regression? What are their strengths and weaknesses?

The authors did one more thing before the results. They produced real-time estimates of ILI activity at time t, assuming that, they had access to:

  1. only CDC-reported ILI data up to 2 weeks prior (i.e., up to t–2 weeks), and
  2. the real-time (time = t) number of searches in the UpToDate database.

This provides robustness to the results gotten. Finally, the methodology was implemented in Matlab version R2011a.

Are we good to check out the results of this study?

Results

The authors trained the model for 26 weeks (5 November 2011–28 April 2012).

The first real time estimate of ILI was calculated for the week of 12 May 2012 (2 weeks later) using the optimal multivariate model. Also, they produced a weekly time series consisting of real-time estimates using the approach for the subsequent weeks up to the week of 30 November 2013. Figure 1 shows the real-time estimates and the CDC-reported ILI visits. GFT estimates were included for context.

Source: Result Section of Paper

The estimates predict very well the CDC-reported ILI visits
and outperform GFT estimates during the prediction period. Although, the authors noted that, there was a slight overestimation of the influenza epidemic curve in the second week of January 2013 (overestimating the flu activity by approximately 25% in relative terms — i.e, 5.6% of ILI as opposed to the actual 4.5%).

However, this overestimation was minimal when compared to the GFT estimates (overestimating the influenza activity by 130% in relative terms — i.e, 10.5% of ILI as opposed to the actual 4.5%).

The methodology used by authors had strong predictive power (Pearson correlation of 0.972. Although, GFT has a very high Pearson correlation (0.9499) during this same time period (the prediction period starting in the week of 12 May 2012 and ending in the last week of November 2013). However, the author’s results were more reliable with a root mean square error [RMSE] of 0.2829% as against the RMSE of GFT estimates which are on average off by 1.4% of the national population (i.e, almost 5 times larger than the author’s RMSE).

The authors presented a heatmap representing the relevance of each search term in predicting influenza activity as a function of time, during the validation time period. The term Tamiflu is the strongest predictor, whereas sinusitis, influenza, H1N1, and coronavirus displayed relevance as predictors during different time periods, as shown in figure 2 below:

Source: Result section of paper

You would agree with me that these results are interesting and insightful…Right?

Link to the paper:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4296132/

Discussion & Conclusion

This paper presented a model that had numerous strengths than the Google Flu Trends (GFT). This provides a promising way to identify meaningful signals to track influenza activity.

The authors hoped that, this will motivate the need for future research aimed at testing the accuracy of their methodology at state and city levels, and potentially in the prediction of other diseases.

That sounds great!

Finally, the authors highlighted some limitations of digital data (novel data) sources for infectious disease surveillance based on search query data.

One of such limitations is that these data sources lack the specificity observed in traditional surveillance systems, which rely on hierarchical reporting procedures. Therefore, to deal with this limitation, the authors had to supplement the digital data with the data provided by the CDC.

Note: UpToDate data is not publicly available and thus not ready to be used as an alternative disease detection sentinel.

I know you gleaned one or two insights from this paper. Remember to read about the question for thought I raised earlier.

Thank you for reading. Make sure you give this a clap. See you next time!

References

  • Cook, S; Conrad, C; Fowlkes, A. L; Mohebbi, M.H. (2011). Assessing Google Flu Trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PLoS One; 6:e23610.
  • Butler, D. (2013). When Google got flu wrong. Nature; 494:155–6.
  • Olson, D.R; Konty, K.J; Paladini, M; Viboud, C; Simonsen, L. (2013). Reassessing Google Flu Trends data for detection of seasonal and pandemic influenza: a comparative epidemiological study at three geographic scales. PLoS Comput Biol; 9:e1003256.
  • Lazer, D; Kennedy, R; King, G; Vespignani, A. (2014). The parable of Google Flu: traps in big data analysis. Science; 343:1203–5.

--

--