Weekly Paper Review: How can Convolutional Neural Networks (CNN) assist in the study of the Association Between the built environment and obesity prevalence?

Arimoro Olayinka
8 min readJun 4, 2020
Source: https://www.pageuppeople.com.sg/resource/hr-its-time-for-your-performance-review/

This week I read the paper titled: “Use of Deep Learning to Examine the Association of the Built Environment With Prevalence of Neighborhood Adult Obesity” by Adyasha M. & Elaine O. N (2018).

It is evident that obesity have been linked with several factors such as genetics, diet, physical activity, and the environment.

It is observed that there have been varying studies indicating the relationships between the built environment (i.e. both natural and modified elements of the physical environment) and obesity.

Therefore, the paper proposed an approach to consistently measure the features of the built environment and its association with obesity prevalence.

Preamble

Research shows that associations exists between obesity and the built environment. However, there have been inconsistencies in these studies, for example, some studies examined the association based on predetermined features. The authors opined that approaches that enable consistent measurement and allow for comparison across studies are needed.

Therefore, in order to achieve this consistency in measurement, the authors used a deep learning approach called Convolutional Neural Networks (CNN) to extract built environment features from satellite images. CNN allows for consistent quantification of the features of the built environment across neighborhoods. Also, this allows for comparability across studies and geographic regions.

The ability to assess and quantify the relationship of obesity prevalence and the built environment will be used to select and implement community-based interventions and prevention efforts as stated by Hales, et al. (2015–2016).

Methods

Obesity Prevalence Data

The authors used 2014 estimate of annual crude obesity prevalence at the census tract from the 500 cities project.

They collected this data from six (6) selected cities from US States with high (Tennessee and Texas) and low (Washington and California) prevalence of obesity. The cities selected are Los Angeles in California, Memphis in Tennessee, San Antonio in Texas, Tacoma, Seattle and Bellevue in Washington. The authors noted that Tacoma, Seattle and Bellevue are neighboring cities and have low census tracts. Therefore, they combined their data into one and called it Seattle.

The authors performed the analysis in this study in two (2) steps. They:

  1. Processed satellites images to extract features of the built environment using CNN, they extracted and processed Point-of-interest (POI) data
  2. Used elastic net regression to built a parsimonious model to assess the association between the built environment and obesity prevalence.

Note: A parsimonious model uses a minimal number of assumptions, steps, or conjectures. In other words, it is a model that accomplishes a desired level of explanation or prediction with as few predictor variables as possible.

Acquiring Satellite Imagery and POI data

The authors downloaded about 150,000 high-resolution satellite images from Google Static Maps API (application programming interface) by providing the geographic center, image dimensions, and zoom level for each image.

Furthermore, the authors used census tract shapefiles to associate each image with its respective census tract. The authors excluded images and Points of interest that were from areas outside the city limits.

Also, using a radial nearby search within an appropriate distance, the authors downloaded POI data through the Google Places of Interest API. In all, they collected a set of 96 unique POI categories. For each census tract the authors counted the number of locations associated with each category.

All POI data and satellite images were initially downloaded from February 14 through 28, 2017, and updated during the study period, which lasted through October 31, 2017.

Image Processing

Due to the lack of a large labeled data set for classifying high- and low-obesity regions, the authors adopted a transfer learning approach, which involves using a pretrained network to extract features of the built environment from the unlabeled data set of nearly 150,000 satellite images.

Transfer learning, in this context, involves fine-tuning the pretrained CNN for a new task (with modification to the output layer) or using the pretrained CNN as a fixed feature extractor combined with linear classifiers or regression models.

The authors used the VGG-CNN-F network as seen below. The network is composed of 8 layers (5 convolutional and 3 fully connected) and is trained on approximately 1.2 million images from the ImageNet database (a dataset of over 14 million images used for large-scale visual recognition challenges) for recognizing objects belonging to 1000 categories (ImageNet, 2018).

Source: Supplemental Content of the paper

The authors collected outputs from the second fully connected layer of the network for each image in the data set. The second fully connected layer has 4096 nodes, each of which has nonlinear connections with all other nodes in the previous and next layers.

This means that, each feature vector has 4096 dimensions, corresponding to the output (also termed activations) from these nodes. These outputs were further aggregated into mean feature vectors for each census tract by computing the mean from all images belonging to a census tract.

Statistical Analysis

Recall that the data contained a large number of features (n=4,096). Therefore, the authors chose the Elastic Net regression method — a regularization and variable selection technique. This technique will guard against overfitting due to the high dimension of the feature dataset.

A major benefit of Elastic Net is that it combines the advantages of Ridge regression and Least Absolute Shrinkage and Selection Operator (LASSO); insignificant covariates are eliminated, while correlated variables that are significant are maintained.

To check for association, the authors performed three analyses:

  1. between features of built environment and prevalence of obesity at the census tract level
  2. between density of POIs and prevalence of obesity at the census tract level
  3. between features of the built environment extracted from satellite images and socioeconomic variables, such as per capita income.

In order to evaluate how well the model predicts obesity prevalence across all cities, the data was split into two sets — a random sample representing 60% of the data for fitting (training set) and the remaining 40% for model evaluation (validation set).

In addition, to select an appropriate value for the tuning parameter (λ value), the authors used cross-validation and selected the value that minimized the mean cross-validated error.

This means that the authors performed 5-fold cross-validated regression analyses to quantify the aforementioned associations for all regions except Memphis, for which the authors used a three-fold cross validation because the sample size was less than 200. Also, these analyses were performed jointly for all regions and independently for each region.

The authors used the root mean squared error (RMSE) metric for model assessment and comparison. The R statistical software was used for all regression analyses.

Results

The study included 1695 census tracts in 6 cities. The age-adjusted obesity prevalence was 18.8% (95% CI, 18.6%-18.9%) for Bellevue, 22.4% (95% CI, 22.3%-22.5%) for Seattle, 30.8% (95% CI, 30.6%-31.0%) for Tacoma, 26.7% (95% CI, 26.7%-26.8%) for Los Angeles, 36.3% (95% CI, 36.2%-36.5%) for Memphis, and 32.9% (95% CI, 32.8%-32.9%) for San Antonio.

After regularization, the authors retained 125 features for all cities combined, 157 for Los Angeles, 79 for Memphis, 69 for San Antonio, and 85 for Seattle.

Features of the built environment explained 64.8%(root mean square error [RMSE], 4.3) of the variation in obesity prevalence in out-of-sample estimates across all 1695 census tracts based on the elastic net regression.

Individually, the variation explained was 55.8% (RMSE, 3.2) for Seattle (213 census tracts), 56.1%(RMSE, 4.2) for Los Angeles (993 census tracts), 73.3% (RMSE, 4.5) for Memphis (178 census tracts), and 61.5% (RMSE, 3.5) for San Antonio (311 census tracts) in out-of-sample estimates, as seen in the figures below:

For High-prevalence areas

Source: Result section of the paper
Source: Result section of the paper

For Low-prevalence areas

Source: Result section of the paper
Source: Result section of paper

Compared with the features of the built environment, the POI data explained 42.4% (RMSE, 4.3) of the variation in obesity prevalence across all 1695 census tracts in out-of-sample estimates. The variation explained at the regional level was approximately 14.0% (RMSE, 4.5) for the 213 Seattle census tracts, 29.2% (RMSE, 5.4) for the 993 Los Angeles census tracts, 43.0% (RMSE, 4.1) for the 311 San Antonio census tracts, and 43.2% (RMSE, 6.7) for the 178 Memphis census tracts.

The authors illustrated the linear correlation between the actual obesity prevalence and the model estimated prevalence and compared the findings using the image features and POI data in the figures below:

Source: Result section of the paper
Source: Result section of the paper

Recall that the authors were also interested in association between features of built environment and socioeconomic status, using the income. The variation in per capita income explained by the features of the built environment was 37.6% for the 213 Seattle census tracts, 62.1% for the 993 Los Angeles census tracts, 58.2% for the 311 San Antonio census tracts, and 23.2% for the 178 Memphis census tracts.

This suggests that association between obesity prevalence and the features of the built environment could potentially be explained by variations in socioeconomic status. This agrees with a previous study by Jean, et al. (2016)

Link to the paper

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2698635?utm_campaign=articlePDF&utm_medium=articlePDFlink&utm_source=articlePDF&utm_content=jamanetworkopen.2018.1535

Conclusion

Although the authors identified some limitations, one of which was that, obesity prevalence estimates from the Behavioral Risk Factor Surveillance System were based on self-reported height and weight. This have been shown to be biased and tend to lead to lower estimates of obesity prevalence

However, the results from this study shows a consistent association between the built environment indicator and obesity prevalence across neighborhoods with low and high prevalence of adult obesity.

It was a long read I guess. You’d agree with me that it was insightful. I also learnt a lot from the paper.

Thank you for taking time to read this. See you next time!

References

  • Hales, C.M, Carroll, M.D, Fryar, C.D, Ogden, C.L. Prevalence of obesity among adults and youth: United States, 2015–2016. NCHS Data Brief №288. https://www.cdc.gov/nchs/products/databriefs/db288.htm. Updated October 13,2017. Accessed October 14, 2017.
  • ImageNet. http://www.image-net.org. Accessed May 9, 2018.
  • Jean, N; Burke, M; Xie, M; Davis, W.M; Lobell, D.B; Ermon, S. (2016). Combining satellite imagery and machine learning to predict poverty. Science; 353(6301):790–794. doi:10.1126/science.aaf7894

--

--