I walk through a little example of how to analyze NHANES accelerometer data.

Our Purpose

Today’s post is about analyzing nationally representative accelerometer data. Before starting my post-doctoral fellowship at the National Cancer Institute, this type of analysis seemed daunting because it is difficult to fully grasp the data collection procedure or study design when you weren’t in charge of planning or execution. I also had an underlying fear that because the data are publicly available, someone would have already done the same analysis and I just didn’t find it during my literature search, or we would unknowingly be working on the same analysis in tandem. (Sadly, that last part did happen to me recently.) However, this type of analysis is critical for surveillance of physical activity levels at the population level and while you don’t have control over every aspect of data collection and study design, you have access to massive amounts of data for free even during a pandemic that shuts down in-person data collection, so there are definite plus sides…

NHANES Accelerometer Data

There are many publicly available data sources out there but today I want to focus on NHANES accelerometer data. The most recently-collected accelerometer data (from 2013-2014) is available in two main formats. First, there is data using the Monitor Independent Movement Summary (MIMS) metric, which provides an indication of the total volume of activity. MIMS are summed over each minute, hour, or day. More recently, the “raw” data were made available. This “raw” data is processed a bit (e.g., non-wear times have been removed), but the units are gravitational units (in g’s) so they are calling it “raw.” In the coming weeks, I will write a post about downloading and analyzing the raw data, but for now I’ll go over a few considerations for the MIMS data which will also apply for the raw data regarding use of survey weights and design variables.

Data Download

The MIMS data can be downloaded from the NHANES website. There are also R packages to directly download these data but I don’t currently use a specific one (yet…more to come in the raw data post 😉). For this example, we need two data files. First, total MIMS per day per participant which is found in the ‘Physical Activity Monitor- Day’ data (PAXDAY_H). Second, we also need the demographics file. This demographics file has the all-important survey weight and sampling design variables. I’m not using it in this example, but I want to note that there is a ton of really important information about the data quality, potential data errors, etc. contained in these files.

Let’s Get Analyzing!

You have to take the survey weights and sampling design into account when analyzing nationally representative data. There are a ton of resources out there about complex survey design analysis, some of which I’ve linked to at the bottom of this post. I use the ‘survey’ package in R, so let’s start by loading the necessary packages!

tryCatch(library(dplyr),error=function(e){install.packages("dplyr");library(dplyr)})
tryCatch(library(survey),error=function(e){install.packages("survey");library(survey)})
tryCatch(library(haven),error=function(e){install.packages("haven");library(haven)})

Then, read in the demographics data using the ‘haven’ package because it is in the ‘xpt’ format. This package is also useful for reading in spss or sas files. In the same step, I select only the variables I want to keep (it can get unwieldy otherwise!). These variables are the participant’s id (SEQN), age (RIDAGEYR), sex/gender (RIAGENDR), survey weights (WTMEC2YR), and design variables (SDMVPSU and SDMVSTRA). There are a few weighting options available in NHANES but I am using WTMEC2YR because the accelerometer data are from the NHANES exam conducted in the mobile examination center. NHANES documentation is very clear about which weights to use based on your variables of interest, so make sure you are using the right ones!

demo<-haven::read_xpt("DEMO_H.xpt") %>% select("SEQN","RIDAGEYR","RIAGENDR","WTMEC2YR","SDMVPSU","SDMVSTRA")
head(demo)
   SEQN RIDAGEYR RIAGENDR WTMEC2YR SDMVPSU SDMVSTRA
1 73557       69        1 13481.04       1      112
2 73558       54        1 24471.77       1      108
3 73559       72        1 57193.29       1      109
4 73560        9        1 55766.51       2      109
5 73561       73        2 65541.87       2      116
6 73562       56        1 25344.99       1      111

This next part has a few steps all packed in to one. I’m going to read in accelerometer data, keep only certain variables, filter out any days with less than 10 h of waking wear time, then calculate mean total MIMS per day for each participant. The variables I am keeping are participant id (SEQN), valid waking wear time (PAXWWMD), and total MIMS per day per participant (PAXMTSD).

paday<-haven::read_xpt("PAXDAY_H.xpt") %>% 
  select("SEQN","PAXWWMD","PAXMTSD") %>%
  filter(PAXWWMD>=600) %>%
  group_by(SEQN) %>% 
  summarize(overallmims=mean(PAXMTSD))
  head(paday)
   SEQN overallmims
1 73557   11766.006
2 73558   10480.065
3 73559    9605.735
4 73560   23256.377
5 73561    7637.048
6 73562   16147.576

I can then merge the demographic and physical activity data together based on the participant id (SEQN). The ‘all=TRUE’ argument is important here because not all participants have accelerometer data but we don’t want to remove people just because they are missing this type of data as we still need to account for them in the survey design (as discussed later). Put another way, we need everyone in the demographics file to be present in ‘data,’ even if they don’t have accelerometer data.

data<-merge(demo,paday,by="SEQN",all=TRUE)

Accounting for Survey Weights and Design

To appropriately account for the survey design and estimate variance, we can’t just do analysis on ‘data.’ We have to use the ‘survey’ package, which first requires us to define a survey design object using ‘svydesign.’ Here, I am indicating which variables in ‘data’ correspond to the primary sampling units (SDMVPSU), strata (SDMVSTA), and survey weights (WTMEC2YR). Nest=TRUE indicates the clusters are nested within the strata.

dsn<-svydesign(id=~SDMVPSU,strata=~SDMVSTRA,weights=~WTMEC2YR,data=data,nest=TRUE)

Another SUPER IMPORTANT thing to keep in mind is that you can’t just select a subset of the NHANES data to analyze. You have to keep ALL of the data in the dataset. If you want to analyze a sub-population, you have to do so using the ‘subset’ function so that the full survey design is taken in to account and variance is properly estimated. In this example, I’m only going to anlayze data for children 11 years of age or younger.

sub<-subset(dsn,RIDAGEYR<=11)

For pretty much any type of descriptives or analysis you want to do, the ‘survey’ package has a function that properly accounts for the design/weights. I’ll give two basic examples. First, I calculate mean (and standard error) of MIMS for boys and girls separately. Second, I use ‘svyglm’ to run a simple regression between age and overall MIMS.

svyby(~as.numeric(overallmims),~factor(RIAGENDR),sub,na.rm=T,vartype="se",svymean)
model<-svyglm(as.numeric(overallmims) ~ as.numeric(RIDAGEYR),sub)
summary(model)
  factor(RIAGENDR) as.numeric(overallmims)       se
1                1                19600.25 139.3811
2                2                19322.97 176.8942
Call:
svyglm(formula = as.numeric(overallmims) ~ RIDAGEYR, design = sub)

Survey design:
subset(dsn, RIDAGEYR <= 11)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  20815.1      197.4 105.428  < 2e-16 ***
RIDAGEYR      -187.8       35.7  -5.261  0.00012 ***
---
Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1

(Dispersion parameter for gaussian family taken to be 11981929)

Number of Fisher Scoring iterations: 2

Extra Resources

Even though the actual steps of analysis are straightforward, there are a lot of important details I’ve glossed over. I recommend spending a lot of time on the NHANES website. Below are some specific examples of places to start:

There are a bunch of NHANES tutorials; here are some asynchronous ones. You can also look at the CDC’s YouTube channel.

This is the overview of the 2013-2014 data collection scope.

You can search for variables across all NHANES waves- a real time saver!

This is the manual for each measurement procedure (e.g., how anthropometrics were measured).

Getting a bit more in the weeds, this site details the weighting scheme, sampling strategy, completion rates, etc. The analytic guidelines are particularly useful for learning how to appropriately account for the sampling design and weighting scheme, combine data across waves, calculate design effects, and make sure you aren’t reporting unstable estimates.

The NHANES website even has some sample code.

To learn more about survey design and analysis, there are a ton of resources online. I mostly worked through slidesets people had made available- like these:

I really enjoyed the course I took with Brady West, who wrote a book on the topic and has posted a ton of free resources (including code!) online.

That’s it!