Descriptive Epidemiology using epiR

Mark Stevenson

2021-07-30

Epidemiology is the study of the frequency, distribution and determinants of health-related states in populations and the application of such knowledge to control health problems (Centers for Disease Control and Prevention 2006).

This vignette provides instruction on the way R and epiR can be used for descriptive epidemiological analyses, that is, to describe how the frequency of disease varies by individual, place and time.

Indivdual

Descriptions of disease frequency involves reporting either the prevalence or incidence of disease.

Some definitions. Strictly speaking, ‘prevalence’ equals the number of cases of a given disease or attribute that exists in a population at a specified point in time. Prevalence risk is the proportion of a population that has a specific disease or attribute at a specified point in time. Many authors use the term ‘prevalence’ when they really mean prevalence risk, and these notes will follow this convention.

Two types of prevalence are reported in the literature: (1) point prevalence equals the proportion of a population in a diseased state at a single point in time, (2) period prevalence equals the proportion of a population with a given disease or condition over a specific period of time (i.e. the number of existing cases at the start of a follow-up period plus the number of incident cases that occur during the follow-up period).

Incidence provides a measure of how frequently susceptible individuals become disease cases as they are observed over time. An incident case occurs when an individual changes from being susceptible to being diseased. The count of incident cases is the number of such events that occur in a population during a defined follow-up period. There are two ways to express incidence:

Incidence risk (also known as cumulative incidence) is the proportion of initially susceptible individuals in a population who become new cases during a defined follow-up period.

Incidence rate (also known as incidence density) is the number of new cases of disease that occur per unit of individual time at risk during a defined follow-up period.

In addition to reporting the point estimate of disease frequency, it is important to provide an indication of the uncertainty around that point estimate. The epi.conf function in the epiR package allows you to calculate confidence intervals for prevalence, incidence risk and incidence rates.

Let’s say we’re interested in the prevalence of disease X in a population comprised of 1000 individuals. Two hundred are tested and four returned a positive result. Assuming 100% test sensitivity and specificity, what is the estimated prevalence of disease X in this population?

library(epiR); library(ggplot2); library(scales); library(lubridate)

ncas <- 4; npop <- 200
tmp <- as.matrix(cbind(ncas, npop))
epi.conf(tmp, ctype = "prevalence", method = "exact", N = 1000, design = 1, 
   conf.level = 0.95) * 100
#>      est     lower    upper
#> ncas   2 0.5475566 5.041361

The estimated prevalence of disease X in this population is 2.0 (95% confidence interval [CI] 0.55 – 5.0) cases per 100 individuals at risk.

Another example. A study was conducted by Feychting, Osterlund, and Ahlbom (1998) to report the frequency of cancer among the blind. A total of 136 diagnoses of cancer were made from 22,050 person-years at risk. What was the incidence rate of cancer in this population?

ncas <- 136; ntar <- 22050
tmp <- as.matrix(cbind(ncas, ntar))
epi.conf(tmp, ctype = "inc.rate", method = "exact", N = 1000, design = 1, 
   conf.level = 0.95) * 1000
#>         est    lower    upper
#> ncas 6.1678 5.174806 7.295817

The incidence rate of cancer in this population was 6.2 (95% CI 5.2 to 7.3) cases per 1000 person-years at risk.

Now lets say we want to compare the frequency of disease across several populations. An effective way to do this is to used a ranked error bar plot. With a ranked error bar plot the points represent the point estimate of the measure of disease frequency and the error bars indicate the 95% confidence interval around each estimate. The disease frequency estimates are then sorted from lowest to highest.

Generate some data. First we’ll generate a distribution of disease prevalence estimates. Let’s say it has a mode of 0.60 and we’re 80% certain that the prevalence is greater than 0.35. Use the epi.betabuster function to generate parameters that can be used for a beta distribution to satisfy these constraints:

tmp <- epi.betabuster(mode = 0.60, conf = 0.80, greaterthan = TRUE, x = 0.35, 
   conf.level = 0.95, max.shape1 = 100, step = 0.001)
tmp$shape1; tmp$shape2
#> [1] 2.357
#> [1] 1.904667

Take 100 draws from a beta distribution using the shape1 and shape2 values calculated above and plot them as a frequency histogram:

dprob <- rbeta(n = 25, shape1 = tmp$shape1, shape2 = tmp$shape2)
dat.df <- data.frame(dprob = dprob)

ggplot(data = dat.df, aes(x = dprob)) +
  geom_histogram(binwidth = 0.01, colour = "gray", size = 0.1) +
  scale_x_continuous(limits = c(0,1), name = "Prevalence") +
  scale_y_continuous(limits = c(0,10), name = "Number of draws")
#> Warning: Removed 2 rows containing missing values (geom_bar).
\label{fig:dfreq01}Frequency histogram of disease prevalence estimates for our simulated population.

Frequency histogram of disease prevalence estimates for our simulated population.

Generate a vector of population sizes using the uniform distribution. Calculate the number of diseased individuals in each population using dprob (calculated above). Finally, calculate the prevalence of disease in each population and its 95% confidence interval using epi.conf. The function epi.conf provides several options for confidence interval calculation methods for prevalence. Here we’ll use the exact method:

dat.df$rname <- paste("Region ", 1:25, sep = "")
dat.df$npop <- round(runif(n = 25, min = 20, max = 1500), digits = 0)
dat.df$ncas <- round(dat.df$dprob * dat.df$npop, digits = 0)

tmp <- as.matrix(cbind(dat.df$ncas, dat.df$npop))
tmp <- epi.conf(tmp, ctype = "prevalence", method = "exact", N = 1000, design = 1, 
   conf.level = 0.95) * 100
dat.df <- cbind(dat.df, tmp)
head(dat.df)
#>       dprob    rname npop ncas      est     lower    upper
#> 1 0.6037750 Region 1 1334  805 60.34483 57.661844 62.98213
#> 2 0.8301315 Region 2 1028  853 82.97665 80.536973 85.22576
#> 3 0.6013646 Region 3  723  435 60.16598 56.492811 63.75584
#> 4 0.6939678 Region 4 1131  785 69.40760 66.629920 72.08394
#> 5 0.1208333 Region 5  288   35 12.15278  8.612695 16.49456
#> 6 0.8590510 Region 6 1209 1039 85.93879 83.850583 87.85038

Sort the data in order of variable est and assign a 1 to n identifier as variable rank:

dat.df <- dat.df[sort.list(dat.df$est),]
dat.df$rank <- 1:nrow(dat.df)

Now create a ranked error bar plot. Because its useful to provide the region-area names on the horizontal axis we’ll rotate the horizontal axis labels by 90 degrees.

ggplot(data = dat.df, aes(x = rank, y = est)) +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1) +
  geom_point() +
  scale_x_continuous(limits = c(0,25), breaks = dat.df$rank, labels = dat.df$rname, name = "Region") +
  scale_y_continuous(limits = c(0,100), name = "Prevalence (cases per 100 individuals
     at risk)") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
\label{fig:dfreq02}Ranked error bar plot showing the prevalence of disease (and its 95% confidence interval) for 100 population units.

Ranked error bar plot showing the prevalence of disease (and its 95% confidence interval) for 100 population units.

Time

Epidemic curve data are often presented in one of two formats:

  1. One row for each individual identified as a case with an event date assigned to each.

  2. One row for every event date with an integer representing the number of cases identified on that date.

Generate some data, with one row for every individual identified as a case:

n.males <- 100; n.females <- 50
odate <- seq(from = as.Date("2004-07-26"), to = as.Date("2004-12-13"), by = 1)
prob <- c(1:100, 41:1); prob <- prob / sum(prob)
modate <- sample(x = odate, size = n.males, replace = TRUE, p = prob)
fodate <- sample(x = odate, size = n.females, replace = TRUE)

dat.df <- data.frame(sex = c(rep("Male", n.males), rep("Female", n.females)), 
   odate = c(modate, fodate))

# Sort the data in order of odate:
dat.df <- dat.df[sort.list(dat.df$odate),] 

We’d like to have the flexibility to plot counts of cases by calendar week or by epidemiological (‘epi’) week. We assign to each date the corresponding epidemiology week number using the epiweek function from the lubridate package.

dat.df$eweek <- epiweek(dat.df$odate)

Plot the epidemic curve using the ggplot2 and scales packages:

ggplot(data = dat.df, aes(x = as.Date(odate))) +
  geom_histogram(binwidth = 7, colour = "gray", size = 0.1) +
  scale_x_date(breaks = date_breaks("7 days"), labels = date_format("%d %b"), 
     name = "Date") +
  scale_y_continuous(breaks = seq(from = 0, to = 20, by = 2), limits = c(0,20), name = "Number of cases") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
\label{fig:epicurve01}Frequency histogram showing counts of incident cases of disease as a function of calendar date, 26 July to 13 December 2004.

Frequency histogram showing counts of incident cases of disease as a function of calendar date, 26 July to 13 December 2004.

The same plot but this time showing epidemiology week on the horizontal axis:

ggplot(data = dat.df, aes(x = eweek)) +
  geom_histogram(binwidth = 1, colour = "gray", size = 0.1) +
  scale_x_continuous(breaks = seq(from = 30, to = 50, by = 1), name = "Epidemiology week") +
  scale_y_continuous(breaks = seq(from = 0, to = 20, by = 2), limits = c(0,20), name = "Number of cases")
\label{fig:epicurve02}Frequency histogram showing counts of incident cases of disease as a function of epidemiology week, 26 July to 13 December 2004.

Frequency histogram showing counts of incident cases of disease as a function of epidemiology week, 26 July to 13 December 2004.

Produce a separate epidemic curve for males and females using the facet_grid option in ggplot2:

ggplot(data = dat.df, aes(x = as.Date(odate))) +
  geom_histogram(binwidth = 7, colour = "gray", size = 0.1) +
  scale_x_date(breaks = date_breaks("1 week"), labels = date_format("%d %b"), 
     name = "Date") +
  scale_y_continuous(breaks = seq(from = 0, to = 20, by = 2), limits = c(0,20), name = "Number of cases") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  facet_grid( ~ sex)
\label{fig:epicurve03}Frequency histogram showing counts of incident cases of disease as a function of time, 26 July to 13 December 2004, conditioned by sex.

Frequency histogram showing counts of incident cases of disease as a function of time, 26 July to 13 December 2004, conditioned by sex.

Let’s say an event occurred on 31 October 2004. Mark this date on your epidemic curve using geom_vline:

ggplot(data = dat.df, aes(x = as.Date(odate))) +
  geom_histogram(binwidth = 7, colour = "gray", size = 0.1) +
  scale_x_date(breaks = date_breaks("1 week"), labels = date_format("%d %b"), 
     name = "Date") +
  scale_y_continuous(breaks = seq(from = 0, to = 20, by = 2), limits = c(0,20), name = "Number of cases") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  facet_grid( ~ sex) + 
  geom_vline(aes(xintercept = as.numeric(as.Date("31/10/2004", format = "%d/%m/%Y"))), 
   linetype = "dashed")
\label{fig:epicurve04}Frequency histogram showing counts of incident cases of disease as a function of time, 26 July to 13 December 2004, conditioned by sex. An event that occurred on 31 October 2004 is indicated by the vertical dashed line.

Frequency histogram showing counts of incident cases of disease as a function of time, 26 July to 13 December 2004, conditioned by sex. An event that occurred on 31 October 2004 is indicated by the vertical dashed line.

Plot the total number of disease events by day, coloured according to sex:

ggplot(data = dat.df, aes(x = as.Date(odate), group = sex, fill = sex)) +
  geom_histogram(binwidth = 7, colour = "gray", size = 0.1) +
  scale_x_date(breaks = date_breaks("1 week"), labels = date_format("%d %b"), 
     name = "Date") +
  scale_y_continuous(breaks = seq(from = 0, to = 20, by = 2), limits = c(0,20), name = "Number of cases") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  geom_vline(aes(xintercept = as.numeric(as.Date("31/10/2004", format = "%d/%m/%Y"))), 
   linetype = "dashed") + 
  scale_fill_manual(values = c("#d46a6a", "#738ca6"), name = "Sex") +
  theme(legend.position = c(0.90, 0.80))
\label{fig:epicurve05}Frequency histogram showing counts of incident cases of disease as a function of time, 26 July to 13 December 2004, grouped by sex.

Frequency histogram showing counts of incident cases of disease as a function of time, 26 July to 13 December 2004, grouped by sex.

It can be difficult to appreciate differences in male and female disease counts as a function of date with the above plot format so we dodge the data instead.

ggplot(data = dat.df, aes(x = as.Date(odate), group = sex, fill = sex)) +
  geom_histogram(binwidth = 7, colour = "gray", size = 0.1, position = "dodge") +
  scale_x_date(breaks = date_breaks("1 week"), labels = date_format("%d %b"), 
     name = "Date") +
  scale_y_continuous(breaks = seq(from = 0, to = 20, by = 2), limits = c(0,20), name = "Number of cases") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  geom_vline(aes(xintercept = as.numeric(as.Date("31/10/2004", format = "%d/%m/%Y"))), 
   linetype = "dashed") + 
  scale_fill_manual(values = c("#d46a6a", "#738ca6"), name = "Sex") + 
  theme(legend.position = c(0.90, 0.80))
\label{fig:epicurve06}Frequency histogram showing counts of incident cases of disease as a function of time, 26 July to 13 December 2004, grouped by sex.

Frequency histogram showing counts of incident cases of disease as a function of time, 26 July to 13 December 2004, grouped by sex.

We now provide code to deal with the situation where the data are presented with one row for every case event date and an integer representing the number of cases identified on each date.

Simulate some data in this format. In the code below the variable ncas represents the number of cases identified on a given date. The variable dcontrol is a factor with two levels: neg and pos. Level neg flags dates when no disease control measures were in place; level pos flags dates when disease controls measures were in place.

odate <- seq(from = as.Date("1/1/00", format = "%d/%m/%y"), 
   to = as.Date("1/1/05", format = "%d/%m/%y"), by = "1 month")
ncas <- round(runif(n = length(odate), min = 0, max = 100), digits = 0)

dat.df <- data.frame(odate, ncas)
dat.df$dcontrol <- "neg"
dat.df$dcontrol[dat.df$odate >= as.Date("1/1/03", format = "%d/%m/%y") & 
   dat.df$odate <= as.Date("1/6/03", format = "%d/%m/%y")] <- "pos"
head(dat.df)
#>        odate ncas dcontrol
#> 1 2000-01-01   67      neg
#> 2 2000-02-01   86      neg
#> 3 2000-03-01   66      neg
#> 4 2000-04-01   88      neg
#> 5 2000-05-01   11      neg
#> 6 2000-06-01    4      neg

Generate an epidemic curve. Note weight = ncas in the aesthetics argument for ggplot2:

ggplot(dat.df, aes(x = odate, weight = ncas, fill = factor(dcontrol))) +
  geom_histogram(binwidth = 60, colour = "gray", size = 0.1) +
  scale_x_date(breaks = date_breaks("6 months"), labels = date_format("%b %Y"), 
     name = "Date") +
  scale_y_continuous(limits = c(0, 200), name = "Number of cases") +
  scale_fill_manual(values = c("#2f4f4f", "red")) + 
  guides(fill = FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
\label{fig:epicurve07}Frequency histogram showing counts of incident cases of disease as a function of time, 1 January 2000 to 1 January 2005. Colours indicate the presence or absence of disease control measures.

Frequency histogram showing counts of incident cases of disease as a function of time, 1 January 2000 to 1 January 2005. Colours indicate the presence or absence of disease control measures.

Place

Two types of maps are often used when describing patterns of disease by place:

  1. Choropleth maps. Choropleth mapping involves producing a summary statistic of the outcome of interest (e.g. count of disease events, prevalence, incidence) for each component area within a study region. A map is created by ‘filling’ (i.e. colouring) each component area with colour, providing an indication of the magnitude of the variable of interest and how it varies geographically.

  2. Point maps.

Choropleth maps

For illustration we make a choropleth map of sudden infant death syndrome (SIDS) babies in North Carolina counties for 1974 using the nc.sids data provided with the spData package.

library(spData); library(rgeos); library(rgdal); library(plyr); library(RColorBrewer); library(spatstat)

ncsids.shp <- readOGR(system.file("shapes/sids.shp", package = "spData")[1])
#> OGR data source with driver: ESRI Shapefile 
#> Source: "C:\Program Files\R\R-4.0.5\library\spData\shapes\sids.shp", layer: "sids"
#> with 100 features
#> It has 22 fields
ncsids.shp@data <- ncsids.shp@data[,c("BIR74","SID74")]
head(ncsids.shp@data)
#>   BIR74 SID74
#> 0  1091     1
#> 1   487     0
#> 2  3188     5
#> 3   508     1
#> 4  1421     9
#> 5  1452     7

The ncsids.shp spatialPolygonsDataframe lists for each county in the North Carolina USA the number SIDS deaths for 1974.

Prepare the spatialPolygonsDataframe by creating a 1 to n identifier called id. We then fortify the spatialPolygonsDataframe to allow it to be used with ggplot2. Finally, join the attribute data from spatialPolygonsDataframe ncsids.shp to the fortified ncsids.df, using variable id as the key:

ncsids.shp$id <- 1:nrow(ncsids.shp@data)
ncsids.df <- fortify(ncsids.shp, region = "id")
ncsids.df <- join(x = ncsids.df, y = ncsids.shp@data, by = "id")

Choropleth map of the counties of the North Carolina showing SIDS counts for 1974:

ggplot(data = ncsids.df) + 
  theme_bw() +
  geom_polygon(aes(x = long, y = lat, group = group, fill = SID74)) + 
  geom_path(aes(x = long, y = lat, group = group), colour = "grey", size = 0.25) +
  scale_fill_gradientn(limits = c(0, 60), colours = brewer.pal(n = 5, "Reds"), 
     guide = "colourbar") +
  scale_x_continuous(name = "Longitude") +
  scale_y_continuous(name = "Latitude") +
  labs(fill = "SIDS 1974") +
  coord_map()
\label{fig:spatial01}Map of North Carolina, USA showing the number of sudden infant death syndrome cases, by county for 1974.

Map of North Carolina, USA showing the number of sudden infant death syndrome cases, by county for 1974.

Point maps

For this example we will used the epi.incin data set included with epiR. Between 1972 and 1980 an industrial waste incinerator operated at a site about 2 kilometres southwest of the town of Coppull in Lancashire, England. Addressing community concerns that there were greater than expected numbers of laryngeal cancer cases in close proximity to the incinerator Diggle (1990) conducted a study investigating risks for laryngeal cancer, using recorded cases of lung cancer as controls. The study area is 20 km x 20 km in size and includes location of residence of patients diagnosed with each cancer type from 1974 to 1983.

Load the epi.incin data set and create negative and positive labels for each point location. We don’t have a boundary map for these data so we’ll use spatstat to create a convex hull around the points and dilate the convex hull by 1000 metres as a proxy boundary. Create an observation window for the data as dat.w and a ppp object for plotting:

data(epi.incin); dat.df <- epi.incin
dat.df$status <- factor(dat.df$status, levels = c(0,1), labels = c("Neg", "Pos"))
names(dat.df)[3] <- "Status"

dat.w <- convexhull.xy(x = dat.df[,1], y = dat.df[,2])
dat.w <- dilation(dat.w, r = 1000)
dat.ppp <- ppp(x = dat.df[,1], y = dat.df[,2], marks = factor(dat.df[,3]), window = dat.w)

Create a SpatialPolygonsDataFrame from dat.w:

coords <- matrix(c(dat.w$bdry[[1]]$x, dat.w$bdry[[1]]$y), ncol = 2, byrow = FALSE)
pol <- Polygon(coords, hole = FALSE)
pol <- Polygons(list(pol),1)
pol <- SpatialPolygons(list(pol))
pol.spdf <- SpatialPolygonsDataFrame(Sr = pol, data = data.frame(id = 1), match.ID = TRUE)
pol.map <- fortify(pol.spdf)
#> Regions defined for each Polygons

Plot the data as a point map:

ggplot() +
  geom_point(data = dat.df, aes(x = xcoord, y = ycoord, colour = Status, shape = Status)) +
  geom_polygon(data = pol.map, aes(x = long, y = lat, group = group), col = "black", 
     fill = "transparent") + 
  scale_colour_manual(values = c("blue", "red")) +
  scale_shape_manual(values = c(1,16)) +
  labs(x = "Easting (m)", y = "Northing (m)", fill = "Status") +
  coord_equal() + 
  theme_bw()
\label{fig:spatial02}Point map showing the place of residence of individuals diagnosed with laryngeal cancer (Pos) and lung cancer (Neg), Copull Lancashire, UK, 1972 to 1980.

Point map showing the place of residence of individuals diagnosed with laryngeal cancer (Pos) and lung cancer (Neg), Copull Lancashire, UK, 1972 to 1980.

References

Centers for Disease Control and Prevention. 2006. Principles of Epidemiology in Public Health Practice: An Introduction to Applied Epidemiology and Biostatistics. Book. Atlanta, Georgia: Centers for Disease Control; Prevention.

Diggle, PJ. 1990. “A point process modeling approach to raised incidence of a rare phenomenon in the vicinity of a prespecified point.” Journal of the Royal Statistical Society Series A 153: 349–62.

Feychting, M, B Osterlund, and A Ahlbom. 1998. “Reduced cancer incidence among the blind.” Epidemiology 9: 490–94.