Finding trends in US national park visits

Sep 16, 2019

TidyTuesday / line graphs / tutorial

In this post, I analyze the National Park Visits dataset from TidyTuesday, a project that shares a new dataset each week to give R users a way to apply and practice their skills. This week’s data is about visitor numbers for US National Parks, going way back to 1904, when there were only six national parks.

I’ve never been to a US national park, but I know about some of the famous ones like Yosemite and Yellowstone. Maybe this dataset will give me some ideas for lesser-known parks to visit!

Setup

First, we’ll load the tidyverse and, because I cheated and looked ahead, a few other packages:

lubridate to work with dates
ggrepel to add text labels to graphs that won’t overlap
ggsci for interesting colour palettes

In addition to visitor figures, TidyTuesday has state population data. This will be useful if we want to look at the relationship between park visitors and population (spoiler: we do exactly this).

library(tidyverse)
library(lubridate)
library(ggrepel)
library(ggsci)

theme_set(theme_light())

park_visits_raw <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/national_parks.csv")
state_pop <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-09-17/state_pop.csv")

Data inspection

Let’s take a look.

park_visits_raw %>%
  glimpse()

## Rows: 21,560
## Columns: 12
## $ year              <chr> "1904", "1941", "1961", "1935", "1982", "1919", "...
## $ gnis_id           <chr> "1163670", "1531834", "2055170", "1530459", "2772...
## $ geometry          <chr> "POLYGON", "MULTIPOLYGON", "MULTIPOLYGON", "MULTI...
## $ metadata          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ number_of_records <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ parkname          <chr> "Crater Lake", "Lake Roosevelt", "Lewis and Clark...
## $ region            <chr> "PW", "PW", "PW", "PW", "PW", "NE", "IM", "NE", "...
## $ state             <chr> "OR", "WA", "WA", "WA", "CA", "ME", "TX", "MD", "...
## $ unit_code         <chr> "CRLA", "LARO", "LEWI", "OLYM", "SAMO", "ACAD", "...
## $ unit_name         <chr> "Crater Lake National Park", "Lake Roosevelt Nati...
## $ unit_type         <chr> "National Park", "National Recreation Area", "Nat...
## $ visitors          <dbl> 1500, 0, 69000, 2200, 468144, 64000, 448000, 7387...

We have four types of data:

Park metadata like name, official name (unit_name), and type of park
Visitors by year
Mapping data about shapefile shapes and IDs
Data export artefact: number_of_records is 1 in every row. This type of metadata appears a lot when downloading data from data portals.

Okay, what about state_pop?

state_pop %>%
  glimpse()

## Rows: 5,916
## Columns: 3
## $ year  <dbl> 1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1...
## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "...
## $ pop   <dbl> 1830000, 1907000, 1935000, 1957000, 1978000, 2012000, 2045000...

That’s as straightforward as it gets. Population by year for each state.

I want to know, How popular have national parks been over time? I would guess that it has increased since 1904, as the park system expanded and it became easier and less expensive to visit them.

Data cleaning

I’m going to do four cleaning steps:

Ditch fields won’t be useful in answering our research question.
Rename unit_code to park_code so that it’s more intuitive.
year has a row for “Totals”, so we will get rid of that. (If we need totals, we can calculate it ourselves.)
Create long (more readable) version region codes to full names, based on National Park Service (NPS) Regional Boundaries.

park_visits <- park_visits_raw %>%
  # 1. Ditch GIS fields and metadata artefact
  select(-number_of_records, -gnis_id, -geometry, -metadata) %>%
  # 2. Rename park ID field
  rename(park_code = unit_code) %>%
  # 3. Get rid of aggregate row
  filter(year != "Total") %>%
  # Convert year to number (was character because of "Total")
  mutate(year = as.numeric(year),
  # 4. Create long version of region (note the backticks for values with spaces)
         region_long = fct_recode(region,
                                  Alaska = "AK",
                                  Midwest = "MW",
                                  `Pacific West` = "PW",
                                  Intermountain = "IM",
                                  Northeast = "NE",
                                  Southeast = "SE",
                                  `National Capital` = "NC"))

We’ve got cleaner, more workable data now. But there is still an issue I’d like to address. There is a region code that wasn’t in the NPS legend: NT

park_visits %>%
  filter(region == "NT") %>%
  distinct(unit_name, unit_type, region)

## # A tibble: 1 x 3
##   region unit_name          unit_type
##   <chr>  <chr>              <chr>    
## 1 NT     Blue Ridge Parkway Parkway

Blue Ridge Parkway. According to Blue Ridge’s Wikipedia page, it is a parkway (duh) that runs through Virginia and North Carolina, which are in different NPS regions. (Virginia is in “Northeast” and North Carolina is in “Southeast”.) “NT” is probably a special classification or a code that means “not classified”.

This discovery makes me question whether to include any parkways in our analysis. When I think of visiting national parks, I think of people getting out of their cars, going into visitors’ centres, and hiking trails. Are there any other parkways in our data?

park_visits %>%
  filter(str_detect(unit_type, "Parkway")) %>%
  distinct(unit_name, unit_type, region)

## # A tibble: 4 x 3
##   region unit_name                                  unit_type       
##   <chr>  <chr>                                      <chr>           
## 1 NT     Blue Ridge Parkway                         Parkway         
## 2 IM     John D. Rockefeller, Jr., Memorial Parkway National Parkway
## 3 NC     George Washington Memorial Parkway         National Parkway
## 4 SE     Natchez Trace Parkway                      Parkway

Yes, three more:

I’m going to exclude these because, even though they are administered by the NPS, “visiting” a parkway doesn’t fit with how I think of visiting a national park. (You can make the argument that I’m wrong to exclude them or that I should exclude more than just parkways based on my logic. I’m doing a quick exploration, so I’m comfortable making this decision based my assumption. But if I were analyzing this dataset in detail, I would do more research before making a decision to exclude data.)

park_visits <- park_visits %>%
  filter(!str_detect(unit_type, "Parkway"))

Visualization

Total visits

Our question is, How popular have national parks been over time? Let’s start with total visitors. But before we, do, we need to talk about US population. According to our state_pop dataset, the population was 82 million in 1904 (the earliest year we have park visit data) and 321 million in 2015 (the latest year we have population data).

state_pop %>%
  filter(year == 1904 | year == 2015) %>%
  group_by(year) %>%
  summarise(pop_millions = sum(pop, na.rm = TRUE) / 1e6) %>%
  ggplot(aes(as.factor(year), pop_millions)) +
  geom_col(fill = "royal blue") +
  scale_y_continuous(breaks = seq(0,300, 50),
                     labels = paste0(seq(0, 300, 50), "M")) +
  labs(title = "US Population in millions, 1900-2015",
       x = "",
       y = "") +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.minor.y = element_blank())

## `summarise()` ungrouping output (override with `.groups` argument)

I’m interested in the popularity of national parks, not just their raw visitor numbers, so it wouldn’t be fair to compare 1904 visits to 2015 visits. There were way more people in 2015 to do the visiting! So I’m going to scale visitors by US population (e.g., “X visits per 1,000 US population”). Scaling this way doesn’t account for non-US visitors, but it should be approximate enough to account for the fact that fewer people were walking around in 1904 than 2015.

total_visits <- state_pop %>%
  group_by(year) %>%
  summarise(total_population = sum(pop, na.rm = TRUE)) %>%
  # Filters out years that aren't in park_visits data (pre-1904)
  right_join(park_visits, by = "year") %>%
  group_by(year) %>%
  summarise(total_visitors = sum(visitors),
            # Already calculated and identical, so taking mean gives total population
            total_population = mean(total_population),
            visits_per_thousand = total_visitors / (total_population / 1000)) %>%
  # Remove years that aren't state_pop dataset (post-2015)
  filter(!is.na(total_population))

## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)

total_visits %>%
  ggplot(aes(year, visits_per_thousand)) +
  geom_hline(yintercept = 1000, lty = 2) +
  geom_line(col = "dark green", size = 1.5) +
  geom_area(fill = "dark green", alpha = 0.25) +
  scale_x_continuous(breaks = seq(1900, 2010, 10)) +
  scale_y_continuous(breaks = seq(0, 1200, 200)) +
  labs(title = "How popular have national parks been?",
       subtitle = "1904-2015. Dashed line represents an equal number of visitors and population",
       x = "",
       y = "Visitors per thousand US population")

(I saw FiveThirtyEight’s graph and liked it so much I stole the aesthetics. They didn’t scale by population, though!)

Visits in the early 20th century were low (real low!), with a notable dip during the US involvement in World War II. Then visits increase until they peak in the mid 1980s. After that, there is a steady-ish decline with a possible uptick in the mid-2010s.

Visits by region

How does this time trend look when broken down by region? I’m going to do something similar to what we did for total visits, with one difference – aggregate visitors per thousand into five-year bins. Graphing annually gives spiky, noisy lines. I’m more interested in the overall trend, so I’m going to sacrifice detail and accuracy to get a graph that’s easier to read.

visits_by_region <- state_pop %>%
  group_by(year) %>%
  summarise(total_population = sum(pop, na.rm = TRUE)) %>%
  right_join(park_visits, by = "year") %>%
  # Truncated division to floor the year into 5-year bins
  mutate(year = year %/% 5 * 5) %>%
  group_by(year, region_long) %>%
  summarise(total_visitors = sum(visitors),
            total_population = mean(total_population),
            visits_per_thousand = mean(total_visitors / (total_population / 1000))) %>%
  filter(!is.na(total_population)) %>%
  ungroup()

## `summarise()` ungrouping output (override with `.groups` argument)

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

visits_by_region %>%
  # Add region labels for the max year
  mutate(region_label = ifelse(year == max(year), as.character(region_long), NA_character_),
         region_label = fct_reorder(region_label, visits_per_thousand, last, order_by = year)) %>%
  ggplot(aes(year, visits_per_thousand, col = region_long)) +
  geom_line(size = 1) +
  # From ggrepel package to keep labels from overlapping
  geom_label_repel(aes(label = region_label), nudge_x = 10, na.rm = TRUE) +
  # Colour palette from ggsci package
  scale_color_jco() +
  scale_x_continuous(breaks = seq(1900, 2010, 10)) +
  # Gives space on right side of graph for labels
  expand_limits(x = 2030) +
  labs(title = "How has the popularity of national park regions changed?",
       subtitle = "By number of visits per thousand US residents. Aggregated into 5-year bins.",
       x = "",
       y = "Visitors per thousand US population") +
  theme(legend.position = "none",
        panel.grid.minor = element_blank())

When we break down by region, we see a few things:

Alaska: few visits and a gradual, smooth increase since the 1960s
Midwest: middling visits and smooth increaseses and decreases
All other regions: many visits and some spiky-ness, with peaks in the late 1980s or early 1990s

The spike in visits in the late 1980s is particularly stark. After some cursory research, I can’t figure out why. An important anniversary? A period of low (or no) entry fees? If you know or have any ideas, tweet me.

Conclusion

If you liked this, check out the #TidyTuesday hashtag on Twitter. People make truly wonderful contributions every week. Even better would be to participate in TidyTuesday. The R community is tremendously positive and supportive.