Heat mapping the timing of Philadelphia parking tickets

Dec 5, 2019

Tidy Tuesday / heat maps / tutorial

In this post, I create heat maps using the Philly Parking Tickets dataset from TidyTuesday, a project that shares a new dataset each week to give R users a way to apply and practice their skills.

Specifically, we’ll cover:

Cleaning and aggregating the data that will go into our heat map
Creating a basic heat map with ggplot2 defaults
Tweaking ggplot2 theme components to get a much prettier heat map

Setup

We’ll load a few packages in addition to the tidyverse:

lubridate to work with dates and times
extrafont to change the font on our graphs
scales to easily change number formats (e.g., 0.32 becomes 32%)
viridis as a nice alternative to default ggplot2 colours

library(tidyverse)
library(lubridate)
library(extrafont)
library(scales)
library(viridis)

theme_set(theme_light())

tickets_raw <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-12-03/tickets.csv")

Data inspection

Let’s get into these parking tickets.

tickets_raw %>%
  glimpse()

## Rows: 1,260,891
## Columns: 7
## $ violation_desc <chr> "BUS ONLY ZONE", "STOPPING PROHIBITED", "OVER TIME L...
## $ issue_datetime <dttm> 2017-12-06 12:29:00, 2017-10-16 18:03:00, 2017-11-0...
## $ fine           <dbl> 51, 51, 26, 26, 76, 51, 36, 36, 76, 26, 26, 301, 36,...
## $ issuing_agency <chr> "PPA", "PPA", "PPA", "PPA", "PPA", "POLICE", "PPA", ...
## $ lat            <dbl> 40.03550, 40.02571, 40.02579, 40.02590, 39.95617, 40...
## $ lon            <dbl> -75.08111, -75.22249, -75.22256, -75.22271, -75.1660...
## $ zip_code       <dbl> 19149, 19127, 19127, 19127, 19102, NA, NA, 19106, 19...

The unit of observation is a parking ticket, and we have over 1.2 million of them. I also see three categories of data:

Ticket: the basics of the ticket, like what it was for, the fine amount, and who issued it.
Time: when it was issued. This dataset has tickets for 2017 only.
Location: where it was issued.

Which are the most common violations?

tickets_raw %>%
  count(violation_desc, sort = TRUE) %>%
  head(20) %>%
  mutate(violation_desc = fct_reorder(violation_desc, n)) %>%
  ggplot(aes(violation_desc, n, fill = violation_desc)) +
  geom_col() +
  scale_fill_viridis(discrete = TRUE, direction = -1) +
  scale_y_continuous(labels = comma_format()) +
  coord_flip() +
  theme(legend.position = "none")

“METER EXPIRED” and “METER EXPIRED CC” are the two most common. Is there any difference between them? Other violations seem the same, too, except for that “CC” at the end. Let’s use a cool trick to look at a couple of them: if we group by violation_desc, we can then use the sample_n function to get random observations from each group. It’s handy for spot-checking or investigating weird values like these.

# Set seed for reproducible sampling
set.seed(24601)

tickets_raw %>%
  # Look at two violation descriptions
  # Note the regex on PROHIBITED to capture slightly different spellings
  filter(str_detect(violation_desc, "METER EXPIRED|PARKING PROHI?BITED")) %>%
  group_by(violation_desc) %>%
  sample_n(2)

## # A tibble: 10 x 7
## # Groups:   violation_desc [5]
##    violation_desc  issue_datetime       fine issuing_agency   lat   lon zip_code
##    <chr>           <dttm>              <dbl> <chr>          <dbl> <dbl>    <dbl>
##  1 METER EXPIRED   2017-05-27 08:13:00    26 PPA             40.0 -75.1    19124
##  2 METER EXPIRED   2017-02-10 15:45:00    26 PPA             40.0 -75.2    19104
##  3 METER EXPIRED ~ 2017-07-20 18:45:00    36 PPA             40.0 -75.2    19107
##  4 METER EXPIRED ~ 2017-08-09 18:28:00    36 PPA             40.0 -75.2       NA
##  5 PARKING PROHBI~ 2017-10-06 11:59:00    41 PPA             40.0 -75.2       NA
##  6 PARKING PROHBI~ 2017-09-19 11:33:00    41 PPA             40.0 -75.2    19130
##  7 PARKING PROHBI~ 2017-05-04 11:40:00    51 PPA             39.9 -75.2    19107
##  8 PARKING PROHBI~ 2017-08-25 14:08:00    51 PPA             40.0 -75.2    19103
##  9 PARKING PROHIB~ 2017-09-07 11:50:00    31 SEPTA           40.0 -75.1    19140
## 10 PARKING PROHIB~ 2017-08-12 06:50:00    31 TEMPLE          40.0 -75.2    19140

The main difference is the fine amount, but “PARKING PROHIBITED” (with its slightly different spelling) has a different issuing_agency. Some quick research makes me think that CC stands for “City Centre”. That would jive with the higher fine amounts – higher fines for violations downtown.

Fortunately, we have location data, so we can test this hypothesis. We’ll use longitude and latitude to make a crude map of violation types (with CC vs. without CC) and see if the results are consistent with “CC” meaning “City Centre”.

tickets_raw %>%
  filter(str_detect(violation_desc, "METER EXPIRED"),
         # Exclude outlier longitude values
         lon > -75.5) %>%
  group_by(violation_desc) %>%
  # Only take 1000 observations -- more takes longer to plot
  sample_n(1e4) %>%
  ggplot(aes(lon, lat, col = violation_desc)) +
  # Shape of . is small, so it alleviates overplotting
  geom_point(shape = ".")

Bingo! That concentration of blue dots looks like a city centre to me. I think we’ve got a good enough feel for our data to decide what we want to do with it.

Research question

I’m sure everyone has parked somewhere they shouldn’t have. Whenever I’ve done that, I always worry: “Will I get away with it?” If I parked illegally late at night on a Sunday, I’d be less worried about getting a ticket than if I parked illegaly on a Tuesday afternoon. Would I be right to be less worried? Let’s visualize the relationship between time and tickets with a heat map to find out. Specifically, let’s look at day of the week and time of the day.

Data cleaning and preparation

We’re going to perform four cleaning steps:

Remove “CC” from violation_desc. We’ll consider violations the same, regardless of whether they were in the city centre or not.
Correct some short-form spellings in violation_desc. For example we’ll add the “E” back into “PASSENGR”.
Derive day-of-week and hour-of-day from issue_datetime. We need these to be the nodes of our heat map.
Ditch fine, issuing_agency, and location data. We won’t be using these.

tickets <- tickets_raw %>%
  # Remove "CC" and correct spelling
  mutate(violation_desc = str_squish(str_remove(violation_desc, " CC")),
         violation_desc = str_replace_all(violation_desc,
                                          c("PASSENGR" = "PASSENGER",
                                            "PROHBITED" = "PROHIBITED",
                                            "LOADNG" = "LOADING"))) %>%
  # Derive features -- we're setting Mon as the first day of the week
  mutate(day_of_week = wday(issue_datetime, label = TRUE, week_start = 1),
         issue_hour = hour(issue_datetime)) %>%
  select(-fine:-zip_code)

# Look at our clean data
tickets %>%
  head()

## # A tibble: 6 x 4
##   violation_desc      issue_datetime      day_of_week issue_hour
##   <chr>               <dttm>              <ord>            <int>
## 1 BUS ONLY ZONE       2017-12-06 12:29:00 Wed                 12
## 2 STOPPING PROHIBITED 2017-10-16 18:03:00 Mon                 18
## 3 OVER TIME LIMIT     2017-11-02 22:09:00 Thu                 22
## 4 OVER TIME LIMIT     2017-11-05 20:19:00 Sun                 20
## 5 STOP PROHIBITED     2017-10-17 06:58:00 Tue                  6
## 6 DOUBLE PARKED       2017-10-02 10:40:00 Mon                 10

It’s a simple dataset, but it’s all we need for a heat map.

Visualizing

We can use the geom_tile function to create heat maps. After specifying our x and y dimensions (day-of-week and hour-of-day) we need to specify what represents the “heat”. In our case, the heat is number of tickets for a given hour of a given day of the week.

tickets %>%
  count(day_of_week, issue_hour) %>%
  ggplot(aes(x = day_of_week, y = issue_hour, fill = n)) +
  geom_tile()

It’s pretty cool we can get a heat map with a few lines of code, but it looks rough and incomplete. We’re going to make a bunch of changes to get something more polished:

Add light borders to each node to distinguish them more easily
Add more frequent labels for hours
Turn the integer hour into a more recognizable time (e.g., 20 becomes 20:00)
Change the colour – lighter should mean fewer tickets and darker should mean more tickets
Make the numbers in the scale prettier
Change the aspect ratio – the default nodes are too wide
Add a descriptive title
Get rid of labels for the x and y axes – I don’t need a label to know that “Mon” and “Tue” mean day of the week
Change the fill label from “n” to something more descriptive
Change the font – you’ll need the extrafont package for anything but the most basic fonts (it requires a totally-worth-it one-time setup step)
Get rid of gridlines – we already have borders on our nodes
Get rid of the border around the heat map
Adjust the x and y axes text sizes
Get rid of axis ticks on the x-axis (we’ll keep the ones on the y-axis)

tickets_heat_overall <- tickets %>%
  count(day_of_week, issue_hour)

tickets_heat_overall %>%
  ggplot(aes(x = day_of_week, y = issue_hour, fill = n)) +
  # 1. Add light borders
  geom_tile(col = "gray90") +
  scale_y_reverse(# 2. More frequent hour labels
                  breaks = seq(0, 23, 2),
                  # 3. More recognizable hours
                  labels = paste0(seq(0, 23, 2), ":00")) +
  scale_fill_gradient(# 4. Change colour
                      low = "white", high = "#ae017e",
                      # 5. Pretty numbers
                      labels = comma_format()) +
  # 6. Better aspect ratio
  coord_fixed(ratio = 0.3) +
  labs(# 7. Descriptive title
       title = "Philadelphia parking tickets by time and day",
       caption = "Based on 2017 Philadelphia parking tickets for all violation types",
       # 8. Get rid of unnecessary axis labels
       x = "",
       y = "",
       # 9. More descriptive legend label
       fill = "Tickets Issued") +
  theme(# 10. Change font
        text = element_text(family = "Bahnschrift"),
        # 11. Eliminate gridlines
        panel.grid = element_blank(),
        # 12. Eliminate border
        panel.border = element_blank(),
        # 13. Change axis text sizes
        axis.text.x = element_text(size = 14),
        axis.text.y = element_text(size = 11),
        # 14. Eliminate x-axis ticks
        axis.ticks.x = element_blank())

That looks pretty good! We can make a few immediate observations:

Most tickets are issued on weekdays during the day
Sunday is the least-ticketed day
Some tickets are issued late at night on party nights (Thu, Fri, Sat), but not other nights

This heat map includes all violations, regardless of what type they are. What if we look at violation types individually? We can still use a heat map, but break it into small multiples to allow for easy comparison. Let’s look at the five violations that racked up the most tickets. First, though, we need to do two extra steps:

Turn implicit missing values into explicit missing values. For example, if a given violation type doesn’t have any tickets for 2:00 a.m. on a Sunday, that will show up as “NA” (missing). We want it to show up as “0” instead, otherwise our heat map won’t graph properly. We can use the complete function to do this.
Look at percent of tickets issued within each type instead of number of tickets so that everything is on the same scale. If we looked at number of tickets, we couldn’t compare across types because types with the most tickets would automatically be darker.

tickets_heat_top5 <- tickets %>%
  mutate(violation_group = str_to_title(fct_lump(violation_desc, 5))) %>%
  count(day_of_week, issue_hour, violation_group) %>%
  complete(day_of_week, issue_hour, violation_group, fill = list(n = 0)) %>%
  group_by(violation_group) %>%
  mutate(pct_of_group = n / sum(n)) %>%
  ungroup()

tickets_heat_top5 %>%
  mutate(violation_group = fct_reorder(violation_group, pct_of_group, median)) %>%
  ggplot(aes(x = day_of_week, y = issue_hour, fill = pct_of_group)) +
  geom_tile(col = "gray90") +
  scale_y_reverse(breaks = seq(0, 23, 2),
                  labels = paste0(seq(0, 23, 2), ":00"),
                  sec.axis = dup_axis()) +
  scale_fill_gradient(low = "white",
                      high = "#ae017e",
                      labels = percent_format()) +
  coord_fixed(ratio = 0.3) +
  facet_wrap(~ violation_group) +
  labs(title = "Watch where you park",
       subtitle = "Nodes represent % of tickets issued for a given type of violation",
       caption = "Based on 2017 Philadelphia parking tickets",
       x = "",
       y = "",
       fill = "") +
  theme(text = element_text(family = "Bahnschrift"),
        panel.grid = element_blank(),
        panel.border = element_blank(),
        strip.text = element_text(colour = "black", size = 11),
        strip.background = element_blank(),
        axis.text = element_text(size = 7),
        axis.ticks.x = element_blank())

I’ll let you draw your own observations about the differences between violations. (And also identify the pitfalls of drawing conclusions from these heat maps!)

Conclusion

I hope you enjoyed making heat maps and going deep into some of the ways we can tweak the ggplot2 theme to get nicer, more polished graphs. If you liked this, check out the #TidyTuesday hashtag on Twitter or, even better, participate yourself!