Data Visualisation with ggplot2

Last updated on 2026-06-30 | Edit this page

Estimated time: 115 minutes

Overview

Questions

What are the components of a ggplot?
How can I visualize check-in patterns over time?
How can I compare check-in frequencies across locations and devices?
What are the main differences between R base plots, lattice, and ggplot?
How can I visualize location data on maps with ggplot2?

Objectives

Produce scatter plots, box plots, and bar plots using ggplot.
Create time series plots for temporal check-in data.
Set universal plot settings.
Describe what faceting is and apply faceting in ggplot.
Modify the aesthetics of an existing ggplot plot (including axis labels and color).
Build complex and customized plots from data in a tibble.
Create maps with ggplot2 to visualize location-based data.
Recognize the differences between base R, lattice, and ggplot visualizations.

This episode is a broad overview of ggplot2 and focuses on getting familiar with the layering system of ggplot2, using the argument group in the aes() function, and basic customization of the plots. We’ll show how to visualize patterns in check-in behavior across different locations and devices, and introduce mapping techniques.

We start by loading the required packages: tidyverse and lubridate. As you may recall, ggplot2 is included in the tidyverse package, so we do not need to load ggplot2 in separately.

R

library(tidyverse)
library(here)
library(lubridate)

Next, let’s load in our data! Throughout this lesson, we will be using a sampled version of the data we created at the end of “Starting With Data”. In practice, sampling data before visualization is NOT required; however, due to the size of our original data set, using a smaller, sampled data set will allow us to generate plots much faster!

R

data <- read_csv(here("data", "checkin_sample_plotting.csv"))

Before we continue, let’s take a look at the structure and size of our data set to see what we’ll be working with in detail:

R

glimpse(data)

ERROR

Error in `glimpse()`:
! could not find function "glimpse"

As you may notice, the house exceeds 12, meaning this data is in 24 hour time! If you are unfamiliar, this means 13 represents 1PM, 14 represents 2PM, and so on.

Additionally, for those curious, the original data set had approximately 352k lines, which means this data set is less than 10% of the size!

Visualization Options in R

Before we start with ggplot2, it’s helpful to know that there are several ways to create visualizations in R. While ggplot2 is great for building complex and highly customizable plots, there are simpler and quicker alternatives that you might encounter or use depending on the context. Let’s briefly explore a few of them:

Base-R Plots

Base R plots are the simplest form of visualization and are great for quick, exploratory analysis. You can create plots with very little code, but customizing them can be cumbersome compared to ggplot2.

Example of a simple time series plot in base R showing the number of check-ins by hour:

R

hourly_counts <- data %>%
                 count(hour)

plot(hourly_counts$hour, hourly_counts$n,
     main = "Base R Plot: Check-Ins by Hour",
     xlab = "Hour of Day",
     ylab = "Number of Check-Ins",
     type = "l")  #'l' for line

`Lattice`

Lattice is another plotting system in R, which allows for creating multi-panel plots easily. It’s different from ggplot2 because you define the entire plot in a single function call, and modifications after plotting are limited.

Example of a lattice plot showing check-ins by device for different locations:

R

library(lattice)

R

#grabs specific locations (so the graph isn't giant) and converts locations + devices to factors
checkins_lattice <- data %>%
                    filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003")) %>%
                    #we're removing "DEVICE_" because it causes overlap within the plot
                    #if you're curious, remove this line and regenerate the plot!
                    mutate(device = str_remove(device, "DEVICE_")) %>%
                    mutate(
                      device = as.factor(device),
                      location = as.factor(location)
                    )

#creates a lattice boxplot (bwplot)
bwplot(hour ~ device | location, data = checkins_lattice,
       main = "Lattice Plot: Check-in Hour Distribution by Device and Location",
       xlab = "Device",
       ylab = "Hour of Check-in",
       layout = c(length(unique(checkins_lattice$location)), 1), #adjusts layout for multiple locations
       strip = strip.custom(bg="lightgrey"),
       scales = list(y = list(at = 0:24)), #adds all hours on y, not just even numbers
       panel = function(x, y, ...) {
         panel.bwplot(x, y, ...)
       })

Plotting with `ggplot2`

ggplot2 is a plotting package that makes creating complex plots from data stored in a tibble simpler. It provides a programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. As a result, if the underlying data changes or if we decide to switch from a bar plot to a scatter plot, we only have to make minimal adjustments to the code!

ggplot2 functions work best with data in the ‘long’ format. As you may recall from “Data Wrangling with tidyr”, this consists of a column for every dimension, and a row for every observation. Ensuring you use well-structured data will save you lots of time when making figures with ggplot2

ggplot2 graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

Each chart built with ggplot2 must include the following: - Data - Aesthetic mapping (aes) - Describes how variables are mapped onto graphical attributes - Visual attribute of data including x-y axes, color, fill, shape, and alpha - Geometric objects (geom) - Determines how values are rendered graphically, as bars (geom_bar), scatterplot (geom_point), line (geom_line), etc.

Thus, the template for graphic in ggplot2 is:

<DATA> %>%
    ggplot(aes(<MAPPINGS>)) +
    <GEOM_FUNCTION>()

Remember that the pipe operator %>% places the result of the previous line(s) into the first argument of the function. The ggplot function expects a data frame to be the first argument, which allows us to change from specifying the data = argument within the ggplot function to instead piping the data into the function.

To create a chart with ggplot2, follow the steps below:

use the ggplot() function and bind the plot to a specific tibble.

R

data %>%
  ggplot()

Using the aesthetic (aes) function, define your mapping by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.

R

data %>%
  ggplot(aes(x = precinct))

Add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). ggplot2 offers many different geoms; we will use some common ones today, including:
- geom_bar() for counting observations in categories
- geom_histogram() for showing distributions
- geom_boxplot() for statistical summaries
- geom_line() for trend lines, time series, etc.

To add a geom to the plot use the + operator. Let’s start by creating a bar chart showing the distribution of check-ins across precincts:

R

data %>%
  ggplot(aes(x = precinct)) +
  geom_bar()

The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects. This means you can easily set up plot templates and conveniently explore different types of plots! Using this idea, the above plot can also be generated with code like this, similar to the “intermediate steps” approach:

R

#assign the plot to a variable
plot <- data %>%
        ggplot(aes(x = precinct))

#draw the plot as a bar plot
plot +
  geom_bar()

Callout

Notes

Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in aes().
You can also specify mappings for a given geom independently of the mapping defined globally in the ggplot() function.
The + sign used to add new layers must be placed at the end of the line containing the previous layer. If, instead, the + sign is added at the beginning of the line containing the new layer, ggplot2 will not add the new layer and will return an error message.

R

## This is the correct syntax for adding layers
checkins_plot +
  geom_point()

## This will not add the new layer and will return an error message
checkins_plot
+ geom_point()

Building Your Plots Iteratively

Building plots with ggplot2 is typically an iterative process. We start by defining the data set we’ll use, lay out the axes, and choose a geom.

Let’s re-create the time-series plot we made for the Base-R demonstration:

R

#using the hourly_counts we created, generate a time-series plot
hourly_counts %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() #creates a line plot using the x and y from the ggplot above!

Now that we have a baseline plot to start from, we can start modifying it to extract additional information! For instance, when inspecting the plot, we can notice that it’s a bit difficult to tell at first glance where each hour sits on the line.

To resolve this, we will add points to the line to clearly indicate each hour:

R

hourly_counts %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point()

Next, we will add colors for all of the points by specifying a color argument inside the geom_point function:

R

hourly_counts %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(color = "blue")

To color each point in the plot differently, you could use a vector as an input to the color argument; however, because we are now mapping features of the data to a color, instead of setting one color for all points, the color of the points now needs to be set inside a call to the aes function. When we map a variable in our data to the color of the points, ggplot2 will provide a different color corresponding to the different values of the variable.

Let’s apply this to our plot below, changing the color of each point based on the hour:

R

hourly_counts %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(aes(color = hour))

Unfortunately, this doesn’t tell us much about our data, just that each point represents a different hour (which we already knew!). Additionally, you may notice that after adding conditional coloring using aes(), ggplot automatically added a legend to explain what the different colors represent/mean!

Now, instead of coloring each point based on one of the variables we already have, we’re going to calculate the average hourly count and set the point to green if the count at that hour is above average and red if the count at that hour is below average!

To do this, we will calculate the average hourly count and, using mutate, add a column to our hourly_counts tibble that indicates whether the count at that hour is above or below the calculated average! Then, we will use the scale_color_manual function to manually color these points green and red instead of the default (which, when writing this lesson, was red and blue, respectively).

R

#calculate average
average <- mean(hourly_counts$n)

#plot
hourly_counts %>%
  mutate(avg_color = ifelse(n > average, "Above", "Below")) %>% #adds the additional column
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(aes(color = avg_color)) + #colors the points
  scale_color_manual(values = c("Above" = "green", "Below" = "red")) #chooses the colors

Additionally, you may want to increase the size of the points! This can be accomplished using the size argument within the geom_point function, as seen below:

R

#calculate average
average <- mean(hourly_counts$n)

#plot
hourly_counts %>%
  mutate(avg_color = ifelse(n > average, "Above", "Below")) %>% #adds the additional column
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(aes(color = avg_color), size = 2) + #colors the points
  scale_color_manual(values = c("Above" = "green", "Below" = "red")) #chooses the colors

At this point, our plot is mostly completed! The only remaining issue is the lack of proper titling and labeling.

By default, the axes labels on a plot are determined by the name of the variable being plotted. However, ggplot2 offers lots of customization options, like specifying the axes labels and adding a title to the plot, with relatively few lines of code. We will add more informative x-and y-axis labels to our plot, a more explanatory label to the legend, and a plot title.

The labs function takes the following arguments:

title – to produce a plot title
subtitle – to produce a plot subtitle (smaller text placed beneath the title)
caption – a caption for the plot
... – any pair of name and value for aesthetics used in the plot (e.g., x, y, fill, color, size)

R

hourly_counts %>%
  mutate(avg_color = ifelse(n > average, "Above", "Below")) %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(aes(color = avg_color), size = 2) +
  scale_color_manual(values = c("Above" = "green", "Below" = "red")) +
  labs(title = "Check-In Count per Hour",
       x = "Hour (24H Format)",
       y = "Count",
       color = "Relation to Average")

Our final step will be to improve the x-axis to include all hours, not just 10, 15, and 20! This can be achieved using the scale_x_continuous function.

The scale_x_continuous function is used to customize the x-axis when the x-axis is numeric (or continuous!). Within this function, you can control the axis limit (or range) and breaks (where tick marks appear).

Let’s finish our plot using this function:

R

hourly_counts %>%
  mutate(avg_color = ifelse(n > average, "Above", "Below")) %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(aes(color = avg_color), size = 2) +
  scale_color_manual(values = c("Above" = "green", "Below" = "red")) +
  labs(title = "Check-In Count per Hour",
       x = "Hour (24H Format)",
       y = "Count",
       color = "Relation to Average") +
  scale_x_continuous(breaks = seq(0, 24, by = 1))

While the plot above gives information on the number of check-ins across all locations, we may want information unique to individual locations instead. To achieve this, using the information above, we can calculate the amount of check-ins every hour and add a line for each of the first five locations below:

R

#calculate check-ins per hour for each location
hourly_count <- data %>%
  count(location, hour)

#plot multiple lines, changing the color for each
hourly_count %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>% 
  ggplot(aes(x = hour, y = n, color = location)) + #Note: putting color in ggplot applies to all plots (geom_line AND geom_point)!
  geom_line(size = 1) +
  geom_point(size = 3) +
  labs(title = "Hourly Check-In Count by Location",
       x = "Hour (24H Format)",
       y = "Count",
       color = "Location") +
  scale_x_continuous(breaks = seq(0, 24, by = 1))

As you can see, LOCATION_003 is very popular at 10AM (and may benefit from additional support from employees/volunteers), whereas LOCATION_002 dies down after 11AM.

Boxplot

We can use box plots to visualize the distribution of check-in times for specific locations:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black")

As you may notice, it’s a bit difficult to understand this plot at first glance! To resolve this, let’s begin by adding all of the hours on the y-axis using the scale_y_continuous function! This function behaves the exact same as the scale_x_continuous function, but it applies to the y-axis instead:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  scale_y_continuous(breaks = seq(0, 23, by = 1))

By adding points to a box plot, we can have a better idea of the number of measurements and of their distribution:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_point(color = "tomato") +
  scale_y_continuous(breaks = seq(0, 23, by = 1))

Looking at this plot, from a rough estimate, it looks like there are far fewer dots on the plot than there rows in our tibble. This should lead us to believe that there may be multiple observations plotted on top of each other (e.g. three observations where hour is 12 and location is LOCATION_001). This is known as “overplotting” and occurs when multiple data points share the same x and y coordinates.

There are two main ways to alleviate overplotting issues: 1. changing the transparency of the points 2. jittering the location of the points

Let’s first explore option 1, or changing the transparency of the points. When we say “transparency”, we mean the opacity/your ability to see through the point. We can control the transparency of the points with the alpha argument! Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors (an alpha of 1 is the default value). Specifically, an alpha of 0.1, would make a point one-tenth as opaque as a normal point. Stated differently ten points stacked on top of each other would correspond to a normal point.

With that being said, we’re going to change the alpha to 0.5. in an attempt to help fix the overplotting. As you may quickly notice, the overplotting is not solved, but adding transparency begins to address this problem, as the points where there are more overlapping observations are darker (as opposed to lighter red):

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_point(color = "tomato", alpha = 0.5) +
  scale_y_continuous(breaks = seq(0, 23, by = 1))

Since that only helped a little bit with the overplotting problem, let’s try option two and jitter the points on the plot, allowing us to see each point. This is due to jittering introducing a little bit of randomness into the position of our points. You can think of this process as taking the overplotted graph and giving it a tiny shake! The points will move a little bit side-to-side and up-and-down, but their position in comparison to the original plot won’t dramatically change.

Note that this solution is only suitable for plotting integer figures! For numeric figures with decimals, geom_jitter() becomes inappropriate because it obscures the true value of the observation.

We can jitter our points using the geom_jitter() function instead of the geom_point() function, as seen below:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_jitter(color = "tomato", alpha = 0.5) +
  scale_y_continuous(breaks = seq(0, 23, by = 1))

As you can see, the points have been moved dramatically! Thankfully, the geom_jitter() function allows for us to specify the amount of random motion in the jitter by using the width and height arguments. When we don’t specify values for width and height, geom_jitter() defaults to 40% of the resolution of the data (the smallest change that can be measured). Hence, if we would like less spread in our jitter than the default, we should pick values between 0.1 and 0.4. Experiment with the values to see how your plot changes!

Here, we initially chose a height of 0.05 (as too much variation in height may suggest different times at first glance) and a width of 0.2:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_jitter(color = "tomato", alpha = 0.5, height = 0.05, width = 0.2) +
  scale_y_continuous(breaks = seq(0, 23, by = 1))

For our final step, let’s add a title, appropriate labels, and improve the visuals of the plot overall! Additionally, to clean the location names on the x-axis, we’ll be using the mutate function (recall from Data Wrangling with dplyr) to remove the “LOCATION_” prefix from each name (since the axis label will indicate that these are locations!):

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>% #removes prefix
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_jitter(color = "tomato", alpha = 0.5, height = 0.05, width = 0.2) +
  scale_y_continuous(breaks = seq(0, 23, by = 1)) + 
  #adds labels to the plot
  labs(title = "Distribution of Check-in Times by Location",
       x = "Location",
       y = "Hour (24-hour Format)")

Challenge

Exercise

Box plots are useful summaries, but hide the shape of the distribution. For example, if the distribution is bi-modal, we would not see it in a box plot. An alternative to the box plot is the violin plot, where the shape (of the density of points) is drawn.

Start by replacing the box plot with a violin plot; see geom_violin().

Show me the solution

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_violin(fill = "lightblue", color = "black") +
  geom_jitter(color = "tomato", alpha = 0.5, height = 0.05, width = 0.2) +
  scale_y_continuous(breaks = seq(0, 23, by = 1)) + 
  labs(title = "Distribution of Check-in Times by Location",
       x = "Location",
       y = "Hour (24-hour Format)")

So far, we’ve looked at the distribution of check-in times between locations. Next, you’re going to try making a new plot to explore the distribution of another variable between locations.

Let’s create a box plot for minute for the locations above. Overlay a jitter layer to on the box plot layer to display the distributions more accurately. Feel free to select any fill, color, alpha, height, and width! Ensure a title and proper axis labels are added.

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  ggplot(aes(x = location, y = minute)) +
  geom_boxplot(alpha = 0) +
  geom_jitter(color = "navy", alpha = 0.5, height = 0, width = 0.2) +
  labs(title = "Distribution of Check-in Minutes by Location",
       x = "Location",
       y = "Minute of Check-in")

Lastly, color each point according to the device used! Ensure you change the name of the legend as well and remove “DEVICE_” from all device names (to ensure a clean legend).

Show me the solution

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = location, y = minute)) +
  geom_boxplot(alpha = 0) +
  geom_jitter(aes(color = device), alpha = 1, width = 0.2, height = 0.2) +
  labs(title = "Distribution of Check-in Minutes by Location",
       x = "Location",
       y = "Minute of Check-in",
       color = "Device")

Bar Plot

Bar plots are great for visualizing categorical data, such as counting the number of check-ins per device, per location, or per precinct. By default, geom_bar accepts a variable for x, and plots the number of instances of each value of x (in this case, location) within the data set.

Let’s create a bar plot displaying check-in counts for the first five locations:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  ggplot(aes(x = location)) +
  geom_bar() +
  labs(title = "Check-In Count by Location",
       x = "Location",
       y = "Count")

Next, let’s use the fill aesthetic for the geom_bar() geom to color bars by the device used for check-in:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = location)) +
  geom_bar(aes(fill = device)) +
  labs(title = "Check-In Count by Location",
       x = "Location",
       y = "Count",
       fill = "Device")

This creates a stacked bar chart. Unfortunately, as you may notice, this is a bit difficult to read. Instead, we can separate the portions of the stacked bar that correspond to each device and put them side-by-side by using the position argument for geom_bar() and setting it to “dodge”.

Let’s apply this concept to the code below, changing the title for clarity:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = location)) +
  geom_bar(aes(fill = device), position = "dodge") +
  labs(title = "Count of Check-Ins by Location for Each Device",
       x = "Location",
       y = "Count",
       fill = "Device")

As you can see, this is much easier to read and interpret!

In some cases, we may be more interested in the proportion of each individual device at each location rather than the actual count of each device. Proportions are helpful because they account for differences in sample sizes, and instead focus on distribution within specific locations! To compare proportions, we will first create a new tibble (prop_device) with a new column named “prop”, representing the percent of each device within each location.

R

prop_device <- data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  count(location, device) %>%
  group_by(location) %>%
  mutate(prop = n / sum(n)) %>%
  ungroup()

Now, we can use this new tibble to create our plot showing the proportion of each device at each location! When creating your plot, ensure you include y = prop within the initial ggplot call AND stat = "identity" to tell ggplot to use the y values instead of the count, and adjust labels/titles for clarity:

R

prop_device %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = location, y = prop)) +
  geom_bar(aes(fill = device), position = "dodge", stat = "identity") +
  labs(title = "Proportion of Check-Ins by Location for Each Device",
       x = "Location",
       y = "Proportion",
       fill = "Device")

Looking at this graph, we can see that all of the devices (except DEVICE_012) have similar proportions (aka. usage rates) when sample sizes are taken into consideration!

Callout

Note

If you’d prefer to visualize percentages instead of proportions, you can multiply the prop column by 100! For example:

R

prop_device <- prop_device %>%
  mutate(prop = (prop * 100))

If you adjust to percentages, however, please ensure you adjust titles and axis labels accordingly!

Challenge

Exercise

Using the information you learned above, create a bar plot showing the proportion (or percentages, if you’d like) of check-ins by hour for the first four devices (ie. “DEVICE_001”, “DEVICE_002”, “DEVICE_003”, and “DEVICE_004”). Which hours had the highest proportion of check-ins from DEVICE_001 and DEVICE_002?

Show me the solution

R

#calculate proportions
prop_hour_device <- data %>%
  filter(device %in% c("DEVICE_001", "DEVICE_002", "DEVICE_003", "DEVICE_004")) %>%
  count(hour, device) %>%
  group_by(hour) %>%
  mutate(prop = n / sum(n)) %>%
  ungroup()

#generate plot
prop_hour_device %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = hour, y = prop, fill = device)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Check-Ins by Hour for Each Device",
       x = "Hour (24H Format)",
       y = "Proportion",
       fill = "Device") +
  scale_x_continuous(breaks = seq(0, 24, by = 1))
  #note: you can remove 6 and 20 by using this line instead: 
  #scale_x_continuous(breaks = seq(7, 19, by = 1))

From this plot, we can identify that DEVICE_001 has the highest proportion at 7:00/7AM and DEVICE_002 has the highest proportion at 19:00/7PM.

Challenge

Exercise

Create a bar plot showing the check-in counts for the ten devices with the highest number of check-ins. Color each bar according to the device, title it appropriately, and use proper axis labels!

Show me the solution

R

#retrieve top devices
top_devices <- data %>%
  count(device) %>%
  top_n(10, n) %>%
  pull(device)

#create plot
data %>%
  filter(device %in% top_devices) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = device, fill = device)) +
  geom_bar() +
  labs(title = "Top 10 Devices by Number of Check-ins",
       x = "Device",
       y = "Count")+
  theme_classic()

Faceting

Rather than creating a single plot with side-by-side bars for each device, we may want to create multiple plots, where each plot shows the data for a single device. This would be especially useful if we had a large number of devices that we had sampled (like 5 or 10), as side-by-side bars become harder to read as the number of bars increase.

ggplot2 has a special technique called faceting that allows the user to split one plot into multiple plots based on a factor included in the data set. Below, we can use this technique to split our bar plot of check-in proportions by hour for each device so each device has its own panel:

R

#generate plot
prop_hour_device %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = hour, y = prop, fill = device)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Check-Ins by Hour for Each Device",
       x = "Hour (24H Format)",
       y = "Proportion",
       fill = "Device") +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  facet_wrap(~ device, scales = "free_y") #here, we specify we want to facet wrap by device

You can click the “Zoom” button in your RStudio plots panel to view a larger version of this plot.

Usually plots with white background look more readable when printed. We can set the background to white using the function theme_bw(). Additionally, we can remove the grid:

R

prop_hour_device %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = hour, y = prop, fill = device)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Check-Ins by Hour for Each Device",
       x = "Hour (24H Format)",
       y = "Proportion",
       fill = "Device") +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  facet_wrap(~ device, scales = "free_y") +
  theme_bw() +
  theme(panel.grid = element_blank())

We can also facet by location to see patterns of device proportions within different locations:

R

#creates new data using location information
prop_hour_device_loc <- data %>%
  filter(device %in% c("DEVICE_001", "DEVICE_002", "DEVICE_003", "DEVICE_004")) %>%
  count(hour, location, device) %>% 
  group_by(hour, location) %>% #this specifies to calculate within locations as well
  mutate(prop = n / sum(n)) %>%
  ungroup()

#generates plot
prop_hour_device_loc %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = hour, y = prop, fill = device)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Hourly Distribution of Device Check-Ins, Faceted by Location",
       x = "Hour (24H Format)",
       y = "Proportion",
       fill = "Device") +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  facet_wrap(~ location, scales = "free_y") +
  theme_bw() +
  theme(panel.grid = element_blank())

Looking at the graph above, we can see that at LOCATION_001, devices have varying rates of usage throughout the day, and at LOCATION_002, devices are often used the same amount!

Histograms

When working with election data, understanding the distribution of check-ins over time is crucial! As seen above, bar plots allow us to look at general peaks and overall trends using the hour variable. However, if we wanted to look at the distribution of check-ins at a more detailed level (like by minute intervals), bar plots become much less effective.

In these cases, histograms are more appropriate to use! This is due to histograms’unique ability to allow for the sorting of continuous variables into bins, making it easier to identify trends.

First, let’s look at the bar chart below:

R

data %>%
  ggplot(aes(x = hour)) +
  geom_bar(color = "black", fill = "lightblue", ) +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  labs(title = "Check-In Distribution by Hour",
       x = "Hour (24H Format)",
       y = "Count")

Now, let’s create a similar plot displaying the distribution of check-ins by hour using a histogram instead of a bar plot:

R

data %>%
  ggplot(aes(x = hour)) +
  geom_histogram(color = "black", fill = "lightblue", binwidth = 1) +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  labs(title = "Check-In Distribution by Hour",
       x = "Hour (24H Format)",
       y = "Count")

As you may see, the plots look almost identical, save for the histogram having bars that touch (since the data is continuous and not discrete/categorical).

With histograms, however, we can create a more granular view by using smaller bins:

R

#create a decimal representation of the data (hour + minutes)
checkins_with_dec_hour <- data %>%
  mutate(dec_hour = hour + minute/60)

#plot with 15 minute bins (0.25 minute bins)
checkins_with_dec_hour %>%
  ggplot(aes(x = dec_hour)) +
  geom_histogram(color = "black", fill = "lightblue", binwidth = 0.25) +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  labs(title = "Check-In Distribution by Hour (15-Minute Intervals)",
       x = "Hour (24H Format)",
       y = "Count")

Looking at this graph, it’s clearer that there is a large spike of check-ins early in the morning (between 7AM and 8AM). If you were to only look at the bar plot or 1-bin histogram, however, you may have assumed check-ins kept about the same rate throughout the whole morning (7AM - 10AM)!

Visualizing Location Data with Maps

When working with geographic or location data, it’s often useful to visualize it on a map. Throughout the next section, we’ll demonstrate ways to work with spacial data using the Game of Thrones Dataset!

First, let’s load the sf package. This package allows gpplot2 to work with spacial data (like shape files):

R

library(sf)

Next, let’s load in the map data containing our map polygons:

R

#read in data and save to object
westeros_map <- st_read(here("data", "polygons_GoT.geojson"), quiet = TRUE)

#look at the data structure
head(westeros_map, 3)

Finally, let’s load the voting data and link it to our map data using the merge function. This function allows for two tibbles to be linked based off of a specified variable (in our case, the “id”):

R

#read in data and save to object
got_votes <- read_csv(here("data", "voting_GoT.csv"))

#look at the data structure
head(got_votes)

#join data using the merge function
westeros_voting <- merge(westeros_map, got_votes, by = "id")

Map Introduction

Now that our data is ready to be mapped, let’s start by visualizing which regions favor Jon Snow over Daenerys Targaryen.

When using spacial data, we use a special ggplot function called geom_sf. Simply, this tells ggplot to look at the simple features (like lines or polygons) in your data and use that for the graph!

Below, we will be using geom_sf on our combined data and use Jon_Snow_pct to determine the level of support Jon Snow is getting from each region:

R

ggplot() +
  geom_sf(data = westeros_voting, aes(fill = Jon_Snow_pct)) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Support for Jon Snow across Westeros",
       fill = "Support %") +
  theme_bw()

Next, let’s do the same for Daenerys Targaryen, but with red instead of blue for the color scale:

R

# Create a map colored by Daenerys support
ggplot() +
  geom_sf(data = westeros_voting, aes(fill = Daenerys_Targaryen_pct)) +
  scale_fill_gradient(low = "pink", high = "darkred") +
  labs(title = "Support for Daenerys Targaryen across Westeros",
       fill = "Support %") +
  theme_bw()

Conditional Map Coloring

Often, it may be more beneficial to color each part of the map according to the candidate that received the most votes, rather than displaying the amount of support a single candidate received.

This can be achieved by determining which candidate received the most votes and filling that section with that candidate’s color using the scale_fill_manual function:

R

#create a column with the name of the dominant candidate
westeros_voting$dominant <- ifelse(westeros_voting$Jon_Snow_pct > westeros_voting$Daenerys_Targaryen_pct, 
                                  "Jon Snow", "Daenerys Targaryen")

#pick fill colors based on the dominant candidate
dom_color <- c("Jon Snow" = "steelblue", 
               "Daenerys Targaryen" = "firebrick")

#create a map with the specified coloring
ggplot() +
  geom_sf(data = westeros_voting, aes(fill = dominant)) +
  scale_fill_manual(name = "Dominant Candidate", values = dom_color) +
  labs(title = "Dominant Candidate by Region") +
  theme_bw()

In some cases, you may not just be interested in who won each region, but additionally by how much. To map this, first determine the margin of victory and add a column containing how strong of a victory they had:

R

#calculate margin of victory
westeros_voting$margin <- abs(westeros_voting$Jon_Snow_pct - westeros_voting$Daenerys_Targaryen_pct)

#bin the margin into three levels (low, med, high)
westeros_voting$margin_bin <- ifelse(
  westeros_voting$margin <= 5, "Low",
  ifelse(westeros_voting$margin <= 20, "Med",
         "High")
)

Using the information you gained above, you can now develop your “fill rule” and select the color that corresponds to each instance. In this case, your “fill rule” consists of the winner of each region (ie. Jon Snow) and how high of a margin of victory they had (ie. High):

R

#make a fill rule (ie. Jon Snow - High)
westeros_voting$marg_fill <- paste(westeros_voting$dominant, westeros_voting$margin_bin, sep = " - ")

#pick fill colors based on the fill rule!
marg_color <- c(
  "Daenerys Targaryen - High" = "brown4",
  "Daenerys Targaryen - Med" = "firebrick",
  "Daenerys Targaryen - Low" = "pink",
  "Jon Snow - High" = "darkblue",
  "Jon Snow - Med" = "royalblue",
  "Jon Snow - Low" = "lightblue"
)

Your final step is to combine your fill rule and chosen colors with your mapping information, creating your margin of victory map:

R

#create margin of victory map
ggplot() +
  geom_sf(data = westeros_voting, aes(fill = marg_fill)) +
  scale_fill_manual(name = "Winner & Margin", values = marg_color) +
  labs(title = "Margin of Victory in Each Region") +
  theme_bw()

Adding Map Labels

After ensuring your map includes all the information required, the final step is adding region labels! Unfortunately, due to the nature of polygons, this is a bit more difficult than simply using the labs function.

To add region labels, your first step is to convert your data to an simple feature, also known as an sf, object. This will allow for the calculation of where your labels will sit on your map:

R

#convert to sf
westeros_voting_sf <- st_as_sf(westeros_voting)

Your second step is to determine where your region labels will sit on your map! This is completed by calculating thergdef(‘centroid’, ‘centroids’)`, or center points, of each region. Below, we will calculate the centroid of each region and convert its x and y coordinates to columns for easier access:

R

#calculate centroid
region_centroids <- st_centroid(westeros_voting_sf)

#extract the coordinates
coords <- st_coordinates(region_centroids)

#convert coordinates to columns coords.X and coords.Y
region_centroids$coords.X <- coords[, 1]
region_centroids$coords.Y <- coords[, 2]

Now that we have determined where the region labels will be placed, we can finally add the region labels onto the map using the geom_text function.

Within this function, we can specify the data used (in this case, region_centroids), the coordinates, the information that will be used for the label, and text formatting information (like size and bold/italics)!

Additionally, it’s important to note that we need to use westeros_voting_sf as the data for the map instead of westeros_voting. This will ensure that the region labels will properly sit on their proper locations!

R

#create a map with the specified coloring
ggplot() +
  geom_sf(data = westeros_voting_sf, aes(fill = dominant)) +
  geom_text(data = region_centroids, 
            aes(x = coords.X, y = coords.Y, label = Name),
            size = 2, fontface = "bold") +
  scale_fill_manual(name = "Dominant Candidate", values = dom_color) +
  labs(title = "Dominant Candidate by Region") +
  theme_bw()

As you may notice, some labels in dense areas are overlapping a lot! This is due to the size of the map in your local version of R. To resolve this, you can export the map at a larger size using ggsave (which will be covered at the end of this lesson!).

Challenge

Exercise

Using what you’ve learned above, create a map displaying the peak check-in wait times across the first 35 precincts. For this lesson, we will be using the avg_checkins.csv file we created within “Data Wrangling with dplyr”!

To complete this map, use the following steps: 1. Read in your data as “checkin_data”. 2. Using the merge function, link together your “checkin_data” with the “westeros_map”, creating a “westeros_checkins” dataframe. Hint: if the linking columns are named differently, use by.x and by.y to specify the two names (with x being the first data and y being the second). 3. Generate your map based on the “westeros_checkins” data, filling each region based on the avg_checkin_length. 4. Choose a title and change the name of the legend to “Check-In Times”.

Show me the solution

R

#read in data
checkin_data <- read_csv(here("data", "avg_checkins.csv"))

#link together map and checkin_data
westeros_checkins <- merge(westeros_map, checkin_data, by.x = "id", by.y = "precinct")

#generate map with labels
ggplot() +
  geom_sf(data = westeros_checkins, aes(fill = avg_checkin_length)) +
  labs(title = "Average Check-In Times Across Westeros",
       fill = "Check-In Times") +
  theme_bw()

Customization

`ggplot2` Themes

In addition to theme_bw(), which changes the plot background to white, ggplot2 comes with several other themes which can be useful to quickly change the look of your visualization. The complete list of themes is available at https://ggplot2.tidyverse.org/reference/ggtheme.html. theme_minimal() and theme_light() are popular, and theme_void() can be useful as a starting point to create a new hand-crafted theme.

The ggthemes package provides a wide variety of options (including an Excel 2003 theme). The ggplot2 extensions website provides a list of packages that extend the capabilities of ggplot2, including additional themes.

Custom Themes

If you do not like the themes offered, or you’d like to change a portion of a theme, you can use the theme() function to manually customize your maps and plots!

The theme() function allows you to customize all portions of a ggplot, including the text, title, subtitle, and grids. You can find the full list in the documentation or by using the panel on the right and navigating to the theme help page (Help > Packages > ggplot2 > theme).

Below, we will be applying a few of these customizations to a plot from earlier in the lesson:

R

prop_device %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = location, y = prop)) +
  geom_bar(aes(fill = device), position = "dodge", stat = "identity") +
  labs(title = "Proportion of Check-Ins by Location for Each Device",
       x = "Location",
       y = "Proportion",
       fill = "Device") +
  theme_bw() +
  theme(
    text = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(face = "italic"),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.border = element_rect(color = "grey70")
  )

Note: it is also possible to change the fonts of your plots! If you are on Windows, you will have to install the extrafont package before doing so..

Additionally, you like the changes you created better than the default themes, you can save your changes as a custom theme for application to other plots:

R

my_theme <- theme_bw() +
  theme(
    text = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(face = "italic"),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.border = element_rect(color = "grey70")
  )

prop_hour_device %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = hour, y = prop, fill = device)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Check-Ins by Hour for Each Device",
       x = "Hour (24H Format)",
       y = "Proportion",
       fill = "Device") +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  my_theme

These themes can also be applied to maps, as seen below:

R

ggplot() +
  geom_sf(data = westeros_voting, aes(fill = dominant)) +
  scale_fill_manual(name = "Dominant Candidate", values = dom_color) +
  labs(title = "Dominant Candidate by Region") +
  my_theme

Discussion

Exercise

With all of this information in hand, please take another five minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own using any of the data used throughout this lesson.

You can use the RStudio ggplot2 cheat sheet for inspiration.

Here are some ideas: - Make a line plot showing the cumulative number of check-ins over the course of the day. - Try using a different color palette for your device comparison. - Generate a new map using the GoT data.

Plot Output

After creating a plot, you may want to save it as a png (or other format). To do this, you can use the use the ggsave() function, which allows you to easily change the dimension and resolution of your plot by adjusting the appropriate arguments (width, height and dpi) before saving the plot to the specified directory.

Here, we will save one of the plots we customized above:

R

plot <- prop_device %>%
        filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
        mutate(location = str_remove(location, "LOCATION_")) %>%
        mutate(device = str_remove(device, "DEVICE_")) %>%
        ggplot(aes(x = location, y = prop)) +
        geom_bar(aes(fill = device), position = "dodge", stat = "identity") +
        labs(title = "Proportion of Check-Ins by Location for Each Device",
             x = "Location",
             y = "Proportion",
             fill = "Device") +
        theme_bw() +
        theme(
          text = element_text(size = 12),
          plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
          axis.title = element_text(face = "italic"),
          panel.grid.minor = element_blank(),
          panel.grid.major.x = element_blank(),
          panel.border = element_rect(color = "grey70")
        )

ggsave("fig-output/device_prop.png", plot, width = 10, height = 6, dpi = 300)

You can find the png generated in your data folder!

Key Points

ggplot2 is a flexible and useful tool for creating plots in R.
The data set and coordinate system can be defined using the ggplot function.
Additional layers, including geoms, are added using the + operator.
Time-series data can be visualized using geom_line() and geom_point().
Box plots are useful for visualizing the distribution of check-in times by location.
Bar plots are useful for visualizing counts of check-ins by categorical variables.
Faceting allows you to generate multiple plots based on a categorical variable like device.
Spatial data can be visualized on maps using the sf and ggplot2 packages.

Data Visualisation with ggplot2

Overview

Questions

Objectives

R

R

R

ERROR

Visualization Options in R

Base-R Plots

R

Lattice

R

R

Plotting with ggplot2

R

R

R

R

Notes

R

Building Your Plots Iteratively

R

R

R

R

R

R

R

R

R

Boxplot

R

R

R

R

R

R

R

Exercise

Show me the solution

R

R

Show me the solution

R

Bar Plot

R

R

R

R

R

Note

R

Exercise

Show me the solution

R

Exercise

Show me the solution

R

Faceting

R

R

R

Histograms

R

R

R

Visualizing Location Data with Maps

R

R

R

Map Introduction

R

R

Conditional Map Coloring

R

R

R

R

Adding Map Labels

`Lattice`

Plotting with `ggplot2`

`ggplot2` Themes