Data Visualisation with ggplot2
Last updated on 2026-04-28 | Edit this page
Overview
Questions
- What are the components of a ggplot?
- How can I visualize check-in patterns over time?
- How can I compare check-in frequencies across locations and devices?
- What are the main differences between R base plots, lattice, and ggplot?
- How can I visualize location data on maps with ggplot2?
Objectives
- Produce scatter plots, box plots, and bar plots using ggplot.
- Create time series plots for temporal check-in data.
- Set universal plot settings.
- Describe what faceting is and apply faceting in ggplot.
- Modify the aesthetics of an existing ggplot plot (including axis labels and color).
- Build complex and customized plots from data in a tibble.
- Create maps with ggplot2 to visualize location-based data.
- Recognize the differences between base R, lattice, and ggplot visualizations.
This episode is a broad overview of ggplot2 and focuses on getting
familiar with the layering system of ggplot2, using the argument
group in the aes() function, and basic
customization of the plots. We’ll show how to visualize patterns in
check-in behavior across different locations and devices, and introduce
mapping techniques.
We start by loading the required packages:
tidyverse and
lubridate. As you may recall,
ggplot2 is included in the
tidyverse package, so we do not need to
load ggplot2 in separately.
R
library(tidyverse)
library(here)
library(lubridate)
Next, let’s load in our data! Throughout this lesson, we will be using a sampled version of the data we created at the end of “Starting With Data”. In practice, sampling data before visualization is NOT required; however, due to the size of our original data set, using a smaller, sampled data set will allow us to generate plots much faster!
R
data <- read_csv(here("data", "checkin_sample_plotting.csv"))
Before we continue, let’s take a look at the structure and size of our data set to see what we’ll be working with in detail:
R
glimpse(data)
ERROR
Error in `glimpse()`:
! could not find function "glimpse"
As you may notice, the house exceeds 12, meaning this data is in 24 hour time! If you are unfamiliar, this means 13 represents 1PM, 14 represents 2PM, and so on.
Additionally, for those curious, the original data set had approximately 352k lines, which means this data set is less than 10% of the size!
Visualization Options in R
Before we start with ggplot2, it’s
helpful to know that there are several ways to create visualizations in
R. While ggplot2 is great for building
complex and highly customizable plots, there are simpler and quicker
alternatives that you might encounter or use depending on the context.
Let’s briefly explore a few of them:
Base-R Plots
Base R plots are the simplest form of visualization and are great for
quick, exploratory analysis. You can create plots with very little code,
but customizing them can be cumbersome compared to
ggplot2.
Example of a simple time series plot in base R showing the number of check-ins by hour:
R
hourly_counts <- data %>%
count(hour)
plot(hourly_counts$hour, hourly_counts$n,
main = "Base R Plot: Check-Ins by Hour",
xlab = "Hour of Day",
ylab = "Number of Check-Ins",
type = "l") #'l' for line
Lattice
Lattice is another plotting system in R, which allows for creating multi-panel plots easily. It’s different from ggplot2 because you define the entire plot in a single function call, and modifications after plotting are limited.
Example of a lattice plot showing check-ins by device for different locations:
R
library(lattice)
R
#grabs specific locations (so the graph isn't giant) and converts locations + devices to factors
checkins_lattice <- data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003")) %>%
#we're removing "DEVICE_" because it causes overlap within the plot
#if you're curious, remove this line and regenerate the plot!
mutate(device = str_remove(device, "DEVICE_")) %>%
mutate(
device = as.factor(device),
location = as.factor(location)
)
#creates a lattice boxplot (bwplot)
bwplot(hour ~ device | location, data = checkins_lattice,
main = "Lattice Plot: Check-in Hour Distribution by Device and Location",
xlab = "Device",
ylab = "Hour of Check-in",
layout = c(length(unique(checkins_lattice$location)), 1), #adjusts layout for multiple locations
strip = strip.custom(bg="lightgrey"),
scales = list(y = list(at = 0:24)), #adds all hours on y, not just even numbers
panel = function(x, y, ...) {
panel.bwplot(x, y, ...)
})
Plotting with ggplot2
ggplot2 is a plotting package that
makes creating complex plots from data stored in a tibble simpler. It
provides a programmatic interface for specifying what variables to plot,
how they are displayed, and general visual properties. As a result, if
the underlying data changes or if we decide to switch from a bar plot to
a scatter plot, we only have to make minimal adjustments to the
code!
ggplot2 functions work best with data
in the ‘long’ format. As you may recall from “Data Wrangling with
tidyr”, this consists of a column for every dimension, and a row for
every observation. Ensuring you use well-structured data will save you
lots of time when making figures with
ggplot2
ggplot2 graphics are built step by step
by adding new elements. Adding layers in this fashion allows for
extensive flexibility and customization of plots.
Each chart built with ggplot2 must
include the following: - Data - Aesthetic mapping (aes) - Describes how
variables are mapped onto graphical attributes - Visual attribute of
data including x-y axes, color, fill, shape, and alpha - Geometric
objects (geom) - Determines how values are rendered graphically, as bars
(geom_bar), scatterplot (geom_point), line
(geom_line), etc.
Thus, the template for graphic in ggplot2 is:
<DATA> %>%
ggplot(aes(<MAPPINGS>)) +
<GEOM_FUNCTION>()
Remember that the pipe operator %>% places the result
of the previous line(s) into the first argument of the function. The
ggplot function expects a data frame to be
the first argument, which allows us to change from specifying the
data = argument within the ggplot function to
instead piping the data into the function.
To create a chart with ggplot2, follow
the steps below:
- use the
ggplot()function and bind the plot to a specific tibble.
R
data %>%
ggplot()
- Using the aesthetic (
aes) function, define your mapping by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.
R
data %>%
ggplot(aes(x = precinct))
- Add ‘geoms’ – graphical representations of the data in the plot
(points, lines, bars).
ggplot2offers many different geoms; we will use some common ones today, including:-
geom_bar()for counting observations in categories -
geom_histogram()for showing distributions -
geom_boxplot()for statistical summaries -
geom_line()for trend lines, time series, etc.
-
To add a geom to the plot use the + operator. Let’s
start by creating a bar chart showing the distribution of check-ins
across precincts:
R
data %>%
ggplot(aes(x = precinct)) +
geom_bar()
The + in the ggplot2
package is particularly useful because it allows you to modify existing
ggplot objects. This means you can easily set up plot
templates and conveniently explore different types of plots! Using this
idea, the above plot can also be generated with code like this, similar
to the “intermediate steps” approach:
R
#assign the plot to a variable
plot <- data %>%
ggplot(aes(x = precinct))
#draw the plot as a bar plot
plot +
geom_bar()
Notes
- Anything you put in the
ggplot()function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up inaes(). - You can also specify mappings for a given geom independently of the
mapping defined globally in the
ggplot()function. - The
+sign used to add new layers must be placed at the end of the line containing the previous layer. If, instead, the+sign is added at the beginning of the line containing the new layer,ggplot2will not add the new layer and will return an error message.
R
## This is the correct syntax for adding layers
checkins_plot +
geom_point()
## This will not add the new layer and will return an error message
checkins_plot
+ geom_point()
Building Your Plots Iteratively
Building plots with ggplot2 is
typically an iterative process. We start by defining the data set we’ll
use, lay out the axes, and choose a geom.
Let’s re-create the time-series plot we made for the Base-R demonstration:
R
#using the hourly_counts we created, generate a time-series plot
hourly_counts %>%
ggplot(aes(x = hour, y = n)) +
geom_line() #creates a line plot using the x and y from the ggplot above!
Now that we have a baseline plot to start from, we can start modifying it to extract additional information! For instance, when inspecting the plot, we can notice that it’s a bit difficult to tell at first glance where each hour sits on the line.
To resolve this, we will add points to the line to clearly indicate each hour:
R
hourly_counts %>%
ggplot(aes(x = hour, y = n)) +
geom_line() +
geom_point()
Next, we will add colors for all of the points by specifying a
color argument inside the geom_point
function:
R
hourly_counts %>%
ggplot(aes(x = hour, y = n)) +
geom_line() +
geom_point(color = "blue")
To color each point in the plot differently, you could use a
vector as an input to the color argument; however, because
we are now mapping features of the data to a color, instead of setting
one color for all points, the color of the points now needs to be set
inside a call to the aes function. When we map a variable
in our data to the color of the points,
ggplot2 will provide a different color
corresponding to the different values of the variable.
Let’s apply this to our plot below, changing the color of each point based on the hour:
R
hourly_counts %>%
ggplot(aes(x = hour, y = n)) +
geom_line() +
geom_point(aes(color = hour))
Unfortunately, this doesn’t tell us much about our data, just that
each point represents a different hour (which we already knew!).
Additionally, you may notice that after adding conditional coloring
using aes(), ggplot automatically added a legend to explain
what the different colors represent/mean!
Now, instead of coloring each point based on one of the variables we already have, we’re going to calculate the average hourly count and set the point to green if the count at that hour is above average and red if the count at that hour is below average!
To do this, we will calculate the average hourly count and, using
mutate, add a column to our hourly_counts tibble that indicates whether
the count at that hour is above or below the calculated average! Then,
we will use the scale_color_manual function to manually
color these points green and red instead of the default (which, when
writing this lesson, was red and blue, respectively).
R
#calculate average
average <- mean(hourly_counts$n)
#plot
hourly_counts %>%
mutate(avg_color = ifelse(n > average, "Above", "Below")) %>% #adds the additional column
ggplot(aes(x = hour, y = n)) +
geom_line() +
geom_point(aes(color = avg_color)) + #colors the points
scale_color_manual(values = c("Above" = "green", "Below" = "red")) #chooses the colors
Additionally, you may want to increase the size of the points! This
can be accomplished using the size argument within the
geom_point function, as seen below:
R
#calculate average
average <- mean(hourly_counts$n)
#plot
hourly_counts %>%
mutate(avg_color = ifelse(n > average, "Above", "Below")) %>% #adds the additional column
ggplot(aes(x = hour, y = n)) +
geom_line() +
geom_point(aes(color = avg_color), size = 2) + #colors the points
scale_color_manual(values = c("Above" = "green", "Below" = "red")) #chooses the colors
At this point, our plot is mostly completed! The only remaining issue is the lack of proper titling and labeling.
By default, the axes labels on a plot are determined by the name of
the variable being plotted. However,
ggplot2 offers lots of customization
options, like specifying the axes labels and adding a title to the plot,
with relatively few lines of code. We will add more informative x-and
y-axis labels to our plot, a more explanatory label to the legend, and a
plot title.
The labs function takes the following arguments:
-
title– to produce a plot title -
subtitle– to produce a plot subtitle (smaller text placed beneath the title) -
caption– a caption for the plot -
...– any pair of name and value for aesthetics used in the plot (e.g.,x,y,fill,color,size)
R
hourly_counts %>%
mutate(avg_color = ifelse(n > average, "Above", "Below")) %>%
ggplot(aes(x = hour, y = n)) +
geom_line() +
geom_point(aes(color = avg_color), size = 2) +
scale_color_manual(values = c("Above" = "green", "Below" = "red")) +
labs(title = "Check-In Count per Hour",
x = "Hour (24H Format)",
y = "Count",
color = "Relation to Average")
Our final step will be to improve the x-axis to include all
hours, not just 10, 15, and 20! This can be achieved using the
scale_x_continuous function.
The scale_x_continuous function is used to customize the
x-axis when the x-axis is numeric (or continuous!). Within this
function, you can control the axis limit (or range) and breaks (where
tick marks appear).
Let’s finish our plot using this function:
R
hourly_counts %>%
mutate(avg_color = ifelse(n > average, "Above", "Below")) %>%
ggplot(aes(x = hour, y = n)) +
geom_line() +
geom_point(aes(color = avg_color), size = 2) +
scale_color_manual(values = c("Above" = "green", "Below" = "red")) +
labs(title = "Check-In Count per Hour",
x = "Hour (24H Format)",
y = "Count",
color = "Relation to Average") +
scale_x_continuous(breaks = seq(0, 24, by = 1))
While the plot above gives information on the number of check-ins across all locations, we may want information unique to individual locations instead. To achieve this, using the information above, we can calculate the amount of check-ins every hour and add a line for each of the first five locations below:
R
#calculate check-ins per hour for each location
hourly_count <- data %>%
count(location, hour)
#plot multiple lines, changing the color for each
hourly_count %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
ggplot(aes(x = hour, y = n, color = location)) + #Note: putting color in ggplot applies to all plots (geom_line AND geom_point)!
geom_line(size = 1) +
geom_point(size = 3) +
labs(title = "Hourly Check-In Count by Location",
x = "Hour (24H Format)",
y = "Count",
color = "Location") +
scale_x_continuous(breaks = seq(0, 24, by = 1))
As you can see, LOCATION_003 is very popular at 10AM (and may benefit from additional support from employees/volunteers), whereas LOCATION_002 dies down after 11AM.
Boxplot
We can use box plots to visualize the distribution of check-in times for specific locations:
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
ggplot(aes(x = location, y = hour)) +
geom_boxplot(fill = "lightblue", color = "black")
As you may notice, it’s a bit difficult to understand this plot at
first glance! To resolve this, let’s begin by adding all of the hours on
the y-axis using the scale_y_continuous function! This
function behaves the exact same as the scale_x_continuous
function, but it applies to the y-axis instead:
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
ggplot(aes(x = location, y = hour)) +
geom_boxplot(fill = "lightblue", color = "black") +
scale_y_continuous(breaks = seq(0, 23, by = 1))
By adding points to a box plot, we can have a better idea of the number of measurements and of their distribution:
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
ggplot(aes(x = location, y = hour)) +
geom_boxplot(fill = "lightblue", color = "black") +
geom_point(color = "tomato") +
scale_y_continuous(breaks = seq(0, 23, by = 1))
Looking at this plot, from a rough estimate, it looks like there are
far fewer dots on the plot than there rows in our tibble. This should
lead us to believe that there may be multiple observations plotted on
top of each other (e.g. three observations where hour is 12
and location is LOCATION_001). This is known as
“overplotting” and occurs when multiple data points share the same x and
y coordinates.
There are two main ways to alleviate overplotting issues: 1. changing the transparency of the points 2. jittering the location of the points
Let’s first explore option 1, or changing the transparency of the
points. When we say “transparency”, we mean the opacity/your ability to
see through the point. We can control the transparency of the points
with the alpha argument! Values of alpha range
from 0 to 1, with lower values corresponding to more transparent colors
(an alpha of 1 is the default value). Specifically, an
alpha of 0.1, would make a point one-tenth as opaque as a normal point.
Stated differently ten points stacked on top of each other would
correspond to a normal point.
With that being said, we’re going to change the alpha to
0.5. in an attempt to help fix the overplotting. As you may quickly
notice, the overplotting is not solved, but adding transparency begins
to address this problem, as the points where there are more overlapping
observations are darker (as opposed to lighter red):
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
ggplot(aes(x = location, y = hour)) +
geom_boxplot(fill = "lightblue", color = "black") +
geom_point(color = "tomato", alpha = 0.5) +
scale_y_continuous(breaks = seq(0, 23, by = 1))
Since that only helped a little bit with the overplotting problem, let’s try option two and jitter the points on the plot, allowing us to see each point. This is due to jittering introducing a little bit of randomness into the position of our points. You can think of this process as taking the overplotted graph and giving it a tiny shake! The points will move a little bit side-to-side and up-and-down, but their position in comparison to the original plot won’t dramatically change.
Note that this solution is only suitable for plotting integer figures! For numeric figures with decimals, geom_jitter() becomes inappropriate because it obscures the true value of the observation.
We can jitter our points using the geom_jitter()
function instead of the geom_point() function, as seen
below:
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
ggplot(aes(x = location, y = hour)) +
geom_boxplot(fill = "lightblue", color = "black") +
geom_jitter(color = "tomato", alpha = 0.5) +
scale_y_continuous(breaks = seq(0, 23, by = 1))
As you can see, the points have been moved dramatically! Thankfully,
the geom_jitter() function allows for us to specify the
amount of random motion in the jitter by using the width
and height arguments. When we don’t specify values for
width and height, geom_jitter()
defaults to 40% of the resolution of the data (the smallest change that
can be measured). Hence, if we would like less spread in our
jitter than the default, we should pick values between 0.1 and 0.4.
Experiment with the values to see how your plot changes!
Here, we initially chose a height of 0.05 (as too much variation in height may suggest different times at first glance) and a width of 0.2:
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
ggplot(aes(x = location, y = hour)) +
geom_boxplot(fill = "lightblue", color = "black") +
geom_jitter(color = "tomato", alpha = 0.5, height = 0.05, width = 0.2) +
scale_y_continuous(breaks = seq(0, 23, by = 1))
For our final step, let’s add a title, appropriate labels, and
improve the visuals of the plot overall! Additionally, to clean the
location names on the x-axis, we’ll be using the mutate
function (recall from Data Wrangling with dplyr) to remove the
“LOCATION_” prefix from each name (since the axis label will indicate
that these are locations!):
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
mutate(location = str_remove(location, "LOCATION_")) %>% #removes prefix
ggplot(aes(x = location, y = hour)) +
geom_boxplot(fill = "lightblue", color = "black") +
geom_jitter(color = "tomato", alpha = 0.5, height = 0.05, width = 0.2) +
scale_y_continuous(breaks = seq(0, 23, by = 1)) +
#adds labels to the plot
labs(title = "Distribution of Check-in Times by Location",
x = "Location",
y = "Hour (24-hour Format)")
Exercise
Box plots are useful summaries, but hide the shape of the distribution. For example, if the distribution is bi-modal, we would not see it in a box plot. An alternative to the box plot is the violin plot, where the shape (of the density of points) is drawn.
Start by replacing the box plot with a violin plot; see
geom_violin().
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
mutate(location = str_remove(location, "LOCATION_")) %>%
ggplot(aes(x = location, y = hour)) +
geom_violin(fill = "lightblue", color = "black") +
geom_jitter(color = "tomato", alpha = 0.5, height = 0.05, width = 0.2) +
scale_y_continuous(breaks = seq(0, 23, by = 1)) +
labs(title = "Distribution of Check-in Times by Location",
x = "Location",
y = "Hour (24-hour Format)")
So far, we’ve looked at the distribution of check-in times between locations. Next, you’re going to try making a new plot to explore the distribution of another variable between locations.
Let’s create a box plot for minute for the locations
above. Overlay a jitter layer to on the box plot layer to display the
distributions more accurately. Feel free to select any fill, color,
alpha, height, and width! Ensure a title and proper axis labels are
added.
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
mutate(location = str_remove(location, "LOCATION_")) %>%
ggplot(aes(x = location, y = minute)) +
geom_boxplot(alpha = 0) +
geom_jitter(color = "navy", alpha = 0.5, height = 0, width = 0.2) +
labs(title = "Distribution of Check-in Minutes by Location",
x = "Location",
y = "Minute of Check-in")
Lastly, color each point according to the device used! Ensure you change the name of the legend as well and remove “DEVICE_” from all device names (to ensure a clean legend).
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
mutate(location = str_remove(location, "LOCATION_")) %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = location, y = minute)) +
geom_boxplot(alpha = 0) +
geom_jitter(aes(color = device), alpha = 1, width = 0.2, height = 0.2) +
labs(title = "Distribution of Check-in Minutes by Location",
x = "Location",
y = "Minute of Check-in",
color = "Device")
Bar Plot
Bar plots are great for visualizing categorical data, such as
counting the number of check-ins per device, per location, or per
precinct. By default, geom_bar accepts a variable for x,
and plots the number of instances of each value of x (in this case,
location) within the data set.
Let’s create a bar plot displaying check-in counts for the first five locations:
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
mutate(location = str_remove(location, "LOCATION_")) %>%
ggplot(aes(x = location)) +
geom_bar() +
labs(title = "Check-In Count by Location",
x = "Location",
y = "Count")
Next, let’s use the fill aesthetic for the
geom_bar() geom to color bars by the device used for
check-in:
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
mutate(location = str_remove(location, "LOCATION_")) %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = location)) +
geom_bar(aes(fill = device)) +
labs(title = "Check-In Count by Location",
x = "Location",
y = "Count",
fill = "Device")
This creates a stacked bar chart. Unfortunately, as you may notice,
this is a bit difficult to read. Instead, we can separate the portions
of the stacked bar that correspond to each device and put them
side-by-side by using the position argument for
geom_bar() and setting it to “dodge”.
Let’s apply this concept to the code below, changing the title for clarity:
R
data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
mutate(location = str_remove(location, "LOCATION_")) %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = location)) +
geom_bar(aes(fill = device), position = "dodge") +
labs(title = "Count of Check-Ins by Location for Each Device",
x = "Location",
y = "Count",
fill = "Device")
As you can see, this is much easier to read and interpret!
In some cases, we may be more interested in the proportion
of each individual device at each location rather than the actual
count of each device. Proportions are helpful because they
account for differences in sample sizes, and instead focus on
distribution within specific locations! To compare proportions, we will
first create a new tibble (prop_device) with a new column
named “prop”, representing the percent of each device within each
location.
R
prop_device <- data %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
count(location, device) %>%
group_by(location) %>%
mutate(prop = n / sum(n)) %>%
ungroup()
Now, we can use this new tibble to create our plot showing the
proportion of each device at each location! When creating your
plot, ensure you include y = prop within the initial ggplot
call AND stat = "identity" to tell ggplot to use the y
values instead of the count, and adjust labels/titles for clarity:
R
prop_device %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
mutate(location = str_remove(location, "LOCATION_")) %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = location, y = prop)) +
geom_bar(aes(fill = device), position = "dodge", stat = "identity") +
labs(title = "Proportion of Check-Ins by Location for Each Device",
x = "Location",
y = "Proportion",
fill = "Device")
Looking at this graph, we can see that all of the devices (except DEVICE_012) have similar proportions (aka. usage rates) when sample sizes are taken into consideration!
Note
If you’d prefer to visualize percentages instead of proportions, you can multiply the prop column by 100! For example:
R
prop_device <- prop_device %>%
mutate(prop = (prop * 100))
If you adjust to percentages, however, please ensure you adjust titles and axis labels accordingly!
Exercise
Using the information you learned above, create a bar plot showing the proportion (or percentages, if you’d like) of check-ins by hour for the first four devices (ie. “DEVICE_001”, “DEVICE_002”, “DEVICE_003”, and “DEVICE_004”). Which hours had the highest proportion of check-ins from DEVICE_001 and DEVICE_002?
R
#calculate proportions
prop_hour_device <- data %>%
filter(device %in% c("DEVICE_001", "DEVICE_002", "DEVICE_003", "DEVICE_004")) %>%
count(hour, device) %>%
group_by(hour) %>%
mutate(prop = n / sum(n)) %>%
ungroup()
#generate plot
prop_hour_device %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = hour, y = prop, fill = device)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Proportion of Check-Ins by Hour for Each Device",
x = "Hour (24H Format)",
y = "Proportion",
fill = "Device") +
scale_x_continuous(breaks = seq(0, 24, by = 1))
#note: you can remove 6 and 20 by using this line instead:
#scale_x_continuous(breaks = seq(7, 19, by = 1))
From this plot, we can identify that DEVICE_001 has the highest proportion at 7:00/7AM and DEVICE_002 has the highest proportion at 19:00/7PM.
Exercise
Create a bar plot showing the check-in counts for the ten devices with the highest number of check-ins. Color each bar according to the device, title it appropriately, and use proper axis labels!
R
#retrieve top devices
top_devices <- data %>%
count(device) %>%
top_n(10, n) %>%
pull(device)
#create plot
data %>%
filter(device %in% top_devices) %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = device, fill = device)) +
geom_bar() +
labs(title = "Top 10 Devices by Number of Check-ins",
x = "Device",
y = "Count")+
theme_classic()
Faceting
Rather than creating a single plot with side-by-side bars for each device, we may want to create multiple plots, where each plot shows the data for a single device. This would be especially useful if we had a large number of devices that we had sampled (like 5 or 10), as side-by-side bars become harder to read as the number of bars increase.
ggplot2 has a special technique called
faceting that allows the user to split one plot into multiple
plots based on a factor included in the data set. Below, we can use this
technique to split our bar plot of check-in proportions by hour for each
device so each device has its own panel:
R
#generate plot
prop_hour_device %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = hour, y = prop, fill = device)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Proportion of Check-Ins by Hour for Each Device",
x = "Hour (24H Format)",
y = "Proportion",
fill = "Device") +
scale_x_continuous(breaks = seq(0, 24, by = 1)) +
facet_wrap(~ device, scales = "free_y") #here, we specify we want to facet wrap by device
You can click the “Zoom” button in your RStudio plots panel to view a larger version of this plot.
Usually plots with white background look more readable when printed.
We can set the background to white using the function
theme_bw(). Additionally, we can remove the grid:
R
prop_hour_device %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = hour, y = prop, fill = device)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Proportion of Check-Ins by Hour for Each Device",
x = "Hour (24H Format)",
y = "Proportion",
fill = "Device") +
scale_x_continuous(breaks = seq(0, 24, by = 1)) +
facet_wrap(~ device, scales = "free_y") +
theme_bw() +
theme(panel.grid = element_blank())
We can also facet by location to see patterns of device proportions within different locations:
R
#creates new data using location information
prop_hour_device_loc <- data %>%
filter(device %in% c("DEVICE_001", "DEVICE_002", "DEVICE_003", "DEVICE_004")) %>%
count(hour, location, device) %>%
group_by(hour, location) %>% #this specifies to calculate within locations as well
mutate(prop = n / sum(n)) %>%
ungroup()
#generates plot
prop_hour_device_loc %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = hour, y = prop, fill = device)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Hourly Distribution of Device Check-Ins, Faceted by Location",
x = "Hour (24H Format)",
y = "Proportion",
fill = "Device") +
scale_x_continuous(breaks = seq(0, 24, by = 1)) +
facet_wrap(~ location, scales = "free_y") +
theme_bw() +
theme(panel.grid = element_blank())
Looking at the graph above, we can see that at LOCATION_001, devices have varying rates of usage throughout the day, and at LOCATION_002, devices are often used the same amount!
Histograms
When working with election data, understanding the distribution of
check-ins over time is crucial! As seen above, bar plots allow us to
look at general peaks and overall trends using the hour
variable. However, if we wanted to look at the distribution of check-ins
at a more detailed level (like by minute intervals), bar plots become
much less effective.
In these cases, histograms are more appropriate to use! This is due to histograms’unique ability to allow for the sorting of continuous variables into bins, making it easier to identify trends.
First, let’s look at the bar chart below:
R
data %>%
ggplot(aes(x = hour)) +
geom_bar(color = "black", fill = "lightblue", ) +
scale_x_continuous(breaks = seq(0, 24, by = 1)) +
labs(title = "Check-In Distribution by Hour",
x = "Hour (24H Format)",
y = "Count")
Now, let’s create a similar plot displaying the distribution of check-ins by hour using a histogram instead of a bar plot:
R
data %>%
ggplot(aes(x = hour)) +
geom_histogram(color = "black", fill = "lightblue", binwidth = 1) +
scale_x_continuous(breaks = seq(0, 24, by = 1)) +
labs(title = "Check-In Distribution by Hour",
x = "Hour (24H Format)",
y = "Count")
As you may see, the plots look almost identical, save for the histogram having bars that touch (since the data is continuous and not discrete/categorical).
With histograms, however, we can create a more granular view by using smaller bins:
R
#create a decimal representation of the data (hour + minutes)
checkins_with_dec_hour <- data %>%
mutate(dec_hour = hour + minute/60)
#plot with 15 minute bins (0.25 minute bins)
checkins_with_dec_hour %>%
ggplot(aes(x = dec_hour)) +
geom_histogram(color = "black", fill = "lightblue", binwidth = 0.25) +
scale_x_continuous(breaks = seq(0, 24, by = 1)) +
labs(title = "Check-In Distribution by Hour (15-Minute Intervals)",
x = "Hour (24H Format)",
y = "Count")
Looking at this graph, it’s clearer that there is a large spike of check-ins early in the morning (between 7AM and 8AM). If you were to only look at the bar plot or 1-bin histogram, however, you may have assumed check-ins kept about the same rate throughout the whole morning (7AM - 10AM)!
Visualizing Location Data with Maps
When working with geographic or location data, it’s often useful to visualize it on a map. Throughout the next section, we’ll demonstrate ways to work with spacial data using the Game of Thrones Dataset!
First, let’s load the sf package. This package allows
gpplot2 to work with spacial data (like shape files):
R
library(sf)
Next, let’s load in the map data containing our map polygons:
R
#read in data and save to object
westeros_map <- st_read(here("data", "polygons_GoT.geojson"), quiet = TRUE)
#look at the data structure
head(westeros_map, 3)
Finally, let’s load the voting data and link it to our map data using
the merge function. This function allows for two tibbles to
be linked based off of a specified variable (in our case, the “id”):
R
#read in data and save to object
got_votes <- read_csv(here("data", "voting_GoT.csv"))
#look at the data structure
head(got_votes)
#join data using the merge function
westeros_voting <- merge(westeros_map, got_votes, by = "id")
Map Introduction
Now that our data is ready to be mapped, let’s start by visualizing which regions favor Jon Snow over Daenerys Targaryen.
When using spacial data, we use a special ggplot function called
geom_sf. Simply, this tells ggplot to look at the simple
features (like lines or polygons) in your data and use that for the
graph!
Below, we will be using geom_sf on our combined data and
use Jon_Snow_pct to determine the level of support Jon Snow is getting
from each region:
R
ggplot() +
geom_sf(data = westeros_voting, aes(fill = Jon_Snow_pct)) +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(title = "Support for Jon Snow across Westeros",
fill = "Support %") +
theme_bw()
Next, let’s do the same for Daenerys Targaryen, but with red instead of blue for the color scale:
R
# Create a map colored by Daenerys support
ggplot() +
geom_sf(data = westeros_voting, aes(fill = Daenerys_Targaryen_pct)) +
scale_fill_gradient(low = "pink", high = "darkred") +
labs(title = "Support for Daenerys Targaryen across Westeros",
fill = "Support %") +
theme_bw()
Conditional Map Coloring
Often, it may be more beneficial to color each part of the map according to the candidate that received the most votes, rather than displaying the amount of support a single candidate received.
This can be achieved by determining which candidate received the most
votes and filling that section with that candidate’s color using the
scale_fill_manual function:
R
#create a column with the name of the dominant candidate
westeros_voting$dominant <- ifelse(westeros_voting$Jon_Snow_pct > westeros_voting$Daenerys_Targaryen_pct,
"Jon Snow", "Daenerys Targaryen")
#pick fill colors based on the dominant candidate
dom_color <- c("Jon Snow" = "steelblue",
"Daenerys Targaryen" = "firebrick")
#create a map with the specified coloring
ggplot() +
geom_sf(data = westeros_voting, aes(fill = dominant)) +
scale_fill_manual(name = "Dominant Candidate", values = dom_color) +
labs(title = "Dominant Candidate by Region") +
theme_bw()
In some cases, you may not just be interested in who won each region, but additionally by how much. To map this, first determine the margin of victory and add a column containing how strong of a victory they had:
R
#calculate margin of victory
westeros_voting$margin <- abs(westeros_voting$Jon_Snow_pct - westeros_voting$Daenerys_Targaryen_pct)
#bin the margin into three levels (low, med, high)
westeros_voting$margin_bin <- ifelse(
westeros_voting$margin <= 5, "Low",
ifelse(westeros_voting$margin <= 20, "Med",
"High")
)
Using the information you gained above, you can now develop your “fill rule” and select the color that corresponds to each instance. In this case, your “fill rule” consists of the winner of each region (ie. Jon Snow) and how high of a margin of victory they had (ie. High):
R
#make a fill rule (ie. Jon Snow - High)
westeros_voting$marg_fill <- paste(westeros_voting$dominant, westeros_voting$margin_bin, sep = " - ")
#pick fill colors based on the fill rule!
marg_color <- c(
"Daenerys Targaryen - High" = "brown4",
"Daenerys Targaryen - Med" = "firebrick",
"Daenerys Targaryen - Low" = "pink",
"Jon Snow - High" = "darkblue",
"Jon Snow - Med" = "royalblue",
"Jon Snow - Low" = "lightblue"
)
Your final step is to combine your fill rule and chosen colors with your mapping information, creating your margin of victory map:
R
#create margin of victory map
ggplot() +
geom_sf(data = westeros_voting, aes(fill = marg_fill)) +
scale_fill_manual(name = "Winner & Margin", values = marg_color) +
labs(title = "Margin of Victory in Each Region") +
theme_bw()
Adding Map Labels
After ensuring your map includes all the information required, the
final step is adding region labels! Unfortunately, due to the nature of
polygons, this is a bit more difficult than simply using the
labs function.
To add region labels, your first step is to convert your data to an simple feature, also known as an sf, object. This will allow for the calculation of where your labels will sit on your map:
R
#convert to sf
westeros_voting_sf <- st_as_sf(westeros_voting)
Your second step is to determine where your region labels will sit on
your map! This is completed by calculating
thergdef(‘centroid’, ‘centroids’)`, or center points, of
each region. Below, we will calculate the centroid of each region and
convert its x and y coordinates to columns for easier access:
R
#calculate centroid
region_centroids <- st_centroid(westeros_voting_sf)
#extract the coordinates
coords <- st_coordinates(region_centroids)
#convert coordinates to columns coords.X and coords.Y
region_centroids$coords.X <- coords[, 1]
region_centroids$coords.Y <- coords[, 2]
Now that we have determined where the region labels will be placed,
we can finally add the region labels onto the map using the
geom_text function.
Within this function, we can specify the data used (in this case,
region_centroids), the coordinates, the information that
will be used for the label, and text formatting information (like size
and bold/italics)!
Additionally, it’s important to note that we need to use
westeros_voting_sf as the data for the map instead of
westeros_voting. This will ensure that the region labels
will properly sit on their proper locations!
R
#create a map with the specified coloring
ggplot() +
geom_sf(data = westeros_voting_sf, aes(fill = dominant)) +
geom_text(data = region_centroids,
aes(x = coords.X, y = coords.Y, label = Name),
size = 2, fontface = "bold") +
scale_fill_manual(name = "Dominant Candidate", values = dom_color) +
labs(title = "Dominant Candidate by Region") +
theme_bw()
As you may notice, some labels in dense areas are overlapping a lot!
This is due to the size of the map in your local version of R. To
resolve this, you can export the map at a larger size using
ggsave (which will be covered at the end of this
lesson!).
Exercise
Using what you’ve learned above, create a map displaying the peak
check-in wait times across the first 35 precincts. For this lesson, we
will be using the avg_checkins.csv file we created within
“Data Wrangling with dplyr”!
To complete this map, use the following steps: 1. Read in your data
as “checkin_data”. 2. Using the merge function, link
together your “checkin_data” with the “westeros_map”, creating a
“westeros_checkins” dataframe. Hint: if the linking columns are named
differently, use by.x and by.y to specify the two names (with x being
the first data and y being the second). 3. Generate your map based on
the “westeros_checkins” data, filling each region based on the
avg_checkin_length. 4. Choose a title and change the name of the legend
to “Check-In Times”.
R
#read in data
checkin_data <- read_csv(here("data", "avg_checkins.csv"))
#link together map and checkin_data
westeros_checkins <- merge(westeros_map, checkin_data, by.x = "id", by.y = "precinct")
#generate map with labels
ggplot() +
geom_sf(data = westeros_checkins, aes(fill = avg_checkin_length)) +
labs(title = "Average Check-In Times Across Westeros",
fill = "Check-In Times") +
theme_bw()
Customization
ggplot2 Themes
In addition to theme_bw(), which changes the plot
background to white, ggplot2 comes with
several other themes which can be useful to quickly change the look of
your visualization. The complete list of themes is available at https://ggplot2.tidyverse.org/reference/ggtheme.html.
theme_minimal() and theme_light() are popular,
and theme_void() can be useful as a starting point to
create a new hand-crafted theme.
The ggthemes
package provides a wide variety of options (including an Excel 2003
theme). The ggplot2
extensions website provides a list of packages that extend the
capabilities of ggplot2, including
additional themes.
Custom Themes
If you do not like the themes offered, or you’d like to change a
portion of a theme, you can use the theme() function to
manually customize your maps and plots!
The theme() function allows you to customize all
portions of a ggplot, including the text, title, subtitle, and grids.
You can find the full list in
the documentation or by using the panel on the right and navigating
to the theme help page (Help > Packages > ggplot2
> theme).
Below, we will be applying a few of these customizations to a plot from earlier in the lesson:
R
prop_device %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
mutate(location = str_remove(location, "LOCATION_")) %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = location, y = prop)) +
geom_bar(aes(fill = device), position = "dodge", stat = "identity") +
labs(title = "Proportion of Check-Ins by Location for Each Device",
x = "Location",
y = "Proportion",
fill = "Device") +
theme_bw() +
theme(
text = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(face = "italic"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
panel.border = element_rect(color = "grey70")
)
Note: it is also possible to change the fonts of your plots! If you
are on Windows, you will have to install the extrafont
package before doing so..
Additionally, you like the changes you created better than the default themes, you can save your changes as a custom theme for application to other plots:
R
my_theme <- theme_bw() +
theme(
text = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(face = "italic"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
panel.border = element_rect(color = "grey70")
)
prop_hour_device %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = hour, y = prop, fill = device)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Proportion of Check-Ins by Hour for Each Device",
x = "Hour (24H Format)",
y = "Proportion",
fill = "Device") +
scale_x_continuous(breaks = seq(0, 24, by = 1)) +
my_theme
These themes can also be applied to maps, as seen below:
R
ggplot() +
geom_sf(data = westeros_voting, aes(fill = dominant)) +
scale_fill_manual(name = "Dominant Candidate", values = dom_color) +
labs(title = "Dominant Candidate by Region") +
my_theme
Exercise
With all of this information in hand, please take another five minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own using any of the data used throughout this lesson.
You can use the RStudio ggplot2
cheat sheet for inspiration.
Here are some ideas: - Make a line plot showing the cumulative number of check-ins over the course of the day. - Try using a different color palette for your device comparison. - Generate a new map using the GoT data.
Plot Output
After creating a plot, you may want to save it as a png (or other
format). To do this, you can use the use the ggsave()
function, which allows you to easily change the dimension and resolution
of your plot by adjusting the appropriate arguments (width,
height and dpi) before saving the plot to the
specified directory.
Here, we will save one of the plots we customized above:
R
plot <- prop_device %>%
filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
mutate(location = str_remove(location, "LOCATION_")) %>%
mutate(device = str_remove(device, "DEVICE_")) %>%
ggplot(aes(x = location, y = prop)) +
geom_bar(aes(fill = device), position = "dodge", stat = "identity") +
labs(title = "Proportion of Check-Ins by Location for Each Device",
x = "Location",
y = "Proportion",
fill = "Device") +
theme_bw() +
theme(
text = element_text(size = 12),
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.title = element_text(face = "italic"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
panel.border = element_rect(color = "grey70")
)
ggsave("fig-output/device_prop.png", plot, width = 10, height = 6, dpi = 300)
You can find the png generated in your data folder!
-
ggplot2is a flexible and useful tool for creating plots in R. - The data set and coordinate system can be defined using the
ggplotfunction. - Additional layers, including geoms, are added using the
+operator. - Time-series data can be visualized using
geom_line()andgeom_point(). - Box plots are useful for visualizing the distribution of check-in times by location.
- Bar plots are useful for visualizing counts of check-ins by categorical variables.
- Faceting allows you to generate multiple plots based on a categorical variable like device.
- Spatial data can be visualized on maps using the
sfandggplot2packages.