All in One View

Content from Before we Start


Last updated on 2026-04-28 | Edit this page

Overview

Questions

  • How to find your way around RStudio?
  • How to interact with R?
  • How to manage your environment?
  • How to install packages?

Objectives

  • Install latest version of R.
  • Install latest version of RStudio.
  • Navigate the RStudio GUI.
  • Install additional packages using the packages tab.
  • Install additional packages using R code.

What is R? What is RStudio?


The term “R” is used to refer to both the programming language and the software that interprets the scripts written using it.

RStudio is currently a very popular way to not only write your R scripts but also to interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.

To make it easier to interact with R, we will use RStudio. RStudio is the most popular IDE (Integrated Development Environment) for R. An IDE is a piece of software that provides tools to make programming easier.

You can also use the R Presentations feature to present your work in an HTML5 presentation mixing Markdown and R code. You can display these within R Studio or your browser. There are many options for customizing your presentation slides, including an option for showing La-TeX equations. This can help you collaborate with others and also has an application in teaching and classroom use.

Why learn R?


R does not involve lots of pointing and clicking, and that’s a good thing

The learning curve might be steeper than with other software but with R, the results of your analysis do not rely on remembering a succession of pointing and clicking, but instead on a series of written commands, and that’s a good thing! So, if you want to redo your analysis because you collected more data, you don’t have to remember which button you clicked in which order to obtain your results; you just have to run your script again.

Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes.

Working with scripts forces you to have a deeper understanding of what you are doing, and facilitates your learning and comprehension of the methods you use.

R code is great for reproducibility

Reproducibility is when someone else (including your future self) can obtain the same results from the same dataset when using the same analysis.

R integrates with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript are updated automatically.

An increasing number of journals and funding agencies expect analyses to be reproducible, so knowing R will give you an edge with these requirements.

To further support reproducibility and transparency, there are also packages that help you with dependency management: keeping track of which packages we are loading and how they depend on the package version you are using. This helps you make sure existing workflows work consistently and continue doing what they did before.

Packages like renv let you “save” and “load” the state of your project library, also keeping track of the package version you use and the source it can be retrieved from.

R is interdisciplinary and extensible

With 10,000+ packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more.

R works on data of all shapes and sizes

The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won’t make much difference to you.

R is designed for data analysis. It comes with special data structures and data types that make handling of missing data and statistical factors convenient.

R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.

R produces high-quality graphics

The plotting functionalities in R are endless, and allow you to adjust any aspect of your graph to convey most effectively the message from your data.

R has a large and welcoming community

Thousands of people use R daily. Many of them are willing to help you through mailing lists and websites such as Stack Overflow, or on the RStudio community. Questions which are backed up with short, reproducible code snippets are more likely to attract knowledgeable responses.

Not only is R free, but it is also open-source and cross-platform

Anyone can inspect the source code to see how R works. Because of this transparency, there is less chance for mistakes, and if you (or someone else) find some, you can report and fix bugs.

Because R is open source and is supported by a large community of developers and users, there is a very large selection of third-party add-on packages which are freely available to extend R’s native capabilities.

RStudio extends what R can do, and makes it easier to write R code and interact with R.
automatic car gear shift representing the ease of RStudio

:::::

RStudio extends what R can do, and makes it easier to write R code and interact with R. Left photo credit; Right photo credit.

A tour of RStudio

Knowing your way around RStudio

Let’s start by learning about RStudio, which is an Integrated Development Environment (IDE) for working with R.

The RStudio IDE open-source product is free under the Affero General Public License (AGPL) v3. The RStudio IDE is also available with a commercial license and priority email support from RStudio, Inc.

We will use the RStudio IDE to write code, navigate the files on our computer, inspect the variables we create, and visualize the plots we generate. RStudio can also be used for other things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.

One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, RStudio provides many shortcuts, auto completion, and highlighting for the major file types you use while developing in R. RStudio makes typing easier and less error-prone.

Getting set up

It is good practice to keep a set of related data, analyses, and text self-contained in a single folder called the working directory. All of the scripts within this folder can then use relative paths to files. Relative paths indicate where inside the project a file is located (as opposed to absolute paths, which point to where a file is on a specific computer). Working this way makes it a lot easier to move your project around on your computer and share it with others without having to directly modify file paths in the individual scripts.

RStudio provides a helpful set of tools to do this through its “Projects” interface, which not only creates a working directory for you but also remembers its location (allowing you to quickly navigate to it). The interface also (optionally) preserves custom settings and open files to make it easier to resume work after a break.

Create a new project

  • Under the File menu, click on New project, choose New directory, then New project
  • Enter a name for this new folder (or “directory”) and choose a convenient location for it. This will be your working directory for the rest of the day (e.g., ~/data-carpentry)
  • Click on Create project
  • Create a new file where we will type our scripts. Go to File > New File > R script. Click the save icon on your toolbar and save your script as “script.R”.

The simplest way to open an RStudio project once it has been created is to navigate through your files to where the project was saved and double click on the .Rproj (blue cube) file. This will open RStudio and start your R session in the same directory as the .Rproj file. All your data, plots and scripts will now be relative to the project directory. RStudio projects have the added benefit of allowing you to open multiple projects at the same time each open to its own project directory. This allows you to keep multiple projects open without them interfering with each other.

The RStudio Interface

Let’s take a quick tour of RStudio.

Screenshot of the RStudio_startup screen

RStudio is divided into four “panes”. The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout).

The Default Layout is:

  • Top Left - Source: your scripts and documents
  • Bottom Left - Console: what R would look and be like without RStudio
  • Top Right - Environment/History: look here to see what you have done
  • Bottom Right - Files and more: see the contents of the project/working directory here, like your Script.R file

Organizing your working directory

Using a consistent folder structure across your projects will help keep things organized and make it easy to find/file things in the future. This can be especially helpful when you have multiple projects. In general, you might create directories (folders) for scripts, data, and documents. Here are some examples of suggested directories:

  • data/ Use this folder to store your raw data and intermediate data sets. For the sake of transparency and provenance, you should always keep a copy of your raw data accessible and do as much of your data cleanup and pre-processing programmatically (i.e., with scripts, rather than manually) as possible.
  • data_output/ When you need to modify your raw data, it might be useful to store the modified versions of the data sets in a different folder.
  • documents/ Used for outlines, drafts, and other text.
  • fig_output/ This folder can store the graphics that are generated by your scripts.
  • scripts/ A place to keep your R scripts for different analyses or plotting.

You may want additional directories or sub directories depending on your project needs, but these should form the backbone of your working directory.

Example of a working directory structure

The working directory

The working directory is an important concept to understand. It is the place where R will look for and save files. When you write code for your project, your scripts should refer to files in relation to the root of your working directory and only to files within this structure.

Using RStudio projects makes this easy and ensures that your working directory is set up properly. If you need to check it, you can use getwd(). If for some reason your working directory is not the same as the location of your RStudio project, it is likely that you opened an R script or RMarkdown file not your .Rproj file. You should close out of RStudio and open the .Rproj file by double clicking on the blue cube! If you ever need to modify your working directory in a script, setwd('my/path') changes the working directory. This should be used with caution since it makes analyses hard to share across devices and with other users.

Downloading the data and getting set up

For this lesson we will use the following folders in our working directory: data/ and fig_output/. Let’s write them all in lowercase to be consistent. We can create them using the RStudio interface by clicking on the “New Folder” button in the file pane (bottom right), or directly from R by typing at console:

R

dir.create("data")
dir.create("fig_output")

You can either download the data used for this lesson from GitHub or with R.

Check-In Dataset:

You can either copy the data from GitHub and paste it into a file called checkin_data.csv in the data/ directory or copy-paste the below code chunk into your terminal:

R

download.file(
  "https://raw.githubusercontent.com/EngineeringForDemocracy/r-election-workers/main/episodes/data/checkin_data.csv",
  "data/checkin_data.csv", mode = "wb"
  )

Check-In Plotting Dataset:

You can either copy the data from this GitHub and paste it into a file called checkin_sample_plotting.csv in the data/ directory or copy-paste the below code chunk into your terminal:

R

download.file(
  "https://raw.githubusercontent.com/EngineeringForDemocracy/r-election-workers/main/episodes/data/checkin_sample_plotting.csv",
  "data/checkin_sample_plotting.csv", mode = "wb"
  )

Messy Dataset:

You can either copy the data from this GitHub link and paste it into a file called messy_data.csv in the data/ directory or copy-paste the below code chunk into your terminal:

R

download.file(
  "https://raw.githubusercontent.com/EngineeringForDemocracy/r-election-workers/main/episodes/data/messy_data.csv",
  "data/messy_data.csv", mode = "wb"
  )

Game of Thrones Dataset:

You can either copy the data from this GitHub link and paste it into a file called voting_GoT.csv in the data/ directory or copy-paste the below code chunk into your terminal:

R

download.file(
  "https://raw.githubusercontent.com/EngineeringForDemocracy/r-election-workers/main/episodes/data/voting_GoT.csv",
  "data/voting_GoT.csv", mode = "wb"
  )

You can either copy the data from this GitHub link and paste it into a file called polygons_GoT.geojson in the data/ directory or copy-paste the below code chunk into your terminal:

R

download.file(
  "https://raw.githubusercontent.com/EngineeringForDemocracy/r-election-workers/main/episodes/data/polygons_GoT.geojson",
  "data/polygons_GoT.geojson", mode = "wb"
  )

JSON Check-In Dataset:

You can either copy the data from this GitHub link and paste it into a file called checkin_snippet.json in the data/ directory or copy-paste the below code chunk into your terminal:

R

download.file(
  "https://raw.githubusercontent.com/EngineeringForDemocracy/r-election-workers/main/episodes/data/checkin_snippet.json",
  "data/checkin_snippet.json", mode = "wb"
  )

Interacting with R

The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.

There are two main ways of interacting with R: by using the console or by using script files (plain text files that contain your code). The console pane (in RStudio, the bottom left panel) is the place where commands written in the R language can be typed and executed immediately by the computer. It is also where the results will be shown for commands that have been executed. You can type commands directly into the console and press Enter to execute those commands, but they will be forgotten when you close the session.

Because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor and save the script. This way, there is a complete record of what we did, and anyone (including our future selves!) can easily replicate the results on their computer.

RStudio allows you to execute commands directly from the script editor by using the Ctrl + Enter shortcut (on Mac, Cmd + Return will work). The command on the current line in the script (indicated by the cursor) or all of the commands in selected text will be sent to the console and executed when you press Ctrl + Enter. If there is information in the console you do not need anymore, you can clear it with Ctrl + L. You can find other keyboard shortcuts in this RStudio cheatsheet about the RStudio IDE.

At some point in your analysis, you may want to check the content of a variable or the structure of an object without necessarily keeping a record of it in your script. You can type these commands and execute them directly in the console. RStudio provides the Ctrl + 1 and Ctrl + 2 shortcuts allow you to jump between the script and the console panes.

If R is ready to accept commands, the R console shows a > prompt. If R receives a command (by typing, copy-pasting, or sent from the script editor using Ctrl + Enter), R will try to execute it and, when ready, will show the results and come back with a new > prompt to wait for new commands.

If R is still waiting for you to enter more text, the console will show a + prompt. It means that you haven’t finished entering a complete command. This is likely because you have not ‘closed’ a parenthesis or quotation, i.e. you don’t have the same number of left-parentheses as right-parentheses or the same number of opening and closing quotation marks. When this happens, and you thought you finished typing your command, click inside the console window and press Esc; this will cancel the incomplete command and return you to the > prompt. You can then proofread the command(s) you entered and correct the error.

Installing additional packages using the packages tab

In addition to the core R installation, there are in excess of 10,000 additional packages which can be used to extend the functionality of R. Many of these have been written by R users and have been made available in central repositories, like the one hosted at CRAN, for anyone to download and install into their own R environment. You should have already installed the packages ‘ggplot2’ and ’dplyr. If you have not, please do so now using these instructions.

You can see if you have a package installed by looking in the packages tab (on the lower-right by default). You can also type the command installed.packages() into the console and examine the output.

Screenshot of Packages pane

Additional packages can be installed from the ‘packages’ tab. On the packages tab, click the ‘Install’ icon and start typing the name of the package you want in the text box. As you type, packages matching your starting characters will be displayed in a drop-down list so that you can select them.

Screenshot of Install Packages Window

At the bottom of the Install Packages window is a check box to ‘Install’ dependencies. This is ticked by default, which is usually what you want. Packages can (and do) make use of functionality built into other packages, so for the functionality contained in the package you are installing to work properly, there may be other packages which have to be installed with them. The ‘Install dependencies’ option makes sure that this happens.

Challenge

Exercise

Use both the Console and the Packages tab to confirm that you have the tidyverse installed.

Scroll through packages tab down to ‘tidyverse’. You can also type a few characters into the searchbox. The ‘tidyverse’ package is really a package of packages, including ‘ggplot2’ and ‘dplyr’, both of which require other packages to run correctly. All of these packages will be installed automatically. Depending on what packages have previously been installed in your R environment, the install of ‘tidyverse’ could be very quick or could take several minutes. As the install proceeds, messages relating to its progress will be written to the console. You will be able to see all of the packages which are actually being installed.

Because the install process accesses the CRAN repository, you will need an Internet connection to install packages.

It is also possible to install packages from other repositories, as well as Github or the local file system, but we won’t be looking at these options in this lesson.

Installing additional packages using R code

If you were watching the console window when you started the install of ‘tidyverse’, you may have noticed that the line

R

install.packages("tidyverse")

was written to the console before the start of the installation messages.

You could also have installed the tidyverse packages by running this command directly at the R terminal.

We will be using additional packages to manage paths, plots, json files, and shape files. We will discuss these in more detail in a later episode, but we will install them now in the console:

R

install.packages("here", "lattice", "sf", "jsonlite")
Key Points
  • Use RStudio to write and run R programs.
  • Use install.packages() to install packages (libraries).

Content from Introduction to R


Last updated on 2026-04-28 | Edit this page

Overview

Questions

  • What data types are available in R?
  • What is an object?
  • How can objects of different data types be assigned to names?
  • What arithmetic and logical operators can be used?
  • How can subsets be extracted from vectors?
  • How does R treat missing values?
  • How can we deal with missing values in R?
  • How can we work with dates and times in R?

Objectives

  • Define the following terms as they relate to R: object, assign, call, function, arguments, options.
  • Assign values to names in R.
  • Learn how to name objects.
  • Use comments to inform script.
  • Solve simple arithmetic operations in R.
  • Call functions and use arguments to change their default options.
  • Inspect the content of vectors and manipulate their content.
  • Subset values from vectors.
  • Analyze vectors with missing data.
  • Work with dates and times in R using proper data types.

Creating Objects in R


You can get output from R simply by typing math in the console:

R

3 + 5

OUTPUT

[1] 8

R

12 / 7

OUTPUT

[1] 1.714286

Everything that exists in R is an objects: from simple numerical values, to strings, to more complex objects like vectors, matrices, and lists. Even expressions and functions are objects in R.

However, to do useful and interesting things, we need to name objects. To do so, we need to give a name followed by the assignment operator <-, and the object we want to be named:

R

num_precincts <- 5

<- is the assignment operator. It assigns values (objects) on the right to names (also called symbols) on the left. So, after executing x <- 3, the value of x is 3. The arrow can be read as 3 goes into x. For historical reasons, you can also use = for assignments, but not in every context. Because of the slight differences in syntax, it is good practice to always use <- for assignments. More generally we prefer the <- syntax over = because it makes it clear what direction the assignment is operating (left assignment), and it increases the read-ability of the code.

In RStudio, typing Alt + - (push Alt at the same time as the - key) will write <- in a single keystroke in a PC, while typing Option + - (push Option at the same time as the - key) does the same in a Mac.

Objects can be given any name such as x, current_temperature, or subject_id. You want your object names to be explicit and not too long. They cannot start with a number (2x is not valid, but x2 is). R is case sensitive (e.g., age is different from Age). There are some names that cannot be used because they are the names of fundamental objects in R (e.g., if, else, for, see R’s reserved words for a complete list). In general, even if it’s allowed, it’s best to not use them (e.g., c, T, mean, data, df, weights). If in doubt, check the help to see if the name is already in use. It’s also best to avoid dots (.) within an object name as in my.dataset. There are many objects in R with dots in their names for historical reasons, but because dots have a special meaning in R (for methods) and other programming languages, it’s best to avoid them. The recommended writing style is called snake_case, which implies using only lowercase letters and numbers and separating each word with underscores (e.g., animals_weight, average_income). It is also recommended to use nouns for object names, and verbs for function names. It’s important to be consistent in the styling of your code (where you put spaces, how you name objects, etc.). Using a consistent coding style makes your code clearer to read for your future self and yourcollaborators. In R, three popular style guides are Google’s, Jean Fan’s and the tidyverse’s. The tidyverse’s is very comprehensive and may seem overwhelming at first. You can install the lintr package to automatically check for issues in the styling of your code.

Callout

Objects vs. Variables

The naming of objects in R is somehow related to variables in many other programming languages. In many programming languages, a variable has three aspects: a name, a memory location, and the current value stored in this location. R abstracts from modifiable memory locations. In R we only have objects which can be named. Depending on the context, name (of an object) and variable can have drastically different meanings. However, in this lesson, the two words are used synonymously. For more information see: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects

When assigning an value to a name, R does not print anything. You can force R to print the value by using parentheses or by typing the object name:

R

num_precincts <- 5    # doesn't print anything
(num_precincts <- 5)  # putting parenthesis around the call prints the value of `area_hectares`

OUTPUT

[1] 5

R

num_precincts         # and so does typing the name of the object

OUTPUT

[1] 5

Now that R has num_precincts in memory, we can do arithmetic with it. For instance, we may want to calculate the number of registered voters (assuming there are 1500 voters per precinct):

R

1500 * num_precincts

OUTPUT

[1] 7500

We can also change an the value assigned to an name by assigning it a new one:

R

num_precincts <- 10
1500 * num_precincts

OUTPUT

[1] 15000

This means that assigning a value to one name does not change the values of other names. For example, let’s name the number of voters num_voters:

R

num_voters <- 1500 * num_precincts

Next, let’s change (reassign) num_precincts to 50:

R

num_precincts <- 50
Challenge

Exercise

What do you think is the current value of num_voters? 15000 or 75000?

The value of num_voters is still 15000. This is because you have not re-run the line num_voters <- 1500 * num_precincts since changing the value of num_precincts.

Comments


All programming languages allow the programmer to include comments in their code. Including comments to your code has many advantages: it helps you explain your reasoning and it forces you to be tidy. A commented code is also a great tool not only to your collaborators, but to your future self. Comments are the key to a reproducible analysis.

To do this in R we use the # character. Anything to the right of the # sign and up to the end of the line is treated as a comment and is ignored by R. You can start lines with comments or include them after any code on the line.

R

num_precincts <- 10      #number of precincts
num_voters <- 1500 * num_precincts  #calculate the total number of voters
num_voters        #print the total number of voters

OUTPUT

[1] 15000

RStudio makes it easy to comment or uncomment a paragraph: after selecting the lines you want to comment, press at the same time on your keyboard Ctrl + Shift + C. If you only want to comment out one line, you can put the cursor at any location of that line (i.e. no need to select the whole line), then press Ctrl + Shift + C.

Challenge

Exercise

  1. Create two variables ballot_cost and ballots_needed and assign them values.

  2. Create a third variable total_cost and give it a value based on the current values of ballot_cost and ballots_needed.

  3. Show that changing the values of either ballot_cost and ballots_needed does not affect the value of total_cost.

R

#set the values of ballot_cost and ballots_needed
ballot_cost <- 0.0125
ballots_needed <- 2250

#give total_cost a value
total_cost <- ballot_cost * ballots_needed

#print current value of total_cost
total_cost

OUTPUT

[1] 28.125

R

#change the values of ballot_cost and ballots_needed
ballot_cost <- 0.068
ballots_needed <- 3000

#display the value of total_cost isn't changed
total_cost

OUTPUT

[1] 28.125

Functions and Their Arguments

Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. Many functions are predefined, or can be made available by importing R packages (more on that later). A function usually gets one or more inputs called arguments. Functions often (but not always) return a value. A typical example would be the function sqrt(). The input (the argument) must be a number, and the return value (in fact, the output) is the square root of that number. Executing a function (‘running it’) is called calling the function. An example of a function call is:

R

b <- sqrt(a)

Here, the value of a is given to the sqrt() function, the sqrt() function calculates the square root, and returns the value which is then assigned to the name b. This function is very simple, because it takes just one argument.

The return ‘value’ of a function need not be numerical (like that of sqrt()), and it also does not need to be a single item: it can be a set of things, or even a data set. We’ll see that when we read data files into R.

Arguments can be anything, not only numbers or file names, but also other objects. Exactly what each argument means differs per function, and must be looked up in the documentation (see below). Some functions take arguments which may either be specified by the user, or, if left out, take on a default value: these are called options. Options are typically used to alter the way the function operates, such as whether it ignores ‘bad values’, or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default.

Using the total_cost we calculated above, let’s try a function that can take multiple arguments: round().

R

round(total_cost)

OUTPUT

[1] 28

Here, we’ve called round() with just one argument, total_cost, and it has returned the value 28. That’s because the default is to round to the nearest whole number. If we want more digits we can see how to do that by getting information about the round function. We can use args(round) or look at the help for this function using ?round.

R

args(round)

OUTPUT

function (x, digits = 0, ...)
NULL

R

?round

We see that if we want a different number of digits, we can type digits=2 or however many we want.

R

round(total_cost, digits = 2)

OUTPUT

[1] 28.12

If you provide the arguments in the exact same order as they are defined you don’t have to name them:

R

round(total_cost, 2)

OUTPUT

[1] 28.12

And if you do name the arguments, you can switch their order:

R

round(digits = 2, x = total_cost)

OUTPUT

[1] 28.12

It’s good practice to put the non-optional arguments (like the number you’re rounding) first in your function call, and to specify the names of all optional arguments. If you don’t, someone reading your code might have to look up the definition of a function with unfamiliar arguments to understand what you’re doing.

Challenge

Exercise

As you may have noticed, in both cases of rounding, the total_cost rounded down. However, when calculating the total cost of something, you should always round UP to the nearest dollar or cent.

For this exercise, type in ?round at the console and then look at the output in the Help panel. What other function similar to round should be used instead? Apply this function to round up to the nearest dollar.

Bonus: apply this function to round to the nearest cent.

The ceiling function rounds up to the nearest integer!

R

ceiling(total_cost)

OUTPUT

[1] 29

To use the function to round to the nearest cent, you can do the following:

R

ceiling(total_cost * 100) / 100

OUTPUT

[1] 28.13

Vectors and Data Types


A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is composed by a series of values, which can be either numbers, characters, or other data types. We can assign a series of values to a vector using the c() function. For example, we can create a vector of job type strings, and we can create another vector storing numbers of votes at different precincts

R

votes_per_precinct <- c(1000, 4300, 2340, 7190)
votes_per_precinct

OUTPUT

[1] 1000 4300 2340 7190

R

job_types <- c("check-in", "check-out", "supervisor")
job_types

OUTPUT

[1] "check-in"   "check-out"  "supervisor"

The quotes around “check-in”, “check-out”, and “supervisor”are essential here. Without the quotes, R will assume there are objects called check-in, check-out, and supervisor. Since these names don’t exist in R’s memory, there will be an error message.

Additionally, you may notice there are no commas in-between the thousands. In R, you cannot add commas in numbers, as R will assume they are separate items in the list.

There are many functions that allow you to inspect the content of a vector. length() tells you how many elements are in a particular vector:

R

length(votes_per_precinct)

OUTPUT

[1] 4

An important feature of a vector is that all of the elements are the same type of data. The function typeof() indicates the type of an object:

R

typeof(votes_per_precinct)

OUTPUT

[1] "double"

The function str() provides an overview of the structure of an object and its elements. It is a useful function when working with large and complex objects:

R

str(votes_per_precinct)

OUTPUT

 num [1:4] 1000 4300 2340 7190

You can use the c() function to add other elements to your vector:

R

devices_per_precinct <- c(5, 2)
devices_per_precinct <- c(devices_per_precinct, 9) # add to the end of the vector
devices_per_precinct <- c(6, devices_per_precinct) # add to the beginning of the vector
devices_per_precinct

OUTPUT

[1] 6 5 2 9

In the first line, we take the original vector devices_per_precinct, add the value 9 to the end of it, and save the result back into devices_per_precinct. Then we add the value 6 to the beginning, again saving the result back into devices_per_precinct.

We can do this over and over again to grow a vector, or assemble a data set. As we program, this may be useful to add results that we are collecting or calculating.

An atomic vector is the simplest R data type and is a linear vector of a single type. Above, we saw 2 of the 6 main atomic vector types that R uses: "character" and "numeric" (or "double"). These are the basic building blocks that all R objects are built from. The other 4 atomic vector types are:

  • "logical" for TRUE and FALSE (the boolean data type)
  • "integer" for integer numbers (e.g., 2L, the L indicates to R that it’s an integer)
  • "complex" to represent complex numbers with real and imaginary parts (e.g., 1 + 4i) and that’s all we’re going to say about them
  • "raw" for bit-streams (we won’t be discussing this further)

Date Types

Dates are a common data type that require special attention. In R, dates can be represented in two ways:

  1. As character strings (e.g., “2018-11-06 07:02:36”, “11/06/2018 07:02:36”)
  2. As Date or POSIXct objects which are special data types for dates and times

When dates are stored as strings, they’re treated like any other text:

R

checkin_times_as_strings <- c("2018-11-06 07:02:36", "2018-11-06 07:04:09", "2018-11-06 07:05:45")
typeof(checkin_times_as_strings)

OUTPUT

[1] "character"

However, storing dates as proper Date or POSIXct objects offers several advantages: - You can perform arithmetic with dates (calculate time differences) - You can extract components like month, year, or day - You can easily format dates for display - You can sort dates chronologically

To convert strings to Date or POSIXct objects, use the as.POSIXct() function:

R

#convert strings to POSIXct objects
checkin_times <- as.POSIXct(checkin_times_as_strings, format = "%Y-%m-%d %H:%M:%S")
typeof(checkin_times)

OUTPUT

[1] "double"

R

class(checkin_times)

OUTPUT

[1] "POSIXct" "POSIXt" 

The following “leap year” scenario highlights the importance of using proper date types. Consider the following example:

R

#BAD: using strings for date arithmetic
date_start <- "2020-02-28"
date_end <- "2020-03-01"

#attempt to calculate the difference by converting strings to numeric days
#here we use substr to extract the day portion in string format.
#it draws the characters at position 9 to 10 and converts them to numeric
difference_wrong <- as.numeric(substr(date_end, 9, 10)) - as.numeric(substr(date_start, 9, 10))
difference_wrong #incorrect!

OUTPUT

[1] -27

In this example, we extract the day portion of the dates as strings and subtract them. While this works for simple cases, it fails to account for: - The transition between months (e.g., February to March). - Leap years (e.g., February 29 in 2020).

Now, compare this with proper date types:

R

#GOOD: using Date for leap year handling
date_start_correct <- as.Date(date_start)
date_end_correct <- as.Date(date_end)

difference_correct <- as.numeric(date_end_correct - date_start_correct)
difference_correct #correctly computes 2 days, accounting for February 29 in the leap year

OUTPUT

[1] 2

Now, the number of days has been calculated properly!

It’s important to note that Date objects and POSIXct objects are not made equal and, while we used the two types interchangeably above, you should ensure you choose the one that fits your data needs. The key differences between Date objects and POSIXct objects can be seen below: - Date: - Represents dates without time. - Useful for operations where time is irrelevant (e.g., calculating the number of days between two dates). - Stored as the number of days since January 1, 1970. - `POSIXct: - Represents both date and time. - Useful for operations involving time (e.g., calculating the number of seconds or hours between two timestamps). - Stored as the number of seconds since January 1, 1970.

Using proper date types ensures that leap years and other calendar-specific rules are handled correctly, making computations accurate and reliable.

Coercion

An important characteristic of vectors is that they can only contain elements of the same data type. If you attempt to combine different types in a vector, R will automatically convert them to a single, common type - a process called “coercion”. This follows a hierarchy: character > numeric (double) > integer > logical.

R

# Coercion examples
num_logical <- c(1, TRUE) # TRUE converted to 1
typeof(num_logical)

OUTPUT

[1] "double"

R

num_character <- c(1, "a") # 1 converted to "1"
typeof(num_character)

OUTPUT

[1] "character"

R

logical_character <- c(TRUE, "a") # TRUE converted to "TRUE"
typeof(logical_character)

OUTPUT

[1] "character"

R

tricky <- c(1, "2", TRUE) # Everything becomes character
typeof(tricky)

OUTPUT

[1] "character"

R will always try to find a common data type that doesn’t lose information. Typically, this means converting toward the more flexible type (with character being the most flexible).

Note: Date/POSIXct will always be treated as “numeric” (days/seconds since January 1st, 1970) when being coerced within a vector!

Challenge

Exercise

  1. Predict the resulting data type for this vector: c(1.1, 2L, TRUE, "a")

  2. Create a vector that contains:

    • The number 5
    • The logical value FALSE
    • The string “data”

    What is the resulting data type? Why?

  1. The vector c(1.1, 2L, TRUE, "a") will have type “character” because character is the most flexible data type.

  2. The vector would be:

R

mixed <- c(5, FALSE, "data")
typeof(mixed)

OUTPUT

[1] "character"

It has type “character” because R coerces all elements to the most flexible data type that includes all values.

Vectors are one of the many data structures that R uses. Other important ones are lists (list), matrices , data frames (data.frame), tibbles (tbl), factors (factor) and arrays (array).

Subsetting vectors


Subsetting (sometimes referred to as extracting or indexing) involves accessing one or more values based on their numeric placement or “index” within a vector. If we want to subset one or several values from a vector, we must provide one index or several indices in square brackets. For instance:

R

job_types <- c("check-in", "check-out", "supervisor")
job_types[2]

OUTPUT

[1] "check-out"

R

job_types[c(3, 2)]

OUTPUT

[1] "supervisor" "check-out" 

We can also repeat the indices to create an object with more elements than the original one:

R

more_jobs <- job_types[c(1, 2, 3, 2, 1, 3)]
more_jobs

OUTPUT

[1] "check-in"   "check-out"  "supervisor" "check-out"  "check-in"
[6] "supervisor"

Conditional subsetting

Another common way of subsetting is by using a logical vector. TRUE will select the element with the same index, while FALSE will not:

R

votes_per_precinct <- c(1000, 4300, 2340, 7190)
votes_per_precinct[c(TRUE, FALSE, TRUE, TRUE)]

OUTPUT

[1] 1000 2340 7190

Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values greater than 2500:

R

votes_per_precinct > 2500    # will return logicals with TRUE for the indices that meet the condition

OUTPUT

[1] FALSE  TRUE FALSE  TRUE

R

## so we can use this to select only the values greater than 2866
votes_per_precinct[votes_per_precinct > 2500]

OUTPUT

[1] 4300 7190

You can combine multiple tests using & (both conditions are true, AND) or | (at least one of the conditions is true, OR):

R

votes_per_precinct[votes_per_precinct < 2000 | votes_per_precinct > 4000]

OUTPUT

[1] 1000 4300 7190

R

votes_per_precinct[votes_per_precinct >= 2000 & votes_per_precinct <= 4000]

OUTPUT

[1] 2340

Here, < stands for “less than”, > for “greater than”, >= for “greater than or equal to”, and == for “equal to”. The double equal sign == is a test for numerical equality between the left and right-hand sides, and should not be confused with the single = sign, which performs variable assignment (similar to <-).

A common task is to search for certain strings in a vector. One could use the “or” operator | to test for equality to multiple values, but this can quickly become tedious.

R

job_types <- c("check-in", "check-out", "supervisor")
job_types[job_types == "check-in" | job_types == "check-out"] # returns both check-in and check-out

OUTPUT

[1] "check-in"  "check-out"

The function %in% allows you to test if any of the elements of a search vector (on the left-hand side) are found in the target vector (on the right-hand side):

R

job_types %in% c("check-in", "check-out")

OUTPUT

[1]  TRUE  TRUE FALSE

Note that the output is the same length as the search vector on the left-hand side, because %in% checks whether each element of the search vector is found somewhere in the target vector. Thus, you can use %in% to select the elements in the search vector that appear in your target vector:

R

job_types[job_types %in% c("check-in", "check-out")]

OUTPUT

[1] "check-in"  "check-out"

Missing Data


As R was designed to analyze data sets, it includes the concept of missing data (which is uncommon in other programming languages). Missing data are represented in vectors as NA.

When doing operations on numbers, most functions will return NA if the data you are working with include missing values. This feature makes it harder to overlook the cases where you are dealing with missing data. You can add the argument na.rm = TRUE to calculate the result while ignoring the missing values.

R

#create vector
checkin_lengths <- c(64, 74, NA, 287)

#calc with NA
mean(checkin_lengths)

OUTPUT

[1] NA

R

max(checkin_lengths)

OUTPUT

[1] NA

R

#calc without NA
mean(checkin_lengths, na.rm = TRUE)

OUTPUT

[1] 141.6667

R

max(checkin_lengths, na.rm = TRUE)

OUTPUT

[1] 287

If your data include missing values, you may want to become familiar with the functions is.na(), na.omit(), and complete.cases(). See below for examples:

R

## Extract those elements which are not missing values.
## The ! character is also called the NOT operator
checkin_lengths[!is.na(checkin_lengths)]

OUTPUT

[1]  64  74 287

R

## Count the number of missing values.
## The output of is.na() is a logical vector (TRUE/FALSE equivalent to 1/0) so the sum() function here is effectively counting
sum(is.na(checkin_lengths))

OUTPUT

[1] 1

R

## Returns the object with incomplete cases removed. The returned object is an atomic vector of type `"numeric"` (or `"double"`).
na.omit(checkin_lengths)

OUTPUT

[1]  64  74 287
attr(,"na.action")
[1] 3
attr(,"class")
[1] "omit"

R

## Extract those elements which are complete cases. The returned object is an atomic vector of type `"numeric"` (or `"double"`).
checkin_lengths[complete.cases(checkin_lengths)]

OUTPUT

[1]  64  74 287

Recall that you can use the typeof() function to find the type of your atomic vector.

Challenge

Exercise

  1. Using this vector of check-in lengths, create a new vector with the NAs removed.

R

checkin_lengths <- c(54, 21, 74, 65, NA, 72, 21, 16, 46, 58, 43, 61, 39, 19, NA, 24)
  1. Use the function median() to calculate the median of the checkin_lengths vector.

  2. Use R to figure out how many check-ins took longer than 55 seconds.

R

#1.
checkin_lengths <- c(54, 21, 74, 65, NA, 72, 21, 16, 46, 58, 43, 61, 39, 19, NA, 24)
checkin_lengths_no_na <- checkin_lengths[!is.na(checkin_lengths)]
# or
checkin_lengths_no_na <- na.omit(checkin_lengths)

# 2.
median(checkin_lengths, na.rm = TRUE)

OUTPUT

[1] 44.5

R

# 3.
checkin_lengths_above_55 <- checkin_lengths_no_na[checkin_lengths_no_na > 55]
length(checkin_lengths_above_55)

OUTPUT

[1] 5
Key Points
  • Access individual values by location using [].
  • Access arbitrary sets of data using [c(...)].
  • Use logical operations and logical vectors to access subsets of data.
  • Use proper date types (Date and POSIXct) instead of strings for date arithmetic.

Content from Starting with Data


Last updated on 2026-04-28 | Edit this page

Overview

Questions

  • What is an R package?
  • What is a data.frame?
  • What is a tibble, and how is it different from a data frame?
  • How can I read a complete csv file into R?
  • How can I get basic summary information about my data set?
  • How can I change the way R treats strings in my data set?
  • Why would I want strings to be treated differently?
  • How are dates represented in data sets and how can I change the format?

Objectives

  • Understand what an R package is.
  • Describe what a data frame is.
  • Describe what a tibble is.
  • Load external data from a .csv file into a tibble.
  • Summarize the contents of a tibble.
  • Subset values from a tibble.
  • Describe the difference between a factor and a string.
  • Convert between strings and factors.
  • Reorder and rename factors.
  • Change how character strings are handled in a tibble.
  • Examine and change date formats within a data set.

What is an R package?


An R package is a collection of functions and (occasionally) data sets that extend the functionality of R. Throughout these lessons, we will primarily be using the tidyverse, which is a collection of R packages designed to make data science easier!

When installing and loading tidyverse, the following are all of the packages that are installed/loaded as part of the collection:

  • ggplot2
  • dplyr
  • tidyr
  • readr
  • tibble
  • forcats
  • lubridate
  • stringr
  • purrr

You can learn more about the tidyverse collection of packages by visiting the tidyverse website.

There are also packages available for a wide range of tasks including downloading data from the NCBI database or performing statistical analysis on your data set. Many packages such as these are housed on, and downloadable from, the Comprehensive R Archive Network (CRAN) using install.packages. This function makes the package accessible by your R installation with the command library().

To easily access the documentation for a package within R or RStudio, use help(package = "package_name").

Callout

Note

There are alternatives to the tidyverse packages for data wrangling, including the package data.table. See this comparison for example to get a sense of the differences between using base, tidyverse, and data.table.

What are data frames?


Data frames are the de facto data structure for tabular data in R, and what we use for data processing, statistics, and plotting.

A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Data frames are analogous to the more familiar spreadsheet in programs such as Excel, with one key difference. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.

A 3 by 3 data frame with columns showing numeric, character and logical values.

Data frames can be created by hand, but most commonly they are generated by the functions read_csv() or read_table(); in other words, when importing spreadsheets from your hard drive (or the web). We will now demonstrate how to import tabular data using read_csv().

Introduction to the Check-In Dataset


The Check-In Dataset is an example data set that is based on the 2018 election. Each row in the data set represents one ballot case, and includes an ID, check-in length, arrival time, location, precinct, and machine.

The following is a visual representation of the data set’s columns:

column_name description
checkin_id Provides a unique key/ID for each ballot instance.
checkin_length How long it took the person submitting the ballot to check-in to the polling location.
checkin_time The arrival time of the person submitting the ballot, includes both the date and time.
location Anonymized ID for the location of the ballot box.
precinct Anonymized ID for the precinct that the ballot box belongs to.
device Anonymized ID for each ballot box.

Importing Data


You are going to load the data in R’s memory using the function read_csv(). This is from the readr package, which (as you may remember) is part of the tidyverse.

Before proceeding, however, this is a good opportunity to talk about conflicts. Certain packages we load can end up introducing function names that are already in use by pre-loaded R packages. For instance, when we load the tidyverse package below, we will introduce two conflicting functions: filter() and lag(). This happens because filter and lag are already functions used by the stats package (which comes pre-loaded in R). What will happen now is that if we, for example, call the filter() function, R will use the dplyr::filter() version and not the stats::filter() one. This happens because, if conflicted, by default R uses the function from the most recently loaded package. Conflicted functions may cause you some trouble in the future, so it is important that we are aware of them so that we can properly handle them, if we want.

To do so, we just need the following functions from the conflicted package:

  • conflicted::conflict_scout(): Shows us any conflicted functions.
  • conflict_prefer("function", "package_prefered"): Allows us to choose the default function we want from now on.

It is also important to know that we can, at any time, just call the function directly from the package we want, such as stats::filter().

Even with the use of an RStudio project, it can be difficult to learn how to specify paths to file locations. Enter the here package! The here package creates paths relative to the top-level directory (your RStudio project). These relative paths work regardless of where the associated source file lives inside your project, like analysis projects with data and reports in different sub-directories. This is an important contrast to using setwd(), which depends on the way you order your files on your computer.

Monsters at a fork in the road, with signs saying here, and not here. One direction, not here, leads to a scary dark forest with spiders and absolute filepaths, while the other leads to a sunny, green meadow, and a city below a rainbow and a world free of absolute filepaths. Art by Allison Horst
Image credit: Allison Horst

Before we can use the read_csv() and here() functions, we need to load the tidyverse and here packages.

R

#loads in the tidyverse and here packages
library(tidyverse)
library(here)

#reads in data and assigns it to the 'data' variable using 'here'
data <- read_csv(here("data", "checkin_data.csv"))

In the above code, we notice the here() function takes folder and file names as inputs (e.g., "data", "checkin_data.csv"), each enclosed in quotations ("") and separated by a comma. The here() will accept as many names as are necessary to navigate to a particular file.

For example, let’s say you have both an RMarkdown file and a folder called "info" that contains multiple CSV files (including "data.csv") on your Desktop. If you want to access "data.csv" within your RMarkdown file, you can use here("info", "data.csv").

The here() function can accept the folder and file names in an alternate format, using a slash (“/”) rather than commas to separate the names. The two methods are equivalent, so that here("data", "checkin_data.csv") and here("data/checkin_data.csv") produce the same result. (The forward slash is used on all operating systems; backslashes are never used.)

If you were to type in the code above, it is likely that the read.csv() function would appear in the automatically populated list of functions. This function is different from the read_csv() function, as it is included in the “base” packages that come pre-installed with R. Overall, read.csv() behaves similar to read_csv(), with a few notable differences. First, read.csv() coerces column names with spaces and/or special characters to different names (e.g. interview date becomes interview.date).

Second, read.csv() stores data as a data.frame, where read_csv() stores data as a different kind of data frame called a tibble. A tibble is an extension of R data frames used by the tidyverse. We prefer tibbles because they have nice printing properties among other desirable qualities. You can read more about tibbles in its docs.

Additionally, the read_csv() statement in the code above creates a tibble but doesn’t output any data because, as you might recall, assignments (<-) don’t display anything. Note, however, that read_csv may show informational text about the data frame that is created.

If we want to check that our tibble has been loaded, we can see the contents of the data by typing its name: data in the console:

R

data
## Try also
## view(interviews)
## head(interviews)

OUTPUT

# A tibble: 352,112 × 6
   checkin_id     checkin_length checkin_time        location    precinct device
   <chr>                   <dbl> <dttm>              <chr>       <chr>    <chr>
 1 CHECKIN_000001             45 2018-11-06 07:02:36 LOCATION_0… PRECINC… DEVIC…
 2 CHECKIN_000002             29 2018-11-06 07:04:09 LOCATION_0… PRECINC… DEVIC…
 3 CHECKIN_000003             65 2018-11-06 07:05:13 LOCATION_0… PRECINC… DEVIC…
 4 CHECKIN_000004             28 2018-11-06 07:06:26 LOCATION_0… PRECINC… DEVIC…
 5 CHECKIN_000005             17 2018-11-06 07:08:08 LOCATION_0… PRECINC… DEVIC…
 6 CHECKIN_000006             56 2018-11-06 07:08:32 LOCATION_0… PRECINC… DEVIC…
 7 CHECKIN_000007             64 2018-11-06 07:09:36 LOCATION_0… PRECINC… DEVIC…
 8 CHECKIN_000008            262 2018-11-06 07:10:18 LOCATION_0… PRECINC… DEVIC…
 9 CHECKIN_000009            245 2018-11-06 07:12:57 LOCATION_0… PRECINC… DEVIC…
10 CHECKIN_000010            260 2018-11-06 07:13:41 LOCATION_0… PRECINC… DEVIC…
# ℹ 352,102 more rows
Callout

Note

read_csv() assumes that fields are delimited by commas (since CSV stands for “Comma Separated Values”). However, in several countries, the comma is used as a decimal separator and the semicolon (;) is used as a field delimiter. If you want to read in this type of files in R, you can use the read_csv2 function. It behaves exactly like read_csv but uses different parameters for the decimal and the field separators. If you are working with another format, they can be both specified by the user. Check out the help for read_csv() by typing ?read_csv to learn more. There is also the read_tsv() for tab-separated data files, and read_delim() allows you to specify more details about the structure of your file.

When the data is read using read_csv(), it is stored in an object of class tbl_df, tbl, and data.frame. You can see the class of an object using:

R

class(data)

OUTPUT

[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

As a tibble, the type of data included in each column is listed in an abbreviated fashion below the column names. For instance, here checkin_id is a column of characters (<chr>),precinct is a column of floating point numbers (abbreviated <dbl> for the word ‘double’), and the checkin_time is a column in the “date and time” format (<dttm> or <S3: POSIXct>).

Inspecting Tibbles


When calling a tbl_df object (like data here), there is already a lot of information about our tibble being displayed, such as the number of rows, the number of columns, the names of the columns, and, as we just saw, the class of data stored in each column. However, there are functions to extract this information from tibbles. Here is a non-exhaustive list of some of these functions. Let’s try them out!

Size:

  • dim(data) - returns a vector with the number of rows as the first element, and the number of columns as the second element (the dimensions of the object)
  • nrow(data) - returns the number of rows
  • ncol(data) - returns the number of columns

Content:

  • head(data) - shows the first 6 rows
  • tail(data) - shows the last 6 rows

Names:

  • names(data) - returns the column names (synonym of colnames() for data.frame objects)

Summary:

  • str(data) - structure of the object and information about the class, length and content of each column
  • summary(data) - summary statistics for each column
  • glimpse(data) - returns the number of columns and rows of the tibble, the names and class of each column, and previews as many values will fit on the screen. Unlike the other inspecting functions listed above, glimpse() is not a “base R” function so you need to have the dplyr or tibble packages loaded to be able to execute it.

Note: most of these functions are “generic.” They can be used on other types of objects besides data frames or tibbles.

Subsetting Tibbles


Our data tibble has rows and columns (it has 2 dimensions). In practice, we may not need the entire tibble; for instance, we may only be interested in a subset of the observations (the rows) or a particular set of variables (the columns). If we want to access some specific data from it, we need to specify the “coordinates” (i.e., indices) we want from it. Row numbers come first, followed by column numbers.

Callout

Tip

Subsetting a tibble with [ always results in a tibble. However, note this is not true in general for data frames, so be careful! Different ways of specifying these coordinates can lead to results with different classes. This is covered in the Software Carpentry lesson R for Reproducible Scientific Analysis.

R

#retrieves 1st element of the 1st column of the tibble
data[1, 1]

OUTPUT

# A tibble: 1 × 1
  checkin_id
  <chr>
1 CHECKIN_000001

R

#retrieves the 1st element in the 5th column of the tibble 
data[1, 5]

OUTPUT

# A tibble: 1 × 1
  precinct
  <chr>
1 PRECINCT_001

R

#retrieves the 1st column of the tibble as a tibble
data[1]

OUTPUT

# A tibble: 352,112 × 1
   checkin_id
   <chr>
 1 CHECKIN_000001
 2 CHECKIN_000002
 3 CHECKIN_000003
 4 CHECKIN_000004
 5 CHECKIN_000005
 6 CHECKIN_000006
 7 CHECKIN_000007
 8 CHECKIN_000008
 9 CHECKIN_000009
10 CHECKIN_000010
# ℹ 352,102 more rows

R

#retrieves the 1st column of the tibble as a vector
#we're using head here, as without it, we would print 100,000 entries!
head(data[[1]])

OUTPUT

[1] "CHECKIN_000001" "CHECKIN_000002" "CHECKIN_000003" "CHECKIN_000004"
[5] "CHECKIN_000005" "CHECKIN_000006"

R

#retrieves the first three elements in the 3rd column of the tibble
data[1:3, 3]

OUTPUT

# A tibble: 3 × 1
  checkin_time
  <dttm>
1 2018-11-06 07:02:36
2 2018-11-06 07:04:09
3 2018-11-06 07:05:13

R

#retrieves the third row of the tibble
data[3, ]

OUTPUT

# A tibble: 1 × 6
  checkin_id     checkin_length checkin_time        location     precinct device
  <chr>                   <dbl> <dttm>              <chr>        <chr>    <chr>
1 CHECKIN_000003             65 2018-11-06 07:05:13 LOCATION_001 PRECINC… DEVIC…

R

#equivalent to head_data <- head(data)
head_data <- data[1:6, ]

: is a special function that creates numeric vectors of integers in increasing or decreasing order, test 1:10 and 10:1 for instance.

You can also exclude certain indices of a tibble using the “-” sign:

R

#retrieves the whole tibble (minus the first column)
data[, -1]

OUTPUT

# A tibble: 352,112 × 5
   checkin_length checkin_time        location     precinct     device
            <dbl> <dttm>              <chr>        <chr>        <chr>
 1             45 2018-11-06 07:02:36 LOCATION_001 PRECINCT_001 DEVICE_001
 2             29 2018-11-06 07:04:09 LOCATION_001 PRECINCT_001 DEVICE_001
 3             65 2018-11-06 07:05:13 LOCATION_001 PRECINCT_001 DEVICE_001
 4             28 2018-11-06 07:06:26 LOCATION_001 PRECINCT_001 DEVICE_001
 5             17 2018-11-06 07:08:08 LOCATION_001 PRECINCT_001 DEVICE_001
 6             56 2018-11-06 07:08:32 LOCATION_001 PRECINCT_001 DEVICE_002
 7             64 2018-11-06 07:09:36 LOCATION_001 PRECINCT_001 DEVICE_001
 8            262 2018-11-06 07:10:18 LOCATION_001 PRECINCT_001 DEVICE_001
 9            245 2018-11-06 07:12:57 LOCATION_001 PRECINCT_001 DEVICE_002
10            260 2018-11-06 07:13:41 LOCATION_001 PRECINCT_001 DEVICE_001
# ℹ 352,102 more rows

R

#equivalent to head(data)
data[-c(7:352112), ]

OUTPUT

# A tibble: 6 × 6
  checkin_id     checkin_length checkin_time        location     precinct device
  <chr>                   <dbl> <dttm>              <chr>        <chr>    <chr>
1 CHECKIN_000001             45 2018-11-06 07:02:36 LOCATION_001 PRECINC… DEVIC…
2 CHECKIN_000002             29 2018-11-06 07:04:09 LOCATION_001 PRECINC… DEVIC…
3 CHECKIN_000003             65 2018-11-06 07:05:13 LOCATION_001 PRECINC… DEVIC…
4 CHECKIN_000004             28 2018-11-06 07:06:26 LOCATION_001 PRECINC… DEVIC…
5 CHECKIN_000005             17 2018-11-06 07:08:08 LOCATION_001 PRECINC… DEVIC…
6 CHECKIN_000006             56 2018-11-06 07:08:32 LOCATION_001 PRECINC… DEVIC…

tibbles can be subset by calling indices (as shown previously), but also by calling their column names directly:

R

#returns a tibble
data["location"]

#returns a tibble
data[, "location"]

#returns a vector
data[["location"]]

#returns a vector
data$location

In RStudio, you can use the auto-completion feature to get the full and correct names of the columns.

Challenge

Exercise

  1. Create a tibble (data_100) containing only the data in row 100 of the data data set.

Now, continue using data for each of the following activities:

  1. Notice how nrow() gave you the number of rows in the tibble?
  • Use that number to pull out just that last row in the tibble.
  • Compare that with what you see as the last row using tail() to make sure it’s meeting expectations.
  • Pull out that last row using nrow() instead of the row number.
  • Create a new tibble (data_last) from that last row.
  1. Using the number of rows in the Check-In Dataset that you found in question 2, extract the rows that are in the middle of the data set. Store the content of these middle rows in an object named data_middle. (hint: the middle two items of a set of 4 would be 2 + 3, or visually, [][X][X][])

  2. Combine nrow() with the - notation above to reproduce the behavior of head(data), keeping just the first through 6th rows of the Check-In Dataset.

R

#part 1:
data_100 <- data[100, ]

#part 2:
#we save nrows so we can use it multiple times! makes the code cleaner :)
n_rows <- nrow(data)
data_last <- data[n_rows, ]

#part 3:
data_middle <- data[(n_rows/2):((n_rows/2) + 1), ]

#part 4:
data_head <- data[-(7:n_rows), ]

Factors


R has a special data class, called factors, to deal with categorical data that you may encounter when creating plots or doing statistical analyses. Factors are very useful and play a key role in making R particularly well suited to working with data.

Factors represent categorical data. They are stored as integers associated with labels, and can be ordered (ordinal) or unordered (nominal). Factors create a structured relation between the different levels (values) of a categorical variable, such as days of the week or responses to a question in a survey. This can make it easier to see how one element relates to the other elements in a column. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So, you need to be very careful when treating them as strings.

Once created, factors can only contain a pre-defined set of values, known as levels. By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:

R

ballot_type <- factor(c("in-person", "absentee", "in-person", "in-person", "absentee"))

R will assign 1 to the level "absentee" and 2 to the level "in-person" (because a comes before i, even though the first element in this vector is "in-person"). You can see this by using the function levels() and you can find the number of levels using nlevels():

R

levels(ballot_type)

OUTPUT

[1] "absentee"  "in-person"

R

nlevels(ballot_type)

OUTPUT

[1] 2

Sometimes, the order of the factors does not matter. Other times you might want to specify the order because it is meaningful (e.g., “low”, “medium”, “high”). It may improve your visualization, or it may be required by a particular type of analysis. Here, one way to reorder our levels in the ballot_type vector would be:

R

ballot_type #current order

OUTPUT

[1] in-person absentee  in-person in-person absentee
Levels: absentee in-person

R

ballot_type <- factor(ballot_type, 
                      levels = c("in-person", "absentee"))

ballot_type #re-ordered

OUTPUT

[1] in-person absentee  in-person in-person absentee
Levels: in-person absentee

In R’s memory, these factors are represented by integers (1, 2), but are more informative than integers because factors are self describing: "in-person", "absentee" is more descriptive than 1, and 2. Which one is “absentee”? You wouldn’t be able to tell just from the integer data. Factors, however, have this information built in. It is particularly helpful when there are many levels, and makes renaming levels easier. Let’s say we made a mistake and need to recode “in-person” to “provisional”. We can do this using the fct_recode() function from the forcats package (included in the tidyverse) – a package that provides some extra tools to work with factors.

R

levels(ballot_type)

OUTPUT

[1] "in-person" "absentee" 

R

ballot_type <- fct_recode(ballot_type, 
                          "provisional" = "in-person")

#alternatively, we could change the "in-person" level directly using the 
#levels() function, but we have to remember that "in-person" is the first level
#levels(ballot_type)[1] <- "provisional"

levels(ballot_type)

OUTPUT

[1] "provisional" "absentee"   

R

ballot_type

OUTPUT

[1] provisional absentee    provisional provisional absentee
Levels: provisional absentee

So far, your factor is unordered, like a nominal variable. R does not know the difference between a nominal and an ordinal variable. You make your factor an ordered factor by using the ordered = TRUE option inside your factor function. Note how the reported levels changed from the unordered factor above to the ordered version below. Ordered levels use the less than sign < to denote level ranking.

R

ballot_type_ordered <- factor(ballot_type, 
                              ordered = TRUE)

ballot_type_ordered #now ordered

OUTPUT

[1] provisional absentee    provisional provisional absentee
Levels: provisional < absentee

Converting Factors

If you need to convert a factor to a character vector, you use as.character(x).

R

as.character(ballot_type)

OUTPUT

[1] "provisional" "absentee"    "provisional" "provisional" "absentee"   

Converting factors where the levels appear as numbers (such as concentration levels, or years) to a numeric vector is a little trickier. The as.numeric() function returns the index values of the factor, not its levels, so it will result in an entirely new (and unwanted in this case) set of numbers. One method to avoid this is to convert factors to characters, and then to numbers. Another method is to use the levels() function. Compare:

R

year_fct <- factor(c(1990, 1983, 1977, 1998, 1990))

as.numeric(year_fct)                     #wrong! and with no warning either...

OUTPUT

[1] 3 2 1 4 3

R

as.numeric(as.character(year_fct))       #technically works...

OUTPUT

[1] 1990 1983 1977 1998 1990

R

as.numeric(levels(year_fct))[year_fct]   #recommended methodology! :)

OUTPUT

[1] 1990 1983 1977 1998 1990

Notice that in the recommended levels() approach, three important steps occur:

  • We obtain all the factor levels using levels(year_fct)
  • We convert these levels to numeric values using as.numeric(levels(year_fct))
  • We then access these numeric values using the underlying integers of the vector year_fct inside the square brackets

Renaming Factors

When your data is stored as a factor, you can use the plot() function to get a quick glance at the number of observations represented by each factor level. Let’s create some new data called ballotData, convert it into a factor, and use it to look at the number of ballots that are in-person or absentee:

R

#create data
ballotData <- c("in-person", "in-person", "in-person", "in-person", "in-person", "in-person", "in-person", "absentee", "absentee", "absentee", "absentee", "absentee", NA, NA)

#convert it into a factor
ballotData <- as.factor(ballotData)

#prints out the data (as a vector)
ballotData

OUTPUT

 [1] in-person in-person in-person in-person in-person in-person in-person
 [8] absentee  absentee  absentee  absentee  absentee  <NA>      <NA>
Levels: absentee in-person

R

#bar plot of the number of cases per ballot type:
plot(ballotData)
Bar plot of Number of Cases per Ballot Type

Looking at the plot compared to the output of the vector, we can see that in addition to “absentee”s and “in-person”s, there are some people for whom their ballot type was not noted. Consequently, these people do not appear on the plot! Let’s encode them differently so they can be counted and visualized in our plot.

R

#recreates the data
ballotData <- c("in-person", "in-person", "in-person", "in-person", "in-person", "in-person", "in-person", "absentee", "absentee", "absentee", "absentee", "absentee", NA, NA)

#replace the missing data with "undetermined"
ballotData[is.na(ballotData)] <- "undetermined"

#convert it into a factor
ballotData <- as.factor(ballotData)

#prints out the data (as a vector)
ballotData

OUTPUT

 [1] in-person    in-person    in-person    in-person    in-person
 [6] in-person    in-person    absentee     absentee     absentee
[11] absentee     absentee     undetermined undetermined
Levels: absentee in-person undetermined

R

#bar plot of the number of cases per ballot type:
plot(ballotData)
Bar plot of Number of Cases per Ballot Type (including missing values)
Challenge

Exercise

  1. Rename the levels of the factor to be in title case: “Absentee”,“In-Person”, and “Undetermined”.

2, Now that we have renamed the factor level to “Undetermined”, can you recreate the bar plot such that “Undetermined” is first (before “Absentee”)?

R

#part 1:
ballotData <- fct_recode(ballotData, 
                         "Absentee" = "absentee",
                         "In-Person" = "in-person", 
                         "Undetermined" = "undetermined")

#part 2:
ballotData <- factor(ballotData, 
                     levels = c("Undetermined", "Absentee", "In-Person"))
plot(ballotData)
Bar plot of Number of Cases per Ballot Type (including missing values)

Formatting Dates


Recall our coverage of dates in “Intro to R”. A best practice for dealing with date data is to ensure that each component of your date is available as a separate variable. In our data set, we have a column checkin_time which contains information about the year, month, day, hour, minute, and second that the person that submitted the ballot arrived in the building. Let’s convert those dates into six separate columns.

R

str(data)

We are going to use the package lubridate, which is included in the tidyverse installation and should be loaded by default. However, if we deal with older versions of tidyverse (2022 and earlier), we can manually load it by typing library(lubridate).

If necessary, start by loading the required package:

R

library(lubridate)

The lubridate function ymd_hms() takes a vector representing year, month, day, hour, minutes, and seconds and converts it to a Date vector.

Let’s extract our checkin_time column and inspect the structure:

R

times <- data$checkin_time
str(times)

OUTPUT

 POSIXct[1:352112], format: "2018-11-06 07:02:36" "2018-11-06 07:04:09" "2018-11-06 07:05:13" ...

When we imported the data in R, read_csv() recognized that this column contained date information. We can now use the day(), month(), year(), hour(), minute(), and second() functions to extract this information from the date, and create new columns in our tibble to store it:

R

data$day <- day(times)
data$month <- month(times)
data$year <- year(times)
data$hour <- hour(times)
data$minute <- minute(times)
data$seconds <- second(times)

data

OUTPUT

# A tibble: 352,112 × 12
   checkin_id  checkin_length checkin_time        location precinct device   day
   <chr>                <dbl> <dttm>              <chr>    <chr>    <chr>  <int>
 1 CHECKIN_00…             45 2018-11-06 07:02:36 LOCATIO… PRECINC… DEVIC…     6
 2 CHECKIN_00…             29 2018-11-06 07:04:09 LOCATIO… PRECINC… DEVIC…     6
 3 CHECKIN_00…             65 2018-11-06 07:05:13 LOCATIO… PRECINC… DEVIC…     6
 4 CHECKIN_00…             28 2018-11-06 07:06:26 LOCATIO… PRECINC… DEVIC…     6
 5 CHECKIN_00…             17 2018-11-06 07:08:08 LOCATIO… PRECINC… DEVIC…     6
 6 CHECKIN_00…             56 2018-11-06 07:08:32 LOCATIO… PRECINC… DEVIC…     6
 7 CHECKIN_00…             64 2018-11-06 07:09:36 LOCATIO… PRECINC… DEVIC…     6
 8 CHECKIN_00…            262 2018-11-06 07:10:18 LOCATIO… PRECINC… DEVIC…     6
 9 CHECKIN_00…            245 2018-11-06 07:12:57 LOCATIO… PRECINC… DEVIC…     6
10 CHECKIN_00…            260 2018-11-06 07:13:41 LOCATIO… PRECINC… DEVIC…     6
# ℹ 352,102 more rows
# ℹ 5 more variables: month <dbl>, year <dbl>, hour <int>, minute <int>,
#   seconds <dbl>

Notice the six new columns at the end of our tibble.

In our example above, the checkin_time column was read in correctly as a Date variable but generally that is not the case. Date columns are often read in as character variables and, similarly to how you can convert character variables to dates using the as_date() function, columns can be converted to the appropriate Date/POSIXctformat.

Let’s say we have a generic tibble of IDs and character dates, as configured:

R

data2 <- tibble(
  ID = c("001", "002", "003"),
  Date = c("01/05/2025", "04/23/2024", "12/25/1987")
)

data2

OUTPUT

# A tibble: 3 × 2
  ID    Date
  <chr> <chr>
1 001   01/05/2025
2 002   04/23/2024
3 003   12/25/1987

As you can see, the Date column is stored as characters. We can easily convert this to a date type by doing one of the following:

R

#option 1: base R (as.Date)
data2$Date1 <- as.Date(data2$Date, format = "%m/%d/%Y")

#option 2: lubridate (mdy)
data2$Date2 <- mdy(data2$Date)

data2

OUTPUT

# A tibble: 3 × 4
  ID    Date       Date1      Date2
  <chr> <chr>      <date>     <date>
1 001   01/05/2025 2025-01-05 2025-01-05
2 002   04/23/2024 2024-04-23 2024-04-23
3 003   12/25/1987 1987-12-25 1987-12-25

Date1 and Date2 store the exact same data! Lubridate is preferred to base R, but either function can be used.

Outputting Data


Occasionally, after editing a data set within RStudio, you may want to output the updated data set to a CSV file. This would allow you to open the updated information in Excel, Google Sheets, or a different RMarkdown file!

To output a file to CSV, we will be using the write_csv() function from the readr package. Below, we will be outputting our updated data with our new date and time columns as "checkin_data_2.csv":

R

#takes the tibble and outputs it as a csv file
write_csv(data, "data/checkin_data_2.csv")

When choosing the name for the new file, ensure there are no files with the same name. By default, write_csv() will overwrite any files of the same name without a warning!

Additionally, you may have noticed we included the file path when specifying the name of the new CSV. When creating any sort of new file – whether that be an image, CSV, or otherwise – R will place the file in the current working directory! In other words, R will always place new files in the same folder as the RMarkdown you’re working in, unless specified otherwise.

Since we have a specific folder (called "data") to store our csv files, we specify that we want the new CSV file to go in that folder by adding "data/" before the file name!

If you want to output your new csv to a different file outside of the working directory, you can use an entire file path (ex. "C:/Users/name/Documents/checkin_data_2.csv") to specify exactly where you want the file to be saved.

Note: similarly to reading in CSV files, readr has a alternate version of write_csv() called write_csv2() that uses commas as decimal separators and semicolons as field delimiters.

Key Points
  • Use read_csv to read tabular data in R.
  • Access rows and columns in a tibble in R.
  • Use factors to represent categorical data in R.
  • Use datetime to represent data in R.
  • Output an updated data set to CSV in R.

Content from Data Wrangling with dplyr


Last updated on 2026-04-28 | Edit this page

Overview

Questions

  • How can I select specific rows and columns from a tibble using dplyr?
  • How does the pipe operator (%\>%) help in combining multiple commands into a single workflow?
  • What is the advantage of using mutate() for creating new variables, and how does it work?
  • How can I summarize my data by grouping observations and applying summary statistics with dplyr?

Objectives

  • Understand the purpose of the dplyr package.
  • Learn how to select specific columns from a tibble using select.
  • Learn how to filter rows based on conditions using filter().
  • Use the pipe operator (%\>%) to seamlessly chain multiple dplyr commands.
  • Create new columns in a tibble with mutate(), deriving them from existing data.
  • Apply the split-apply-combine strategy using group_by() and summarize() to generate summary statistics.

dplyr is a powerful and intuitive package in R designed to make data manipulation both easy and efficient. It is part of the tidyverse ecosystem, which emphasizes readable, consistent syntax for working with data. We’re going to learn some of the most common dplyr functions:

  • select(): subset columns
  • filter(): subset rows on conditions
  • mutate(): create new columns by using information from other columns
  • group_by() and summarize(): create summary statistics on grouped data
  • arrange(): sort results
  • count(): count discrete values

As covered in “Starting with Data”, dplyr is also part of the tidyverse and will be loaded in R’s memory when we call library(tidyverse).

Callout

Note

The packages in the tidyverse, namely dplyr, tidyr and ggplot2 accept both the British (e.g. summarise) and American (e.g. summarize) spelling variants of different function and option names. For this lesson, we utilize the American spellings of different functions; however, feel free to use either whichever variant feels best for you!

To begin working with dplyr, let’s start by loading in the packages and data set:

R

#load packages
library(tidyverse)
library(here)

#read in data
data <- read_csv(here("data", "checkin_data.csv"))

Selecting Columns


The first function we will be covering is the select() function! This function allows us to select specific columns of our data set and accepts two primary types of arguments: the original data set, and the column(s) to isolate.

In our case, for example, we are interested in seeing ONLY the precinct id’s in our data set, so our arguments will be data and precinct:

R

#selects JUST the precinct column
select(data, precinct)

OUTPUT

# A tibble: 352,112 × 1
   precinct
   <chr>
 1 PRECINCT_001
 2 PRECINCT_001
 3 PRECINCT_001
 4 PRECINCT_001
 5 PRECINCT_001
 6 PRECINCT_001
 7 PRECINCT_001
 8 PRECINCT_001
 9 PRECINCT_001
10 PRECINCT_001
# ℹ 352,102 more rows

Using the select function, you can also select MULTIPLE columns. This can be particularly helpful with larger data sets. Theoretically, this function can be performed using subsetting instead of the select function, but it’s best practice to use dplyr functions when possible:

R

#selects the precinct column AND the checkin_time column
select(data, precinct, checkin_time)

OUTPUT

# A tibble: 352,112 × 2
   precinct     checkin_time
   <chr>        <dttm>
 1 PRECINCT_001 2018-11-06 07:02:36
 2 PRECINCT_001 2018-11-06 07:04:09
 3 PRECINCT_001 2018-11-06 07:05:13
 4 PRECINCT_001 2018-11-06 07:06:26
 5 PRECINCT_001 2018-11-06 07:08:08
 6 PRECINCT_001 2018-11-06 07:08:32
 7 PRECINCT_001 2018-11-06 07:09:36
 8 PRECINCT_001 2018-11-06 07:10:18
 9 PRECINCT_001 2018-11-06 07:12:57
10 PRECINCT_001 2018-11-06 07:13:41
# ℹ 352,102 more rows

In some cases, you may want to select multiple, adjacent columns. Instead of writing out each individual column name directly, they can be selected with a :, as seen below:

R

#selects all columns from checkin_time to precinct
select(data, checkin_time:precinct)

You can see a visualized example of the select() function on tidy data tutor

Filtering Rows


Our next function we will be covering is the filter() function! This function allows us to choose rows based on specific criteria, and accepts two arguments: the original data set, and the condition to select the rows based off of. In this case, we ONLY want rows where the precinct is “PRECINCT_001”:

R

#filters rows where the precinct is "PRECINCT_001"
filter(data, precinct == "PRECINCT_001")

OUTPUT

# A tibble: 648 × 6
   checkin_id     checkin_length checkin_time        location    precinct device
   <chr>                   <dbl> <dttm>              <chr>       <chr>    <chr>
 1 CHECKIN_000001             45 2018-11-06 07:02:36 LOCATION_0… PRECINC… DEVIC…
 2 CHECKIN_000002             29 2018-11-06 07:04:09 LOCATION_0… PRECINC… DEVIC…
 3 CHECKIN_000003             65 2018-11-06 07:05:13 LOCATION_0… PRECINC… DEVIC…
 4 CHECKIN_000004             28 2018-11-06 07:06:26 LOCATION_0… PRECINC… DEVIC…
 5 CHECKIN_000005             17 2018-11-06 07:08:08 LOCATION_0… PRECINC… DEVIC…
 6 CHECKIN_000006             56 2018-11-06 07:08:32 LOCATION_0… PRECINC… DEVIC…
 7 CHECKIN_000007             64 2018-11-06 07:09:36 LOCATION_0… PRECINC… DEVIC…
 8 CHECKIN_000008            262 2018-11-06 07:10:18 LOCATION_0… PRECINC… DEVIC…
 9 CHECKIN_000009            245 2018-11-06 07:12:57 LOCATION_0… PRECINC… DEVIC…
10 CHECKIN_000010            260 2018-11-06 07:13:41 LOCATION_0… PRECINC… DEVIC…
# ℹ 638 more rows

You can also use comparison operators within filter() arguments! This includes less-than (<), less-than or equal-to (<=), greater-than (>), greater-than or equal-to (>=), or not-equal-to (!=).

For example, you could filter for all rows where the check-in length is less-than or equal-to 20 seconds:

R

#filters rows with the "less-than or equal-to"/"<=" operator
filter(data, checkin_length <= 20)

OUTPUT

# A tibble: 32,264 × 6
   checkin_id     checkin_length checkin_time        location    precinct device
   <chr>                   <dbl> <dttm>              <chr>       <chr>    <chr>
 1 CHECKIN_000005             17 2018-11-06 07:08:08 LOCATION_0… PRECINC… DEVIC…
 2 CHECKIN_000017             19 2018-11-06 07:20:40 LOCATION_0… PRECINC… DEVIC…
 3 CHECKIN_000059             19 2018-11-06 08:07:12 LOCATION_0… PRECINC… DEVIC…
 4 CHECKIN_000079             20 2018-11-06 08:25:41 LOCATION_0… PRECINC… DEVIC…
 5 CHECKIN_000092             18 2018-11-06 08:37:45 LOCATION_0… PRECINC… DEVIC…
 6 CHECKIN_000094             19 2018-11-06 08:39:38 LOCATION_0… PRECINC… DEVIC…
 7 CHECKIN_000119             17 2018-11-06 08:57:22 LOCATION_0… PRECINC… DEVIC…
 8 CHECKIN_000162             19 2018-11-06 09:30:57 LOCATION_0… PRECINC… DEVIC…
 9 CHECKIN_000163             20 2018-11-06 09:32:14 LOCATION_0… PRECINC… DEVIC…
10 CHECKIN_000190             18 2018-11-06 09:49:41 LOCATION_0… PRECINC… DEVIC…
# ℹ 32,254 more rows

Similarly to the select() function, the filter() function also allows us to specify multiple conditions. However, instead of splitting these by commas, conditions are combined using ‘and’, ‘or’, or comparison statements.

In an ‘and’ statement, an observation (row) must meet all criteria to be included in the resulting tibble. To form ‘and’ statements within dplyr, we can pass our desired conditions as arguments in the filter() function, separated by an ampersand (&).

Below, let’s filter rows that include “PRECINCT_001” as the precinct and “DEVICE_002” as the device:

R

#filters rows with the "and"/"&" logical operator
filter(data, precinct == "PRECINCT_001" & device == "DEVICE_002")

OUTPUT

# A tibble: 265 × 6
   checkin_id     checkin_length checkin_time        location    precinct device
   <chr>                   <dbl> <dttm>              <chr>       <chr>    <chr>
 1 CHECKIN_000006             56 2018-11-06 07:08:32 LOCATION_0… PRECINC… DEVIC…
 2 CHECKIN_000009            245 2018-11-06 07:12:57 LOCATION_0… PRECINC… DEVIC…
 3 CHECKIN_000019             41 2018-11-06 07:23:05 LOCATION_0… PRECINC… DEVIC…
 4 CHECKIN_000026             22 2018-11-06 07:33:38 LOCATION_0… PRECINC… DEVIC…
 5 CHECKIN_000028             21 2018-11-06 07:35:44 LOCATION_0… PRECINC… DEVIC…
 6 CHECKIN_000031             33 2018-11-06 07:37:36 LOCATION_0… PRECINC… DEVIC…
 7 CHECKIN_000041             56 2018-11-06 07:49:06 LOCATION_0… PRECINC… DEVIC…
 8 CHECKIN_000044             23 2018-11-06 07:52:08 LOCATION_0… PRECINC… DEVIC…
 9 CHECKIN_000046             24 2018-11-06 07:54:06 LOCATION_0… PRECINC… DEVIC…
10 CHECKIN_000057             48 2018-11-06 08:05:54 LOCATION_0… PRECINC… DEVIC…
# ℹ 255 more rows

In an ‘or’ statement, an observation (row) must meet at least one criteria to be included in the resulting tibble. To form ‘or’ statements within dplyr, we can pass our desired conditions as arguments in the filter() function, separated by a vertical bar (|).

Below, let’s filter rows that include “PRECINCT_001” or “PRECINCT_002” as the precinct:

R

#filters rows with the "or"/"|" logical operator
filter(data, precinct == "PRECINCT_001" | precinct == "PRECINCT_002")

OUTPUT

# A tibble: 905 × 6
   checkin_id     checkin_length checkin_time        location    precinct device
   <chr>                   <dbl> <dttm>              <chr>       <chr>    <chr>
 1 CHECKIN_000001             45 2018-11-06 07:02:36 LOCATION_0… PRECINC… DEVIC…
 2 CHECKIN_000002             29 2018-11-06 07:04:09 LOCATION_0… PRECINC… DEVIC…
 3 CHECKIN_000003             65 2018-11-06 07:05:13 LOCATION_0… PRECINC… DEVIC…
 4 CHECKIN_000004             28 2018-11-06 07:06:26 LOCATION_0… PRECINC… DEVIC…
 5 CHECKIN_000005             17 2018-11-06 07:08:08 LOCATION_0… PRECINC… DEVIC…
 6 CHECKIN_000006             56 2018-11-06 07:08:32 LOCATION_0… PRECINC… DEVIC…
 7 CHECKIN_000007             64 2018-11-06 07:09:36 LOCATION_0… PRECINC… DEVIC…
 8 CHECKIN_000008            262 2018-11-06 07:10:18 LOCATION_0… PRECINC… DEVIC…
 9 CHECKIN_000009            245 2018-11-06 07:12:57 LOCATION_0… PRECINC… DEVIC…
10 CHECKIN_000010            260 2018-11-06 07:13:41 LOCATION_0… PRECINC… DEVIC…
# ℹ 895 more rows

you can see a visualized examples of the filter() function on tidy data tutor

Using Pipes


In many cases, you will want to apply multiple functions at the same time! Within dplyr, there are three ways to do this.

  1. Intermediate Steps: Using this method, you apply the first function to your data and save the result as a new object. After saving, the second function is applied to your new object instead of the original data. While this method is easy to understand, it can create many extra, unnecessary objects in your R environment.

R

#step 1: apply filter function and save it to a new object (filtered_data)
filtered_data <- filter(data, precinct == "PRECINCT_005")

#step 2: apply select function on the filtered_data object
select(filtered_data, precinct, checkin_time)

OUTPUT

# A tibble: 762 × 2
   precinct     checkin_time
   <chr>        <dttm>
 1 PRECINCT_005 2018-11-06 11:39:28
 2 PRECINCT_005 2018-11-06 11:26:09
 3 PRECINCT_005 2018-11-06 18:25:45
 4 PRECINCT_005 2018-11-06 07:01:07
 5 PRECINCT_005 2018-11-06 07:01:22
 6 PRECINCT_005 2018-11-06 07:02:02
 7 PRECINCT_005 2018-11-06 07:02:02
 8 PRECINCT_005 2018-11-06 07:02:38
 9 PRECINCT_005 2018-11-06 07:02:50
10 PRECINCT_005 2018-11-06 07:03:23
# ℹ 752 more rows
  1. Nested Functions: Instead of saving intermediate results, you can instead put your first function inside the second. This is called nesting, and, while it works, can become confusing if more than two functions are put together.

R

# Do it all in one go, nesting the functions
select(filter(data, precinct == "PRECINCT_005"), precinct, checkin_time)

OUTPUT

# A tibble: 762 × 2
   precinct     checkin_time
   <chr>        <dttm>
 1 PRECINCT_005 2018-11-06 11:39:28
 2 PRECINCT_005 2018-11-06 11:26:09
 3 PRECINCT_005 2018-11-06 18:25:45
 4 PRECINCT_005 2018-11-06 07:01:07
 5 PRECINCT_005 2018-11-06 07:01:22
 6 PRECINCT_005 2018-11-06 07:02:02
 7 PRECINCT_005 2018-11-06 07:02:02
 8 PRECINCT_005 2018-11-06 07:02:38
 9 PRECINCT_005 2018-11-06 07:02:50
10 PRECINCT_005 2018-11-06 07:03:23
# ℹ 752 more rows
  1. Using Pipes: Pipes allow you to connect your commands in a simple, step-by-step way. These let you take the output of one function and send it directly to the next, which is useful when you need to do many things to the same data set! When analyzing code with pipes, you can think of it as the word “then”.

R

#takes the data THEN applies the filter function THEN applies the select function
data %>%
  filter(precinct == "PRECINCT_005") %>%
  select(precinct, checkin_time)

OUTPUT

# A tibble: 762 × 2
   precinct     checkin_time
   <chr>        <dttm>
 1 PRECINCT_005 2018-11-06 11:39:28
 2 PRECINCT_005 2018-11-06 11:26:09
 3 PRECINCT_005 2018-11-06 18:25:45
 4 PRECINCT_005 2018-11-06 07:01:07
 5 PRECINCT_005 2018-11-06 07:01:22
 6 PRECINCT_005 2018-11-06 07:02:02
 7 PRECINCT_005 2018-11-06 07:02:02
 8 PRECINCT_005 2018-11-06 07:02:38
 9 PRECINCT_005 2018-11-06 07:02:50
10 PRECINCT_005 2018-11-06 07:03:23
# ℹ 752 more rows

In the above code, you may have noticed that the data data set was not included as an argument in either of the functions. Since pipes take the object on its left and pass it as the first argument to the function on its right, we don’t need to explicitly include the tibble as an argument to the filter() and select() functions anymore.

In R, there are two main types of pipe operators: 1. |>: called the native pipe — included with base R. 2. %>%: called the magrittr pipe — installed automatically with dplyr. This pipe is the most common, and what we will be using throughout this lesson.

Both pipes behave the exact same way, so the choice of which one to use is a matter of taste.

Challenge

Exercise

Using pipes, filter the data data set to include only observations where the device is "DEVICE_738" select only the columns precinct, checkin_time, and device.

R

data %>%
  filter(device == "DEVICE_738") %>%
  select(precinct, checkin_time, device)

OUTPUT

# A tibble: 134 × 3
   precinct     checkin_time        device
   <chr>        <dttm>              <chr>
 1 PRECINCT_332 2018-11-06 07:02:26 DEVICE_738
 2 PRECINCT_332 2018-11-06 07:03:10 DEVICE_738
 3 PRECINCT_332 2018-11-06 07:03:56 DEVICE_738
 4 PRECINCT_332 2018-11-06 07:04:27 DEVICE_738
 5 PRECINCT_332 2018-11-06 07:05:01 DEVICE_738
 6 PRECINCT_332 2018-11-06 07:06:00 DEVICE_738
 7 PRECINCT_332 2018-11-06 07:06:36 DEVICE_738
 8 PRECINCT_332 2018-11-06 07:07:03 DEVICE_738
 9 PRECINCT_332 2018-11-06 07:07:45 DEVICE_738
10 PRECINCT_332 2018-11-06 07:08:24 DEVICE_738
# ℹ 124 more rows

Split-Apply-Combine Data Analysis


Many data analysis tasks follow a pattern known as split-apply-combine: 1. Split the data into groups. 2. Apply some analysis or calculation to each group. 3. Combine the results into a summary

The dplyr package makes this easy with two main functions: - group_by() to define how you want to split the data. - summarize() to apply one or more calculations on each group and return a summary.

group_by()

The `group_by() function allows us to treat parts of our data set as separate groups so other functions can work within each group instead of on the entire data set. This function accepts the one or more columns to group-by as arguments!

Below, we will be grouping the data by location, and filtering the rows to only include the check-in(s) with the longest check-in length for each location:

R

#groups the data by location and applies filter
data %>%
  group_by(location) %>%
  filter(checkin_length == max(checkin_length))

OUTPUT

# A tibble: 561 × 6
# Groups:   location [417]
   checkin_id     checkin_length checkin_time        location    precinct device
   <chr>                   <dbl> <dttm>              <chr>       <chr>    <chr>
 1 CHECKIN_000032            300 2018-11-06 07:37:43 LOCATION_0… PRECINC… DEVIC…
 2 CHECKIN_000106            300 2018-11-06 08:51:39 LOCATION_0… PRECINC… DEVIC…
 3 CHECKIN_000640            300 2018-11-06 19:47:13 LOCATION_0… PRECINC… DEVIC…
 4 CHECKIN_000839            300 2018-11-06 16:50:29 LOCATION_0… PRECINC… DEVIC…
 5 CHECKIN_001137            299 2018-11-06 10:03:21 LOCATION_0… PRECINC… DEVIC…
 6 CHECKIN_002362            298 2018-11-06 19:09:12 LOCATION_0… PRECINC… DEVIC…
 7 CHECKIN_002572            299 2018-11-06 10:46:01 LOCATION_0… PRECINC… DEVIC…
 8 CHECKIN_003919            300 2018-11-06 18:05:53 LOCATION_0… PRECINC… DEVIC…
 9 CHECKIN_004805            298 2018-11-06 17:57:33 LOCATION_0… PRECINC… DEVIC…
10 CHECKIN_005944            300 2018-11-06 16:17:04 LOCATION_0… PRECINC… DEVIC…
# ℹ 551 more rows

Additionally, when multiple columns are provided, `group_by() goes from left-to-right, grouping by the first column, then within each group by the second, and so on!

Below, we will be doing the same calculation that we did above, but instead of grouping only by location, we will be grouping by location and device:

R

#groups the data by location and applies filter
data %>%
  group_by(location, device) %>%
  filter(checkin_length == max(checkin_length))

OUTPUT

# A tibble: 1,344 × 6
# Groups:   location, device [1,215]
   checkin_id     checkin_length checkin_time        location    precinct device
   <chr>                   <dbl> <dttm>              <chr>       <chr>    <chr>
 1 CHECKIN_000032            300 2018-11-06 07:37:43 LOCATION_0… PRECINC… DEVIC…
 2 CHECKIN_000106            300 2018-11-06 08:51:39 LOCATION_0… PRECINC… DEVIC…
 3 CHECKIN_000640            300 2018-11-06 19:47:13 LOCATION_0… PRECINC… DEVIC…
 4 CHECKIN_000774            295 2018-11-06 12:52:23 LOCATION_0… PRECINC… DEVIC…
 5 CHECKIN_000839            300 2018-11-06 16:50:29 LOCATION_0… PRECINC… DEVIC…
 6 CHECKIN_001015            296 2018-11-06 08:36:18 LOCATION_0… PRECINC… DEVIC…
 7 CHECKIN_001137            299 2018-11-06 10:03:21 LOCATION_0… PRECINC… DEVIC…
 8 CHECKIN_001792            290 2018-11-06 08:00:47 LOCATION_0… PRECINC… DEVIC…
 9 CHECKIN_002210             75 2018-11-06 16:11:09 LOCATION_0… PRECINC… DEVIC…
10 CHECKIN_002362            298 2018-11-06 19:09:12 LOCATION_0… PRECINC… DEVIC…
# ℹ 1,334 more rows

As you can see, there are additional rows, since we are looking at the longest check-in times for each device within each location, instead of just within each location!

After completing analysis, you may want to remove grouping! To do so, you can use the ungroup() function:

R

data %>%
  group_by(location, device) %>% 
  filter(checkin_length == max(checkin_length)) %>%
  ungroup()

OUTPUT

# A tibble: 1,344 × 6
   checkin_id     checkin_length checkin_time        location    precinct device
   <chr>                   <dbl> <dttm>              <chr>       <chr>    <chr>
 1 CHECKIN_000032            300 2018-11-06 07:37:43 LOCATION_0… PRECINC… DEVIC…
 2 CHECKIN_000106            300 2018-11-06 08:51:39 LOCATION_0… PRECINC… DEVIC…
 3 CHECKIN_000640            300 2018-11-06 19:47:13 LOCATION_0… PRECINC… DEVIC…
 4 CHECKIN_000774            295 2018-11-06 12:52:23 LOCATION_0… PRECINC… DEVIC…
 5 CHECKIN_000839            300 2018-11-06 16:50:29 LOCATION_0… PRECINC… DEVIC…
 6 CHECKIN_001015            296 2018-11-06 08:36:18 LOCATION_0… PRECINC… DEVIC…
 7 CHECKIN_001137            299 2018-11-06 10:03:21 LOCATION_0… PRECINC… DEVIC…
 8 CHECKIN_001792            290 2018-11-06 08:00:47 LOCATION_0… PRECINC… DEVIC…
 9 CHECKIN_002210             75 2018-11-06 16:11:09 LOCATION_0… PRECINC… DEVIC…
10 CHECKIN_002362            298 2018-11-06 19:09:12 LOCATION_0… PRECINC… DEVIC…
# ℹ 1,334 more rows

The final table will no longer be considered “grouped”, which can be helpful if you plan to do further operations that don’t rely on grouping.

summarize()

The summarize()`** function is often used alongside **group_by()`, as it allows us to reduce a group of rows to a single row per group. This function accepts one or more expressions that compute summary statistics as arguments!

Some common `summarize() summary functions include: - mean(): calculates the average of a numeric column - max()/min(): returns the maximum or minimum of a group - n(): counts the number of rows in a group - n_distinct(): counts the number of unique values in a column

Suppose we want to see how many total check-ins there were for each precinct in our data set. We can do this by grouping the data by the precinct column using group_by()`** and then using the **summarize()` function to count each row within each precinct group, as seen below:

R

data %>%
  group_by(precinct) %>%
  summarize(total_checkins = n())

OUTPUT

# A tibble: 420 × 2
   precinct     total_checkins
   <chr>                 <int>
 1 PRECINCT_001            648
 2 PRECINCT_002            257
 3 PRECINCT_003            806
 4 PRECINCT_004            466
 5 PRECINCT_005            762
 6 PRECINCT_006            676
 7 PRECINCT_007           1347
 8 PRECINCT_008           1652
 9 PRECINCT_009            742
10 PRECINCT_010            882
# ℹ 410 more rows

We can also apply `summarize() on data that has been grouped by multiple columns! Below, we will be grouping by precinct and device, allowing us to see how many check-ins occurred for each device within each precinct:

R

data %>%
  group_by(precinct, device) %>%
  summarize(total_checkins = n())

OUTPUT

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by precinct and device.
ℹ Output is grouped by precinct.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(precinct, device))` for per-operation grouping
  (`?dplyr::dplyr_by`) instead.

OUTPUT

# A tibble: 1,778 × 3
# Groups:   precinct [420]
   precinct     device     total_checkins
   <chr>        <chr>               <int>
 1 PRECINCT_001 DEVICE_001            381
 2 PRECINCT_001 DEVICE_002            265
 3 PRECINCT_001 DEVICE_671              1
 4 PRECINCT_001 DEVICE_844              1
 5 PRECINCT_002 DEVICE_003            125
 6 PRECINCT_002 DEVICE_004            131
 7 PRECINCT_002 DEVICE_536              1
 8 PRECINCT_003 DEVICE_005            449
 9 PRECINCT_003 DEVICE_006            357
10 PRECINCT_004 DEVICE_006              1
# ℹ 1,768 more rows

You’re not limited to a single summary statistic, either! For example, you might want both the total number of check-ins and the number of unique devices for each precinct. You can combine these in one summarize() call:

R

data %>%
  group_by(precinct) %>%
  summarize(
    total_checkins = n(),
    unique_devices = n_distinct(device)
  )

OUTPUT

# A tibble: 420 × 3
   precinct     total_checkins unique_devices
   <chr>                 <int>          <int>
 1 PRECINCT_001            648              4
 2 PRECINCT_002            257              3
 3 PRECINCT_003            806              2
 4 PRECINCT_004            466              5
 5 PRECINCT_005            762              5
 6 PRECINCT_006            676              2
 7 PRECINCT_007           1347              5
 8 PRECINCT_008           1652              5
 9 PRECINCT_009            742              7
10 PRECINCT_010            882              6
# ℹ 410 more rows

Additionally, if you need to exclude certain rows before summarizing, ensure you use filter() before grouping. For example, to include only check-ins from a specific location, you can do the following:

R

data %>%
  filter(location == "LOCATION_001") %>%
  group_by(precinct) %>%
  summarize(total_checkins = n())

OUTPUT

# A tibble: 1 × 2
  precinct     total_checkins
  <chr>                 <int>
1 PRECINCT_001            646

Additional examples of the group_by() and summarize() functions can be found at tidy data tutor

arrange()

After summarizing, you may want to sort your results. To do so, you can use the arrange() function to reorder rows. For example, to list precincts from lowest to highest check-in counts, you can do the following:

R

data %>%
  group_by(precinct) %>%
  summarize(total_checkins = n()) %>%
  arrange(total_checkins)

OUTPUT

# A tibble: 420 × 2
   precinct     total_checkins
   <chr>                 <int>
 1 PRECINCT_092              2
 2 PRECINCT_360             11
 3 PRECINCT_411             37
 4 PRECINCT_345             42
 5 PRECINCT_101             43
 6 PRECINCT_253             58
 7 PRECINCT_355             60
 8 PRECINCT_175             64
 9 PRECINCT_031             66
10 PRECINCT_403             68
# ℹ 410 more rows

Or, to instead arrange from highest to lowest, include desc() around the arrange() attribute, as seen below: # {r arrange-desc, purl=FALSE} data %>% group_by(precinct) %>% summarize(total_checkins = n()) %>% arrange(desc(total_checkins))

An additional example of the arrange() function can be found at tidy data tutor

count()

When working with data, we often want to know how many observations we have for each factor or combination of factors. As you saw above, we were able to complete this using the group_by() function, followed by the summarize() function.

However, since this is such a common task, dplyr provides the count() function to make this task much quicker and easier to write and perform!

For example, if we want to count the number of check-ins for each precinct, instead of grouping by precinct and summarizing using the n() function, we can do the following:

R

data %>%
    count(precinct)

OUTPUT

# A tibble: 420 × 2
   precinct         n
   <chr>        <int>
 1 PRECINCT_001   648
 2 PRECINCT_002   257
 3 PRECINCT_003   806
 4 PRECINCT_004   466
 5 PRECINCT_005   762
 6 PRECINCT_006   676
 7 PRECINCT_007  1347
 8 PRECINCT_008  1652
 9 PRECINCT_009   742
10 PRECINCT_010   882
# ℹ 410 more rows

Additionally, if you’d like your results sorted, instead of using the arrange() function, you can add “sort = TRUE” as an argument to the count() function, as seen below:

R

data %>%
    count(precinct, sort = TRUE)

OUTPUT

# A tibble: 420 × 2
   precinct         n
   <chr>        <int>
 1 PRECINCT_219  1968
 2 PRECINCT_016  1807
 3 PRECINCT_271  1798
 4 PRECINCT_317  1731
 5 PRECINCT_358  1717
 6 PRECINCT_239  1705
 7 PRECINCT_199  1700
 8 PRECINCT_323  1695
 9 PRECINCT_106  1680
10 PRECINCT_045  1671
# ℹ 410 more rows
Challenge

Exercise

Using what you’ve learned above, determine how many check-ins were recorded for each device. Which device had the highest number of check-ins?

R

data %>%
    count(device, sort = TRUE)

OUTPUT

# A tibble: 1,215 × 2
   device         n
   <chr>      <int>
 1 DEVICE_255   898
 2 DEVICE_190   894
 3 DEVICE_642   887
 4 DEVICE_178   850
 5 DEVICE_435   821
 6 DEVICE_960   817
 7 DEVICE_959   812
 8 DEVICE_436   796
 9 DEVICE_641   782
10 DEVICE_822   769
# ℹ 1,205 more rows

“DEVICE_255” has the highest number of check ins, with 898 recorded!

Challenge

Exercise (continued)

For “PRECINCT_007”, find the device that recorded the least amount of check-ins.

Hint: ensure you filter your data before applying split-apply-combine!

R

data %>%
  filter(precinct == "PRECINCT_007") %>%
  group_by(device) %>%
  summarize(total_checkins = n()) %>%
  arrange(desc(total_checkins))

OUTPUT

# A tibble: 5 × 2
  device     total_checkins
  <chr>               <int>
1 DEVICE_919            462
2 DEVICE_917            448
3 DEVICE_918            426
4 DEVICE_920             10
5 DEVICE_009              1

“DEVICE_009” had the least amount of check-ins, recording only 1.

Mutating Data


Sometimes, you may want to create new columns based on values in existing columns. For example, if you have a column represented in seconds, and you might want to add a new column with the same information, but represented as minutes instead.

To complete this, we use the `mutate() function. This function allows us to create new columns OR modify existing columns by applying operations to each row of the data set!

For example, let’s say that we want to create a new column that, as mentioned above, converts the checkin_length column (which is in seconds) into minutes by dividing each value by 60. Below, we can use the mutate function to add this column to our data:

R

data %>%
    mutate(checkin_length_min = checkin_length / 60)

OUTPUT

# A tibble: 352,112 × 7
   checkin_id     checkin_length checkin_time        location    precinct device
   <chr>                   <dbl> <dttm>              <chr>       <chr>    <chr>
 1 CHECKIN_000001             45 2018-11-06 07:02:36 LOCATION_0… PRECINC… DEVIC…
 2 CHECKIN_000002             29 2018-11-06 07:04:09 LOCATION_0… PRECINC… DEVIC…
 3 CHECKIN_000003             65 2018-11-06 07:05:13 LOCATION_0… PRECINC… DEVIC…
 4 CHECKIN_000004             28 2018-11-06 07:06:26 LOCATION_0… PRECINC… DEVIC…
 5 CHECKIN_000005             17 2018-11-06 07:08:08 LOCATION_0… PRECINC… DEVIC…
 6 CHECKIN_000006             56 2018-11-06 07:08:32 LOCATION_0… PRECINC… DEVIC…
 7 CHECKIN_000007             64 2018-11-06 07:09:36 LOCATION_0… PRECINC… DEVIC…
 8 CHECKIN_000008            262 2018-11-06 07:10:18 LOCATION_0… PRECINC… DEVIC…
 9 CHECKIN_000009            245 2018-11-06 07:12:57 LOCATION_0… PRECINC… DEVIC…
10 CHECKIN_000010            260 2018-11-06 07:13:41 LOCATION_0… PRECINC… DEVIC…
# ℹ 352,102 more rows
# ℹ 1 more variable: checkin_length_min <dbl>

Admittedly, this operation doesn’t tell us anything additional about our data, as it only converts part of our data into a different format. But, with a more complex operation we could, for example, add a column that says whether a check-in length is “abnormal” or not!

For the sake of the example, let’s say that any check-in length greater-than or equal-to 200 seconds is abnormal:

R

data %>%
  mutate(checkin_category = ifelse(checkin_length >= 200, "abnormal", "normal"))

OUTPUT

# A tibble: 352,112 × 7
   checkin_id     checkin_length checkin_time        location    precinct device
   <chr>                   <dbl> <dttm>              <chr>       <chr>    <chr>
 1 CHECKIN_000001             45 2018-11-06 07:02:36 LOCATION_0… PRECINC… DEVIC…
 2 CHECKIN_000002             29 2018-11-06 07:04:09 LOCATION_0… PRECINC… DEVIC…
 3 CHECKIN_000003             65 2018-11-06 07:05:13 LOCATION_0… PRECINC… DEVIC…
 4 CHECKIN_000004             28 2018-11-06 07:06:26 LOCATION_0… PRECINC… DEVIC…
 5 CHECKIN_000005             17 2018-11-06 07:08:08 LOCATION_0… PRECINC… DEVIC…
 6 CHECKIN_000006             56 2018-11-06 07:08:32 LOCATION_0… PRECINC… DEVIC…
 7 CHECKIN_000007             64 2018-11-06 07:09:36 LOCATION_0… PRECINC… DEVIC…
 8 CHECKIN_000008            262 2018-11-06 07:10:18 LOCATION_0… PRECINC… DEVIC…
 9 CHECKIN_000009            245 2018-11-06 07:12:57 LOCATION_0… PRECINC… DEVIC…
10 CHECKIN_000010            260 2018-11-06 07:13:41 LOCATION_0… PRECINC… DEVIC…
# ℹ 352,102 more rows
# ℹ 1 more variable: checkin_category <chr>

This code filters out duplicate entries, showing just one record for each unique precinct and the new column we just added.

Additional examples of the mutate() function can be found at tidy data tutor

Challenge

Exercise

Using what you’ve learned throughout this lesson, create a tibble called “avg_checkins” that meets the following criteria: 1. Includes only precincts from “PRECINCT_001” to “PRECINCT_035”. 2. Removes the “PRECINCT_0” prefix from the precinct names and converts each precinct name to a numeric value. 3. Calculates the average check-in length for each precinct, ensuring this column is named “avg_checkin_length”. 4. Contains two columns: “precinct” and “avg_checkin_length”. 5. Sorts the tibble by precinct (1 to 35).

R

avg_checkins <- data %>%
  mutate(precinct = as.numeric(str_remove(precinct, "PRECINCT_0"))) %>%
  filter(precinct >= 1 & precinct <= 35) %>%
  group_by(precinct) %>%
  summarize(avg_checkin_length = mean(checkin_length)) %>%
  arrange(precinct)

WARNING

Warning: There was 1 warning in `mutate()`.
ℹ In argument: `precinct = as.numeric(str_remove(precinct, "PRECINCT_0"))`.
Caused by warning:
! NAs introduced by coercion
Challenge

Exercise (continued)

Save your new “avg_checkins” into your data folder as “avg_checkins.csv”!

R

write_csv(avg_checkins, "data/avg_checkins.csv")
Key Points
  • Use the dplyr package to manipulate tibbles.
  • Use select() to choose variables from a tibble.
  • Use filter() to choose data based on values.
  • Use group_by() and summarize() to work with subsets of data.
  • Use mutate() to create new variables.

Content from Data Wrangling with tidyr


Last updated on 2026-04-28 | Edit this page

Overview

Questions

  • How can I reformat a tibble to meet my needs?

Objectives

  • Describe the concept of a wide and a long table format and for which purpose those formats are useful.
  • Describe the roles of variable names and their associated values when a table is reshaped.
  • Reshape a tibble from long to wide format and back with the pivot_wider and pivot_longer commands from the tidyr package.

dplyr pairs nicely with tidyr, a package that enables you to swiftly convert between different data formats (long vs. wide) for plotting and analysis. To learn more about tidyr after the workshop, you may want to check out this handy data tidying with tidyr cheatsheet.

To make sure everyone will use the same data sets for this lesson, we’ll be reading in the updated version of the Check-In Dataset (as created in “Starting With Data”), as well as the Messy Dataset (which we will cover at the end of this lesson).

Reading in Data


To start, we will load in the tidyverse and here packages so we can read in our CSV files.

R

library(tidyverse)
library(here)

Next, we will read in the Check-In Data:

R

data <- read_csv(here("data", "checkin_data_2.csv"))

Reshaping with pivot_wider() and pivot_longer()


There are essentially three rules that define a “tidy” data set:

  1. Each variable has its own column
  2. Each observation has its own row
  3. Each value must have its own cell

This graphic visually represents the three rules that define a “tidy” data set:

A visual representation of the three rules that define a 'tidy' data set.R for Data Science, Wickham H and Grolemund G (https://r4ds.had.co.nz/index.html) © Wickham, Grolemund 2017 This image is licensed under Attribution-NonCommercial-NoDerivs 3.0 United States (CC-BY-NC-ND 3.0 US)

In this section we will explore how these rules are linked to the different data formats researchers are often interested in: “wide” and “long”. This tutorial will help you efficiently transform your data shape, regardless of its original format.

First, we will explore qualities of the data data and how they relate to these different types of data formats.

Long and Wide Data Formats

In data, each row contains the values of variables associated with each record collected (each ballot instance). As you may recall from “Starting With Data”, it was stated that the checkin_id was added to provide a “unique key/ID” for each individual ballot.

Since checkin_id is unique to each instance, we can use this variable as an identifier corresponding to each of the 352112 observations.

R

data %>% 
  select(checkin_id) %>% 
  distinct() %>%
  nrow()

OUTPUT

[1] 352112

As seen in the code below, for each check-in time corresponding to each device, no two checkin_ids are the same. Thus, this format is what we call a “long” data format, where each observation occupies only one row in the tibble.

R

data %>%
  filter(location == "LOCATION_001") %>%
  select(checkin_id, checkin_time, location) %>%
  sample_n(size = 10)

OUTPUT

# A tibble: 10 × 3
   checkin_id     checkin_time        location
   <chr>          <dttm>              <chr>
 1 CHECKIN_000106 2018-11-06 08:51:39 LOCATION_001
 2 CHECKIN_000440 2018-11-06 15:06:41 LOCATION_001
 3 CHECKIN_000175 2018-11-06 09:38:17 LOCATION_001
 4 CHECKIN_000395 2018-11-06 13:49:55 LOCATION_001
 5 CHECKIN_000185 2018-11-06 09:43:13 LOCATION_001
 6 CHECKIN_000060 2018-11-06 08:08:15 LOCATION_001
 7 CHECKIN_000340 2018-11-06 12:26:51 LOCATION_001
 8 CHECKIN_000107 2018-11-06 08:51:54 LOCATION_001
 9 CHECKIN_000345 2018-11-06 12:32:37 LOCATION_001
10 CHECKIN_000138 2018-11-06 09:13:48 LOCATION_001

If you were to look at the entire data data set, you would notice that the layout/format of the data adheres to rules 1-3, where:

  1. each column is a variable
  2. each row is an observation
  3. each value has its own cell

As mentioned above, this is called a “long” data format. Additionally, you may notice that each column represents a different variable. In the “longest” data format there would only be three columns, one for the id variable, one for the observed variable, and one for the observed value (of that variable). This data format is quite unsightly and difficult to work with, so you will rarely see it in use.

Alternatively, in a “wide” data format we see modifications to rule 1, where each column no longer represents a single variable. Instead, columns can represent different levels/values of a variable. For instance, in some data you encounter, the researchers may have chosen for every check-in hour to be a different column.

These may sound like dramatically different data layouts, but there are some tools that make transitions between these layouts much simpler than you might think! The GIF below shows how these two formats relate to each other, and gives you an idea of how we can use R to shift from one format to the other.

A gif showing how long and wide tibble layouts relate to each other.
Animation showing pivot_wider and pivot_longer functions transforming data between long and wide formats

Long and wide tibble layouts mainly affect readability. You may find that visually you may prefer the “wide” format, since you can see more of the data on the screen. However, all of the R functions we have used thus far expect for your data to be in a “long” data format. This is because the long format is more machine readable and is closer to the formatting of databases.

Questions That Warrant Different Data Formats

In data, each row contains values associated with each record (the unit). This may include values such as the ID of the ballot box, the ballot box’s location, the precinct the ballot box belongs to, or the arrival time of the person submitting the ballot. This format allows for us to make comparisons across individual ballot instances!

However, what if we wanted to look at how many check-ins occurred each hour in regards to each polling location?

To facilitate this comparison, we would need to create a new table where each row (the unit) represents a polling location (associated with the location column), each column (after the first) represents an hour of the day (associated with the hour column), and the values of each row containing the number of check-ins recorded at that location during that hour.

Once we we’ve created this new table, we can explore the relationships within and between locations. The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest.

Alternatively, let’s say the check-in times were originally spread across multiple columns, and we were interested in visualizing, across multiple locations, how check-in activity has changed over the course of the day. This would require the check-in time to be included in a single column rather than spread across multiple columns. Thus, we would need to transform the column names into the values of a variable.

We can do both of these transformations with two tidyr functions, pivot_wider() and pivot_longer().

Pivoting Wider


pivot_wider() takes in three principal arguments:

  1. the data to be transformed
  2. the names_from column variable (whose values will become new column names).
  3. the values_from column variable (whose values will fill the new column variables).

Further arguments include values_fill which, if set, fills in missing values with the value provided, and names_sort, which, if set, sorts the columns in alphanumerical order.

Let’s use pivot_wider() to transform data to create new columns for each hour represented within the data.

To help with understanding, we will be walking through the transformation line-by-line.

First we create a new object (data_tc) based on the data tibble:

R

data_tc <- data %>%

Our next step will be to get the values for each cell, so we will be using the count() function from the dplyr package. This is completed in the next line, grouping by location and hour:

R

count(location, hour) %>%

Finally, we will be creating and populating the new, “wide” data using the counts and the column values! This can be seen below:

R

pivot_wider(
  names_from = hour,
  values_from = n,
  values_fill = 0
)

Now that we understand what’s going on, let’s combine all those chunks together and look at what our completed tibble looks like!

R

#create the object
data_tc <- data %>%
  #get the values
  count(location, hour) %>%
  #pivot the data
  pivot_wider(
  names_from = hour,
  values_from = n,
  values_fill = 0
)

head(data_tc)

OUTPUT

# A tibble: 6 × 16
  location       `7`   `8`   `9`  `10`  `11`  `12`  `13`  `14`  `15`  `16`  `17`
  <chr>        <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 LOCATION_001    50    71    77    62    65    40    41    30    28    35    62
2 LOCATION_002    16    29    19    32    14    22    14    13    19    20    24
3 LOCATION_003    74    69    88   106    65    64    54    42    49    51    55
4 LOCATION_004    81    74    73    61    59    29    35    36    42    45    54
5 LOCATION_005    53    31    57    64    61    49    57    45    54    67    99
6 LOCATION_006   115    65    75    75    78    44    50    52    50    92    88
# ℹ 4 more variables: `18` <int>, `19` <int>, `6` <int>, `20` <int>

Oh no! It looks like the hours columns are out of order, with 6 sitting between 19 and 20. If we were to perform data analysis, this would not matter, but visually, this can be confusing or misleading, since we expect time to move from left to right in ascending order.

In order to fix this, we can add the aforementioned names_sort argument to the function to specify that the columns should be in order. This line has been added to the code block below:

R

#create the object
data_tc <- data %>%
  #get the values
  count(location, hour) %>%
  #pivot the data
  pivot_wider(
  names_from = hour,
  values_from = n,
  values_fill = 0,
  names_sort = TRUE #sorts the columns from left to right
)

head(data_tc)

OUTPUT

# A tibble: 6 × 16
  location       `6`   `7`   `8`   `9`  `10`  `11`  `12`  `13`  `14`  `15`  `16`
  <chr>        <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 LOCATION_001     0    50    71    77    62    65    40    41    30    28    35
2 LOCATION_002     0    16    29    19    32    14    22    14    13    19    20
3 LOCATION_003     0    74    69    88   106    65    64    54    42    49    51
4 LOCATION_004     0    81    74    73    61    59    29    35    36    42    45
5 LOCATION_005     0    53    31    57    64    61    49    57    45    54    67
6 LOCATION_006     1   115    65    75    75    78    44    50    52    50    92
# ℹ 4 more variables: `17` <int>, `18` <int>, `19` <int>, `20` <int>

As seen by the outputted tibble above, the hour columns now appear in ascending order, making the table far easier to interpret at a glance!

Now that we’ve used pivot_wider() to make our data “wide”, let’s take a closer look at the resulting data_tc tibble to gain a better understanding.

First, let’s check the dimensions:

R

dim(data_tc)

OUTPUT

[1] 417  16

As we can see, there are 417 rows and 16 columns! Each row represents a unique location within the data set. We can verify this by counting the number of unique location values within data:

R

n_distinct(data$location)

OUTPUT

[1] 417

This also returns 417, confirming that each row corresponds to a single, unique location within the data.

Next, let’s look at the 16 columns of the tibble:

R

colnames(data_tc)

OUTPUT

 [1] "location" "6"        "7"        "8"        "9"        "10"
 [7] "11"       "12"       "13"       "14"       "15"       "16"
[13] "17"       "18"       "19"       "20"      

Notice there is no longer a column titled hour. This is because the pivot_wider() function, by default, removes the original column that the new column values were taken from. In this case, the values from the original hour column have now become columns with names that range from 6 to 20, representing the hours from 6AM to 8PM, and thus the hour column has been dropped.

This new format of the data allows us to do interesting things, like make a table showing the number of check-ins across all locations at a particular time, with the rows being ordered from highest to lowest in terms of count:

R

data_tc %>%
  select(location, `7`) %>%
  arrange(desc(`7`))

OUTPUT

# A tibble: 417 × 2
   location       `7`
   <chr>        <int>
 1 LOCATION_233   234
 2 LOCATION_364   215
 3 LOCATION_258   212
 4 LOCATION_366   197
 5 LOCATION_417   197
 6 LOCATION_306   194
 7 LOCATION_317   193
 8 LOCATION_166   189
 9 LOCATION_403   188
10 LOCATION_386   183
# ℹ 407 more rows

Or, we can calculate the total amount of check-ins for each location across all hours, and sort the data to determine which location had the least check-ins:

R

data_tc %>%
  mutate(total_checkins = rowSums(data_tc[-1])) %>%
  select(location, total_checkins) %>%
  arrange(total_checkins)

OUTPUT

# A tibble: 417 × 2
   location     total_checkins
   <chr>                 <dbl>
 1 LOCATION_048              2
 2 LOCATION_308             11
 3 LOCATION_393             38
 4 LOCATION_103             42
 5 LOCATION_280             42
 6 LOCATION_164             58
 7 LOCATION_298             60
 8 LOCATION_101             64
 9 LOCATION_014             66
10 LOCATION_138             68
# ℹ 407 more rows
Challenge

Exercise

We created data_tc by reshaping the data. Replicate this process to create a tibble named data_total that shows the total number of check-ins for each hour, across all locations.

The resulting tibble should have columns for each hour, sorted from earliest to latest similarly to the data_tc tibble. There should only be one row, representative of all locations, and an extra summary column, called total_checkins, that calculates the total number of check ins across the entire data data set.

R

data_total <- data %>%
  count(hour) %>%
  pivot_wider(
    names_from = hour,
    values_from = n,
    values_fill = 0,
    names_sort = TRUE
    ) %>%
  mutate(total_checkins = rowSums(across(everything())))

data_total

OUTPUT

# A tibble: 1 × 16
    `6`   `7`   `8`   `9`  `10`  `11`  `12`  `13`  `14`  `15`  `16`  `17`  `18`
  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1   265 34918 29613 34076 35186 30909 23119 21751 20178 23233 28925 31774 25924
# ℹ 3 more variables: `19` <int>, `20` <int>, total_checkins <dbl>

R

#alternative solution:
data_total_2 <- data %>%
  count(hour) %>%
  pivot_wider(
    names_from = hour,
    values_from = n,
    values_fill = 0,
    names_sort = TRUE
    )

data_total_2 <- data_total_2 %>%
  mutate(total_checkins = rowSums(data_total_2))

data_total_2

OUTPUT

# A tibble: 1 × 16
    `6`   `7`   `8`   `9`  `10`  `11`  `12`  `13`  `14`  `15`  `16`  `17`  `18`
  <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1   265 34918 29613 34076 35186 30909 23119 21751 20178 23233 28925 31774 25924
# ℹ 3 more variables: `19` <int>, `20` <int>, total_checkins <dbl>

Pivoting Longer


The opposing situation could occur if we had been provided with the data_tc tibble, but instead of treating each hour as an individual column, we instead wish to treat them as values of a variable instead.

In this situation, we are gathering all of these columns and turning them into a pair of new variables. One variable will include the column names as values (checkin_hour), and the other will contain the values in each cell previously associated with the column names (checkin_count)!

pivot_longer() takes four principal arguments:

  1. the data to be transformed
  2. the names of the columns we use to fill the a new values variable (or to drop), referred to as cols.
  3. the names_to column variable we wish to create from the cols provided.
  4. the values_to column variable we wish to create and fill with values associated with the cols provided.

R

data_tc_long <- data_tc %>%
  pivot_longer(cols = `6`:`20`,
               names_to = "checkin_hour",
               values_to = "checkin_count")

Below, we will look at the two tibbles and compare their structures:

R

head(data_tc)

OUTPUT

# A tibble: 6 × 16
  location       `6`   `7`   `8`   `9`  `10`  `11`  `12`  `13`  `14`  `15`  `16`
  <chr>        <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 LOCATION_001     0    50    71    77    62    65    40    41    30    28    35
2 LOCATION_002     0    16    29    19    32    14    22    14    13    19    20
3 LOCATION_003     0    74    69    88   106    65    64    54    42    49    51
4 LOCATION_004     0    81    74    73    61    59    29    35    36    42    45
5 LOCATION_005     0    53    31    57    64    61    49    57    45    54    67
6 LOCATION_006     1   115    65    75    75    78    44    50    52    50    92
# ℹ 4 more variables: `17` <int>, `18` <int>, `19` <int>, `20` <int>

R

head(data_tc_long)

OUTPUT

# A tibble: 6 × 3
  location     checkin_hour checkin_count
  <chr>        <chr>                <int>
1 LOCATION_001 6                        0
2 LOCATION_001 7                       50
3 LOCATION_001 8                       71
4 LOCATION_001 9                       77
5 LOCATION_001 10                      62
6 LOCATION_001 11                      65

As you can see, the hours and their corresponding counts for each location are now separated into individual rows! Each location appears multiple times – once for every hour – rather than appearing just once, as in a wide-table format.

Challenge

Exercise

In the last exercise, you created the wide tibble, data_total. In this exercise, your goal is to reverse this transformation using pivot_longer().

Create a tibble called data_total_long that has two columns: one for the hour, and one for the corresponding check-in count. During your transformation, remove the total_checkins column.

R

data_total_long <- data_total %>%
  select(-total_checkins) %>%
  pivot_longer(
    cols = everything(),
    names_to = "hour",
    values_to = "checkin_count"
  )

data_total_long

OUTPUT

# A tibble: 15 × 2
   hour  checkin_count
   <chr>         <int>
 1 6               265
 2 7             34918
 3 8             29613
 4 9             34076
 5 10            35186
 6 11            30909
 7 12            23119
 8 13            21751
 9 14            20178
10 15            23233
11 16            28925
12 17            31774
13 18            25924
14 19            12178
15 20               63

Other Useful tidyr Functions


Throughout this lesson, we used only a portion of the commands that tidyr offers for data transformation. Below, we will be briefly covering some other functions that may prove useful throughout your future analyses (you can refer to the tidyr cheat sheet linked at the beginning of the lesson for more in-depth explanations):

  1. separate_longer_delim() – splits one column into many rows, based on a delimiter.

R

tibble(location = "1", count = "1,2,3") %>%
  separate_longer_delim(count, delim = ",")

OUTPUT

# A tibble: 3 × 2
  location count
  <chr>    <chr>
1 1        1
2 1        2
3 1        3    
  1. separate_wider_delim() – splits one column into multiple columns, based on a delimiter.

R

tibble(date = "01/01/2025") %>%
  separate_wider_delim(date, delim = "/", names = c("month", "day", "year"))

OUTPUT

# A tibble: 1 × 3
  month day   year
  <chr> <chr> <chr>
1 01    01    2025 
  1. unite() – combines multiple columns into one.

R

tibble(city = "Providence", state = "RI") %>%
  unite("location", city, state, sep = ", ")

OUTPUT

# A tibble: 1 × 1
  location
  <chr>
1 Providence, RI
  1. replace_na() – fills in missing values (NA) with a value of choice. The replacement must be in a list.

R

tibble(count = c(1, NA, 3)) %>%
  replace_na(list(count = 2))

OUTPUT

# A tibble: 3 × 1
  count
  <dbl>
1     1
2     2
3     3
  1. drop_na() – removes rows that contain missing values (NA).

R

tibble(count = c(1, NA, 3)) %>%
  drop_na()

OUTPUT

# A tibble: 2 × 1
  count
  <dbl>
1     1
2     3
  1. fill() – fills in missing values (NA) with the value either above (.direction = “down”) or below (.direction = “up”) it.

R

#below
tibble(count = c(1, NA, 3)) %>%
  fill(count, .direction = "up")

OUTPUT

# A tibble: 3 × 1
  count
  <dbl>
1     1
2     3
3     3

R

#above
tibble(count = c(1, NA, 3)) %>%
  fill(count, .direction = "down")

OUTPUT

# A tibble: 3 × 1
  count
  <dbl>
1     1
2     1
3     3
  1. complete() – fills in all combinations of variables that could exist, but don’t within the inputted data.

R

tibble(location = c("A", "B", "B"), hour = c(3, 1, 2)) %>%
  complete(location, hour)

OUTPUT

# A tibble: 6 × 2
  location  hour
  <chr>    <dbl>
1 A            1
2 A            2
3 A            3
4 B            1
5 B            2
6 B            3

Applying What We Learned to Clean Data


Introduction to the Messy Dataset

The Messy Dataset is an example of a “messy” data set that tracks when people check-in to a voting location! In the context of the data set, labels (“provisional”, “assistance”, and “provisional and assistance”) are used to explain why check-in times may be longer than average. If a check-in does not have a label, assistance was not needed, and the check-in can be considered “normal”. Within this data set, missing data is encoded as “NULL”.

The following is a visual representation of the data set’s columns:

column_name description
CheckIn_Duration_Provisional Includes check-ins that fall under the “Provisional” label.
CheckIn_Duration_Assistance Includes check-ins that fall under the “Assistance” label.
CheckIn_Duration_Provisional_and_Assistance Includes check-ins that fall under the “Provisional and Assistance” label.
CheckIn_Duration_ Includes check-ins that did not fall under any label, or in other words, were normal.

As mentioned above, missing information in data is encoded as “NULL”. This requires us to specify na = "NULL" within the read_csv() function, allowing R to automatically convert all the “NULL” entries in the data set into NA.

Below, we will be reading in the Check-In Dataset using the additional line:

R

messy_data <- read_csv(here("data", "messy_data.csv"), na = "NULL")

Tidying the Data

Throughout this next section, we’re going to be tidying/cleaning the Check-In Data step-by-step to ensure understanding throughout!

We’ll start by looking at the data so we can understand what we’re working with:

R

messy_data

OUTPUT

# A tibble: 514 × 4
   CheckIn_Duration_Provisional CheckIn_Duration_Assist…¹ CheckIn_Duration_Pro…²
                          <dbl>                     <dbl>                  <dbl>
 1                           NA                        NA                     NA
 2                           NA                        NA                     NA
 3                           NA                        NA                     NA
 4                           NA                        NA                     NA
 5                           NA                        NA                     NA
 6                           NA                        NA                     NA
 7                           NA                        NA                     NA
 8                           NA                        NA                     NA
 9                           NA                        NA                     NA
10                           NA                        NA                     NA
# ℹ 504 more rows
# ℹ abbreviated names: ¹​CheckIn_Duration_Assistance,
#   ²​CheckIn_Duration_Provisional_and_Assistance
# ℹ 1 more variable: CheckIn_Duration_ <dbl>

At first glance, we can see this data set is wide, with each label tacked onto the end of the phrase “CheckIn_Duration_” and underscores replacing spaces. Additionally, there is no label after “CheckIn_Duration_”, which indicates this is likely representative of the normal check-ins!

However, looking at how many missing values there are, it may be a better choice to turn the data into “long” data, instead of “wide” data, with a duration column, and a label column. Let’s apply this pivot to a new tibble, named clean_data, below:

R

clean_data <- messy_data %>%
  pivot_longer(cols = everything(),
               names_to = "label",
               values_to = "duration")

head(clean_data)

OUTPUT

# A tibble: 6 × 2
  label                                       duration
  <chr>                                          <dbl>
1 CheckIn_Duration_Provisional                      NA
2 CheckIn_Duration_Assistance                       NA
3 CheckIn_Duration_Provisional_and_Assistance       NA
4 CheckIn_Duration_                                 80
5 CheckIn_Duration_Provisional                      NA
6 CheckIn_Duration_Assistance                       NA

Oh no! That’s a lot of NA values. Taking a closer look at the original data, we can see the first value within the data set consists of a duration of 80 underneath the "CheckIn_Duration_" column. Looking at our in-progress, “clean” data set, we can see the labels that do not apply to this duration are listed as NA.

Since the labels that have a duration of NA do not matter within our data set, we can drop them from the tibble completely:

R

clean_data <- clean_data %>%
  drop_na()

head(clean_data)

OUTPUT

# A tibble: 6 × 2
  label             duration
  <chr>                <dbl>
1 CheckIn_Duration_       80
2 CheckIn_Duration_       55
3 CheckIn_Duration_       61
4 CheckIn_Duration_       58
5 CheckIn_Duration_       63
6 CheckIn_Duration_       64

Now we’re getting somewhere! Next, when we loaded in the data set, it was noted that underscores replaced spaces throughout the data. As seen below, the next step is to revert that change:

R

clean_data <- clean_data %>%
  #including "all" in the str replace call ensures both underscores are replaced
  mutate(label = str_replace_all(label, "_", " "))

head(clean_data)

OUTPUT

# A tibble: 6 × 2
  label               duration
  <chr>                  <dbl>
1 "CheckIn Duration "       80
2 "CheckIn Duration "       55
3 "CheckIn Duration "       61
4 "CheckIn Duration "       58
5 "CheckIn Duration "       63
6 "CheckIn Duration "       64

Our next step is removing the “CheckIn Duration” phrase from each label, which we will be completing below:

R

clean_data <- clean_data %>%
  mutate(label = str_remove(label, "CheckIn Duration "))

head(clean_data)

OUTPUT

# A tibble: 6 × 2
  label duration
  <chr>    <dbl>
1 ""          80
2 ""          55
3 ""          61
4 ""          58
5 ""          63
6 ""          64

After removing the “CheckIn Duration” prefix, we can see that some of our labels are now an empty strings. However, as you may recall from our initial analysis of the data, empty labels indicate that the check-in was normal! So, our next step will be replacing the empty labels with “Normal” labels:

R

clean_data <- clean_data %>%
  mutate(label = ifelse(label == "", "Normal", label))

head(clean_data)

OUTPUT

# A tibble: 6 × 2
  label  duration
  <chr>     <dbl>
1 Normal       80
2 Normal       55
3 Normal       61
4 Normal       58
5 Normal       63
6 Normal       64

Now, our data is clean! In practice, all of these functions can (and should!) be chained together using pipes (and comments), as seen in the code block below:

R

clean_data_final <- messy_data %>%
  #pivot longer by label
  pivot_longer(cols = everything(),
               names_to = "label",
               values_to = "duration") %>%
  #remove rows with missing values
  drop_na() %>%
  #replace underscores with spaces
  mutate(label = str_replace_all(label, "_", " ")) %>%
  #remove "CheckIn Duration " from each label
  mutate(label = str_remove(label, "CheckIn Duration ")) %>%
  #replace empty labels with "Normal"
  mutate(label = ifelse(label == "", "Normal", label))
  
head(clean_data_final)

OUTPUT

# A tibble: 6 × 2
  label  duration
  <chr>     <dbl>
1 Normal       80
2 Normal       55
3 Normal       61
4 Normal       58
5 Normal       63
6 Normal       64

Since our data has been cleaned, we can now export it as clean_data.csv for use in future analysis. As you may recall from “Starting with Data”, we will be using the write_csv() function, specifying that we want our csv to go into our data folder:

R

write_csv(clean_data_final, "data/clean_data.csv")
Key Points
  • Use the tidyr package to change the layout of tibbles.
  • Use pivot_wider() to go from long to wide format.
  • Use pivot_longer() to go from wide to long format.

Content from Data Visualisation with ggplot2


Last updated on 2026-04-28 | Edit this page

Overview

Questions

  • What are the components of a ggplot?
  • How can I visualize check-in patterns over time?
  • How can I compare check-in frequencies across locations and devices?
  • What are the main differences between R base plots, lattice, and ggplot?
  • How can I visualize location data on maps with ggplot2?

Objectives

  • Produce scatter plots, box plots, and bar plots using ggplot.
  • Create time series plots for temporal check-in data.
  • Set universal plot settings.
  • Describe what faceting is and apply faceting in ggplot.
  • Modify the aesthetics of an existing ggplot plot (including axis labels and color).
  • Build complex and customized plots from data in a tibble.
  • Create maps with ggplot2 to visualize location-based data.
  • Recognize the differences between base R, lattice, and ggplot visualizations.

This episode is a broad overview of ggplot2 and focuses on getting familiar with the layering system of ggplot2, using the argument group in the aes() function, and basic customization of the plots. We’ll show how to visualize patterns in check-in behavior across different locations and devices, and introduce mapping techniques.

We start by loading the required packages: tidyverse and lubridate. As you may recall, ggplot2 is included in the tidyverse package, so we do not need to load ggplot2 in separately.

R

library(tidyverse)
library(here)
library(lubridate) 

Next, let’s load in our data! Throughout this lesson, we will be using a sampled version of the data we created at the end of “Starting With Data”. In practice, sampling data before visualization is NOT required; however, due to the size of our original data set, using a smaller, sampled data set will allow us to generate plots much faster!

R

data <- read_csv(here("data", "checkin_sample_plotting.csv"))

Before we continue, let’s take a look at the structure and size of our data set to see what we’ll be working with in detail:

R

glimpse(data)

ERROR

Error in `glimpse()`:
! could not find function "glimpse"

As you may notice, the house exceeds 12, meaning this data is in 24 hour time! If you are unfamiliar, this means 13 represents 1PM, 14 represents 2PM, and so on.

Additionally, for those curious, the original data set had approximately 352k lines, which means this data set is less than 10% of the size!

Visualization Options in R


Before we start with ggplot2, it’s helpful to know that there are several ways to create visualizations in R. While ggplot2 is great for building complex and highly customizable plots, there are simpler and quicker alternatives that you might encounter or use depending on the context. Let’s briefly explore a few of them:

Base-R Plots

Base R plots are the simplest form of visualization and are great for quick, exploratory analysis. You can create plots with very little code, but customizing them can be cumbersome compared to ggplot2.

Example of a simple time series plot in base R showing the number of check-ins by hour:

R

hourly_counts <- data %>%
                 count(hour)

plot(hourly_counts$hour, hourly_counts$n,
     main = "Base R Plot: Check-Ins by Hour",
     xlab = "Hour of Day",
     ylab = "Number of Check-Ins",
     type = "l")  #'l' for line

Lattice

Lattice is another plotting system in R, which allows for creating multi-panel plots easily. It’s different from ggplot2 because you define the entire plot in a single function call, and modifications after plotting are limited.

Example of a lattice plot showing check-ins by device for different locations:

R

library(lattice)

R

#grabs specific locations (so the graph isn't giant) and converts locations + devices to factors
checkins_lattice <- data %>%
                    filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003")) %>%
                    #we're removing "DEVICE_" because it causes overlap within the plot
                    #if you're curious, remove this line and regenerate the plot!
                    mutate(device = str_remove(device, "DEVICE_")) %>%
                    mutate(
                      device = as.factor(device),
                      location = as.factor(location)
                    )

#creates a lattice boxplot (bwplot)
bwplot(hour ~ device | location, data = checkins_lattice,
       main = "Lattice Plot: Check-in Hour Distribution by Device and Location",
       xlab = "Device",
       ylab = "Hour of Check-in",
       layout = c(length(unique(checkins_lattice$location)), 1), #adjusts layout for multiple locations
       strip = strip.custom(bg="lightgrey"),
       scales = list(y = list(at = 0:24)), #adds all hours on y, not just even numbers
       panel = function(x, y, ...) {
         panel.bwplot(x, y, ...)
       })

Plotting with ggplot2


ggplot2 is a plotting package that makes creating complex plots from data stored in a tibble simpler. It provides a programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. As a result, if the underlying data changes or if we decide to switch from a bar plot to a scatter plot, we only have to make minimal adjustments to the code!

ggplot2 functions work best with data in the ‘long’ format. As you may recall from “Data Wrangling with tidyr”, this consists of a column for every dimension, and a row for every observation. Ensuring you use well-structured data will save you lots of time when making figures with ggplot2

ggplot2 graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.

Each chart built with ggplot2 must include the following: - Data - Aesthetic mapping (aes) - Describes how variables are mapped onto graphical attributes - Visual attribute of data including x-y axes, color, fill, shape, and alpha - Geometric objects (geom) - Determines how values are rendered graphically, as bars (geom_bar), scatterplot (geom_point), line (geom_line), etc.

Thus, the template for graphic in ggplot2 is:

<DATA> %>%
    ggplot(aes(<MAPPINGS>)) +
    <GEOM_FUNCTION>()

Remember that the pipe operator %>% places the result of the previous line(s) into the first argument of the function. The ggplot function expects a data frame to be the first argument, which allows us to change from specifying the data = argument within the ggplot function to instead piping the data into the function.

To create a chart with ggplot2, follow the steps below:

  1. use the ggplot() function and bind the plot to a specific tibble.

R

data %>%
  ggplot()
  1. Using the aesthetic (aes) function, define your mapping by selecting the variables to be plotted and specifying how to present them in the graph, e.g. as x/y positions or characteristics such as size, shape, color, etc.

R

data %>%
  ggplot(aes(x = precinct))
  1. Add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars). ggplot2 offers many different geoms; we will use some common ones today, including:
    • geom_bar() for counting observations in categories
    • geom_histogram() for showing distributions
    • geom_boxplot() for statistical summaries
    • geom_line() for trend lines, time series, etc.

To add a geom to the plot use the + operator. Let’s start by creating a bar chart showing the distribution of check-ins across precincts:

R

data %>%
  ggplot(aes(x = precinct)) +
  geom_bar()

The + in the ggplot2 package is particularly useful because it allows you to modify existing ggplot objects. This means you can easily set up plot templates and conveniently explore different types of plots! Using this idea, the above plot can also be generated with code like this, similar to the “intermediate steps” approach:

R

#assign the plot to a variable
plot <- data %>%
        ggplot(aes(x = precinct))

#draw the plot as a bar plot
plot +
  geom_bar()
Callout

Notes

  • Anything you put in the ggplot() function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis mapping you set up in aes().
  • You can also specify mappings for a given geom independently of the mapping defined globally in the ggplot() function.
  • The + sign used to add new layers must be placed at the end of the line containing the previous layer. If, instead, the + sign is added at the beginning of the line containing the new layer, ggplot2 will not add the new layer and will return an error message.

R

## This is the correct syntax for adding layers
checkins_plot +
  geom_point()

## This will not add the new layer and will return an error message
checkins_plot
+ geom_point()

Building Your Plots Iteratively


Building plots with ggplot2 is typically an iterative process. We start by defining the data set we’ll use, lay out the axes, and choose a geom.

Let’s re-create the time-series plot we made for the Base-R demonstration:

R

#using the hourly_counts we created, generate a time-series plot
hourly_counts %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() #creates a line plot using the x and y from the ggplot above!

Now that we have a baseline plot to start from, we can start modifying it to extract additional information! For instance, when inspecting the plot, we can notice that it’s a bit difficult to tell at first glance where each hour sits on the line.

To resolve this, we will add points to the line to clearly indicate each hour:

R

hourly_counts %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point()

Next, we will add colors for all of the points by specifying a color argument inside the geom_point function:

R

hourly_counts %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(color = "blue")

To color each point in the plot differently, you could use a vector as an input to the color argument; however, because we are now mapping features of the data to a color, instead of setting one color for all points, the color of the points now needs to be set inside a call to the aes function. When we map a variable in our data to the color of the points, ggplot2 will provide a different color corresponding to the different values of the variable.

Let’s apply this to our plot below, changing the color of each point based on the hour:

R

hourly_counts %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(aes(color = hour))

Unfortunately, this doesn’t tell us much about our data, just that each point represents a different hour (which we already knew!). Additionally, you may notice that after adding conditional coloring using aes(), ggplot automatically added a legend to explain what the different colors represent/mean!

Now, instead of coloring each point based on one of the variables we already have, we’re going to calculate the average hourly count and set the point to green if the count at that hour is above average and red if the count at that hour is below average!

To do this, we will calculate the average hourly count and, using mutate, add a column to our hourly_counts tibble that indicates whether the count at that hour is above or below the calculated average! Then, we will use the scale_color_manual function to manually color these points green and red instead of the default (which, when writing this lesson, was red and blue, respectively).

R

#calculate average
average <- mean(hourly_counts$n)

#plot
hourly_counts %>%
  mutate(avg_color = ifelse(n > average, "Above", "Below")) %>% #adds the additional column
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(aes(color = avg_color)) + #colors the points
  scale_color_manual(values = c("Above" = "green", "Below" = "red")) #chooses the colors

Additionally, you may want to increase the size of the points! This can be accomplished using the size argument within the geom_point function, as seen below:

R

#calculate average
average <- mean(hourly_counts$n)

#plot
hourly_counts %>%
  mutate(avg_color = ifelse(n > average, "Above", "Below")) %>% #adds the additional column
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(aes(color = avg_color), size = 2) + #colors the points
  scale_color_manual(values = c("Above" = "green", "Below" = "red")) #chooses the colors

At this point, our plot is mostly completed! The only remaining issue is the lack of proper titling and labeling.

By default, the axes labels on a plot are determined by the name of the variable being plotted. However, ggplot2 offers lots of customization options, like specifying the axes labels and adding a title to the plot, with relatively few lines of code. We will add more informative x-and y-axis labels to our plot, a more explanatory label to the legend, and a plot title.

The labs function takes the following arguments:

  • title – to produce a plot title
  • subtitle – to produce a plot subtitle (smaller text placed beneath the title)
  • caption – a caption for the plot
  • ... – any pair of name and value for aesthetics used in the plot (e.g., x, y, fill, color, size)

R

hourly_counts %>%
  mutate(avg_color = ifelse(n > average, "Above", "Below")) %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(aes(color = avg_color), size = 2) +
  scale_color_manual(values = c("Above" = "green", "Below" = "red")) +
  labs(title = "Check-In Count per Hour",
       x = "Hour (24H Format)",
       y = "Count",
       color = "Relation to Average")

Our final step will be to improve the x-axis to include all hours, not just 10, 15, and 20! This can be achieved using the scale_x_continuous function.

The scale_x_continuous function is used to customize the x-axis when the x-axis is numeric (or continuous!). Within this function, you can control the axis limit (or range) and breaks (where tick marks appear).

Let’s finish our plot using this function:

R

hourly_counts %>%
  mutate(avg_color = ifelse(n > average, "Above", "Below")) %>%
  ggplot(aes(x = hour, y = n)) +
  geom_line() +
  geom_point(aes(color = avg_color), size = 2) +
  scale_color_manual(values = c("Above" = "green", "Below" = "red")) +
  labs(title = "Check-In Count per Hour",
       x = "Hour (24H Format)",
       y = "Count",
       color = "Relation to Average") +
  scale_x_continuous(breaks = seq(0, 24, by = 1))

While the plot above gives information on the number of check-ins across all locations, we may want information unique to individual locations instead. To achieve this, using the information above, we can calculate the amount of check-ins every hour and add a line for each of the first five locations below:

R

#calculate check-ins per hour for each location
hourly_count <- data %>%
  count(location, hour)

#plot multiple lines, changing the color for each
hourly_count %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>% 
  ggplot(aes(x = hour, y = n, color = location)) + #Note: putting color in ggplot applies to all plots (geom_line AND geom_point)!
  geom_line(size = 1) +
  geom_point(size = 3) +
  labs(title = "Hourly Check-In Count by Location",
       x = "Hour (24H Format)",
       y = "Count",
       color = "Location") +
  scale_x_continuous(breaks = seq(0, 24, by = 1))

As you can see, LOCATION_003 is very popular at 10AM (and may benefit from additional support from employees/volunteers), whereas LOCATION_002 dies down after 11AM.

Boxplot


We can use box plots to visualize the distribution of check-in times for specific locations:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black")

As you may notice, it’s a bit difficult to understand this plot at first glance! To resolve this, let’s begin by adding all of the hours on the y-axis using the scale_y_continuous function! This function behaves the exact same as the scale_x_continuous function, but it applies to the y-axis instead:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  scale_y_continuous(breaks = seq(0, 23, by = 1)) 

By adding points to a box plot, we can have a better idea of the number of measurements and of their distribution:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_point(color = "tomato") +
  scale_y_continuous(breaks = seq(0, 23, by = 1))

Looking at this plot, from a rough estimate, it looks like there are far fewer dots on the plot than there rows in our tibble. This should lead us to believe that there may be multiple observations plotted on top of each other (e.g. three observations where hour is 12 and location is LOCATION_001). This is known as “overplotting” and occurs when multiple data points share the same x and y coordinates.

There are two main ways to alleviate overplotting issues: 1. changing the transparency of the points 2. jittering the location of the points

Let’s first explore option 1, or changing the transparency of the points. When we say “transparency”, we mean the opacity/your ability to see through the point. We can control the transparency of the points with the alpha argument! Values of alpha range from 0 to 1, with lower values corresponding to more transparent colors (an alpha of 1 is the default value). Specifically, an alpha of 0.1, would make a point one-tenth as opaque as a normal point. Stated differently ten points stacked on top of each other would correspond to a normal point.

With that being said, we’re going to change the alpha to 0.5. in an attempt to help fix the overplotting. As you may quickly notice, the overplotting is not solved, but adding transparency begins to address this problem, as the points where there are more overlapping observations are darker (as opposed to lighter red):

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_point(color = "tomato", alpha = 0.5) +
  scale_y_continuous(breaks = seq(0, 23, by = 1))

Since that only helped a little bit with the overplotting problem, let’s try option two and jitter the points on the plot, allowing us to see each point. This is due to jittering introducing a little bit of randomness into the position of our points. You can think of this process as taking the overplotted graph and giving it a tiny shake! The points will move a little bit side-to-side and up-and-down, but their position in comparison to the original plot won’t dramatically change.

Note that this solution is only suitable for plotting integer figures! For numeric figures with decimals, geom_jitter() becomes inappropriate because it obscures the true value of the observation.

We can jitter our points using the geom_jitter() function instead of the geom_point() function, as seen below:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_jitter(color = "tomato", alpha = 0.5) +
  scale_y_continuous(breaks = seq(0, 23, by = 1))

As you can see, the points have been moved dramatically! Thankfully, the geom_jitter() function allows for us to specify the amount of random motion in the jitter by using the width and height arguments. When we don’t specify values for width and height, geom_jitter() defaults to 40% of the resolution of the data (the smallest change that can be measured). Hence, if we would like less spread in our jitter than the default, we should pick values between 0.1 and 0.4. Experiment with the values to see how your plot changes!

Here, we initially chose a height of 0.05 (as too much variation in height may suggest different times at first glance) and a width of 0.2:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_jitter(color = "tomato", alpha = 0.5, height = 0.05, width = 0.2) +
  scale_y_continuous(breaks = seq(0, 23, by = 1))

For our final step, let’s add a title, appropriate labels, and improve the visuals of the plot overall! Additionally, to clean the location names on the x-axis, we’ll be using the mutate function (recall from Data Wrangling with dplyr) to remove the “LOCATION_” prefix from each name (since the axis label will indicate that these are locations!):

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>% #removes prefix
  ggplot(aes(x = location, y = hour)) +
  geom_boxplot(fill = "lightblue", color = "black") +
  geom_jitter(color = "tomato", alpha = 0.5, height = 0.05, width = 0.2) +
  scale_y_continuous(breaks = seq(0, 23, by = 1)) + 
  #adds labels to the plot
  labs(title = "Distribution of Check-in Times by Location",
       x = "Location",
       y = "Hour (24-hour Format)")
Challenge

Exercise

Box plots are useful summaries, but hide the shape of the distribution. For example, if the distribution is bi-modal, we would not see it in a box plot. An alternative to the box plot is the violin plot, where the shape (of the density of points) is drawn.

Start by replacing the box plot with a violin plot; see geom_violin().

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  ggplot(aes(x = location, y = hour)) +
  geom_violin(fill = "lightblue", color = "black") +
  geom_jitter(color = "tomato", alpha = 0.5, height = 0.05, width = 0.2) +
  scale_y_continuous(breaks = seq(0, 23, by = 1)) + 
  labs(title = "Distribution of Check-in Times by Location",
       x = "Location",
       y = "Hour (24-hour Format)")

So far, we’ve looked at the distribution of check-in times between locations. Next, you’re going to try making a new plot to explore the distribution of another variable between locations.

Let’s create a box plot for minute for the locations above. Overlay a jitter layer to on the box plot layer to display the distributions more accurately. Feel free to select any fill, color, alpha, height, and width! Ensure a title and proper axis labels are added.

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  ggplot(aes(x = location, y = minute)) +
  geom_boxplot(alpha = 0) +
  geom_jitter(color = "navy", alpha = 0.5, height = 0, width = 0.2) +
  labs(title = "Distribution of Check-in Minutes by Location",
       x = "Location",
       y = "Minute of Check-in")

Lastly, color each point according to the device used! Ensure you change the name of the legend as well and remove “DEVICE_” from all device names (to ensure a clean legend).

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = location, y = minute)) +
  geom_boxplot(alpha = 0) +
  geom_jitter(aes(color = device), alpha = 1, width = 0.2, height = 0.2) +
  labs(title = "Distribution of Check-in Minutes by Location",
       x = "Location",
       y = "Minute of Check-in",
       color = "Device")

Bar Plot


Bar plots are great for visualizing categorical data, such as counting the number of check-ins per device, per location, or per precinct. By default, geom_bar accepts a variable for x, and plots the number of instances of each value of x (in this case, location) within the data set.

Let’s create a bar plot displaying check-in counts for the first five locations:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  ggplot(aes(x = location)) +
  geom_bar() +
  labs(title = "Check-In Count by Location",
       x = "Location",
       y = "Count")

Next, let’s use the fill aesthetic for the geom_bar() geom to color bars by the device used for check-in:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = location)) +
  geom_bar(aes(fill = device)) +
  labs(title = "Check-In Count by Location",
       x = "Location",
       y = "Count",
       fill = "Device")

This creates a stacked bar chart. Unfortunately, as you may notice, this is a bit difficult to read. Instead, we can separate the portions of the stacked bar that correspond to each device and put them side-by-side by using the position argument for geom_bar() and setting it to “dodge”.

Let’s apply this concept to the code below, changing the title for clarity:

R

data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = location)) +
  geom_bar(aes(fill = device), position = "dodge") +
  labs(title = "Count of Check-Ins by Location for Each Device",
       x = "Location",
       y = "Count",
       fill = "Device")

As you can see, this is much easier to read and interpret!

In some cases, we may be more interested in the proportion of each individual device at each location rather than the actual count of each device. Proportions are helpful because they account for differences in sample sizes, and instead focus on distribution within specific locations! To compare proportions, we will first create a new tibble (prop_device) with a new column named “prop”, representing the percent of each device within each location.

R

prop_device <- data %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  count(location, device) %>%
  group_by(location) %>%
  mutate(prop = n / sum(n)) %>%
  ungroup()

Now, we can use this new tibble to create our plot showing the proportion of each device at each location! When creating your plot, ensure you include y = prop within the initial ggplot call AND stat = "identity" to tell ggplot to use the y values instead of the count, and adjust labels/titles for clarity:

R

prop_device %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = location, y = prop)) +
  geom_bar(aes(fill = device), position = "dodge", stat = "identity") +
  labs(title = "Proportion of Check-Ins by Location for Each Device",
       x = "Location",
       y = "Proportion",
       fill = "Device")

Looking at this graph, we can see that all of the devices (except DEVICE_012) have similar proportions (aka. usage rates) when sample sizes are taken into consideration!

Callout

Note

If you’d prefer to visualize percentages instead of proportions, you can multiply the prop column by 100! For example:

R

prop_device <- prop_device %>%
  mutate(prop = (prop * 100))

If you adjust to percentages, however, please ensure you adjust titles and axis labels accordingly!

Challenge

Exercise

Using the information you learned above, create a bar plot showing the proportion (or percentages, if you’d like) of check-ins by hour for the first four devices (ie. “DEVICE_001”, “DEVICE_002”, “DEVICE_003”, and “DEVICE_004”). Which hours had the highest proportion of check-ins from DEVICE_001 and DEVICE_002?

R

#calculate proportions
prop_hour_device <- data %>%
  filter(device %in% c("DEVICE_001", "DEVICE_002", "DEVICE_003", "DEVICE_004")) %>%
  count(hour, device) %>%
  group_by(hour) %>%
  mutate(prop = n / sum(n)) %>%
  ungroup()

#generate plot
prop_hour_device %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = hour, y = prop, fill = device)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Check-Ins by Hour for Each Device",
       x = "Hour (24H Format)",
       y = "Proportion",
       fill = "Device") +
  scale_x_continuous(breaks = seq(0, 24, by = 1))
  #note: you can remove 6 and 20 by using this line instead: 
  #scale_x_continuous(breaks = seq(7, 19, by = 1))

From this plot, we can identify that DEVICE_001 has the highest proportion at 7:00/7AM and DEVICE_002 has the highest proportion at 19:00/7PM.

Challenge

Exercise

Create a bar plot showing the check-in counts for the ten devices with the highest number of check-ins. Color each bar according to the device, title it appropriately, and use proper axis labels!

R

#retrieve top devices
top_devices <- data %>%
  count(device) %>%
  top_n(10, n) %>%
  pull(device)

#create plot
data %>%
  filter(device %in% top_devices) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = device, fill = device)) +
  geom_bar() +
  labs(title = "Top 10 Devices by Number of Check-ins",
       x = "Device",
       y = "Count")+
  theme_classic()

Faceting


Rather than creating a single plot with side-by-side bars for each device, we may want to create multiple plots, where each plot shows the data for a single device. This would be especially useful if we had a large number of devices that we had sampled (like 5 or 10), as side-by-side bars become harder to read as the number of bars increase.

ggplot2 has a special technique called faceting that allows the user to split one plot into multiple plots based on a factor included in the data set. Below, we can use this technique to split our bar plot of check-in proportions by hour for each device so each device has its own panel:

R

#generate plot
prop_hour_device %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = hour, y = prop, fill = device)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Check-Ins by Hour for Each Device",
       x = "Hour (24H Format)",
       y = "Proportion",
       fill = "Device") +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  facet_wrap(~ device, scales = "free_y") #here, we specify we want to facet wrap by device

You can click the “Zoom” button in your RStudio plots panel to view a larger version of this plot.

Usually plots with white background look more readable when printed. We can set the background to white using the function theme_bw(). Additionally, we can remove the grid:

R

prop_hour_device %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = hour, y = prop, fill = device)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Check-Ins by Hour for Each Device",
       x = "Hour (24H Format)",
       y = "Proportion",
       fill = "Device") +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  facet_wrap(~ device, scales = "free_y") +
  theme_bw() +
  theme(panel.grid = element_blank())

We can also facet by location to see patterns of device proportions within different locations:

R

#creates new data using location information
prop_hour_device_loc <- data %>%
  filter(device %in% c("DEVICE_001", "DEVICE_002", "DEVICE_003", "DEVICE_004")) %>%
  count(hour, location, device) %>% 
  group_by(hour, location) %>% #this specifies to calculate within locations as well
  mutate(prop = n / sum(n)) %>%
  ungroup()

#generates plot
prop_hour_device_loc %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = hour, y = prop, fill = device)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Hourly Distribution of Device Check-Ins, Faceted by Location",
       x = "Hour (24H Format)",
       y = "Proportion",
       fill = "Device") +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  facet_wrap(~ location, scales = "free_y") +
  theme_bw() +
  theme(panel.grid = element_blank())

Looking at the graph above, we can see that at LOCATION_001, devices have varying rates of usage throughout the day, and at LOCATION_002, devices are often used the same amount!

Histograms


When working with election data, understanding the distribution of check-ins over time is crucial! As seen above, bar plots allow us to look at general peaks and overall trends using the hour variable. However, if we wanted to look at the distribution of check-ins at a more detailed level (like by minute intervals), bar plots become much less effective.

In these cases, histograms are more appropriate to use! This is due to histograms’unique ability to allow for the sorting of continuous variables into bins, making it easier to identify trends.

First, let’s look at the bar chart below:

R

data %>%
  ggplot(aes(x = hour)) +
  geom_bar(color = "black", fill = "lightblue", ) +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  labs(title = "Check-In Distribution by Hour",
       x = "Hour (24H Format)",
       y = "Count")

Now, let’s create a similar plot displaying the distribution of check-ins by hour using a histogram instead of a bar plot:

R

data %>%
  ggplot(aes(x = hour)) +
  geom_histogram(color = "black", fill = "lightblue", binwidth = 1) +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  labs(title = "Check-In Distribution by Hour",
       x = "Hour (24H Format)",
       y = "Count")

As you may see, the plots look almost identical, save for the histogram having bars that touch (since the data is continuous and not discrete/categorical).

With histograms, however, we can create a more granular view by using smaller bins:

R

#create a decimal representation of the data (hour + minutes)
checkins_with_dec_hour <- data %>%
  mutate(dec_hour = hour + minute/60)

#plot with 15 minute bins (0.25 minute bins)
checkins_with_dec_hour %>%
  ggplot(aes(x = dec_hour)) +
  geom_histogram(color = "black", fill = "lightblue", binwidth = 0.25) +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  labs(title = "Check-In Distribution by Hour (15-Minute Intervals)",
       x = "Hour (24H Format)",
       y = "Count")

Looking at this graph, it’s clearer that there is a large spike of check-ins early in the morning (between 7AM and 8AM). If you were to only look at the bar plot or 1-bin histogram, however, you may have assumed check-ins kept about the same rate throughout the whole morning (7AM - 10AM)!

Visualizing Location Data with Maps


When working with geographic or location data, it’s often useful to visualize it on a map. Throughout the next section, we’ll demonstrate ways to work with spacial data using the Game of Thrones Dataset!

First, let’s load the sf package. This package allows gpplot2 to work with spacial data (like shape files):

R

library(sf)

Next, let’s load in the map data containing our map polygons:

R

#read in data and save to object
westeros_map <- st_read(here("data", "polygons_GoT.geojson"), quiet = TRUE)

#look at the data structure
head(westeros_map, 3)

Finally, let’s load the voting data and link it to our map data using the merge function. This function allows for two tibbles to be linked based off of a specified variable (in our case, the “id”):

R

#read in data and save to object
got_votes <- read_csv(here("data", "voting_GoT.csv"))

#look at the data structure
head(got_votes)

#join data using the merge function
westeros_voting <- merge(westeros_map, got_votes, by = "id")

Map Introduction

Now that our data is ready to be mapped, let’s start by visualizing which regions favor Jon Snow over Daenerys Targaryen.

When using spacial data, we use a special ggplot function called geom_sf. Simply, this tells ggplot to look at the simple features (like lines or polygons) in your data and use that for the graph!

Below, we will be using geom_sf on our combined data and use Jon_Snow_pct to determine the level of support Jon Snow is getting from each region:

R

ggplot() +
  geom_sf(data = westeros_voting, aes(fill = Jon_Snow_pct)) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Support for Jon Snow across Westeros",
       fill = "Support %") +
  theme_bw()

Next, let’s do the same for Daenerys Targaryen, but with red instead of blue for the color scale:

R

# Create a map colored by Daenerys support
ggplot() +
  geom_sf(data = westeros_voting, aes(fill = Daenerys_Targaryen_pct)) +
  scale_fill_gradient(low = "pink", high = "darkred") +
  labs(title = "Support for Daenerys Targaryen across Westeros",
       fill = "Support %") +
  theme_bw()

Conditional Map Coloring

Often, it may be more beneficial to color each part of the map according to the candidate that received the most votes, rather than displaying the amount of support a single candidate received.

This can be achieved by determining which candidate received the most votes and filling that section with that candidate’s color using the scale_fill_manual function:

R

#create a column with the name of the dominant candidate
westeros_voting$dominant <- ifelse(westeros_voting$Jon_Snow_pct > westeros_voting$Daenerys_Targaryen_pct, 
                                  "Jon Snow", "Daenerys Targaryen")

#pick fill colors based on the dominant candidate
dom_color <- c("Jon Snow" = "steelblue", 
               "Daenerys Targaryen" = "firebrick")

#create a map with the specified coloring
ggplot() +
  geom_sf(data = westeros_voting, aes(fill = dominant)) +
  scale_fill_manual(name = "Dominant Candidate", values = dom_color) +
  labs(title = "Dominant Candidate by Region") +
  theme_bw()

In some cases, you may not just be interested in who won each region, but additionally by how much. To map this, first determine the margin of victory and add a column containing how strong of a victory they had:

R

#calculate margin of victory
westeros_voting$margin <- abs(westeros_voting$Jon_Snow_pct - westeros_voting$Daenerys_Targaryen_pct)

#bin the margin into three levels (low, med, high)
westeros_voting$margin_bin <- ifelse(
  westeros_voting$margin <= 5, "Low",
  ifelse(westeros_voting$margin <= 20, "Med",
         "High")
)

Using the information you gained above, you can now develop your “fill rule” and select the color that corresponds to each instance. In this case, your “fill rule” consists of the winner of each region (ie. Jon Snow) and how high of a margin of victory they had (ie. High):

R

#make a fill rule (ie. Jon Snow - High)
westeros_voting$marg_fill <- paste(westeros_voting$dominant, westeros_voting$margin_bin, sep = " - ")

#pick fill colors based on the fill rule!
marg_color <- c(
  "Daenerys Targaryen - High" = "brown4",
  "Daenerys Targaryen - Med" = "firebrick",
  "Daenerys Targaryen - Low" = "pink",
  "Jon Snow - High" = "darkblue",
  "Jon Snow - Med" = "royalblue",
  "Jon Snow - Low" = "lightblue"
)

Your final step is to combine your fill rule and chosen colors with your mapping information, creating your margin of victory map:

R

#create margin of victory map
ggplot() +
  geom_sf(data = westeros_voting, aes(fill = marg_fill)) +
  scale_fill_manual(name = "Winner & Margin", values = marg_color) +
  labs(title = "Margin of Victory in Each Region") +
  theme_bw()

Adding Map Labels

After ensuring your map includes all the information required, the final step is adding region labels! Unfortunately, due to the nature of polygons, this is a bit more difficult than simply using the labs function.

To add region labels, your first step is to convert your data to an simple feature, also known as an sf, object. This will allow for the calculation of where your labels will sit on your map:

R

#convert to sf
westeros_voting_sf <- st_as_sf(westeros_voting)

Your second step is to determine where your region labels will sit on your map! This is completed by calculating thergdef(‘centroid’, ‘centroids’)`, or center points, of each region. Below, we will calculate the centroid of each region and convert its x and y coordinates to columns for easier access:

R

#calculate centroid
region_centroids <- st_centroid(westeros_voting_sf)

#extract the coordinates
coords <- st_coordinates(region_centroids)

#convert coordinates to columns coords.X and coords.Y
region_centroids$coords.X <- coords[, 1]
region_centroids$coords.Y <- coords[, 2]

Now that we have determined where the region labels will be placed, we can finally add the region labels onto the map using the geom_text function.

Within this function, we can specify the data used (in this case, region_centroids), the coordinates, the information that will be used for the label, and text formatting information (like size and bold/italics)!

Additionally, it’s important to note that we need to use westeros_voting_sf as the data for the map instead of westeros_voting. This will ensure that the region labels will properly sit on their proper locations!

R

#create a map with the specified coloring
ggplot() +
  geom_sf(data = westeros_voting_sf, aes(fill = dominant)) +
  geom_text(data = region_centroids, 
            aes(x = coords.X, y = coords.Y, label = Name),
            size = 2, fontface = "bold") +
  scale_fill_manual(name = "Dominant Candidate", values = dom_color) +
  labs(title = "Dominant Candidate by Region") +
  theme_bw()

As you may notice, some labels in dense areas are overlapping a lot! This is due to the size of the map in your local version of R. To resolve this, you can export the map at a larger size using ggsave (which will be covered at the end of this lesson!).

Challenge

Exercise

Using what you’ve learned above, create a map displaying the peak check-in wait times across the first 35 precincts. For this lesson, we will be using the avg_checkins.csv file we created within “Data Wrangling with dplyr”!

To complete this map, use the following steps: 1. Read in your data as “checkin_data”. 2. Using the merge function, link together your “checkin_data” with the “westeros_map”, creating a “westeros_checkins” dataframe. Hint: if the linking columns are named differently, use by.x and by.y to specify the two names (with x being the first data and y being the second). 3. Generate your map based on the “westeros_checkins” data, filling each region based on the avg_checkin_length. 4. Choose a title and change the name of the legend to “Check-In Times”.

R

#read in data
checkin_data <- read_csv(here("data", "avg_checkins.csv"))

#link together map and checkin_data
westeros_checkins <- merge(westeros_map, checkin_data, by.x = "id", by.y = "precinct")

#generate map with labels
ggplot() +
  geom_sf(data = westeros_checkins, aes(fill = avg_checkin_length)) +
  labs(title = "Average Check-In Times Across Westeros",
       fill = "Check-In Times") +
  theme_bw()

Customization


ggplot2 Themes

In addition to theme_bw(), which changes the plot background to white, ggplot2 comes with several other themes which can be useful to quickly change the look of your visualization. The complete list of themes is available at https://ggplot2.tidyverse.org/reference/ggtheme.html. theme_minimal() and theme_light() are popular, and theme_void() can be useful as a starting point to create a new hand-crafted theme.

The ggthemes package provides a wide variety of options (including an Excel 2003 theme). The ggplot2 extensions website provides a list of packages that extend the capabilities of ggplot2, including additional themes.

Custom Themes

If you do not like the themes offered, or you’d like to change a portion of a theme, you can use the theme() function to manually customize your maps and plots!

The theme() function allows you to customize all portions of a ggplot, including the text, title, subtitle, and grids. You can find the full list in the documentation or by using the panel on the right and navigating to the theme help page (Help > Packages > ggplot2 > theme).

Below, we will be applying a few of these customizations to a plot from earlier in the lesson:

R

prop_device %>%
  filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
  mutate(location = str_remove(location, "LOCATION_")) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = location, y = prop)) +
  geom_bar(aes(fill = device), position = "dodge", stat = "identity") +
  labs(title = "Proportion of Check-Ins by Location for Each Device",
       x = "Location",
       y = "Proportion",
       fill = "Device") +
  theme_bw() +
  theme(
    text = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(face = "italic"),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.border = element_rect(color = "grey70")
  )

Note: it is also possible to change the fonts of your plots! If you are on Windows, you will have to install the extrafont package before doing so..

Additionally, you like the changes you created better than the default themes, you can save your changes as a custom theme for application to other plots:

R

my_theme <- theme_bw() +
  theme(
    text = element_text(size = 12),
    plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
    axis.title = element_text(face = "italic"),
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
    panel.border = element_rect(color = "grey70")
  )

prop_hour_device %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = hour, y = prop, fill = device)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Proportion of Check-Ins by Hour for Each Device",
       x = "Hour (24H Format)",
       y = "Proportion",
       fill = "Device") +
  scale_x_continuous(breaks = seq(0, 24, by = 1)) +
  my_theme

These themes can also be applied to maps, as seen below:

R

ggplot() +
  geom_sf(data = westeros_voting, aes(fill = dominant)) +
  scale_fill_manual(name = "Dominant Candidate", values = dom_color) +
  labs(title = "Dominant Candidate by Region") +
  my_theme
Discussion

Exercise

With all of this information in hand, please take another five minutes to either improve one of the plots generated in this exercise or create a beautiful graph of your own using any of the data used throughout this lesson.

You can use the RStudio ggplot2 cheat sheet for inspiration.

Here are some ideas: - Make a line plot showing the cumulative number of check-ins over the course of the day. - Try using a different color palette for your device comparison. - Generate a new map using the GoT data.

Plot Output


After creating a plot, you may want to save it as a png (or other format). To do this, you can use the use the ggsave() function, which allows you to easily change the dimension and resolution of your plot by adjusting the appropriate arguments (width, height and dpi) before saving the plot to the specified directory.

Here, we will save one of the plots we customized above:

R

plot <- prop_device %>%
        filter(location %in% c("LOCATION_001", "LOCATION_002", "LOCATION_003", "LOCATION_004", "LOCATION_005")) %>%
        mutate(location = str_remove(location, "LOCATION_")) %>%
        mutate(device = str_remove(device, "DEVICE_")) %>%
        ggplot(aes(x = location, y = prop)) +
        geom_bar(aes(fill = device), position = "dodge", stat = "identity") +
        labs(title = "Proportion of Check-Ins by Location for Each Device",
             x = "Location",
             y = "Proportion",
             fill = "Device") +
        theme_bw() +
        theme(
          text = element_text(size = 12),
          plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
          axis.title = element_text(face = "italic"),
          panel.grid.minor = element_blank(),
          panel.grid.major.x = element_blank(),
          panel.border = element_rect(color = "grey70")
        )

ggsave("fig-output/device_prop.png", plot, width = 10, height = 6, dpi = 300)

You can find the png generated in your data folder!

Key Points
  • ggplot2 is a flexible and useful tool for creating plots in R.
  • The data set and coordinate system can be defined using the ggplot function.
  • Additional layers, including geoms, are added using the + operator.
  • Time-series data can be visualized using geom_line() and geom_point().
  • Box plots are useful for visualizing the distribution of check-in times by location.
  • Bar plots are useful for visualizing counts of check-ins by categorical variables.
  • Faceting allows you to generate multiple plots based on a categorical variable like device.
  • Spatial data can be visualized on maps using the sf and ggplot2 packages.

Content from Getting Started with R Markdown (optional)


Last updated on 2026-04-28 | Edit this page

Overview

Questions

  • What is R Markdown?
  • How can I integrate my R code with text and plots?
  • How can I convert .Rmd files to .html, .pdf, or .docx?

Objectives

  • Create a .Rmd document containing R code, text, and plots
  • Create a YAML header to control output
  • Understand basic syntax of R Markdown
  • Customize code chunks to control formatting
  • Use code chunks and in-line code to create dynamic, reproducible documents

R Markdown


R Markdown is a flexible type of document that allows you to seamlessly combine executable R code (and its output) with text and images in a single document. These documents can be readily converted to multiple static and dynamic output formats, including PDF (.pdf), Word (.docx), and HTML (.html).

The benefit of a well-prepared R Markdown document is full reproducibility! This also means that, if you notice a data transcription error or you are able to add more data to your analysis, you will be able to recompile the report without making any changes in the actual document.

The rmarkdown package comes pre-installed with RStudio, so no action is necessary to begin using R Markdown documents.

R Markdown wizard monsters creating a R Markdown document from a recipe. Art by Allison Horst
Image credit: Allison Horst

Creating an R Markdown File


To create a new R Markdown document in RStudio, click File -> New File -> R Markdown:

Screenshot of the New R Markdown file dialogue box in RStudio

Then, click on ‘Create Empty Document’ to generate your R Markdown file.

In practice, you can enter the title of your document, your name (Author), and select the type of output. However, in this lesson, we will be learning how to start from a blank document.

Basic Components of R Markdown


To control the output, a YAML header is needed. YAML (which stands for YAML Ain’t Markup Language) is a human-readable serialization language that helps with the configuration of files!

An example of a YAML header can be seen below:

---
title: "My Awesome Report"
author: "Emmet Brickowski"
date: ""
output: html_document
---

In R Markdown, the header is defined by the three hyphens at the beginning (---) and the three hyphens at the end (---).

Within this header, the only required field is the output, which specifies the type of output you want. This can be an html_document, a pdf_document, or a word_document. We will start with an HTML document and discuss the other options later.

Since the other fields are not required, you can delete them if they are unneeded!

To begin the body of your document, start typing after the end of the YAML header (i.e. after the second ---).

Markdown Syntax


Markdown is a popular markup language that allows you to add formatting elements to text, such as bold, italics, and code. However, the formatting will not be immediately visible in your markdown (.md) document, like you would see in a Word document. Rather, Markdown syntax applied to text within your file is converted into formatted elements upon output. Markdown is useful because it is lightweight, flexible, and platform independent.

Some platforms provide a real time preview of the formatting, like RStudio’s visual markdown editor (available from version 1.4).

First, let’s create a heading! A # in front of text indicates to Markdown that this text is a heading. Adding more #s make the heading smaller, i.e. one # is a first level heading, two ##s is a second level heading, etc. This can be repeated up until the 6th level heading.

# Title
## Section
### Sub-section
#### Sub-sub section
##### Sub-sub-sub section
###### Sub-sub-sub-sub section

Please note that you should only use a level if the one above it is also in use! For example, you should not create a header using #### unless headers at ### and all higher levels are present earlier in the document.

Since we have already defined our title in the YAML header, we will use a section heading to create an Introduction section.

## Introduction

You can make things bold by surrounding the word with double asterisks, **bold**, or double underscores, __bold__. Italics can be applied using single asterisks, *italics*, or single underscores, _italics_.

You can also combine bold and italics to write something really important with triple-asterisks, ***really***, or underscores, ___really___. If you’re feeling bold (pun intended), you can also use a combination of asterisks and underscores, **_really_**, *__really__*.

To create code-type font, surround the word with back-ticks, `code-type`.

Now, let’s apply everything we’ve learned about markdown syntax thus far:

## Introduction

This report uses the **tidyverse** package along with the *Check-In* Dataset,
which has columns that include:

Then we can create a list for the variables using -, +, or * keys.

## Introduction

This report uses the **tidyverse** package along with the *Check-In* Dataset,
which has columns that include:

- checkin\_id
- checkin\_length
- checkin\_time
- location
- precinct
- device 

You can also create an ordered list using numbers:

1. checkin\_id
2 checkin\_length
3. checkin\_time
4. location
5. precinct
6. device

And nested items by tab-indenting:

- checkin\_id
  + Unique key/ID for each ballot instance
- checkin\_length
  + Number of seconds it took for the person submitting the ballot to check-in
- checkin\_time
  + Arrival time of the person submitting the ballot
- location
  + Anonymized ID for the location of the ballot box
- precinct
  + Anonymized ID for the precinct that the ballot box belongs to
- device
  + Anonymized ID for each ballot box

For more Markdown syntax see the following reference guide.

To render your document into HTML, click the Knit button at the top of the Source panel (top left), or use the keyboard shortcut Ctrl+Shift+K for Windows and Linux or Cmd+Shift+K for Mac. If you haven’t saved the document yet, you will be prompted to do so when you Knit for the first time.

The 'knitting' process: First, R Markdown is converted to Markdown, which is then converted (via pandoc) to .html, .pdf, .docx, etc.

Writing an R Markdown Report


Next, we will add some R code from our previous data wrangling and visualization, which means we need to make sure tidyverse is loaded. However, it is no longer enough to just load tidyverse from the console – when working with R Markdown, you must ensure any necessary packages are loaded within the document itself. The same applies to our data. To do so, we will need to create a ‘code chunk’ at the top of our document (below the YAML header).

A code chunk can be inserted by clicking Code -> Insert Chunk, or by using the keyboard shortcuts Ctrl+Alt+I for Windows and Linux or Cmd+Option+I for Mac.

The syntax of a code chunk is:

MARKDOWN

```{r chunk-name}
"This is where you would place your R code!"
```

An R Markdown document knows that this text is not part of the report from the ``` that begins and ends the chunk. It also knows that the code inside of the chunk is written in R from the r inside of the curly braces ({}). After the r, you can add a name for the code chunk. Naming a chunk is optional, but recommended for organizational purposes. When naming chunks, each chunk name must be unique, and only contain alphanumeric characters and -.

To load tidyverse and our checkin_data.csv file, we will insert a chunk and call it ‘setup’. Since we don’t want this code or the output to show in our knitted HTML document, we add an include = FALSE option after the code chunk name ({r setup, include = FALSE}).

MARKDOWN

```{r setup, include = FALSE}
#loads in the tidyverse and here packages
library(tidyverse)
library(here)

#reads in data and assigns it to the 'data' variable using 'here'
data <- read_csv(here("data", "checkin_data.csv"))
```
Callout

Important Note!

The file paths you give in a .Rmd document, e.g. to load a .csv file, are relative to the .Rmd document, not the project root.

As suggested in the Starting with Data episode, we highly recommend the use of the here() function to keep the file paths consistent within your project.

Insert Table


Next, we will re-create a table from the Data Wrangling episode which shows the total number of check-ins grouped by precinct. We can do this by creating a new code chunk and calling it ‘anon-tbl’. Alternatively, you can come up with something more creative (just remember to stick to the naming rules).

When writing code chunks, unlike text, it isn’t necessary to Knit your document every time you want to see the output. Instead you can run the code chunk with the green triangle in the top right corner of the the chunk, or by using the keyboard shortcuts Ctrl+Alt+C for Windows and Linux or Cmd+Option+C for Mac.

To make sure the table is formatted nicely in our output document, we will need to use the kable() function from the knitr package. The kable() function takes the output of your R code and knits it into a nice looking HTML table. You can also specify different aspects of the table (i.e., the column names or the caption).

Run the code chunk below to ensure you get the desired output:

R

data %>%
  group_by(precinct) %>%
  summarize(total_checkins = n()) %>%
  arrange(desc(total_checkins)) %>%
  knitr::kable(caption = "We can also add a caption.", 
               col.names = c("Precinct", 
                             "Total Check-Ins"))
We can also add a caption.
Precinct Total Check-Ins
PRECINCT_219 1968
PRECINCT_016 1807
PRECINCT_271 1798
PRECINCT_317 1731
PRECINCT_358 1717
PRECINCT_239 1705
PRECINCT_199 1700
PRECINCT_323 1695
PRECINCT_106 1680
PRECINCT_045 1671
PRECINCT_008 1652
PRECINCT_051 1652
PRECINCT_046 1640
PRECINCT_133 1640
PRECINCT_408 1636
PRECINCT_119 1633
PRECINCT_254 1630
PRECINCT_242 1628
PRECINCT_047 1621
PRECINCT_386 1617
PRECINCT_315 1611
PRECINCT_367 1607
PRECINCT_307 1600
PRECINCT_215 1598
PRECINCT_134 1592
PRECINCT_294 1585
PRECINCT_136 1584
PRECINCT_340 1584
PRECINCT_376 1583
PRECINCT_387 1580
PRECINCT_309 1568
PRECINCT_246 1565
PRECINCT_319 1564
PRECINCT_105 1561
PRECINCT_395 1554
PRECINCT_306 1550
PRECINCT_027 1539
PRECINCT_251 1527
PRECINCT_210 1519
PRECINCT_211 1507
PRECINCT_308 1507
PRECINCT_146 1500
PRECINCT_039 1489
PRECINCT_161 1483
PRECINCT_266 1479
PRECINCT_262 1478
PRECINCT_258 1475
PRECINCT_297 1470
PRECINCT_324 1466
PRECINCT_263 1464
PRECINCT_179 1459
PRECINCT_200 1459
PRECINCT_035 1448
PRECINCT_022 1436
PRECINCT_235 1432
PRECINCT_335 1427
PRECINCT_256 1417
PRECINCT_177 1415
PRECINCT_121 1402
PRECINCT_398 1402
PRECINCT_217 1392
PRECINCT_018 1380
PRECINCT_193 1380
PRECINCT_084 1370
PRECINCT_158 1360
PRECINCT_196 1358
PRECINCT_204 1352
PRECINCT_007 1347
PRECINCT_225 1344
PRECINCT_150 1336
PRECINCT_066 1334
PRECINCT_044 1332
PRECINCT_128 1328
PRECINCT_070 1317
PRECINCT_320 1317
PRECINCT_282 1314
PRECINCT_303 1314
PRECINCT_237 1313
PRECINCT_336 1306
PRECINCT_399 1306
PRECINCT_036 1291
PRECINCT_117 1278
PRECINCT_178 1278
PRECINCT_236 1268
PRECINCT_412 1268
PRECINCT_331 1265
PRECINCT_050 1262
PRECINCT_124 1251
PRECINCT_096 1246
PRECINCT_109 1246
PRECINCT_037 1241
PRECINCT_280 1236
PRECINCT_157 1232
PRECINCT_371 1232
PRECINCT_290 1225
PRECINCT_375 1220
PRECINCT_404 1219
PRECINCT_216 1212
PRECINCT_054 1211
PRECINCT_356 1203
PRECINCT_041 1201
PRECINCT_126 1199
PRECINCT_328 1198
PRECINCT_332 1198
PRECINCT_351 1188
PRECINCT_065 1187
PRECINCT_195 1187
PRECINCT_125 1185
PRECINCT_406 1183
PRECINCT_055 1179
PRECINCT_098 1179
PRECINCT_048 1174
PRECINCT_339 1173
PRECINCT_038 1171
PRECINCT_139 1170
PRECINCT_191 1168
PRECINCT_011 1167
PRECINCT_014 1161
PRECINCT_270 1154
PRECINCT_110 1149
PRECINCT_118 1130
PRECINCT_153 1127
PRECINCT_015 1125
PRECINCT_097 1122
PRECINCT_341 1122
PRECINCT_257 1119
PRECINCT_281 1114
PRECINCT_052 1109
PRECINCT_318 1109
PRECINCT_255 1105
PRECINCT_159 1102
PRECINCT_396 1101
PRECINCT_333 1096
PRECINCT_174 1092
PRECINCT_312 1091
PRECINCT_079 1090
PRECINCT_353 1089
PRECINCT_269 1082
PRECINCT_220 1079
PRECINCT_067 1074
PRECINCT_230 1063
PRECINCT_137 1062
PRECINCT_160 1056
PRECINCT_033 1054
PRECINCT_313 1050
PRECINCT_260 1047
PRECINCT_187 1042
PRECINCT_206 1040
PRECINCT_129 1036
PRECINCT_203 1028
PRECINCT_296 1028
PRECINCT_029 1026
PRECINCT_377 1023
PRECINCT_081 1022
PRECINCT_080 1021
PRECINCT_221 1006
PRECINCT_154 1002
PRECINCT_415 998
PRECINCT_394 995
PRECINCT_325 992
PRECINCT_115 991
PRECINCT_321 988
PRECINCT_085 987
PRECINCT_184 986
PRECINCT_064 982
PRECINCT_370 981
PRECINCT_202 979
PRECINCT_299 977
PRECINCT_310 976
PRECINCT_201 974
PRECINCT_420 963
PRECINCT_021 956
PRECINCT_114 954
PRECINCT_241 954
PRECINCT_194 944
PRECINCT_316 943
PRECINCT_397 943
PRECINCT_059 942
PRECINCT_053 941
PRECINCT_049 939
PRECINCT_143 934
PRECINCT_075 932
PRECINCT_168 926
PRECINCT_298 925
PRECINCT_349 915
PRECINCT_381 910
PRECINCT_197 908
PRECINCT_166 904
PRECINCT_372 904
PRECINCT_123 890
PRECINCT_083 889
PRECINCT_288 885
PRECINCT_010 882
PRECINCT_068 882
PRECINCT_017 875
PRECINCT_207 873
PRECINCT_127 872
PRECINCT_337 868
PRECINCT_283 866
PRECINCT_327 861
PRECINCT_393 858
PRECINCT_107 856
PRECINCT_140 854
PRECINCT_116 853
PRECINCT_390 853
PRECINCT_131 851
PRECINCT_348 849
PRECINCT_132 845
PRECINCT_354 845
PRECINCT_164 844
PRECINCT_095 843
PRECINCT_209 838
PRECINCT_359 831
PRECINCT_248 820
PRECINCT_169 819
PRECINCT_058 816
PRECINCT_076 816
PRECINCT_198 815
PRECINCT_181 810
PRECINCT_378 810
PRECINCT_003 806
PRECINCT_023 797
PRECINCT_025 796
PRECINCT_069 796
PRECINCT_234 795
PRECINCT_267 791
PRECINCT_144 785
PRECINCT_322 783
PRECINCT_130 776
PRECINCT_224 766
PRECINCT_416 766
PRECINCT_329 764
PRECINCT_005 762
PRECINCT_352 762
PRECINCT_142 761
PRECINCT_012 759
PRECINCT_120 757
PRECINCT_314 748
PRECINCT_102 743
PRECINCT_009 742
PRECINCT_250 738
PRECINCT_013 737
PRECINCT_024 734
PRECINCT_108 734
PRECINCT_057 733
PRECINCT_113 732
PRECINCT_228 731
PRECINCT_149 728
PRECINCT_391 727
PRECINCT_073 724
PRECINCT_071 708
PRECINCT_231 701
PRECINCT_185 691
PRECINCT_034 682
PRECINCT_138 682
PRECINCT_145 682
PRECINCT_304 680
PRECINCT_006 676
PRECINCT_369 669
PRECINCT_172 663
PRECINCT_030 662
PRECINCT_183 660
PRECINCT_155 652
PRECINCT_001 648
PRECINCT_233 648
PRECINCT_243 643
PRECINCT_188 639
PRECINCT_364 638
PRECINCT_028 633
PRECINCT_111 621
PRECINCT_212 621
PRECINCT_213 614
PRECINCT_026 604
PRECINCT_060 601
PRECINCT_094 592
PRECINCT_170 585
PRECINCT_208 581
PRECINCT_223 581
PRECINCT_344 580
PRECINCT_141 578
PRECINCT_350 573
PRECINCT_063 571
PRECINCT_182 571
PRECINCT_122 570
PRECINCT_086 565
PRECINCT_273 562
PRECINCT_252 560
PRECINCT_388 556
PRECINCT_278 555
PRECINCT_151 553
PRECINCT_368 552
PRECINCT_384 547
PRECINCT_343 546
PRECINCT_186 543
PRECINCT_409 540
PRECINCT_087 536
PRECINCT_259 530
PRECINCT_249 528
PRECINCT_240 527
PRECINCT_289 520
PRECINCT_287 513
PRECINCT_347 511
PRECINCT_311 504
PRECINCT_072 498
PRECINCT_407 493
PRECINCT_192 490
PRECINCT_104 489
PRECINCT_295 482
PRECINCT_214 479
PRECINCT_245 478
PRECINCT_305 477
PRECINCT_247 473
PRECINCT_103 469
PRECINCT_004 466
PRECINCT_366 463
PRECINCT_226 462
PRECINCT_147 459
PRECINCT_402 459
PRECINCT_162 457
PRECINCT_284 454
PRECINCT_019 444
PRECINCT_293 443
PRECINCT_156 441
PRECINCT_152 439
PRECINCT_077 429
PRECINCT_100 415
PRECINCT_279 412
PRECINCT_135 406
PRECINCT_165 404
PRECINCT_099 403
PRECINCT_090 397
PRECINCT_264 397
PRECINCT_218 396
PRECINCT_276 393
PRECINCT_413 386
PRECINCT_383 385
PRECINCT_338 373
PRECINCT_361 371
PRECINCT_362 367
PRECINCT_405 363
PRECINCT_190 362
PRECINCT_418 362
PRECINCT_373 359
PRECINCT_040 356
PRECINCT_093 348
PRECINCT_392 342
PRECINCT_400 339
PRECINCT_173 333
PRECINCT_379 324
PRECINCT_082 321
PRECINCT_163 320
PRECINCT_285 320
PRECINCT_232 313
PRECINCT_286 296
PRECINCT_277 295
PRECINCT_222 288
PRECINCT_301 284
PRECINCT_275 280
PRECINCT_291 279
PRECINCT_238 274
PRECINCT_385 265
PRECINCT_389 259
PRECINCT_002 257
PRECINCT_357 248
PRECINCT_148 244
PRECINCT_380 243
PRECINCT_302 241
PRECINCT_342 234
PRECINCT_330 232
PRECINCT_417 232
PRECINCT_032 227
PRECINCT_268 224
PRECINCT_374 220
PRECINCT_363 218
PRECINCT_346 213
PRECINCT_300 212
PRECINCT_265 207
PRECINCT_334 206
PRECINCT_074 190
PRECINCT_043 189
PRECINCT_167 187
PRECINCT_205 184
PRECINCT_410 184
PRECINCT_401 180
PRECINCT_229 179
PRECINCT_089 178
PRECINCT_112 171
PRECINCT_365 171
PRECINCT_274 169
PRECINCT_326 167
PRECINCT_078 150
PRECINCT_244 148
PRECINCT_056 143
PRECINCT_061 142
PRECINCT_088 140
PRECINCT_171 124
PRECINCT_176 124
PRECINCT_292 111
PRECINCT_020 109
PRECINCT_091 102
PRECINCT_180 101
PRECINCT_261 101
PRECINCT_382 101
PRECINCT_272 98
PRECINCT_419 89
PRECINCT_042 78
PRECINCT_062 75
PRECINCT_189 70
PRECINCT_227 70
PRECINCT_403 68
PRECINCT_414 68
PRECINCT_031 66
PRECINCT_175 64
PRECINCT_355 60
PRECINCT_253 58
PRECINCT_101 43
PRECINCT_345 42
PRECINCT_411 37
PRECINCT_360 11
PRECINCT_092 2

Many different R packages can be used to generate tables. Some of the more commonly used options are listed in the table below:

Name Creator(s) Description
condformat Oller Moreno (2022) Allows for the application and visualization of conditional formatting to data frames using defined criteria.
DT Xie et al. (2023) By using the JavaScript library ‘DataTables’ (included within the library), data objects can be rendered as HTML tables via R Markdown or Shiny.
formattable Ren and Russell (2021) Provides functions that create “formattable” vectors and data frames. Formattable vectors are displayed with text formatting, while formattable data frames use HTML to enhance the readability when rendered on web pages.
flextable Gohel and Skintzos (2023) Assists in the creation and customization of tables for reporting and publication purposes. The following formats are supported: ‘HTML’, ‘PDF’, ‘RTF’, ‘Microsoft Word’, ‘Microsoft PowerPoint’ and R ‘Grid Graphics’. ‘R Markdown’, ‘Quarto’, and the package ‘officer’ can be used to produce files with results.
gt Iannone et al. (2022) Builds display tables from tabular data. Within this package, tables are constructed using a set of cohesive table parts. Table values can be formatted using any of the included formatting functions.
huxtable Hugh-Jones (2022) Creates styled tables for data presentation. These tables can be exported to HTML, LaTeX, RTF, ‘Word’, ‘Excel’, and ‘PowerPoint’. Using this package, you can manipulate borders, size, position, captions, colors, text styles and number formatting.
pander Daróczi and Tsegelskyi (2022) Includes functions that catch all messages, ‘stdout’ and other useful information while evaluating R code. It also provides helpers to return user-specified text elements (e.g., header, paragraph, table, image, lists, etc.), or several types of R objects similarly automatically transformed to markdown format, in ‘pandoc’ markdown.
pixiedust Nutter and Kretch (2021) Provides tidy data frames with a programming interface intended to be similar to ’ggplot2’s system of layers, allowing fine-tuned control over each cell of the table.
reactable Lin et al. (2023) Creates interactive data tables for R based on the ‘React Table’ JavaScript library. Provides an HTML widget that can be used in ‘R Markdown’ or ‘Quarto’ documents, ‘Shiny’ applications, or viewed from an R console.
rhandsontable Owen et al. (2021) Provides an R interface to the ‘Handsontable’ JavaScript library (a minimalist Excel-like data grid editor).
stargazer Hlavac (2022) Generates LaTeX code, HTML/CSS code and ASCII text for well-formatted tables that display regression analysis results from multiple models side-by-side, along with summary statistics.
tables Murdoch (2022) Computes and displays complex tables of summary statistics. Output can be in LaTeX, HTML, plain text, or an R matrix for further processing.
tangram Garbett et al. (2023) Provides a flexible formula system to create production quality tables quickly and easily. The processing steps include a formula parser, statistical content generation from data defined by a formula, and table rendering.
xtable Dahl et al. (2019) Coerces data to LaTeX and HTML tables.
ztable Moon (2021) Makes zebra-striped tables (tables with alternating row colors) in LaTeX and HTML formats using data.frame, matrix, lm, aov, anova, glm, coxph, nls, fitdistr, mytable and cbind.mytable objects.

Customizing Chunk Output


Earlier, we mentioned using include = FALSE in a code chunk to prevent the code and output from printing in the knitted document. There are additional options available to customize how the code-chunks are presented in the output document. The options are entered in the code chunk after chunk-name and separated by commas, e.g. {r chunk-name, eval = FALSE, echo = TRUE}.

Option Options Output
eval TRUE or FALSE Whether or not the code within the code chunk should be run.
echo TRUE or FALSE Choose if you want to show your code chunk in the output document. echo = TRUE will show the code chunk.
include TRUE or FALSE Choose if the output of a code chunk should be included in the document. FALSE means that your code will run, but will not show up in the document.
warning TRUE or FALSE Whether or not you want your output document to display potential warning messages produced by your code.
message TRUE or FALSE Whether or not you want your output document to display potential messages produced by your code.
fig.align default, left, right, center Where the figure from your R code chunk should be output on the page
Callout

Tip

  • The default settings for the above chunk options are all TRUE.
  • The default settings can be modified per chunk, or with knitr::opts_chunk$set() (i.e., entering knitr::opts_chunk$set(echo = FALSE) will change the default of value of echo to FALSE for every code chunk in the document).
Challenge

Exercise

Play around with the different options in the chunk with the code for the table, and re-Knit to see what each option does to the output.

What happens if you use eval = FALSE and echo = FALSE? What is the difference between this and include = FALSE?

Chunk 1:

MARKDOWN

```{r eval = FALSE, echo = FALSE}
data %>%
  group_by(precinct) %>%
  summarize(total_checkins = n()) %>%
  arrange(desc(total_checkins)) %>%
  knitr::kable(caption = "We can also add a caption.", 
               col.names = c("Precinct", 
                             "Total Check-Ins"))
```

Chunk 2:

MARKDOWN

```{r include = FALSE}
data %>%
  group_by(precinct) %>%
  summarize(total_checkins = n()) %>%
  arrange(desc(total_checkins)) %>%
  knitr::kable(caption = "We can also add a caption.", 
               col.names = c("Precinct", 
                             "Total Check-Ins"))
```
  • eval = FALSE and echo = FALSE will neither run the code in the chunk, nor show the code in the knitted document. The code chunk essentially doesn’t exist in the knitted document!
  • include = FALSE will not display the code nor the output, but it will be ran, with the output stored for later use!

In-Line R Code


Now we will use some in-line R code to present some descriptive statistics. To use in-line R-code, we use the same back-ticks that we used in the Markdown section, with an r to specify that we are generating R-code. The difference between in-line code and a code chunk is the number of back-ticks. In-line R code uses one back-tick (`r`), whereas code chunks use three back-ticks (```r```).

For example, today’s date is `r Sys.Date()`, will be rendered as: today’s date is 2026-04-28. The code will display today’s date in the output document (or, technically, the date the document was last knitted).

The best way to use in-line R code is by preparing the output in code chunks, minimizing the code needed to produce the output. For example, let’s say we’re interested in presenting the total check-ins for a specific precinct.

We can run the below code to create the total_2866 object, making future in-line R code much easier to write:

R

#create a summary tibble with the total check-ins per precinct
df <- data %>%
      group_by(precinct) %>%
      summarize(total_checkins = n())

#select the precinct we want to use
total_2866 <- df %>%
              filter(precinct == "2866")

Now we can make an informative statement on the counts of each precinct, and include the total values as in-line R-code. For example:

The total check-ins at precinct 2866 is `r total_2866$total_checkins`

becomes…

The total check-ins at precinct 2866 is .

Because we are using in-line R code instead of the actual values, we have created a dynamic document that will automatically update if we make changes to the data set and/or code chunks.

Plots


Finally, our last addition to our document will be a plot from the Data Visualization lesson!

Challenge

Exercise

Create a new code chunk for the plot, and copy the code from any of the plots we created in the previous episode to produce a plot in the chunk.

If you are feeling adventurous, you can also create a new plot using the data tibble.

R

#retrieve top devices
top_devices <- data %>%
  count(device) %>%
  top_n(10, n) %>%
  pull(device)

#create plot
data %>%
  filter(device %in% top_devices) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = device, fill = device)) +
  geom_bar() +
  labs(title = "Top 10 Devices by Number of Check-ins",
       x = "Device",
       y = "Count")+
  theme_classic() + 
  theme(legend.position = "none")

We can also create a caption with the chunk option fig.cap.

MARKDOWN

```{r chunk-name, fig.cap = "I made this plot while attending an
awesome workshop where I learned a ton of cool stuff!"}
"Insert the code for the plot here"
```

…or, ideally, something more informative.

R

#retrieve top devices
top_devices <- data %>%
  count(device) %>%
  top_n(10, n) %>%
  pull(device)

#create plot
data %>%
  filter(device %in% top_devices) %>%
  mutate(device = str_remove(device, "DEVICE_")) %>%
  ggplot(aes(x = device, fill = device)) +
  geom_bar() +
  labs(title = "Top 10 Devices by Number of Check-ins",
       x = "Device",
       y = "Count")+
  theme_classic() + 
  theme(legend.position = "none")
I made this plot while attending an awesome workshop where I learned a ton of cool stuff!
I made this plot while attending an awesome workshop where I learned a ton of cool stuff!

Other Output Options


To convert an R Markdown file to a PDF or Word Document, you can either click the little triangle next to the Knit button to get a drop-down menu or put pdf_document or word_document in the initial header of the file.

For example, to output to a word_document:

---
title: "My Awesome Report"
author: "Emmet Brickowski"
date: ""
output: word_document
---
Callout

Note: Creating PDF Documents

Creating .pdf documents may require installation of some extra software. The R package tinytex provides some tools to help make this process easier for R users. With tinytex installed, run tinytex::install_tinytex() to install the required software (you’ll only need to do this once) and then when you Knit to pdf tinytex will automatically detect and install any additional LaTeX packages that are needed to produce the pdf document. For more information, visit the tinytex website.

Callout

Note: Inserting Citations into an R Markdown File

It is possible to insert citations into an R Markdown file using the editor toolbar. The editor toolbar includes commonly seen formatting buttons generally seen in text editors (e.g., bold and italic buttons) and is accessible by using the settings drop-down menu (next to the ‘Knit’ drop-down menu) to select ‘Use Visual Editor’. You can also use the keyboard shortcuts Ctrl+Shift+F4 for Windows and Linux or Cmd+Shift+F4 for Mac. From here, clicking ‘Insert’ allows ‘Citation’ to be selected.

Using this menu, you can search various sources for citations and insert the appropriate citation necessary. For example, searching ‘10.1007/978-3-319-24277-4’ in ‘From DOI’ and inserting will provide the citation for ggplot2 [@wickham2016]. This will also save the citation(s) in ‘references.bib’ in the current working directory. Visit the R Studio website for more information.

Additionally, you can obtain citation information from relevant packages by using citation("package").

Resources


Key Points
  • R Markdown is a useful language for creating reproducible documents combining text and executable R-code.
  • You can specify chunk options to control formatting of the output document.