Introduction to R


Data Wrangling and Exploring Data with R

Data Academy

Course Outline

Overview

  • RStudio
  • R Projects
  • Variables
  • Types and Data Structures
  • Operators
  • Functions
  • Packages
  • Reading/Writing Data

Exploratory Data Analysis

  • Data Frames
  • The tidyverse
  • Subsetting, Sorting, and Summarizing
  • Manipulating
  • Piping
  • Joining
  • Pivoting
  • Plotting
  • Script Structure

Course Goals

  • Navigate RStudio
  • Pledge to use R Projects
  • Understand R’s types and structures
  • Manipulate data frames
  • Conduct basic EDA
  • Know where and how to get help

Why use R?

  • It’s powerful.
  • It promotes reproducibility.
  • It’s popular around the city.
  • It works well with other languages.
  • There are packages for almost everything.
  • If there isn’t a package for something, it’s fairly easy to make it yourself!
  • It’s cooler than Python.

Your First R Project

(Live)

Intro to R GitHub Repo

Use R Projects because:

  • They promote sensible file/directory structure.
  • They’re easy to share.
  • Manually fiddling with your working directory is bad practice.

Variable Assignment

Variable names can’t include spaces and must start with a letter.

a <- 5
a = 5
a
[1] 5
b <- 4
b
[1] 4
c <- a + b
c
[1] 9
test_scores <- c(82, 89, 92, 75, 74, 99)
test_scores
[1] 82 89 92 75 74 99

Tip

c() is a function that ‘concatenates’ a vector. More on vectors in a bit!

Functions

Functions are pre-defined code that accomplish one specific task. A function has two components: (1) the name of the function; and (2) the input or ‘arguments’. The value returned is called the ‘output.’ Running or executing a function is called ‘calling’ a function.

mean(x = c(3, 7, 11, 12, 14, 15))
[1] 10.33333
  • A function can have any number of arguments. Some are required, but some are optional.
  • If an argument is not explicitly specified, a default will be used.
mean(c(3, 7, 11, 12, 14, 15, NA))
[1] NA
mean(c(3, 7, 11, 12, 14, 15, NA), na.rm = TRUE)
[1] 10.33333
  • A function will throw an error if a required argument is not used.
mean(na.rm = TRUE)
Error in mean.default(na.rm = TRUE): argument "x" is missing, with no default

Types

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

Types - numbers

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

R has three numeric types: numeric (or floating point), integer, and complex.

n <- 2.5
class(n)
[1] "numeric"
int <- as.integer(2)
class(int)
[1] "integer"
class(-1 + 3i)
[1] "complex"

Types - characters

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

Characters are for storing text or strings and can be entered using either double or single quotes.

greetings <- "hello world"

class(greetings)
[1] "character"

‘Escape’ quotes within a character string with backslashes.

my_enthusiasm <- 'I\'m loving R!'
print(my_enthusiasm)
[1] "I'm loving R!"

Types - boolean/logical

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

Logical types have three possible values: TRUE, FALSE, and NA.

class(TRUE)
[1] "logical"
cond <- 1 > 2
cond
[1] FALSE
class(cond)
[1] "logical"

Types - dates

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

Objects of the date class are used for dates.

d <- as.Date("2024/09/07")
class(d)
[1] "Date"
as.Date("09/07/2024")
[1] "0009-07-20"
as.Date("09/07/2024", format = "%m/%d/%Y")
[1] "2024-09-07"
%Y 4-digit year (1982)
%y 2-digit year (82)
%m 2-digit month (01)
%d 2-digit day of the month (13)

Types - factors

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

Factors are used for categorical, ‘ordinal’ variables.

m <- c("Dec", "Apr", "Jan", "Mar")
sort(m)
[1] "Apr" "Dec" "Jan" "Mar"
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

m <- factor(m, levels = month_levels)
sort(m)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Structures - vectors

Types

  • vector
  • list
  • matrix
  • data frame

A vector is a one dimensional collection of elements of the same class.

temps <- c(94, 83, 79, 55, 65)
temps
[1] 94 83 79 55 65
groceries <- c("apples", "carrots", "ice cream", "hot sauce")
groceries
[1] "apples"    "carrots"   "ice cream" "hot sauce"
conds <- c(TRUE, TRUE, FALSE, FALSE)
conds
[1]  TRUE  TRUE FALSE FALSE

Important

Be wary of R’s implicit coercion

c(1, 2, TRUE)
[1] 1 2 1
c("A", "B", TRUE)
[1] "A"    "B"    "TRUE"

Subsetting vectors

Types

  • vector
  • list
  • matrix
  • data frame

Elements can be accessed using ‘subscripts’ or ‘indices’, which are specified using brackets:

groceries <- c("apples", "carrots", "ice cream", "hot sauce")
groceries[2]
[1] "carrots"
groceries[-2]
[1] "apples"    "ice cream" "hot sauce"
groceries[c(1, 2, 3)]
[1] "apples"    "carrots"   "ice cream"
groceries[1:3]
[1] "apples"    "carrots"   "ice cream"

Structures - lists

Types

  • vector
  • list
  • matrix
  • data frame

A list can include elements of any type and of any dimension.

l1 <- list("A", 1, TRUE)
l1
[[1]]
[1] "A"

[[2]]
[1] 1

[[3]]
[1] TRUE
l2 <- list(c(1, 2, 3, 4, 5), c("a", "b", "c"), c(TRUE, TRUE))
l2
[[1]]
[1] 1 2 3 4 5

[[2]]
[1] "a" "b" "c"

[[3]]
[1] TRUE TRUE

Structures - matrices

Types

  • vector
  • list
  • matrix
  • data frame

A matrix is a two-dimensional collection of elements of the same type.

m1 <- matrix(
  c(43, 43, 65, 76, 87, 34),
  nrow = 3,
  ncol = 2
)
m1
     [,1] [,2]
[1,]   43   76
[2,]   43   87
[3,]   65   34
m2 <- matrix(
  c(TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE),
  nrow = 2,
  ncol = 4
)
m2
     [,1]  [,2]  [,3]  [,4]
[1,] TRUE FALSE FALSE  TRUE
[2,] TRUE FALSE  TRUE FALSE

Structures - data frames

Types

  • vector
  • list
  • matrix
  • data frame

A data frame is a table made of equal length vectors.

df <- data.frame(
  x = c(1, 2, 3),
  y = c(5, 6, 7),
  z = c("a", "b", "c")
)
df
  x y z
1 1 5 a
2 2 6 b
3 3 7 c
  • A ‘tibble’ is an improved type of data frame.
library(tidyverse)
df <- tibble(
  x = c(1, 2, 3),
  z = c("a", "b", "c"),
  f = c(TRUE, FALSE, TRUE)
)
df
# A tibble: 3 × 3
      x z     f    
  <dbl> <chr> <lgl>
1     1 a     TRUE 
2     2 b     FALSE
3     3 c     TRUE 

Subsetting data frames

Types

  • vector
  • list
  • matrix
  • data frame

Use $ to access columns (vectors) within a data frame.

df$x
[1] 1 2 3
sum(df$x)
[1] 6
df$z[1]
[1] "a"
df$z[c(1, 3)]
[1] "a" "c"

Subscripts/indices for data frames are pairs specifying the row and column numbers.

df[1,1]
# A tibble: 1 × 1
      x
  <dbl>
1     1
df[2, 1:3]
# A tibble: 1 × 3
      x z     f    
  <dbl> <chr> <lgl>
1     2 b     FALSE

Operators

Category Operator Operation Example
Artithmetic + Addition x + y
Artithmetic - Subtraction x - y
Artithmetic * Multiplication x * y
Artithmetic / Division x / y
Artithmetic ^ Exponent x ^ y
Artithmetic %% Modulus (Remainder from division) x %% y
Comparison == Equal x == y
Comparison != Not equal x != y
Comparison > Greater than x > y
Comparison < Less than x < y
Comparison >= Greater than or equal to x >= y
Comparison <= Less than or equal to x <= y
Logical & AND x & y
Logical | OR x | y
Logical ! NOT !(x > y)
Logical %in% IN x %in% y
Sequence : Sequence 1:10

Vectorization

Many of R’s operations are ‘vectorized’, meaning a given operation will operate on each element of a vector without explicit specification.

x <- 1:5
x + 1
[1] 2 3 4 5 6
x < 4
[1]  TRUE  TRUE  TRUE FALSE FALSE
x == 4
[1] FALSE FALSE FALSE  TRUE FALSE
x[x < 4]
[1] 1 2 3
temps_f <- c(94, 83, 79, 71, 72)
temps_c <- (temps_f - 32)*5/9
temps_c
[1] 34.44444 28.33333 26.11111 21.66667 22.22222

Practice Using Functions!

temps <- c(71, 91, 77, 81, 91, 91, 68, 66, 88, 82, 85, 76, 83, 91, 81, 77, 89, 68)
min(temps)
[1] 66
max(temps)
[1] 91
median(temps)
[1] 81.5
range(temps)
[1] 66 91
sd(temps)
[1] 8.587877
sum(temps)
[1] 1456
length(temps)
[1] 18
sum(temps > 90)
[1] 4
mean(temps > 90)
[1] 0.2222222

Packages

Packages extend R’s functionality beyond the functions available in the ‘base’ version.

  • Before you can use the functions from a package, the package must first be installed.

  • After a package has been installed, it can then be loaded into your session.

install.packages("dplyr") # you only need to install the package once
library(dplyr) # but you need to load the package each session 

The Comprehensive R Archive Network serves as a repository for most packages (21,1145 as of August 2024).

The Tidyverse

“The tidyverse is an collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”

Key Tidyverse Packages

  • readr - functions for reading rectangular data (like csv, tsv, and fwf).
  • readxl - functions to read data from .xls and .xlsx files.
  • dplyr - a grammar of data manipulation that solve the most common data manipulation challenges.
  • tidyr - tools for pivoting, cleaning, and ‘tidying’ data.
  • ggplot2 - a system for creating graphics, based on The Grammar of Graphics.
  • purrr - tools for working with functions and vectors.
  • lubridate - functions for working with dates.
  • stringr - functions for working with strings.
  • forcats - functions for working with factors.

Reading Data

library(readr)
penguins <- read_csv("data/penguins.csv")
Rows: 344 Columns: 8                                                                                                                                                     
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readxl)
penguins <- read_xlsx("data/penguins.xlsx")
penguins <- read_rds("data/penguins.rds")

Reading from DataSF

library(RSocrata)
crashes <- read.socrata("https://data.sfgov.org/resource/dau3-4s8f.csv")
# crashes <- read_csv("https://data.sfgov.org/resource/dau3-4s8f.csv")
glimpse(crashes)
Rows: 287
Columns: 27
$ unique_id             <int> 269, 286, 268, 291, 271, 263, 302, 116, 267, 262…
$ case_id_fkey          <chr> "200252819", "200592740", "", "200704616", "2002…
$ latitude              <dbl> 37.75243, 37.74968, 37.76478, 37.76578, 37.75243…
$ longitude             <dbl> -122.3946, -122.3973, -122.4241, -122.4075, -122…
$ collision_year        <int> 2020, 2020, 2020, 2020, 2020, 2020, 2021, 2016, …
$ death_date            <dttm> 2020-04-21, 2020-10-02, 2020-03-17, 2020-12-07,…
$ death_time            <chr> "19:35:00", "04:50:00", "21:44:00", "", "21:45:0…
$ death_datetime        <dttm> 2020-04-21 19:35:00, 2020-10-02 04:50:00, 2020-…
$ collision_date        <dttm> 2020-04-21, 2020-10-02, 2020-03-17, 2020-11-21,…
$ collision_time        <chr> "17:30:00", "", "20:15:00", "19:31:00", "17:30:0…
$ collision_datetime    <dttm> 2020-04-21 17:30:00, 2020-10-02 00:00:00, 2020-…
$ location              <chr> "Dakota Street and 25th Street", "Cesar Chavez S…
$ age                   <int> 28, 55, 36, NA, 32, 49, 27, 72, 45, 67, 39, 66, …
$ sex                   <chr> "Female", "Male", "Male", "Male", "Female", "Mal…
$ deceased              <chr> "Driver", "Pedestrian", "Motorcyclist", "Driver"…
$ collision_type        <chr> "Motor Vehicle Collision", "Pedestrian vs Motor …
$ street_type           <chr> "City Street", "City Street", "City Street", "Ci…
$ on_vz_hin_2017        <chr> "false", "true", "true", "true", "false", "true"…
$ in_coc_2018           <chr> "true", "false", "false", "true", "true", "true"…
$ publish               <chr> "true", "true", "true", "true", "true", "true", …
$ on_vz_hin_2022        <chr> "true", "true", "true", "true", "false", "true",…
$ in_epa_2021           <chr> "false", "false", "false", "false", "false", "fa…
$ point                 <chr> "POINT (-122.394586978 37.752426801)", "POINT (-…
$ data_loaded_at        <dttm> 2024-07-01 16:48:29, 2024-07-01 16:48:29, 2024-…
$ analysis_neighborhood <chr> "Bayview Hunters Point", "Bayview Hunters Point"…
$ supervisor_district   <int> 10, 10, 8, 9, 10, 9, 7, 11, 6, 3, 9, 5, 5, 5, 7,…
$ police_district       <chr> "BAYVIEW", "BAYVIEW", "MISSION", "MISSION", "BAY…

select()

library(dplyr)
select(penguins, species, island, sex, body_mass_g)
# A tibble: 344 × 4
  species island    sex    body_mass_g
  <fct>   <fct>     <fct>        <int>
1 Adelie  Torgersen male          3750
2 Adelie  Torgersen female        3800
3 Adelie  Torgersen female        3250
# … with 341 more rows
select(penguins, 1:4)
# A tibble: 344 × 4
  species island    bill_length_mm bill_depth_mm
  <fct>   <fct>              <dbl>         <dbl>
1 Adelie  Torgersen           39.1          18.7
2 Adelie  Torgersen           39.5          17.4
3 Adelie  Torgersen           40.3          18  
# … with 341 more rows
select(penguins, bill_length_mm:body_mass_g)
# A tibble: 344 × 4
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
           <dbl>         <dbl>             <int>       <int>
1           39.1          18.7               181        3750
2           39.5          17.4               186        3800
3           40.3          18                 195        3250
# … with 341 more rows

arrange()

arrange(penguins, bill_length_mm)
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Dream               32.1          15.5         188    3050 fema…  2009
2 Adelie  Dream               33.1          16.1         178    2900 fema…  2008
3 Adelie  Torgersen           33.5          19           190    3600 fema…  2008
# … with 341 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
arrange(penguins, species)
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
# … with 341 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
arrange(penguins, desc(bill_length_mm))
# A tibble: 344 × 8
  species   island bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex    year
  <fct>     <fct>           <dbl>         <dbl>        <int>   <int> <fct> <int>
1 Gentoo    Biscoe           59.6          17            230    6050 male   2007
2 Chinstrap Dream            58            17.8          181    3700 fema…  2007
3 Gentoo    Biscoe           55.9          17            228    5600 male   2009
# … with 341 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

rename()

rename(penguins, Sex = sex)
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² Sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
# … with 341 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
rename(penguins, genus = species, isle = island)
# A tibble: 344 × 8
  genus  isle      bill_length_mm bill_depth_mm flipper_le…¹ body_…² sex    year
  <fct>  <fct>              <dbl>         <dbl>        <int>   <int> <fct> <int>
1 Adelie Torgersen           39.1          18.7          181    3750 male   2007
2 Adelie Torgersen           39.5          17.4          186    3800 fema…  2007
3 Adelie Torgersen           40.3          18            195    3250 fema…  2007
# … with 341 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

distinct()

distinct(penguins, sex)
# A tibble: 3 × 1
  sex   
  <fct> 
1 male  
2 female
3 <NA>  
distinct(penguins, island)
# A tibble: 3 × 1
  island   
  <fct>    
1 Torgersen
2 Biscoe   
3 Dream    
distinct(penguins, island, species)
# A tibble: 5 × 2
  island    species
  <fct>     <fct>  
1 Torgersen Adelie 
2 Biscoe    Adelie 
3 Dream     Adelie 
# … with 2 more rows

filter()

filter(penguins, sex == "female")
# A tibble: 165 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
2 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
3 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
# … with 162 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
filter(penguins, body_mass_g > 4800)
# A tibble: 81 × 8
  species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex    year
  <fct>   <fct>           <dbl>         <dbl>          <int>   <int> <fct> <int>
1 Gentoo  Biscoe           50            16.3            230    5700 male   2007
2 Gentoo  Biscoe           50            15.2            218    5700 male   2007
3 Gentoo  Biscoe           47.6          14.5            215    5400 male   2007
# … with 78 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
filter(penguins, sex == "female", body_mass_g >= 4800)
# A tibble: 22 × 8
  species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex    year
  <fct>   <fct>           <dbl>         <dbl>          <int>   <int> <fct> <int>
1 Gentoo  Biscoe           45.4          14.6            211    4800 fema…  2007
2 Gentoo  Biscoe           46.2          14.5            209    4800 fema…  2007
3 Gentoo  Biscoe           45.1          14.5            215    5000 fema…  2007
# … with 19 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

filter()

filter(penguins, is.na(bill_length_mm))
# A tibble: 2 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen             NA            NA          NA      NA <NA>   2007
2 Gentoo  Biscoe                NA            NA          NA      NA <NA>   2009
# … with abbreviated variable names ¹​flipper_length_mm, ²​body_mass_g
filter(penguins, !is.na(bill_length_mm))
# A tibble: 342 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
# … with 339 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

filter()

filter(penguins, island == "Biscoe" | island == "Dream")
# A tibble: 292 × 8
  species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex    year
  <fct>   <fct>           <dbl>         <dbl>          <int>   <int> <fct> <int>
1 Adelie  Biscoe           37.8          18.3            174    3400 fema…  2007
2 Adelie  Biscoe           37.7          18.7            180    3600 male   2007
3 Adelie  Biscoe           35.9          19.2            189    3800 fema…  2007
# … with 289 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
filter(penguins, island %in% c("Biscoe", "Dream"))
# A tibble: 292 × 8
  species island bill_length_mm bill_depth_mm flipper_leng…¹ body_…² sex    year
  <fct>   <fct>           <dbl>         <dbl>          <int>   <int> <fct> <int>
1 Adelie  Biscoe           37.8          18.3            174    3400 fema…  2007
2 Adelie  Biscoe           37.7          18.7            180    3600 male   2007
3 Adelie  Biscoe           35.9          19.2            189    3800 fema…  2007
# … with 289 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
filter(penguins, !island %in% c("Biscoe", "Dream"))
# A tibble: 52 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
# … with 49 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

filter() with string helpers

library(stringr)

filter(penguins, str_detect(island, "ger"))
# A tibble: 52 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
# … with 49 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
filter(penguins, str_length(species) == 6)
# A tibble: 276 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
# … with 273 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

mutate()

mutate(penguins, body_mass_lb = body_mass_g/453.6)
# A tibble: 344 × 9
  species island    bill_length_mm bill_de…¹ flipp…² body_…³ sex    year body_…⁴
  <fct>   <fct>              <dbl>     <dbl>   <int>   <int> <fct> <int>   <dbl>
1 Adelie  Torgersen           39.1      18.7     181    3750 male   2007    8.27
2 Adelie  Torgersen           39.5      17.4     186    3800 fema…  2007    8.38
3 Adelie  Torgersen           40.3      18       195    3250 fema…  2007    7.16
# … with 341 more rows, and abbreviated variable names ¹​bill_depth_mm,
#   ²​flipper_length_mm, ³​body_mass_g, ⁴​body_mass_lb
usa_penguins <- mutate(
  penguins, 
  body_mass_lb = body_mass_g/453.6, 
  flipper_length_in = flipper_length_mm/25.4
  )

select(usa_penguins, species, body_mass_lb, flipper_length_in)
# A tibble: 344 × 3
  species body_mass_lb flipper_length_in
  <fct>          <dbl>             <dbl>
1 Adelie          8.27              7.13
2 Adelie          8.38              7.32
3 Adelie          7.16              7.68
# … with 341 more rows

mutate() with helpers

  • If you need to specify only one condition, use if_else(). Otherwise, use case_when().
mutate(penguins, body_mass_g = if_else(island == "Biscoe", body_mass_g - 50, body_mass_g))
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <dbl> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
# … with 341 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
new_measurements <- mutate(penguins, new_body_mass_g = case_when(
  island == "Biscoe" ~ body_mass_g - 50,
  island == "Dream" ~ body_mass_g - 75,
  island == "Torgersen" ~ body_mass_g - 100
  )
)
select(new_measurements, island, body_mass_g, new_body_mass_g)
# A tibble: 344 × 3
  island    body_mass_g new_body_mass_g
  <fct>           <int>           <dbl>
1 Torgersen        3750            3650
2 Torgersen        3800            3700
3 Torgersen        3250            3150
# … with 341 more rows

mutate() with string helpers

penguins_with_ids <- mutate(penguins, id = paste(island, species, sex, year, sep = "-"))
select(penguins_with_ids, island, species, sex, year, id)
# A tibble: 344 × 5
  island    species sex     year id                          
  <fct>     <fct>   <fct>  <int> <chr>                       
1 Torgersen Adelie  male    2007 Torgersen-Adelie-male-2007  
2 Torgersen Adelie  female  2007 Torgersen-Adelie-female-2007
3 Torgersen Adelie  female  2007 Torgersen-Adelie-female-2007
# … with 341 more rows
mutate(penguins, sex = str_sub(sex, start = 1, end = 1))
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <chr> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 m      2007
2 Adelie  Torgersen           39.5          17.4         186    3800 f      2007
3 Adelie  Torgersen           40.3          18           195    3250 f      2007
# … with 341 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g
mutate(penguins, sex = str_to_title(sex))
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <chr> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 Male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 Fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 Fema…  2007
# … with 341 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

count()

count(penguins, sex)
# A tibble: 3 × 2
  sex        n
  <fct>  <int>
1 female   165
2 male     168
3 <NA>      11
count(penguins, species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
count(penguins, sex, species, sort = TRUE)
# A tibble: 8 × 3
  sex    species       n
  <fct>  <fct>     <int>
1 female Adelie       73
2 male   Adelie       73
3 male   Gentoo       61
4 female Gentoo       58
5 female Chinstrap    34
6 male   Chinstrap    34
7 <NA>   Adelie        6
8 <NA>   Gentoo        5

count()

count(penguins, island, name = "n_island_dwellers")
# A tibble: 3 × 2
  island    n_island_dwellers
  <fct>                 <int>
1 Biscoe                  168
2 Dream                   124
3 Torgersen                52
count(penguins, island == "Biscoe")
# A tibble: 2 × 2
  `island == "Biscoe"`     n
  <lgl>                <int>
1 FALSE                  176
2 TRUE                   168
count(penguins, body_mass_g < 3000)
# A tibble: 3 × 2
  `body_mass_g < 3000`     n
  <lgl>                <int>
1 FALSE                  333
2 TRUE                     9
3 NA                       2

summarize()

summarize(penguins, mean_flipper_length = mean(flipper_length_mm))
# A tibble: 1 × 1
  mean_flipper_length
                <dbl>
1                  NA
summarize(
  penguins, 
  mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
  mean_body_mass = mean(body_mass_g, na.rm = TRUE),
  mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
  )
# A tibble: 1 × 3
  mean_flipper_length mean_body_mass mean_bill_length
                <dbl>          <dbl>            <dbl>
1                201.          4202.             43.9

group_by()

penguins_grouped_by_sex <- group_by(penguins, sex)
summarize(penguins_grouped_by_sex, mean_body_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 3 × 2
  sex    mean_body_mass
  <fct>           <dbl>
1 female          3862.
2 male            4546.
3 <NA>            4006.
penguins_grouped_by_sex_and_species <- group_by(penguins, sex, species)
summarize(penguins_grouped_by_sex_and_species, mean_body_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 8 × 3
# Groups:   sex [3]
  sex    species   mean_body_mass
  <fct>  <fct>              <dbl>
1 female Adelie             3369.
2 female Chinstrap          3527.
3 female Gentoo             4680.
4 male   Adelie             4043.
5 male   Chinstrap          3939.
6 male   Gentoo             5485.
7 <NA>   Adelie             3540 
8 <NA>   Gentoo             4588.
summarize(penguins, mean_body_mass = mean(body_mass_g, na.rm = TRUE), .by = sex)
# A tibble: 3 × 2
  sex    mean_body_mass
  <fct>           <dbl>
1 male            4546.
2 female          3862.
3 <NA>            4006.

The Pipe

We typically want to run numerous operations on a data frame, and saving the intermediate outputs as separate variables is tedious. The ‘pipe’ operator (%>% or |>), passes the output from one function directly into another.

  • Windows: Ctrl + M; Mac: Cmd + M
penguins %>% 
  filter(body_mass_g > 4000) %>% 
  group_by(sex) %>% 
  summarize(mean_body_mass = mean(body_mass_g, na.rm = TRUE)) %>% 
  arrange(desc(mean_body_mass))
# A tibble: 3 × 2
  sex    mean_body_mass
  <fct>           <dbl>
1 male            4975 
2 female          4683.
3 <NA>            4520 
penguins %>% count(species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

The tidyverse paradigm

  • ‘siuba’ in Python:
from siuba import group_by, summarize, _
from siuba.data import mtcars

(mtcars
  >> group_by(_.cyl)
  >> summarize(avg_hp = _.hp.mean()))
  • ‘Tidier.jl’ in Julia:
@chain movies begin
    @group_by(Year)
    @summarize(Mean_Yearly_Rating = mean(skipmissing(Rating)),
        Median_Yearly_Rating = median(skipmissing(Rating)))
    @slice(1:5)
end
  • PRQL:
from invoices
filter invoice_date >= @1970-01-16
derive {
  transaction_fees = 0.8,
  income = total - transaction_fees
}

Practice using the tidyverse!

Source: Air Traffic Passenger Statistics

air_traffic <- read.socrata("https://data.sfgov.org/resource/rkru-6vcg.csv")

# How many passengers deplaned from airlines with 'China' in their name?
air_traffic %>% 
  filter(
    str_detect(operating_airline, "China"),
    activity_type_code == "Deplaned"
    ) %>% 
  group_by(operating_airline) %>% 
  summarize(passengers = sum(passenger_count)) %>% 
  arrange(desc(passengers))

# How many flights for each operating airline in 2020?
air_traffic %>% 
  filter(
    activity_period_start_date >= as.Date("2020-01-01") & 
      activity_period_start_date <= as.Date("2020-12-31")
    ) %>% 
  count(operating_airline, sort = TRUE, name = "flights") %>% 
  head()

left_join()

If a row in ‘x’ or the left-hand side matches a row in ‘y’ or the right-hand side, the columns from the y table are joined to the x table.

df1 <- tibble(x = 1:3)
df2 <- tibble(x = c(1, 2), y = c("first", "second"))
df1
# A tibble: 3 × 1
      x
  <int>
1     1
2     2
3     3
df2
# A tibble: 2 × 2
      x y     
  <dbl> <chr> 
1     1 first 
2     2 second
left_join(df1, df2, by = "x") # or df1 %>% left_join(df2, join_by(x))
# A tibble: 3 × 2
      x y     
  <dbl> <chr> 
1     1 first 
2     2 second
3     3 <NA>  

left_join() (multiple matches)

If a row in ‘x’ or the left-hand side has multiple matches in ‘y’ or the right-hand side, all the matching rows in y will be joined to x.

df1 <- tibble(id = 1:3)
df2 <- tibble(code = c(1, 1, 2), y = c("first", "second", "third"))
df1
# A tibble: 3 × 1
     id
  <int>
1     1
2     2
3     3
df2
# A tibble: 3 × 2
   code y     
  <dbl> <chr> 
1     1 first 
2     1 second
3     2 third 
df1 %>% left_join(df2, join_by(id == code))
# A tibble: 4 × 2
     id y     
  <dbl> <chr> 
1     1 first 
2     1 second
3     2 third 
4     3 <NA>  

inner_join()

x <- tibble(c1 = 1:3, c2 = c("x1", "x2", "x3"))
y <- tibble(c1 = c(1, 2, 4), c3 = c("y1", "y2", "y4"))
inner_join(x, y, by = join_by(c1))
# A tibble: 2 × 3
     c1 c2    c3   
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   

bind_rows()

penguins_2007 <- penguins %>% filter(year == 2007)
penguins_2008 <- penguins %>% filter(year == 2008)
nrow(penguins_2007)
[1] 110
nrow(penguins_2008)
[1] 114
all_penguins <- bind_rows(penguins_2007, penguins_2008)
all_penguins
# A tibble: 224 × 8
  species island    bill_length_mm bill_depth_mm flipper_l…¹ body_…² sex    year
  <fct>   <fct>              <dbl>         <dbl>       <int>   <int> <fct> <int>
1 Adelie  Torgersen           39.1          18.7         181    3750 male   2007
2 Adelie  Torgersen           39.5          17.4         186    3800 fema…  2007
3 Adelie  Torgersen           40.3          18           195    3250 fema…  2007
4 Adelie  Torgersen           NA            NA            NA      NA <NA>   2007
5 Adelie  Torgersen           36.7          19.3         193    3450 fema…  2007
6 Adelie  Torgersen           39.3          20.6         190    3650 male   2007
7 Adelie  Torgersen           38.9          17.8         181    3625 fema…  2007
8 Adelie  Torgersen           39.2          19.6         195    4675 male   2007
# … with 216 more rows, and abbreviated variable names ¹​flipper_length_mm,
#   ²​body_mass_g

Practice joining data!

flights <- read_rds("data/flights.rds")
airlines <- read_rds("data/airlines.rds")
planes <- read_rds("data/planes.rds")
airports <- read_rds("data/airports.rds")

left_join(flights, airlines, by = join_by(carrier))

flights %>% 
  left_join(airports, join_by(dest == faa)) %>% 
  select(year, month, day, origin, dest, tzone)

flights %>% 
  inner_join(planes, join_by(tailnum)) %>% 
  select(flight, month, day, type, engine)

pivoting

Reshape your data into something longer (increasing number of rows and decreasing the number of columns) or reshape your data into something wider (increasing the number of columns and decreasing the number of rows).

pivot_longer()

glimpse(relig_income)
Rows: 18
Columns: 11
$ religion             <chr> "Agnostic", "Atheist", "Buddhist", "Catholic", "D…
$ `<$10k`              <dbl> 27, 12, 27, 418, 15, 575, 1, 228, 20, 19, 289, 29…
$ `$10-20k`            <dbl> 34, 27, 21, 617, 14, 869, 9, 244, 27, 19, 495, 40…
$ `$20-30k`            <dbl> 60, 37, 30, 732, 15, 1064, 7, 236, 24, 25, 619, 4…
$ `$30-40k`            <dbl> 81, 52, 34, 670, 11, 982, 9, 238, 24, 25, 655, 51…
$ `$40-50k`            <dbl> 76, 35, 33, 638, 10, 881, 11, 197, 21, 30, 651, 5…
$ `$50-75k`            <dbl> 137, 70, 58, 1116, 35, 1486, 34, 223, 30, 95, 110…
$ `$75-100k`           <dbl> 122, 73, 62, 949, 21, 949, 47, 131, 15, 69, 939, …
$ `$100-150k`          <dbl> 109, 59, 39, 792, 17, 723, 48, 81, 11, 87, 753, 4…
$ `>150k`              <dbl> 84, 74, 53, 633, 18, 414, 54, 78, 6, 151, 634, 42…
$ `Don't know/refused` <dbl> 96, 76, 54, 1489, 116, 1529, 37, 339, 37, 162, 13…
relig_income %>% 
  pivot_longer(
    cols = 2:11,
    names_to = "income",
    values_to = "count"
  )
# A tibble: 180 × 3
  religion income    count
  <chr>    <chr>     <dbl>
1 Agnostic <$10k        27
2 Agnostic $10-20k      34
3 Agnostic $20-30k      60
4 Agnostic $30-40k      81
5 Agnostic $40-50k      76
6 Agnostic $50-75k     137
7 Agnostic $75-100k    122
8 Agnostic $100-150k   109
# … with 172 more rows

pivot_wider()

penguins %>% 
  count(island, species) %>% 
  pivot_wider(
    names_from = species,
    values_from = n
  )
# A tibble: 3 × 4
  island    Adelie Gentoo Chinstrap
  <fct>      <int>  <int>     <int>
1 Biscoe        44    124        NA
2 Dream         56     NA        68
3 Torgersen     52     NA        NA

Writing Data

adelie_males_on_torgersen_in_2007 <- penguins %>% 
  filter(
    species == "Adelie",
    sex == "male",
    island == "Torgersen",
    year == "2007"
  ) %>% 
  select(bill_length_mm:body_mass_g)

write_csv(adelie_males_on_torgersen_in_2007, "data/adelie_males_on_torgersen_in_2007.csv")
write_rds(adelie_males_on_torgersen_in_2007, "data/adelie_males_on_torgersen_in_2007.rds")

library(writexl)
write_xlsx(adelie_males_on_torgersen_in_2007, "data/adelie_males_on_torgersen_in_2007.xlsx")

ggplot2

ggplot(data = penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot(outliers = FALSE) +
  geom_jitter(width = 0.1, alpha = 0.3) +
  labs(x = "Species", y = "Body Mass (grams)", title = "Penguin Body Mass by Species") +
  theme_classic() +
  theme(legend.position = "none")

gt

library(gt)

penguins %>% 
  group_by(island, species, sex) %>% 
  summarize(
    mean_body_mass = mean(body_mass_g, na.rm = TRUE)
    ) %>% 
  ungroup() %>% 
  drop_na(sex) %>% 
  pivot_wider(
    names_from = sex, 
    values_from = mean_body_mass
    ) %>% 
  mutate(island = paste("On", island, "island")) %>% 
  rename(
    Island = island,
    Species = species,
    Female = female,
    Male = male
  ) %>% 
  gt(
    groupname_col = "Island", 
    rowname_col = "Species"
    ) %>% 
  tab_style(
    style = list(cell_text(align = "right")),
    locations = cells_stub(rows = TRUE)
  ) %>% 
  tab_header(
    title = "Penguin Body Mass",
    subtitle = "Adult penguins near Palmer Station"
  )
Penguin Body Mass
Adult foraging penguins near Palmer Station
Female Male
On Biscoe island
Adelie 3369.318 4050.000
Gentoo 4679.741 5484.836
On Dream island
Adelie 3344.444 4045.536
Chinstrap 3527.206 3938.971
On Torgersen island
Adelie 3395.833 4034.783

Script Structure and Basic EDA

eda.R

  1. Load packages
  2. Import raw data
  3. Clean/reshape/‘tidy’ data
  4. Analyze
  5. Write cleaned data

Where to get Help

  • Stackoverflow

  • Posit Community

  • Twitter/X/Mastadon/BlueSky

  • CCSF Teams Channel

How to get Help

What should I learn next?

Other Cool R Things

Lab

Lab instructions in the README here.

fin

Give us your feedback! (Please respond to the survey sent out after class)