Introduction to R


Data Wrangling and Exploring Data with R

Data Academy

Course Outline

Overview

  • RStudio
  • R Projects
  • Variables
  • Types and Data Structures
  • Operators
  • Functions
  • Packages
  • Reading/Writing Data

Exploratory Data Analysis

  • Data Frames
  • The tidyverse
  • Subsetting, Sorting, and Summarizing
  • Manipulating
  • Piping
  • Joining
  • Pivoting
  • Plotting
  • Script Structure

Course Goals

  • Navigate RStudio
  • Pledge to use R Projects
  • Understand R’s types and structures
  • Manipulate data frames
  • Conduct basic EDA
  • Know where and how to get help

Why use R?

  • It’s powerful.
  • It promotes reproducibility.
  • It’s popular around the city.
  • It works well with other languages.
  • There are packages for almost everything.
  • If there isn’t a package for something, it’s fairly easy to make it yourself!
  • It’s cooler than Python.

R at CCSF

  • The Rent Board uses R to clean and join parcel assessor, DBI, housing inventory and rent board fee data.

  • DataSF uses R to manage and publish the San Francisco Population and Demographic Census Data.

  • DPH uses R to clean and analyze survey data.

  • OEWD uses R to make maps, read data from Salesforce, Airtable, and other third-party systems and APIs, and build robust data pipelines.

  • You will use R to…

Your First R Project

(Live)

Intro to R GitHub Repo

Use R Projects because:

  • They promote sensible file/directory structure.
  • They’re easy to share.
  • Manually fiddling with your working directory is bad practice.

Variable Assignment

Variable names can’t include spaces and must start with a letter.

a <- 5
a = 5
a
[1] 5
b <- 4
b
[1] 4
c <- a + b
c
[1] 9
test_scores <- c(82, 89, 92, 75, 74, 99)
test_scores
[1] 82 89 92 75 74 99

Tip

c() is a function that ‘concatenates’ a vector. More on vectors in a bit!

Functions

Functions are pre-defined code that accomplish one specific task. A function has two components: (1) the name of the function; and (2) the input or ‘arguments’. The value returned is called the ‘output.’ Running or executing a function is called ‘calling’ a function.

mean(x = c(3, 7, 11, 12, 14, 15))
[1] 10.33333
  • A function can have any number of arguments. Some are required, but some are optional.
  • If an argument is not explicitly specified, a default will be used.
mean(c(3, 7, 11, 12, 14, 15, NA))
[1] NA
mean(c(3, 7, 11, 12, 14, 15, NA), na.rm = TRUE)
[1] 10.33333
  • A function will throw an error if a required argument is not used.
mean(na.rm = TRUE)
Error in mean.default(na.rm = TRUE): argument "x" is missing, with no default

Types

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

Types - numbers

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

R has three numeric types: numeric (or floating point), integer, and complex.

n <- 2.5
class(n)
[1] "numeric"
int <- as.integer(2)
class(int)
[1] "integer"
class(-1 + 3i)
[1] "complex"

Types - characters

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

Characters are for storing text or strings and can be entered using either double or single quotes.

greetings <- "hello world"

class(greetings)
[1] "character"

‘Escape’ quotes within a character string with backslashes.

my_enthusiasm <- 'I\'m loving R!'
print(my_enthusiasm)
[1] "I'm loving R!"

Types - boolean/logical

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

Logical types have three possible values: TRUE, FALSE, and NA.

class(TRUE)
[1] "logical"
cond <- 1 > 2
cond
[1] FALSE
class(cond)
[1] "logical"

Types - dates

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

Objects of the date class are used for dates.

d <- as.Date("2024/09/07")
class(d)
[1] "Date"
as.Date("09/07/2024")
[1] "0009-07-20"
as.Date("09/07/2024", format = "%m/%d/%Y")
[1] "2024-09-07"
%Y 4-digit year (1982)
%y 2-digit year (82)
%m 2-digit month (01)
%d 2-digit day of the month (13)

Types - factors

Types

  • numbers
  • characters
  • logical
  • dates
  • factors

Factors are used for categorical, ‘ordinal’ variables.

m <- c("Dec", "Apr", "Jan", "Mar")
sort(m)
[1] "Apr" "Dec" "Jan" "Mar"
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

m <- factor(m, levels = month_levels)
sort(m)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Structures - vectors

Types

  • vector
  • list
  • matrix
  • data frame

A vector is a one dimensional collection of elements of the same class.

temps <- c(94, 83, 79, 55, 65)
temps
[1] 94 83 79 55 65
groceries <- c("apples", "carrots", "ice cream", "hot sauce")
groceries
[1] "apples"    "carrots"   "ice cream" "hot sauce"
conds <- c(TRUE, TRUE, FALSE, FALSE)
conds
[1]  TRUE  TRUE FALSE FALSE

Important

Be wary of R’s implicit coercion

c(1, 2, TRUE)
[1] 1 2 1
c("A", "B", TRUE)
[1] "A"    "B"    "TRUE"

Subsetting vectors

Types

  • vector
  • list
  • matrix
  • data frame

Elements can be accessed using ‘subscripts’ or ‘indices’, which are specified using brackets:

groceries <- c("apples", "carrots", "ice cream", "hot sauce")
groceries[2]
[1] "carrots"
groceries[-2]
[1] "apples"    "ice cream" "hot sauce"
groceries[c(1, 2, 3)]
[1] "apples"    "carrots"   "ice cream"
groceries[1:3]
[1] "apples"    "carrots"   "ice cream"

Structures - lists

Types

  • vector
  • list
  • matrix
  • data frame

A list can include elements of any type and of any dimension.

l1 <- list("A", 1, TRUE)
l1
[[1]]
[1] "A"

[[2]]
[1] 1

[[3]]
[1] TRUE
l2 <- list(c(1, 2, 3, 4, 5), c("a", "b", "c"), c(TRUE, TRUE))
l2
[[1]]
[1] 1 2 3 4 5

[[2]]
[1] "a" "b" "c"

[[3]]
[1] TRUE TRUE

Structures - matrices

Types

  • vector
  • list
  • matrix
  • data frame

A matrix is a two-dimensional collection of elements of the same type.

m1 <- matrix(
  c(43, 43, 65, 76, 87, 34),
  nrow = 3,
  ncol = 2
)
m1
     [,1] [,2]
[1,]   43   76
[2,]   43   87
[3,]   65   34
m2 <- matrix(
  c(TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE),
  nrow = 2,
  ncol = 4
)
m2
     [,1]  [,2]  [,3]  [,4]
[1,] TRUE FALSE FALSE  TRUE
[2,] TRUE FALSE  TRUE FALSE

Structures - data frames

Types

  • vector
  • list
  • matrix
  • data frame

A data frame is a table made of equal length vectors.

df <- data.frame(
  x = c(1, 2, 3),
  y = c(5, 6, 7),
  z = c("a", "b", "c")
)
df
  x y z
1 1 5 a
2 2 6 b
3 3 7 c
  • A ‘tibble’ is an improved type of data frame.
library(tidyverse)
df <- tibble(
  x = c(1, 2, 3),
  z = c("a", "b", "c"),
  f = c(TRUE, FALSE, TRUE)
)
df
# A tibble: 3 × 3
      x z     f    
  <dbl> <chr> <lgl>
1     1 a     TRUE 
2     2 b     FALSE
3     3 c     TRUE 

Subsetting data frames

Types

  • vector
  • list
  • matrix
  • data frame

Use $ to access columns (vectors) within a data frame.

df$x
[1] 1 2 3
sum(df$x)
[1] 6
df$z[1]
[1] "a"
df$z[c(1, 3)]
[1] "a" "c"

Subscripts/indices for data frames are pairs specifying the row and column numbers.

df[1,1]
# A tibble: 1 × 1
      x
  <dbl>
1     1
df[2, 1:3]
# A tibble: 1 × 3
      x z     f    
  <dbl> <chr> <lgl>
1     2 b     FALSE

Operators

Category Operator Operation Example
Artithmetic + Addition x + y
Artithmetic - Subtraction x - y
Artithmetic * Multiplication x * y
Artithmetic / Division x / y
Artithmetic ^ Exponent x ^ y
Artithmetic %% Modulus (Remainder from division) x %% y
Comparison == Equal x == y
Comparison != Not equal x != y
Comparison > Greater than x > y
Comparison < Less than x < y
Comparison >= Greater than or equal to x >= y
Comparison <= Less than or equal to x <= y
Logical & AND x & y
Logical | OR x | y
Logical ! NOT !(x > y)
Logical %in% IN x %in% y
Sequence : Sequence 1:10

Vectorization

Many of R’s operations are ‘vectorized’, meaning a given operation will operate on each element of a vector without explicit specification.

x <- 1:5
x + 1
[1] 2 3 4 5 6
x < 4
[1]  TRUE  TRUE  TRUE FALSE FALSE
x == 4
[1] FALSE FALSE FALSE  TRUE FALSE
x[x < 4]
[1] 1 2 3
temps_f <- c(94, 83, 79, 71, 72)
temps_c <- (temps_f - 32)*5/9
temps_c
[1] 34.44444 28.33333 26.11111 21.66667 22.22222

Practice Using Functions!

temps <- c(71, 91, 77, 81, 91, 91, 68, 66, 88, 82, 85, 76, 83, 91, 81, 77, 89, 68)
min(temps)
[1] 66
max(temps)
[1] 91
median(temps)
[1] 81.5
range(temps)
[1] 66 91
sd(temps)
[1] 8.587877
sum(temps)
[1] 1456
length(temps)
[1] 18
sum(temps > 90)
[1] 4
mean(temps > 90)
[1] 0.2222222

Packages

Packages extend R’s functionality beyond the functions available in the ‘base’ version.

  • Before you can use the functions from a package, the package must first be installed.

  • After a package has been installed, it can then be loaded into your session.

install.packages("dplyr") # you only need to install the package once
library(dplyr) # but you need to load the package each session 

The Comprehensive R Archive Network serves as a repository for most packages (21,1145 as of August 2024).

The Tidyverse

“The tidyverse is an collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”

Key Tidyverse Packages

  • readr - functions for reading rectangular data (like csv, tsv, and fwf).
  • readxl - functions to read data from .xls and .xlsx files.
  • dplyr - a grammar of data manipulation that solve the most common data manipulation challenges.
  • tidyr - tools for pivoting, cleaning, and ‘tidying’ data.
  • ggplot2 - a system for creating graphics, based on The Grammar of Graphics.
  • purrr - tools for working with functions and vectors.
  • lubridate - functions for working with dates.
  • stringr - functions for working with strings.
  • forcats - functions for working with factors.

Reading Data

library(readr)
penguins <- read_csv("data/penguins.csv")
Rows: 344 Columns: 8                                                                                                                                                     
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(readxl)
penguins <- read_xlsx("data/penguins.xlsx")
penguins <- read_rds("data/penguins.rds")

Reading from DataSF

library(RSocrata)
crashes <- read.socrata("https://data.sfgov.org/resource/dau3-4s8f.csv")
# crashes <- read_csv("https://data.sfgov.org/resource/dau3-4s8f.csv")
glimpse(crashes)
Rows: 323
Columns: 28
$ unique_id             <int> 1, 2, 4, 16, 17, 20240304, 20240346, 20240391, 2…
$ case_id_fkey          <chr> "140236301", "140755533", "140365546", "15056204…
$ latitude              <dbl> 37.71041, 37.72548, 37.74826, 37.77730, 37.77825…
$ longitude             <dbl> -122.4042, -122.3942, -122.4137, -122.4197, -122…
$ collision_year        <int> 2014, 2014, 2014, 2015, 2014, 2024, 2024, 2024, …
$ death_date            <dttm> 2014-03-20, 2014-09-08, 2014-05-03, 2015-06-30,…
$ death_time            <chr> "11:21:00", "16:38:00", "17:20:00", "06:00:00", …
$ death_datetime        <dttm> 2014-03-20 11:21:00, 2014-09-08 16:38:00, 2014-…
$ collision_date        <dttm> 2014-03-20, 2014-09-08, 2014-05-03, 2015-06-28,…
$ collision_time        <chr> "", "05:10:00", "02:24:00", "03:52:00", "02:26:0…
$ collision_datetime    <dttm> 2014-03-20 00:00:00, 2014-09-08 05:10:00, 2014-…
$ location              <chr> "Bayshore Blvd near Visitation Ave", "3rd St at …
$ age                   <int> 82, 71, 26, 52, 53, 41, 48, 40, 24, 1, 38, 25, 8…
$ sex                   <chr> "Female", "Male", "Male", "Male", "Male", "Male"…
$ deceased              <chr> "Pedestrian", "Pedestrian", "Driver", "Motorcycl…
$ collision_type        <chr> "Pedestrian vs Motor Vehicle", "Pedestrian vs LR…
$ street_type           <chr> "City Street", "City Street", "City Street", "Ci…
$ on_vz_hin_2017        <chr> "false", "true", "true", "true", "true", "true",…
$ in_coc_2018           <chr> "false", "true", "false", "false", "false", "tru…
$ publish               <chr> "true", "true", "true", "true", "true", "true", …
$ on_vz_hin_2022        <chr> "true", "true", "true", "true", "true", "true", …
$ in_epa_2021           <chr> "false", "true", "false", "true", "true", "true"…
$ point                 <chr> "POINT (-122.404226037 37.710409217)", "POINT (-…
$ analysis_neighborhood <chr> "Bayview Hunters Point", "Bayview Hunters Point"…
$ supervisor_district   <int> 10, 10, 9, 5, 5, 5, 7, 7, 6, 7, 7, 10, 11, 6, 9,…
$ police_district       <chr> "INGLESIDE", "BAYVIEW", "MISSION", "NORTHERN", "…
$ data_as_of            <dttm> 2024-09-18, 2024-09-18, 2024-09-18, 2024-09-18,…
$ data_loaded_at        <dttm> 2025-02-10 13:42:17, 2025-02-10 13:42:17, 2025-…

select()

library(dplyr)
select(penguins, species, island, sex, body_mass_g)
# A tibble: 344 × 4
  species island    sex    body_mass_g
  <fct>   <fct>     <fct>        <int>
1 Adelie  Torgersen male          3750
2 Adelie  Torgersen female        3800
3 Adelie  Torgersen female        3250
# ℹ 341 more rows
select(penguins, 1:4)
# A tibble: 344 × 4
  species island    bill_length_mm bill_depth_mm
  <fct>   <fct>              <dbl>         <dbl>
1 Adelie  Torgersen           39.1          18.7
2 Adelie  Torgersen           39.5          17.4
3 Adelie  Torgersen           40.3          18  
# ℹ 341 more rows
select(penguins, bill_length_mm:body_mass_g)
# A tibble: 344 × 4
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
           <dbl>         <dbl>             <int>       <int>
1           39.1          18.7               181        3750
2           39.5          17.4               186        3800
3           40.3          18                 195        3250
# ℹ 341 more rows

arrange()

arrange(penguins, bill_length_mm)
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Dream               32.1          15.5               188        3050
2 Adelie  Dream               33.1          16.1               178        2900
3 Adelie  Torgersen           33.5          19                 190        3600
# ℹ 341 more rows
# ℹ 2 more variables: sex <fct>, year <int>
arrange(penguins, species)
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
# ℹ 341 more rows
# ℹ 2 more variables: sex <fct>, year <int>
arrange(penguins, desc(bill_length_mm))
# A tibble: 344 × 8
  species   island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>     <fct>           <dbl>         <dbl>             <int>       <int>
1 Gentoo    Biscoe           59.6          17                 230        6050
2 Chinstrap Dream            58            17.8               181        3700
3 Gentoo    Biscoe           55.9          17                 228        5600
# ℹ 341 more rows
# ℹ 2 more variables: sex <fct>, year <int>

rename()

rename(penguins, Sex = sex)
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
# ℹ 341 more rows
# ℹ 2 more variables: Sex <fct>, year <int>
rename(penguins, genus = species, isle = island)
# A tibble: 344 × 8
  genus  isle   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex  
  <fct>  <fct>           <dbl>         <dbl>             <int>       <int> <fct>
1 Adelie Torge…           39.1          18.7               181        3750 male 
2 Adelie Torge…           39.5          17.4               186        3800 fema…
3 Adelie Torge…           40.3          18                 195        3250 fema…
# ℹ 341 more rows
# ℹ 1 more variable: year <int>

distinct()

distinct(penguins, sex)
# A tibble: 3 × 1
  sex   
  <fct> 
1 male  
2 female
3 <NA>  
distinct(penguins, island)
# A tibble: 3 × 1
  island   
  <fct>    
1 Torgersen
2 Biscoe   
3 Dream    
distinct(penguins, island, species)
# A tibble: 5 × 2
  island    species
  <fct>     <fct>  
1 Torgersen Adelie 
2 Biscoe    Adelie 
3 Dream     Adelie 
# ℹ 2 more rows

filter()

filter(penguins, sex == "female")
# A tibble: 165 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.5          17.4               186        3800
2 Adelie  Torgersen           40.3          18                 195        3250
3 Adelie  Torgersen           36.7          19.3               193        3450
# ℹ 162 more rows
# ℹ 2 more variables: sex <fct>, year <int>
filter(penguins, body_mass_g > 4800)
# A tibble: 81 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Gentoo  Biscoe           50            16.3               230        5700
2 Gentoo  Biscoe           50            15.2               218        5700
3 Gentoo  Biscoe           47.6          14.5               215        5400
# ℹ 78 more rows
# ℹ 2 more variables: sex <fct>, year <int>
filter(penguins, sex == "female", body_mass_g >= 4800)
# A tibble: 22 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Gentoo  Biscoe           45.4          14.6               211        4800
2 Gentoo  Biscoe           46.2          14.5               209        4800
3 Gentoo  Biscoe           45.1          14.5               215        5000
# ℹ 19 more rows
# ℹ 2 more variables: sex <fct>, year <int>

filter()

filter(penguins, is.na(bill_length_mm))
# A tibble: 2 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen             NA            NA                NA          NA
2 Gentoo  Biscoe                NA            NA                NA          NA
# ℹ 2 more variables: sex <fct>, year <int>
filter(penguins, !is.na(bill_length_mm))
# A tibble: 342 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
# ℹ 339 more rows
# ℹ 2 more variables: sex <fct>, year <int>

filter()

filter(penguins, island == "Biscoe" | island == "Dream")
# A tibble: 292 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Adelie  Biscoe           37.8          18.3               174        3400
2 Adelie  Biscoe           37.7          18.7               180        3600
3 Adelie  Biscoe           35.9          19.2               189        3800
# ℹ 289 more rows
# ℹ 2 more variables: sex <fct>, year <int>
filter(penguins, island %in% c("Biscoe", "Dream"))
# A tibble: 292 × 8
  species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
1 Adelie  Biscoe           37.8          18.3               174        3400
2 Adelie  Biscoe           37.7          18.7               180        3600
3 Adelie  Biscoe           35.9          19.2               189        3800
# ℹ 289 more rows
# ℹ 2 more variables: sex <fct>, year <int>
filter(penguins, !island %in% c("Biscoe", "Dream"))
# A tibble: 52 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
# ℹ 49 more rows
# ℹ 2 more variables: sex <fct>, year <int>

filter() with string helpers

library(stringr)

filter(penguins, str_detect(island, "ger"))
# A tibble: 52 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
# ℹ 49 more rows
# ℹ 2 more variables: sex <fct>, year <int>
filter(penguins, str_length(species) == 6)
# A tibble: 276 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
# ℹ 273 more rows
# ℹ 2 more variables: sex <fct>, year <int>

mutate()

mutate(penguins, body_mass_lb = body_mass_g/453.6)
# A tibble: 344 × 9
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
# ℹ 341 more rows
# ℹ 3 more variables: sex <fct>, year <int>, body_mass_lb <dbl>
usa_penguins <- mutate(
  penguins, 
  body_mass_lb = body_mass_g/453.6, 
  flipper_length_in = flipper_length_mm/25.4
  )

select(usa_penguins, species, body_mass_lb, flipper_length_in)
# A tibble: 344 × 3
  species body_mass_lb flipper_length_in
  <fct>          <dbl>             <dbl>
1 Adelie          8.27              7.13
2 Adelie          8.38              7.32
3 Adelie          7.16              7.68
# ℹ 341 more rows

mutate() with helpers

  • If you need to specify only one condition, use if_else(). Otherwise, use case_when().
mutate(penguins, body_mass_g = if_else(island == "Biscoe", body_mass_g - 50, body_mass_g))
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <dbl>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
# ℹ 341 more rows
# ℹ 2 more variables: sex <fct>, year <int>
new_measurements <- mutate(penguins, new_body_mass_g = case_when(
  island == "Biscoe" ~ body_mass_g - 50,
  island == "Dream" ~ body_mass_g - 75,
  island == "Torgersen" ~ body_mass_g - 100
  )
)
select(new_measurements, island, body_mass_g, new_body_mass_g)
# A tibble: 344 × 3
  island    body_mass_g new_body_mass_g
  <fct>           <int>           <dbl>
1 Torgersen        3750            3650
2 Torgersen        3800            3700
3 Torgersen        3250            3150
# ℹ 341 more rows

mutate() with string helpers

penguins_with_ids <- mutate(penguins, id = paste(island, species, sex, year, sep = "-"))
select(penguins_with_ids, island, species, sex, year, id)
# A tibble: 344 × 5
  island    species sex     year id                          
  <fct>     <fct>   <fct>  <int> <chr>                       
1 Torgersen Adelie  male    2007 Torgersen-Adelie-male-2007  
2 Torgersen Adelie  female  2007 Torgersen-Adelie-female-2007
3 Torgersen Adelie  female  2007 Torgersen-Adelie-female-2007
# ℹ 341 more rows
mutate(penguins, sex = str_sub(sex, start = 1, end = 1))
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
# ℹ 341 more rows
# ℹ 2 more variables: sex <chr>, year <int>
mutate(penguins, sex = str_to_title(sex))
# A tibble: 344 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
# ℹ 341 more rows
# ℹ 2 more variables: sex <chr>, year <int>

count()

count(penguins, sex)
# A tibble: 3 × 2
  sex        n
  <fct>  <int>
1 female   165
2 male     168
3 <NA>      11
count(penguins, species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
count(penguins, sex, species, sort = TRUE)
# A tibble: 8 × 3
  sex    species       n
  <fct>  <fct>     <int>
1 female Adelie       73
2 male   Adelie       73
3 male   Gentoo       61
4 female Gentoo       58
5 female Chinstrap    34
6 male   Chinstrap    34
7 <NA>   Adelie        6
8 <NA>   Gentoo        5

count()

count(penguins, island, name = "n_island_dwellers")
# A tibble: 3 × 2
  island    n_island_dwellers
  <fct>                 <int>
1 Biscoe                  168
2 Dream                   124
3 Torgersen                52
count(penguins, island == "Biscoe")
# A tibble: 2 × 2
  `island == "Biscoe"`     n
  <lgl>                <int>
1 FALSE                  176
2 TRUE                   168
count(penguins, body_mass_g < 3000)
# A tibble: 3 × 2
  `body_mass_g < 3000`     n
  <lgl>                <int>
1 FALSE                  333
2 TRUE                     9
3 NA                       2

summarize()

summarize(penguins, mean_flipper_length = mean(flipper_length_mm))
# A tibble: 1 × 1
  mean_flipper_length
                <dbl>
1                  NA
summarize(
  penguins, 
  mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE),
  mean_body_mass = mean(body_mass_g, na.rm = TRUE),
  mean_bill_length = mean(bill_length_mm, na.rm = TRUE),
  )
# A tibble: 1 × 3
  mean_flipper_length mean_body_mass mean_bill_length
                <dbl>          <dbl>            <dbl>
1                201.          4202.             43.9

group_by()

penguins_grouped_by_sex <- group_by(penguins, sex)
summarize(penguins_grouped_by_sex, mean_body_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 3 × 2
  sex    mean_body_mass
  <fct>           <dbl>
1 female          3862.
2 male            4546.
3 <NA>            4006.
penguins_grouped_by_sex_and_species <- group_by(penguins, sex, species)
summarize(penguins_grouped_by_sex_and_species, mean_body_mass = mean(body_mass_g, na.rm = TRUE))
# A tibble: 8 × 3
# Groups:   sex [3]
  sex    species   mean_body_mass
  <fct>  <fct>              <dbl>
1 female Adelie             3369.
2 female Chinstrap          3527.
3 female Gentoo             4680.
4 male   Adelie             4043.
5 male   Chinstrap          3939.
6 male   Gentoo             5485.
7 <NA>   Adelie             3540 
8 <NA>   Gentoo             4588.
summarize(penguins, mean_body_mass = mean(body_mass_g, na.rm = TRUE), .by = sex)
# A tibble: 3 × 2
  sex    mean_body_mass
  <fct>           <dbl>
1 male            4546.
2 female          3862.
3 <NA>            4006.

The Pipe

We typically want to run numerous operations on a data frame, and saving the intermediate outputs as separate variables is tedious. The ‘pipe’ operator (%>% or |>), passes the output from one function directly into another.

  • Windows: Ctrl + M; Mac: Cmd + M
penguins %>% 
  filter(body_mass_g > 4000) %>% 
  group_by(sex) %>% 
  summarize(mean_body_mass = mean(body_mass_g, na.rm = TRUE)) %>% 
  arrange(desc(mean_body_mass))
# A tibble: 3 × 2
  sex    mean_body_mass
  <fct>           <dbl>
1 male            4975 
2 female          4683.
3 <NA>            4520 
penguins %>% count(species)
# A tibble: 3 × 2
  species       n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124

The tidyverse paradigm

  • ‘siuba’ in Python:
from siuba import group_by, summarize, _
from siuba.data import mtcars

(mtcars
  >> group_by(_.cyl)
  >> summarize(avg_hp = _.hp.mean()))
  • ‘Tidier.jl’ in Julia:
@chain movies begin
    @group_by(Year)
    @summarize(Mean_Yearly_Rating = mean(skipmissing(Rating)),
        Median_Yearly_Rating = median(skipmissing(Rating)))
    @slice(1:5)
end
  • PRQL:
from invoices
filter invoice_date >= @1970-01-16
derive {
  transaction_fees = 0.8,
  income = total - transaction_fees
}

Practice using the tidyverse!

Source: Air Traffic Passenger Statistics

air_traffic <- read.socrata("https://data.sfgov.org/resource/rkru-6vcg.csv")

# How many passengers deplaned from airlines with 'China' in their name?
air_traffic %>% 
  filter(
    str_detect(operating_airline, "China"),
    activity_type_code == "Deplaned"
    ) %>% 
  group_by(operating_airline) %>% 
  summarize(passengers = sum(passenger_count)) %>% 
  arrange(desc(passengers))

# How many flights for each operating airline in 2020?
air_traffic %>% 
  filter(
    activity_period_start_date >= as.Date("2020-01-01") & 
      activity_period_start_date <= as.Date("2020-12-31")
    ) %>% 
  count(operating_airline, sort = TRUE, name = "flights") %>% 
  head()

left_join()

If a row in ‘x’ or the left-hand side matches a row in ‘y’ or the right-hand side, the columns from the y table are joined to the x table.

df1 <- tibble(x = 1:3)
df2 <- tibble(x = c(1, 2), y = c("first", "second"))
df1
# A tibble: 3 × 1
      x
  <int>
1     1
2     2
3     3
df2
# A tibble: 2 × 2
      x y     
  <dbl> <chr> 
1     1 first 
2     2 second
left_join(df1, df2, by = "x") # or df1 %>% left_join(df2, join_by(x))
# A tibble: 3 × 2
      x y     
  <dbl> <chr> 
1     1 first 
2     2 second
3     3 <NA>  

left_join() (multiple matches)

If a row in ‘x’ or the left-hand side has multiple matches in ‘y’ or the right-hand side, all the matching rows in y will be joined to x.

df1 <- tibble(id = 1:3)
df2 <- tibble(code = c(1, 1, 2), y = c("first", "second", "third"))
df1
# A tibble: 3 × 1
     id
  <int>
1     1
2     2
3     3
df2
# A tibble: 3 × 2
   code y     
  <dbl> <chr> 
1     1 first 
2     1 second
3     2 third 
df1 %>% left_join(df2, join_by(id == code))
# A tibble: 4 × 2
     id y     
  <dbl> <chr> 
1     1 first 
2     1 second
3     2 third 
4     3 <NA>  

inner_join()

x <- tibble(c1 = 1:3, c2 = c("x1", "x2", "x3"))
y <- tibble(c1 = c(1, 2, 4), c3 = c("y1", "y2", "y4"))
inner_join(x, y, by = join_by(c1))
# A tibble: 2 × 3
     c1 c2    c3   
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   

bind_rows()

penguins_2007 <- penguins %>% filter(year == 2007)
penguins_2008 <- penguins %>% filter(year == 2008)
nrow(penguins_2007)
[1] 110
nrow(penguins_2008)
[1] 114
all_penguins <- bind_rows(penguins_2007, penguins_2008)
all_penguins
# A tibble: 224 × 8
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
1 Adelie  Torgersen           39.1          18.7               181        3750
2 Adelie  Torgersen           39.5          17.4               186        3800
3 Adelie  Torgersen           40.3          18                 195        3250
4 Adelie  Torgersen           NA            NA                  NA          NA
5 Adelie  Torgersen           36.7          19.3               193        3450
6 Adelie  Torgersen           39.3          20.6               190        3650
7 Adelie  Torgersen           38.9          17.8               181        3625
8 Adelie  Torgersen           39.2          19.6               195        4675
# ℹ 216 more rows
# ℹ 2 more variables: sex <fct>, year <int>

Practice joining data!

flights <- read_rds("data/flights.rds")
airlines <- read_rds("data/airlines.rds")
planes <- read_rds("data/planes.rds")
airports <- read_rds("data/airports.rds")

left_join(flights, airlines, by = join_by(carrier))

flights %>% 
  left_join(airports, join_by(dest == faa)) %>% 
  select(year, month, day, origin, dest, tzone)

flights %>% 
  inner_join(planes, join_by(tailnum)) %>% 
  select(flight, month, day, type, engine)

pivoting

Reshape your data into something longer (increasing number of rows and decreasing the number of columns) or reshape your data into something wider (increasing the number of columns and decreasing the number of rows).

pivot_longer()

glimpse(relig_income)
Rows: 18
Columns: 11
$ religion             <chr> "Agnostic", "Atheist", "Buddhist", "Catholic", "D…
$ `<$10k`              <dbl> 27, 12, 27, 418, 15, 575, 1, 228, 20, 19, 289, 29…
$ `$10-20k`            <dbl> 34, 27, 21, 617, 14, 869, 9, 244, 27, 19, 495, 40…
$ `$20-30k`            <dbl> 60, 37, 30, 732, 15, 1064, 7, 236, 24, 25, 619, 4…
$ `$30-40k`            <dbl> 81, 52, 34, 670, 11, 982, 9, 238, 24, 25, 655, 51…
$ `$40-50k`            <dbl> 76, 35, 33, 638, 10, 881, 11, 197, 21, 30, 651, 5…
$ `$50-75k`            <dbl> 137, 70, 58, 1116, 35, 1486, 34, 223, 30, 95, 110…
$ `$75-100k`           <dbl> 122, 73, 62, 949, 21, 949, 47, 131, 15, 69, 939, …
$ `$100-150k`          <dbl> 109, 59, 39, 792, 17, 723, 48, 81, 11, 87, 753, 4…
$ `>150k`              <dbl> 84, 74, 53, 633, 18, 414, 54, 78, 6, 151, 634, 42…
$ `Don't know/refused` <dbl> 96, 76, 54, 1489, 116, 1529, 37, 339, 37, 162, 13…
relig_income %>% 
  pivot_longer(
    cols = 2:11,
    names_to = "income",
    values_to = "count"
  )
# A tibble: 180 × 3
  religion income    count
  <chr>    <chr>     <dbl>
1 Agnostic <$10k        27
2 Agnostic $10-20k      34
3 Agnostic $20-30k      60
4 Agnostic $30-40k      81
5 Agnostic $40-50k      76
6 Agnostic $50-75k     137
7 Agnostic $75-100k    122
8 Agnostic $100-150k   109
# ℹ 172 more rows

pivot_wider()

penguins %>% 
  count(island, species) %>% 
  pivot_wider(
    names_from = species,
    values_from = n
  )
# A tibble: 3 × 4
  island    Adelie Gentoo Chinstrap
  <fct>      <int>  <int>     <int>
1 Biscoe        44    124        NA
2 Dream         56     NA        68
3 Torgersen     52     NA        NA

Writing Data

adelie_males_on_torgersen_in_2007 <- penguins %>% 
  filter(
    species == "Adelie",
    sex == "male",
    island == "Torgersen",
    year == "2007"
  ) %>% 
  select(bill_length_mm:body_mass_g)

write_csv(adelie_males_on_torgersen_in_2007, "data/adelie_males_on_torgersen_in_2007.csv")
write_rds(adelie_males_on_torgersen_in_2007, "data/adelie_males_on_torgersen_in_2007.rds")

library(writexl)
write_xlsx(adelie_males_on_torgersen_in_2007, "data/adelie_males_on_torgersen_in_2007.xlsx")

ggplot2

ggplot(data = penguins, aes(x = species, y = body_mass_g, fill = species)) +
  geom_boxplot(outliers = FALSE) +
  geom_jitter(width = 0.1, alpha = 0.3) +
  labs(x = "Species", y = "Body Mass (grams)", title = "Penguin Body Mass by Species") +
  theme_classic() +
  theme(legend.position = "none")

gt

library(gt)

penguins %>% 
  group_by(island, species, sex) %>% 
  summarize(
    mean_body_mass = mean(body_mass_g, na.rm = TRUE)
    ) %>% 
  ungroup() %>% 
  drop_na(sex) %>% 
  pivot_wider(
    names_from = sex, 
    values_from = mean_body_mass
    ) %>% 
  mutate(island = paste("On", island, "island")) %>% 
  rename(
    Island = island,
    Species = species,
    Female = female,
    Male = male
  ) %>% 
  gt(
    groupname_col = "Island", 
    rowname_col = "Species"
    ) %>% 
  tab_style(
    style = list(cell_text(align = "right")),
    locations = cells_stub(rows = TRUE)
  ) %>% 
  tab_header(
    title = "Penguin Body Mass",
    subtitle = "Adult penguins near Palmer Station"
  )
Penguin Body Mass
Adult foraging penguins near Palmer Station
Female Male
On Biscoe island
Adelie 3369.318 4050.000
Gentoo 4679.741 5484.836
On Dream island
Adelie 3344.444 4045.536
Chinstrap 3527.206 3938.971
On Torgersen island
Adelie 3395.833 4034.783

Script Structure and Basic EDA

eda.R

  1. Load packages
  2. Import raw data
  3. Clean/reshape/‘tidy’ data
  4. Analyze
  5. Write cleaned data

Where to get Help

  • CCSF R Users Teams Channel

  • Stackoverflow

  • Posit Community

  • Twitter/X/Mastadon/BlueSky

How to get Help

What should I learn next?

Other Cool R Things

Lab

Lab instructions in the README here.

fin

Give us your feedback! (Please respond to the survey sent out after class)