Data wrangling with ‘dplyr’ in R

Published

July 12, 2023

Prepared by Claire Lepault and Marie Sevestre

dplyr offers a simple and consistent code syntaxe for data manipulation. We provide here short examples of the key dplyr verbs: select(), filter(), arrange(), mutate(), summarize() and group_by().

To get started !

Ensure tidyverse is installed

The dplyr package is part of the tidyverse (Hadleyverse), and it follows a grammar-based approach to programming/data work.

First, ensure tidyverse is installed : install.packages('tidyverse')

library("tidyverse") #Load the library

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

tidyverse_packages() #Display the packages

 [1] "broom"         "conflicted"    "cli"           "dbplyr"       
 [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
 [9] "googledrive"   "googlesheets4" "haven"         "hms"          
[13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
[17] "modelr"        "pillar"        "purrr"         "ragg"         
[21] "readr"         "readxl"        "reprex"        "rlang"        
[25] "rstudioapi"    "rvest"         "stringr"       "tibble"       
[29] "tidyr"         "xml2"          "tidyverse"

The pipe operator

Pipes & dplyr

Pipes are a simplifying programming tool that make your code easier to read. In dplyr, the expression for a pipe is %>%.

What is a pipe ?

Pipes take the output of a function as the input/argument of another function. R’s pipe specifically plugs the returned object to the left of the pipe into the first argument of the function on the right fo the pipe.

Example: generate a random sample and obtain its mean

rnorm(10) %>%  #rnorm(n) generates a random sample of n observations from a normal distribution 
    mean()

[1] 0.140672

Pipes avoid nested functions, prevent excessive writing to your disc, and increase the readability of our R scripts. By default, R pipes the output from the LHS of the pipe into the first argument of the function on the RHS of the pipe.
For example, a %>% fun(3) is equivalent of fun(arg1=a, arg2 = 3)

The storms database

To introduce dplyr verbs with pipes, we will use the storms database, which is the NOAA Atlantic hurricane database best track data. The data includes the positions and attributes of storms from 1975-2021. Storms from 1979 onward are measured every six hours during the lifetime of the storms.

If you want to learn more about storms :

?storms

NB : In the following sections, the head(n) and tail(n) functions are used for the presentation, to show only the first or last n or rows of the database.

storms %>%
    head(2)

# A tibble: 2 × 13
  name   year month   day  hour   lat  long status       category  wind pressure
  <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>           <dbl> <int>    <int>
1 Amy    1975     6    27     0  27.5   -79 tropical de…       NA    25     1013
2 Amy    1975     6    27     6  28.5   -79 tropical de…       NA    25     1013
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>

The five key `dplyr` verbs

There are five key dplyr verbs that you need to learn.

select(): Select (i.e. subset) columns by their names.
filter(): Filter (i.e. subset) rows based on their values.
arrange(): Arrange (i.e. reorder) rows based on their values.
mutate(): Create new columns.
summarize() and group_by(): Collapse multiple rows into a single summary value.

Let’s practice these commands together using the storms data frame !

Select columns with `select()`

If you want to select specific columns of a dataframe, you can use select().

Selection based on columns names

Use commas to select multiple columns out of a data frame:

storms %>% 
    select(name, year, month, day, hour, wind, pressure)%>%
    head(5)

# A tibble: 5 × 7
  name   year month   day  hour  wind pressure
  <chr> <dbl> <dbl> <int> <dbl> <int>    <int>
1 Amy    1975     6    27     0    25     1013
2 Amy    1975     6    27     6    25     1013
3 Amy    1975     6    27    12    25     1013
4 Amy    1975     6    27    18    25     1013
5 Amy    1975     6    28     0    25     1012

You can use “first:last” for consecutive columns:

storms %>% 
    select(name:hour, wind, pressure)%>%
    head(5)

# A tibble: 5 × 7
  name   year month   day  hour  wind pressure
  <chr> <dbl> <dbl> <int> <dbl> <int>    <int>
1 Amy    1975     6    27     0    25     1013
2 Amy    1975     6    27     6    25     1013
3 Amy    1975     6    27    12    25     1013
4 Amy    1975     6    27    18    25     1013
5 Amy    1975     6    28     0    25     1012

You can deselect a column with -:

storms %>% 
    select(name:hour, wind, pressure, -day)%>%
    head(5)

# A tibble: 5 × 6
  name   year month  hour  wind pressure
  <chr> <dbl> <dbl> <dbl> <int>    <int>
1 Amy    1975     6     0    25     1013
2 Amy    1975     6     6    25     1013
3 Amy    1975     6    12    25     1013
4 Amy    1975     6    18    25     1013
5 Amy    1975     6     0    25     1012

Selection based on string characters

select(contains(PATTERN)) allows to select columns in a dataframe which contain a specific “PATTERN”. It provides a nice shortcut in relevant cases:

storms %>% 
    select(name, contains("diameter"))%>%
    tail(5)

# A tibble: 5 × 3
  name  tropicalstorm_force_diameter hurricane_force_diameter
  <chr>                        <int>                    <int>
1 Wanda                           60                        0
2 Wanda                           60                        0
3 Wanda                           90                        0
4 Wanda                           90                        0
5 Wanda                           70                        0

Here, we have selected the columns whose names contain the word “diameter”.

Rename variables

You can also rename some (or all) of your selected variables in place:

storms %>%
    select(alias=name, year, month, day, hour, wind_speed=wind, ts_diameter=tropicalstorm_force_diameter)%>%
    head(5)

# A tibble: 5 × 7
  alias  year month   day  hour wind_speed ts_diameter
  <chr> <dbl> <dbl> <int> <dbl>      <int>       <int>
1 Amy    1975     6    27     0         25          NA
2 Amy    1975     6    27     6         25          NA
3 Amy    1975     6    27    12         25          NA
4 Amy    1975     6    27    18         25          NA
5 Amy    1975     6    28     0         25          NA

If you just want to rename columns without subsetting them, you can use rename:

storms %>%
    rename(alias=name, wind_speed=wind, ts_diameter=tropicalstorm_force_diameter)%>%
    head(5)

# A tibble: 5 × 13
  alias  year month   day  hour   lat  long status  category wind_speed pressure
  <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>      <dbl>      <int>    <int>
1 Amy    1975     6    27     0  27.5 -79   tropic…       NA         25     1013
2 Amy    1975     6    27     6  28.5 -79   tropic…       NA         25     1013
3 Amy    1975     6    27    12  29.5 -79   tropic…       NA         25     1013
4 Amy    1975     6    27    18  30.5 -79   tropic…       NA         25     1013
5 Amy    1975     6    28     0  31.5 -78.8 tropic…       NA         25     1012
# ℹ 2 more variables: ts_diameter <int>, hurricane_force_diameter <int>

Filter observations with `filter()`

To keep observations, that validate a particular condition, you can use filter().

We can chain multiple filter commands with the pipe %>% or just separate them within a single filter command using commas:

storms %>% 
    select(name,year,month,day)%>%
    filter(year==2008,month >= 6)%>%
    head(3)

# A tibble: 3 × 4
  name    year month   day
  <chr>  <dbl> <dbl> <int>
1 Arthur  2008     6     1
2 Arthur  2008     6     1
3 Arthur  2008     6     1

Here, we have selected the columns “name”, “year”, “month” and “day” thanks to select(). With filter(), we keep row data with “year” equal to 2008 and “month” equal to or greater than six.

Using logical operators

We can use the & (“and”) and/or | (“or”) logical operators:

storms %>% 
    select(name,year,month,day)%>%
    filter(year==2008 | month >= 6)%>%
    head(3)

# A tibble: 3 × 4
  name   year month   day
  <chr> <dbl> <dbl> <int>
1 Amy    1975     6    27
2 Amy    1975     6    27
3 Amy    1975     6    27

Using regular expressions

The grepl() function allows to find the pattern in a character string.

storms %>% 
    select(name:status)%>%
    filter(grepl("tropical", status))%>%
    head(5)

# A tibble: 5 × 8
  name   year month   day  hour   lat  long status             
  <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>              
1 Amy    1975     6    27     0  27.5 -79   tropical depression
2 Amy    1975     6    27     6  28.5 -79   tropical depression
3 Amy    1975     6    27    12  29.5 -79   tropical depression
4 Amy    1975     6    27    18  30.5 -79   tropical depression
5 Amy    1975     6    28     0  31.5 -78.8 tropical depression

Here, we have kept the rows where the word “tropical” appears in the “status” column.

Missing values

A very common filter use case is identifying (or removing) missing data cases:

storms %>% 
    select(name,year,month,day,hour,ts_diameter=tropicalstorm_force_diameter)%>%
    filter(is.na(ts_diameter))%>% 
    head(3)

# A tibble: 3 × 6
  name   year month   day  hour ts_diameter
  <chr> <dbl> <dbl> <int> <dbl>       <int>
1 Amy    1975     6    27     0          NA
2 Amy    1975     6    27     6          NA
3 Amy    1975     6    27    12          NA

Here, we have renamed the column “tropicalstorm_force_diameter” in “ts_diameter”. The use of the filter() function, with the is.na() function, allows to filter the rows where the “ts_diameter” columns is missing (=“NA”).

To remove missing observations, simply use negation: filter(!is.na(ts_diameter))

storms %>% 
    select(name,year,month,day,hour,ts_diameter=tropicalstorm_force_diameter)%>%
    filter(!is.na(ts_diameter))%>% 
    head(3)

# A tibble: 3 × 6
  name   year month   day  hour ts_diameter
  <chr> <dbl> <dbl> <int> <dbl>       <int>
1 Alex   2004     7    31    18           0
2 Alex   2004     8     1     0           0
3 Alex   2004     8     1     6           0

Sort data with `arrange()`

You can arrange your data according to a colums, with the arrange() function.

storms %>% 
    select(name,year,month,day,hour,ts_diameter=tropicalstorm_force_diameter)%>%
    filter(!is.na(ts_diameter))%>%
    arrange(ts_diameter)%>%
    head(3)

# A tibble: 3 × 6
  name   year month   day  hour ts_diameter
  <chr> <dbl> <dbl> <int> <dbl>       <int>
1 Alex   2004     7    31    18           0
2 Alex   2004     8     1     0           0
3 Alex   2004     8     1     6           0

Here, the arrange() function is used to sort the rows according to the “ts_diameter” columns, by ascensing order.

Descending order

By default, sorting will be organised by ascending order. We can also arrange items in descending order using arrange(desc()).

storms %>% 
    select(name,year,month,day,hour,ts_diameter=tropicalstorm_force_diameter)%>%
    filter(!is.na(ts_diameter))%>%
    arrange(desc(ts_diameter))%>%
    head(3)

# A tibble: 3 × 6
  name   year month   day  hour ts_diameter
  <chr> <dbl> <dbl> <int> <dbl>       <int>
1 Oscar  2018    11     3    12        1440
2 Oscar  2018    11     3     6        1230
3 Oscar  2018    11     3    18        1140

Create news variable with `mutate()`

You can create new columns from scratch, or (more commonly) as transformations of existing columns.

storms %>% 
    select(name, year, month, day) %>%
    mutate(decade = paste0(substr(year, start = 1, stop = 3),0),
           quarter = ifelse(month %in% c(12,1,2), "Winter",
                        ifelse(month %in% 3:5, "Spring", 
                            ifelse(month %in% 6:8,"Summer","Automn"))),
            text = paste0(name, " occured in the ", decade,"'s"))%>%
    head(4)

# A tibble: 4 × 7
  name   year month   day decade quarter text                     
  <chr> <dbl> <dbl> <int> <chr>  <chr>   <chr>                    
1 Amy    1975     6    27 1970   Summer  Amy occured in the 1970's
2 Amy    1975     6    27 1970   Summer  Amy occured in the 1970's
3 Amy    1975     6    27 1970   Summer  Amy occured in the 1970's
4 Amy    1975     6    27 1970   Summer  Amy occured in the 1970's

Here, we have created 3 columns :

“decade” is composed of the first three digits of the column year (extracted with the substr() function) followed by a 0 (concatenated to the first 3 digits thanks to the paste0() function).
“quarter” reprensents the quarter season based on the month value
“text” contains the concatenation of the name of the storm, the expression “occurred in the” and the corresponding decade.

mutate is order aware. So you can chain multiple mutates in a single call.

mutate_if

mutate_if() allows you to easily work on a subset of variables.

storms %>% 
    select(name:status) %>% 
    mutate(status=as.character(status))%>%
    mutate_if(is.character, toupper) %>% 
    head(5)

# A tibble: 5 × 8
  name   year month   day  hour   lat  long status             
  <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>              
1 AMY    1975     6    27     0  27.5 -79   TROPICAL DEPRESSION
2 AMY    1975     6    27     6  28.5 -79   TROPICAL DEPRESSION
3 AMY    1975     6    27    12  29.5 -79   TROPICAL DEPRESSION
4 AMY    1975     6    27    18  30.5 -79   TROPICAL DEPRESSION
5 AMY    1975     6    28     0  31.5 -78.8 TROPICAL DEPRESSION

The toupper() function converts a character string in uppercase. So with the mutate_if(is.character) function, here, if the columns of the storms dataframe are character strings, the toupper() function will be apply.

mutate_at & mutate_all

mutate_at and mutate_all are also helpful. mutate_at allows to modify specified columns. The list argument enables to choose the suffix added to the name of the column that is “mutated”.

storms %>% 
    select(name:status) %>% 
    mutate_at(vars(name,status), list(UP =~ toupper(.))) %>% 
    head(3)

# A tibble: 3 × 10
  name   year month   day  hour   lat  long status             name_UP status_UP
  <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>              <chr>   <chr>    
1 Amy    1975     6    27     0  27.5   -79 tropical depressi… AMY     TROPICAL…
2 Amy    1975     6    27     6  28.5   -79 tropical depressi… AMY     TROPICAL…
3 Amy    1975     6    27    12  29.5   -79 tropical depressi… AMY     TROPICAL…

Here, we have created the “name_UP” and “status_UP” columns, by converting the “name” and “status” column values in uppercase.

Agregate operations with `summarize()` and `group_by()`

summarize() summarizes variables: you choose the variables and the summaries (e.g., mean() or min()). When using group_by(), summaries are done by groups:

storms %>% 
    select(name, year, wind, pressure) %>%
    mutate(decade = paste0(substr(year, start = 1, stop = 3),0))%>%
    group_by(decade) %>%                                       # for each decade
    summarize(Nobs = n(),                                      #number of observations
              mean_wind = mean(wind, na.rm = TRUE),            #average wind speed
              max_pressure = max(pressure, na.rm = TRUE))%>%   #maximum pressure
    head(4)

# A tibble: 4 × 4
  decade  Nobs mean_wind max_pressure
  <chr>  <int>     <dbl>        <int>
1 1970     932      50.9         1015
2 1980    2674      51.0         1018
3 1990    3895      51.4         1020
4 2000    5000      49.9         1024

Note that including “na.rm = TRUE” (or, its alias “na.rm = T”) is usually a good idea with summarize functions. Otherwise, any missing value will propogate to the summarized value too.

storms %>% 
    summarize(mean_ts_diameter = mean(tropicalstorm_force_diameter))

# A tibble: 1 × 1
  mean_ts_diameter
             <dbl>
1               NA

storms %>%
    summarize(mean_ts_diameter = mean(tropicalstorm_force_diameter, na.rm = TRUE))

# A tibble: 1 × 1
  mean_ts_diameter
             <dbl>
1             146.

summarize_if, summarize_at and summarize_all

The same if-, at- and all-based workflow that we saw with mutate a few slides back also works with summarize.

storms %>% 
    select(name, year, wind, pressure) %>%
    mutate(decade = paste0(substr(year, start = 1, stop = 3),0))%>%
    group_by(decade) %>% 
    summarize_if(is.numeric, mean, na.rm=T) %>%
    head(4)

# A tibble: 4 × 4
  decade  year  wind pressure
  <chr>  <dbl> <dbl>    <dbl>
1 1970   1977.  50.9     995.
2 1980   1985.  51.0     994.
3 1990   1995.  51.4     993.
4 2000   2004.  49.9     993.

storms %>% 
    select(year, wind, pressure) %>%
    mutate(decade = paste0(substr(year, start = 1, stop = 3),0))%>%
    select(-year)%>%group_by(decade) %>% 
    summarize_all(list(Mean=~mean(.,na.rm=TRUE), 
                       Min=~min(.,na.rm=TRUE),
                       Max=~max(.,na.rm=TRUE)))

# A tibble: 6 × 7
  decade wind_Mean pressure_Mean wind_Min pressure_Min wind_Max pressure_Max
  <chr>      <dbl>         <dbl>    <int>        <int>    <int>        <int>
1 1970        50.9          995.       15          924      150         1015
2 1980        51.0          994.       10          888      165         1018
3 1990        51.4          993.       10          905      155         1020
4 2000        49.9          993.       10          882      160         1024
5 2010        49.0          994.       15          908      160         1021
6 2020        47.8          995.       15          917      135         1018

And more in tidyverse

There’s more with dplyr:

Merging inner_join(),left_join(),anti_join()
Viewing data glimpse(), top_n()
Sampling sample_n(), sample_frac()
Summaries first(), last(), nth(), n_distinct()
Duplicates distinct()
Missingness na_if()

tidyr offers useful tools too:

Missingness replace_na(), drop_na(), fill()
Tidying pivot_longer(), pivot_wider()

Acknowlegdments

This tutorial is inspired from great resources on data analysis and econometrics with R offered by Ed Rubin(PhD Econometrics III) and Grant Mac Dermott (Data science for economists).