dplyr offers a simple and consistent code syntaxe for data manipulation. We provide here short examples of the key dplyr verbs: select(), filter(), arrange(), mutate(), summarize() and group_by().
To get started !
Ensure tidyverse is installed
The dplyr package is part of the tidyverse (Hadleyverse), and it follows a grammar-based approach to programming/data work.
First, ensure tidyverse is installed : install.packages('tidyverse')
library("tidyverse") #Load the library
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Pipes are a simplifying programming tool that make your code easier to read. In dplyr, the expression for a pipe is %>%.
What is a pipe ?
Pipes take the output of a function as the input/argument of another function. R’s pipe specifically plugs the returned object to the left of the pipe into the first argument of the function on the right fo the pipe.
Example: generate a random sample and obtain its mean
rnorm(10) %>%#rnorm(n) generates a random sample of n observations from a normal distribution mean()
[1] 0.140672
Pipes avoid nested functions, prevent excessive writing to your disc, and increase the readability of our R scripts. By default, R pipes the output from the LHS of the pipe into the first argument of the function on the RHS of the pipe. For example, a %>% fun(3) is equivalent of fun(arg1=a, arg2 = 3)
The storms database
To introduce dplyr verbs with pipes, we will use the storms database, which is the NOAA Atlantic hurricane database best track data. The data includes the positions and attributes of storms from 1975-2021. Storms from 1979 onward are measured every six hours during the lifetime of the storms.
If you want to learn more about storms :
?storms
NB : In the following sections, the head(n) and tail(n) functions are used for the presentation, to show only the first or last n or rows of the database.
storms %>%head(2)
# A tibble: 2 × 13
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Amy 1975 6 27 0 27.5 -79 tropical de… NA 25 1013
2 Amy 1975 6 27 6 28.5 -79 tropical de… NA 25 1013
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>
The five key dplyr verbs
There are five key dplyr verbs that you need to learn.
select(): Select (i.e. subset) columns by their names.
filter(): Filter (i.e. subset) rows based on their values.
arrange(): Arrange (i.e. reorder) rows based on their values.
mutate(): Create new columns.
summarize() and group_by(): Collapse multiple rows into a single summary value.
Let’s practice these commands together using the storms data frame !
Select columns with select()
If you want to select specific columns of a dataframe, you can use select().
Selection based on columns names
Use commas to select multiple columns out of a data frame:
# A tibble: 3 × 4
name year month day
<chr> <dbl> <dbl> <int>
1 Arthur 2008 6 1
2 Arthur 2008 6 1
3 Arthur 2008 6 1
Here, we have selected the columns “name”, “year”, “month” and “day” thanks to select(). With filter(), we keep row data with “year” equal to 2008 and “month” equal to or greater than six.
Using logical operators
We can use the & (“and”) and/or | (“or”) logical operators:
# A tibble: 3 × 6
name year month day hour ts_diameter
<chr> <dbl> <dbl> <int> <dbl> <int>
1 Amy 1975 6 27 0 NA
2 Amy 1975 6 27 6 NA
3 Amy 1975 6 27 12 NA
Here, we have renamed the column “tropicalstorm_force_diameter” in “ts_diameter”. The use of the filter() function, with the is.na() function, allows to filter the rows where the “ts_diameter” columns is missing (=“NA”).
To remove missing observations, simply use negation: filter(!is.na(ts_diameter))
# A tibble: 4 × 7
name year month day decade quarter text
<chr> <dbl> <dbl> <int> <chr> <chr> <chr>
1 Amy 1975 6 27 1970 Summer Amy occured in the 1970's
2 Amy 1975 6 27 1970 Summer Amy occured in the 1970's
3 Amy 1975 6 27 1970 Summer Amy occured in the 1970's
4 Amy 1975 6 27 1970 Summer Amy occured in the 1970's
Here, we have created 3 columns :
“decade” is composed of the first three digits of the column year (extracted with the substr() function) followed by a 0 (concatenated to the first 3 digits thanks to the paste0() function).
“quarter” reprensents the quarter season based on the month value
“text” contains the concatenation of the name of the storm, the expression “occurred in the” and the corresponding decade.
mutate is order aware. So you can chain multiple mutates in a single call.
mutate_if
mutate_if() allows you to easily work on a subset of variables.
The toupper() function converts a character string in uppercase. So with the mutate_if(is.character) function, here, if the columns of the storms dataframe are character strings, the toupper() function will be apply.
mutate_at & mutate_all
mutate_at and mutate_all are also helpful. mutate_at allows to modify specified columns. The list argument enables to choose the suffix added to the name of the column that is “mutated”.
# A tibble: 3 × 10
name year month day hour lat long status name_UP status_UP
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <chr> <chr>
1 Amy 1975 6 27 0 27.5 -79 tropical depressi… AMY TROPICAL…
2 Amy 1975 6 27 6 28.5 -79 tropical depressi… AMY TROPICAL…
3 Amy 1975 6 27 12 29.5 -79 tropical depressi… AMY TROPICAL…
Here, we have created the “name_UP” and “status_UP” columns, by converting the “name” and “status” column values in uppercase.
Agregate operations with summarize() and group_by()
summarize() summarizes variables: you choose the variables and the summaries (e.g., mean() or min()). When using group_by(), summaries are done by groups:
storms %>%select(name, year, wind, pressure) %>%mutate(decade =paste0(substr(year, start =1, stop =3),0))%>%group_by(decade) %>%# for each decadesummarize(Nobs =n(), #number of observationsmean_wind =mean(wind, na.rm =TRUE), #average wind speedmax_pressure =max(pressure, na.rm =TRUE))%>%#maximum pressurehead(4)
Note that including “na.rm = TRUE” (or, its alias “na.rm = T”) is usually a good idea with summarize functions. Otherwise, any missing value will propogate to the summarized value too.