Data visualization with ‘ggplot2’ in R

Published

July 31, 2023

Prepared by Claire Lepault and Marie Sevestre

ggplot2 is a data visualization package widely used for creating sophisticated plots. It was developped by Hadley Wickham and is based on the Grammar of Graphics (gg), which is a systematic framework for understanding and constructing data visualizations.

To get started !

Ensure tidyverse is installed

The ggplot2 package is part of the tidyverse (Hadleyverse).

First, ensure tidyverse is installed : install.packages('tidyverse')

library("tidyverse") #Load the library 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The storms database

As in the dplyr tutorial, we will work with the storms dataset to present the package. Thanks to the tidyverse library, we have already loaded the dplyr package. storms is the NOAA Atlantic hurricane database best track data. The data includes the positions and attributes of storms from 1975-2021. Storms from 1979 onward are measured every six hours during the lifetime of the storms.

If you want to learn more about storms :

?storms

For example, we will be looking at the average, minimum and maximum wind speed of storms per decade. To do this, we create the storms_decade dataset :

storms_decade <- storms %>% 
                      select(name, year, wind, pressure) %>%
                      mutate(decade = as.numeric(paste0(substr(year, start = 1, stop = 3),0)))%>%
                      arrange(decade)%>% 
                      group_by(decade) %>% 
                      summarize(Nobs = n(),
                                min_wind = min(wind, na.rm = TRUE),
                                mean_wind = mean(wind, na.rm = TRUE),
                                max_wind = max(wind, na.rm = TRUE))

Creating a plot

To create a plot, the key elements to be specified are:

  • the dataset
  • the mapping of variables to aesthetics (like x and y axes, color, shape, size, etc.)
  • the geometric objects that represent the data (e.g., points, lines, bars, etc.)

ggplot()

To initialize a plot, the key function of the package is ggplot(). The first argument data represents the data frame containing the variables to be plotted. Then, the aes() argument maps variables to aesthetics (e.g. x-axis, y-axis, color, size, etc ). Finally, layers from geometric object functions (e.g. geom_point, geom_line, geom_bar) allow to visualize data.

  • The ggplot() function doesn’t plot anything—it sets up the plot.
ggplot(data = storms_decade, aes(x = decade, y =mean_wind)) 

  • You can choose the titles of the plot as well as the x-axis and the y-axis with labs:
ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) + 
     labs(title="Average wind speed per decade", 
          x="Decade", y="Average wind speed")

  • To plot the figure, you need to add layers, with the sign : +.

We will now represent the average speed of storms per decade through different geometric visualizations.

geom_point() & geom_line()

Use geom_point() to represent points

ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) + 
     labs(title="Average wind speed per decade", 
          x="Decade", y="Average wind speed") +
     geom_point()

Use geom_line() to represent lines and eventually connect points on the plot

ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) + 
     labs(title="Average wind speed per decade", 
          x="Decade", y="Average wind speed") +
     geom_point() + 
     geom_line()

Key Arguments

  • To change the shape of your points use shape
ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) + 
     labs(title="Average wind speed per decade", 
          x="Decade", y="Average wind speed") +
     geom_point(shape=15) + 
     geom_line()

By selecting shape=15, the points on the graph are represented by solid squares. You can look at the point shape options on this blog.

  • To change the size of your points use size, and linewidth for the lines
ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) + 
     labs(title="Average wind speed per decade", 
          x="Decade", y="Average wind speed") +
     geom_point(shape=15, size=3) + 
     geom_line(linewidth=1)

  • To change the color of your points and your lines use color
ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) + 
     labs(title="Average wind speed per decade", 
          x="Decade", y="Average wind speed") +
     geom_point(shape=15, size=3, color = "red") +
     geom_line(linewidth=1, color="red")

  • To modify the transparency of the points use alpha
ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) + 
     labs(title="Average wind speed per decade", 
          x="Decade", y="Average wind speed") +
     geom_point(shape=15, size=3,color="red", alpha=0.5) +
     geom_line(linewidth=1, color="red")

The closer the alpha is to 0, the more transparent the points. If alpha equals 1, the points will be opaque.

geom_bar()

Use geom_bar() to represent bars

ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) + 
     labs(title="Average wind speed per decade", 
          x="Decade", y="Average wind speed") +
     geom_bar(stat = "identity")

stat=“identity” allows to determine the size of the bars according to the mean_wind values (the variable included in the aes() function).

Key Arguments

  • You can color your bars with fill and their framework with color.
ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) + 
     labs(title="Average wind speed per decade", 
          x="Decade", y="Average wind speed") +
     geom_bar(stat="identity", fill="red", color="darkred")

  • To change the width of the bars use width
ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) + 
     labs(title="Average wind speed per decade", 
          x="Decade", y="Average wind speed") +
     geom_bar(stat="identity", fill="yellow", width=5)

geom_histogram()

Use geom_histogram() to create an histogram

ggplot(storms, aes(x=wind))+
     labs(title="Histogram of wind speed observations",
          x="Wind speed", y="Nb. observations") + 
     geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Key Argument

  • To define the size of bins according to the x-axis use binwidth
ggplot(storms, aes(x=wind))+
     labs(title="Histogram of wind speed observations",
          x="Wind speed", y="Nb. observations") + 
     geom_histogram(color = "darkgreen", fill = "lightgreen", binwidth=10)

geom_boxplot() & geom_violin()

####  Use geom_boxplot() to create a boxplot

ggplot(storms, aes(y=wind)) +
     labs(title="Boxplot of wind speed distribution", 
          y="Wind speed")+
     geom_boxplot(fill="lightblue", color="blue")

By default the plot is centered on the x-axis.

  • To show the distribution within different groups, in this example within each month:
ggplot(storms, aes(x=as.factor(month), y=wind)) +
     labs(title="Boxplots of wind speed distribution per month",
           x="Month", y="Wind speed")+
     geom_boxplot(fill="lightblue", color="blue")

Use geom_violin() to create a violin plot

ggplot(storms, aes(x=as.factor(month), y=wind)) +
     labs(title="Violins of wind speed distribution per month", 
          x="Month", y="Wind speed")+
     geom_violin(fill="pink", color="violet")

Presentation tips

Theme customization with theme()

You can customize the prensentation of your graphics with theme(). We present here example for the plot title customization with the plot.title argument.

  • To bold the text use face `

  • To change the font size use size

  • To change the color use color

  • To adjust the position use hjust

    ** hjust =0.5 to center the title; 0 to put the title to the left, and 1 to put the title to the right.

  • To change the font of the title use family

ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) +
      labs(title="Average wind speed per decade",
           x="Date", y="Average wind speed") +
      geom_point(shape=1, size=3, color="red", alpha=0.5) +
      geom_line(linewidth=1, color="red") + 
      theme(plot.title=element_text(face="bold", 
                                    size=15, 
                                    color="darkblue", 
                                    hjust=0.5, 
                                    family="Times New Roman"))

NB : You can check which font its available in your computer with : windowsFonts() on windows, and with quartzFonts() on MacOS.

  • To add space between the title, the axes name, and the graphic use \n
ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) +
      labs(title="Average wind speed per decade \n",
           x="\n Date", y="Average wind speed \n") +
      geom_point(shape=1, size=3, color="red", alpha=0.5) +
      geom_line(linewidth=1, color="red") + 
      theme(plot.title=element_text(face="bold", 
                                    size=15, 
                                    color="darkblue", 
                                    hjust=0.5, 
                                    family="Times New Roman"))

coord_cartesian()

coord_cartesian() controls the extent of the graph axes. It allows you to explicitly define the limits of the x and y axes of the graph (useful for zooming in or out on a specific part of the graph, while keeping the original data!)

ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) +
      labs(title="Average wind speed per decade \n",
           x="\n Date", y="Average wind speed \n") +
      geom_point(shape=1, size=3, color="red", alpha=0.5) +
      geom_line(linewidth=1, color="red") + 
      theme(plot.title=element_text(face="bold", 
                                    size=15, 
                                    color="darkblue", 
                                    hjust=0.5, 
                                    family="Times New Roman"))+
      coord_cartesian(y=c(0,60))                              

In this example, coord_cartesian(y=c(0,60)) allows to restrict the y-axis to the range from 0 to 60.

scale_color_manual()

The scale_color_manual() function is used to specify the colours associated with different values or levels of an aesthetic variable in a plot. With the name argument, you specify the title of the color legend. With the values arguments, you specify the colours associated with each value of the selected variable.

plot_speed <- ggplot(data=storms_decade) + 
                    labs(title="Wind speed statistics per decade \n",
                         x="\n Decade", y="Wind Speed \n") +
                    geom_point(aes(x=decade, y=mean_wind,color="Mean")) + 
                    geom_line(aes(x=decade, y=mean_wind,color="Mean"))+
                    geom_point(aes(x=decade, y=min_wind,color="Min")) + 
                    geom_line(aes(x=decade, y=min_wind,color="Min"))+
                    geom_point(aes(x=decade, y=max_wind,color="Max")) + 
                    geom_line(aes(x=decade, y=max_wind,color="Max")) + 
                    coord_cartesian(y=c(0,200))+  #personalize y-axis limits
                    scale_color_manual(name="Statistic",
                                       values=c("Min"="violet",
                                                "Mean"='#1B9E77',
                                                "Max"='#D95F02'))+
                    theme_minimal(base_size = 13) + 
                    theme(legend.position="top",
                          legend.title = element_text(size = 14),
                          legend.text = element_text(size = 12),
                          plot.title = element_text(size = 16, hjust = 0.5,face ="bold"),
                          axis.title.x = element_text(size = 14, hjust = 0.5,face ="bold"),
                          axis.title.y = element_text(size = 14, hjust = 0.5,face ="bold")
                         )
plot_speed                          


NB : scale_fill_manual() functions exactely the same.

theme(legend.X())
As for the title of the graphic, you can use theme.() function with legend.X, to put in bold, to change the size, the color, the position,… of the legend (its title, and its text).


ggarrange()

With the ggarrange() function you can organize your plots : their size, their marges, etc. For example, you can create graphics mosaics, which display several graphs side by side in a single figure.

First, ensure ggpubr is installed : install.packages('ggpubr')

library("ggpubr") #Load the library 

You can personalize a theme for displaying the plots in the same way, by creating a function theme() with the options you want.

theme_custom <- function(...) {
            theme(plot.title = element_text(face = "bold", color="black", hjust = 0.5), 
                  title = element_text(size=10))
}

With title you change the plot title and the x-axis and y-axis. With plot.title you change only the title of your plot.

Choose your plots and store them as objects to use ggarrange()

plot1 <- ggplot(data=storms_decade, aes(x=decade, y=mean_wind)) +
          labs(title = "Average wind speed per decade",
               x="Decade", y="Average wind speed") +
          geom_point(shape=15, size=3,color = "red", alpha = 0.5) +
          geom_line(linewidth=1, color="red")+
          theme_custom()

plot2 <- ggplot(storms_decade, aes(x=decade, y=mean_wind)) + 
          labs(title="Average wind speed per decade",
               x="Decade", y="Average wind speed")  +
          geom_bar(stat="identity", fill = "yellow", width=5)+
          theme_custom()

plot3 <- ggplot(storms, aes(x=wind))+
          labs(title="Histogram of wind speed",
               x="Wind speed") +
              geom_histogram(color = "darkgreen", fill="lightgreen", binwidth=10)+
              theme_custom()

plot4 <- ggplot(storms, aes(y = wind)) +
          labs(title="Boxplot of wind speed",
               x="", y="Wind speed") +
          geom_boxplot(fill="lightblue", color="blue") +
          theme_custom()+
          theme(axis.text.x=element_blank(),
                axis.ticks.x=element_blank())

You can create a mosaic with your plots.

ggarrange(plot1, plot2, plot3, plot4, nrow = 2, ncol = 2)

ggsave()

To save a graphic that you have created with ggplot2, you can use ggsave(). It saves the graph as an image file.

ggsave(filename='./img/wind_speed_stats.png', plot=plot1, height=10, width=15)
  • The image format (as PNG here, or JPEG, PDF, etc…) is defined in the filename argument.

  • The image dimension is based on the arguments height and width. Here, we saved an image of 15*10 inch.