This tutorial builds on Hadley Wickham’s book ‘R for Data Science’; r-statistics.co; www.sthda.com tutorials; www.datanovia.com tutorials and www.r-graph-gallery.com .

Links to the other two parts of the workshop:
Introduction to R Markdown: https://verticalmeadows.github.io/basics.html
Introduction to ggmap: https://verticalmeadows.github.io/ggmap_basics.html

1 What is ggplot and why you should use it

ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs.

Because a quick and dirty visualisation doesn’t have to look quick and dirty too. It’s useful for simple data exploration but you can create beautiful graphs and visualisations for your articles or online technical reports
Everything is under your control. The plots can be created iteratively and edited later.
Reproducibility based on saving the design of your plots (themes) and your earlier plots.

Official package documentation: https://ggplot2.tidyverse.org/index.html

1.1 How it works

Ggplot is part of tidyverse, along with dplyr, tidyr, stringr etc., which you might have learnt last time.

“You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.” (https://ggplot2.tidyverse.org/)

We can plot data from dataframes only
Each function returns a graphic layer - so the order of the layers count too
The functions only get executed when drawing and not when saving them into a variable.

A succinct description of how ggplot2 works is hard because it embodies a deep philosophy of visualisation. However, in most cases you

start with ggplot(),
supply a dataset and
aesthetic mapping (with aes()). These are obligatory for any plot. You then add
geometrics objects or layers, such as geom_point() or geom_histogram()),
scales to manipulate the aesthetic mapping (like scale_colour_brewer()),
faceting specifications (like facet_wrap()) and
coordinate systems (like coord_flip())
and maybe you manipulate the look of the plot by setting the theme, beyond axis labels and titles.

#install.packages("ggplot2")
library(tidyverse) # tidyverse contains several other needed packages beside ggplot, such as dplyr
#library(ggplot2)
#library(dplyr)

2 Basics

2.1 `qplot()` Quick plot with ggplot2

The qplot() function is very similar to the standard R plot() function. It can be used to create quickly and easily different types of graphs: scatter plots, box plots, violin plots, histogram and density plots.

attach(mpg)

qplot(x=displ, y=cty) # aesthetics are passed to x and y, the axes of the plot

qplot(x=displ, geom="histogram") # default 'geom' is a point plot, if only x is specified, then a histogram

detach(mpg)

qplot(wt, mpg, data = mtcars) # miles per gallon vs. weight in 'cars'

mod <- lm(wt ~ mpg, data = mtcars) # we can conjure up a little linear model and plot the residuals against the fitted values
qplot(resid(mod), fitted(mod))

2.2 Parts of an actual ggplot

First, let’s analyse a simple scatterplot presenting three variables.

# Based on mpg, a dataset about car emissions included in the ggplot2 package, where
# hwy = fuel efficiency on the highway
# displ = a car’s engine size, in litres
# class = the kind of car
dplyr::glimpse(mpg)

## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "au…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quatt…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 199…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)",…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17,…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25,…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "…
## $ class        <chr> "compact", "compact", "compact", "compact", "compac…

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

The basic structure of a ggplot:

Basically on its own, ggplot(data = mpg) wouldn’t display anything since we only talk about the data and there are no defaults to the geom function and the aesthetics. The geom function adds another layer to the plot, importantly, using the + sign.

2.3 Analyse a complex plot

gg <- ggplot(diamonds, aes(x=carat, y=price)) 
gg + geom_point(aes(shape=color, color=cut),  alpha =0.5) + 
  geom_smooth(aes(color=cut), size=0.5) #+

  # facet_wrap(~ color, nrow = 2)  # carat, cut and color are variables in `diamonds` 

gg + geom_point(aes(shape=color, color=cut), size =0.5, alpha =0.5) + 
  geom_smooth(aes(color=cut), size=0.5)+
  facet_wrap(~ color, nrow = 2) + ggtitle("Diamonds by colour")

This is called the layered grammar of graphics.

In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.

The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.

More on this in: https://r4ds.had.co.nz/data-visualisation.html#the-layered-grammar-of-graphics

3 Geometric Objects and Aesthetics

Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. `ggplot2 looks for the mapped variables in the data argument, in this case, mpg. (https://r4ds.had.co.nz/data-visualisation.html#creating-a-ggplot)

Here we will cover the most useful geometric objects, what to use them for and how to use them.

3.1 Geometric objects (`geom_<>()`)

They are essentially the layers on the plot including points, lines, text and labels, boxplots, histograms and many more.

You can get help and an overview of the existing geoms by typing help.search("geom_", package = "ggplot2") in the console. Or when typing your code, type after geom_ to get a list of available functions.

3.2 Aesthetic mapping (`aes()`)

In ggplot2, aesthetic means “something you can see”. Each aesthetic is a mapping between a visual cue and a variable. Examples include:

position (i.e., on the x and y axes)
color (“outside” color)
fill (“inside” color)
shape (of points)
linetype
size

Each type of geom accepts only a subset of all aesthetics—refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the aes() function.

Concisely: https://beanumber.github.io/sds192/lab-ggplot2.html
Officially (package documentation): https://ggplot2.tidyverse.org/articles/ggplot2-specs.html

3.3 Barplot

Used for displaying
- counts in a categorical variable
- values of one continuous variable across instances

Let’s plot the number of people saying OMG in different ways in Switzerland.

ggplot(data = sdats) +
  geom_bar(mapping = aes(x =OMG_Sprache))

If we want to break it down by another variable, say, the gender of the respondents, then we can map the fill colour to the gender variable.
The default option of the position will be stacked.

ggplot(data = sdats) +
  geom_bar(mapping = aes(x =OMG_Sprache, fill=Geschlecht))

position = "fill" shows the proportion rather than the count. As position is not an aesthetic, the value needs to have " " around it. Instead, position is a parameter in the geom_bar() function, and actually a game changer in many geometric objects

ggplot(data = sdats) +
  geom_bar(mapping = aes(x =OMG_Sprache, fill=Geschlecht), position = "fill")

position = "dodge" shows the genders next to each other, in comparison.

ggplot(data = sdats) +
  geom_bar(mapping = aes(x =OMG_Sprache, fill=Geschlecht), position = "dodge")

Plotting similar data as a pie chart in ggplot needs more tweaking. To try it, visit https://www.r-graph-gallery.com/128-ring-or-donut-plot

3.4 Frequency of a continuous variable

3.4.1 Area and density plots

Area and density plots can present the count data in continuous variables, by binning them, similarly to histograms.

ggplot(data = sdats) +
  geom_area(mapping = aes(x =Gewicht),stat = "bin", bins=15)

We can similarly use the fill aesthetic here too to show weight by gender, areas will be stacked.

ggplot(data = sdats) +
  geom_area(mapping = aes(x=Gewicht, fill=Geschlecht),stat = "bin", bins=15)

Or we can display the weights by gender but as density.

ggplot(data = sdats) +
  geom_density(mapping = aes(x =Gewicht, fill=Geschlecht), alpha=0.4)

And we can adjust the smoothing of this density plot.

ggplot(data = sdats) +
  geom_density(mapping = aes(x =Gewicht, fill=Geschlecht), alpha=0.4, adjust=1/2)

Behold alpha, using which we can adjust the transparency of just about anything in our plots.

3.4.2 Histogram

A simple histogram has even got a default binwidth.

ggplot(data = sdats) +
  geom_histogram(mapping = aes(x =Gewicht))

Let’s adjust this binwidth.

ggplot(data = sdats) +
  geom_histogram(mapping = aes(x =Gewicht), binwidth = 2)

Let’s rearrange y axis to show the density rather than the count. This way we can also add a smoothed density on the same plot, otherwise the y axis, having different values, would prevent this.

ggplot(data = sdats) +
  geom_histogram(aes(x =Gewicht, y=..density..), binwidth = 2, colour="black", fill="white")+
  geom_density(aes(x =Gewicht), alpha=.2, fill="#FF6666")

Here we can observe some new things. What is fill doing outside the aes()?
colour means the “outside” colour, while y=..density.. means that we want to deviate from the default ..count.. statistics used in histograms.
"white" and "#FF6666" are named colours, more on this later on.

Also, you don’t need to write mapping all the time, ggplot will also understand if you only put aes().

We also see that x=Gewicht comes up in the aes() of both layers. It is possible to set this as a global aesthetic by putting it up into the ggplot() function, and thus, all subsequent layers will use it as an aesthetic.

ggplot(data = sdats, mapping = aes(x=Gewicht)) +
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666") # we don't even have to write 0, when we mean 0.2

We can show the genders the same way in a histogram as well.

ggplot(data = sdats, mapping = aes(x=Gewicht)) +
  geom_histogram(aes( fill=Geschlecht), binwidth = 5, alpha=0.5, position="identity")

What do you think, what would happen if the position="identity" wasn’t there?

3.5 One continuous and one discrete variable

Discrete or Nominal or Categorical

3.5.1 Boxplot and friends

A basic boxplot, coloured again by a discrete variable, gender. Notice that the numbers along the x axes don’t really have a meaning beside denoting the Cartesian coordinates of the plot.

ggplot(data = sdats) +
  geom_boxplot(aes(y=Gewicht, fill=Geschlecht))

Limit: We’ve got an outlier there, which makes it harder to interpret the plot. Let’s remove it by limiting the bounds of the plot. Let’s also turn it around, to use the Coordinate_function grammar mentioned before.

ggplot(data = sdats) +
  geom_boxplot(aes(y=Gewicht, fill=Geschlecht)) +
  ylim(40,125) +
  coord_flip()

Let’s also change the shape aesthetics of the points and make the boxplot look different.
The possible basic shapes in R:

ggplot(data = sdats) +
  geom_boxplot(aes(y=Gewicht, fill=Geschlecht), notch=TRUE,
               outlier.colour="red", 
               outlier.shape=8,
               outlier.size=2) +
  ylim(40,125)

It is possible to plot the data points as non-overlapping dots along with the boxplot using geom_dotplot.

ggplot(data = sdats ) +
   geom_boxplot(aes(y=Gewicht, x=Geschlecht, fill=Geschlecht)) +
   geom_dotplot(aes(y=Gewicht, x=Geschlecht, fill=Geschlecht), binaxis = "y", stackdir = "center", dotsize = 0.3) +
   ylim(40,125)

Violin plots can look even better, essentially plotting the density of the data points.

ggplot(data = sdats ) +
   geom_violin(aes(y=Gewicht, x=Geschlecht, fill=Geschlecht),scale="area") +
   ylim(40,125)

Why don’t we plot both boxplots and violin plots?

ggplot(data = sdats,aes(y=Gewicht, x=Geschlecht, fill=Geschlecht) ) +
   geom_violin(scale="area", trim = TRUE) +
   geom_boxplot(width=0.1)+
   ylim(40,125)

mean_sdl computes the mean plus or minus a constant times the standard deviation (mult sets this time as 1, as the default is 2). This can be shown as a pointrange.

ggplot(data = sdats,aes(y=Gewicht, x=Geschlecht, fill=Geschlecht) ) +
   geom_violin(scale="area", trim = TRUE) +
   stat_summary(fun.data=mean_sdl, mult=1, 
                 geom="pointrange", color="red")+
   ylim(40,125)

stat_summary() is a more advanced way to construct plots and to include summary statistics within them.

3.6 Two continuous variables

3.6.1 Scatterplots

Let’s plot the height of our participants against their weight.

ggplot(data = sdats, mapping = aes(x=Gewicht, y=Groesse)) +
  geom_point() +
  xlim(40,125)

The values are integers, so there is a possibility of overlaps, people with the exact same weight and height as others.

ggplot(data = sdats, mapping = aes(x=Gewicht, y=Groesse)) +
  geom_point(alpha=0.25) +
  xlim(40,125)

geom_jitter() moves the points apart from each other by adding a small number to their position. We can even control this, but the default is 40% of the resolution of the data. Besides, geom_rug() helps show the position of the points on the axes as well. You can use alpha on these too.

ggplot(data = sdats, mapping = aes(x=Gewicht, y=Groesse)) +
  geom_jitter(alpha=0.5) +
  xlim(40,125)  +
   geom_rug(sides = "bl", col="steelblue",alpha=0.1, size=1.5)

geom_smooth() is smoothed conditional mean, essentially giving us a trendline, including confidence intervals. See individual modelling functions for more details: lm for linear smooths, glm for generalised linear smooths, loess for local smooths.

ggplot(data = sdats, mapping = aes(x=Gewicht, y=Groesse)) +
  geom_jitter(alpha=0.5) +
  geom_smooth() +
  xlim(40,125)

The default method for smoothing is method="loess" (for less than 1000 observations). NULL by default, corresponding to y ~ x, the formula argument can be adjusted such that we plot polynomials.

ggplot(data = sdats, mapping = aes(x=Gewicht, y=Groesse)) +
  geom_jitter(alpha=0.5) +
  geom_smooth(method = "lm",formula = y ~ poly(x, 1), size = 1, col="red") +
  geom_smooth(method = "lm",formula = y ~ poly(x, 2), size = 1, col="green") +
  geom_smooth(method = "lm",formula = y ~ poly(x, 3), size = 1, col="blue") +
  xlim(40,125)

geom_smooth() more detailed: https://ropensci.github.io/plotly/ggplot2/geom_smooth.html
More fitted lines: https://aosmith.rbind.io/2018/11/16/plot-fitted-lines/

Let’s see if people from the Alps are smaller than those living in the Swiss Plateau. To this end, plot a sampled subset of the data, but by labelling out which cantons our participants are from.

set.seed(77) # if using the same seed, the same random subsets are reproduced
ggplot(data = sdats[sample(1:375, 100),], mapping = aes(x=Gewicht, y=Groesse)) +
  geom_label(aes(label=Ref_KT)) +
  xlim(40,95) + ylim(155,185)

Even though we limited the bounds even more, not all labels can be visible. How about plotting them as points AND labelling them with texts.

The ggrepel package helps with this task.
There are lots of other similar extensions to ggplot: https://exts.ggplot2.tidyverse.org/gallery/ .

library(ggrepel)
set.seed(77)
ggplot(data = sdats[sample(1:375, 100),], mapping = aes(x=Gewicht, y=Groesse)) +
  geom_jitter(alpha=0.5, colour="red") +
  geom_text_repel(aes(label=Ref_KT))+
  xlim(40,95) + ylim(155,185)

When even jittering or alpha doesn’t help, we can turn to density plotting in 2D as well.

The following graph plots a 383 X 383 correspondence table, plotting how linear distance corresponds to linguistic distance among 383 Swiss localities, with alpha=0.08.

syntdist <- read.csv("syntdist_vs_eucldist.csv", header=T, stringsAsFactors = F)
ggplot(syntdist,aes(DISTANCE, Syntdist60)) +
  geom_point(alpha = 0.08, size = 1.5)

The command ggMarginal() from the package ggExtra helps us plot histograms, density graphs or even boxplots outside the axes to see the distribution of the points. For ggMarginal() to work, we have to save the plot as an object first.

library(ggExtra)

p <- ggplot(syntdist,aes(DISTANCE, Syntdist60)) +
  geom_point(alpha = 0.08, size = 1.5)
ggMarginal(p, type="histogram")

If we save a plot as an object we can keep expanding it without having to write the basics again. But for clarity I’m trying to avoid that this time.

At least we can see that the densest area is not even in the middle of that blob. Try type="density" and type="boxplot" as well.

But we can plot the density of the points itself on the graph, in different ways.

p + geom_density2d() # density mapped with countour lines

p + geom_bin2d() # with default binwidth

p + geom_bin2d(binwidth=c(2500,1)) # with finer granularity (binwidths on the x and y axis, in their respective scales)

p + geom_hex() # with hexagons instead of squares

4 Stats

A quick look on stats_<>. Details in Hadley Wickham’s handbook ‘R for Data Science’: https://r4ds.had.co.nz/data-visualisation.html#statistical-transformations

Stats are an alternative way to build a layer.
A stat builds new variables to plot (e.g.,count, prop).

Visualize a stat by changing the default stat of a geom function, geom_bar(stat="count") or by using a stat function, stat_count(geom="bar"), which calls a default geom to make a layer (equivalent to a geom function).

One of the classic methods to graph is by using the stat_summary() function. We begin by using the ggplot() function, which requires the name of the dataset, we’ll use mydata that we create first, followed by the aes() function that encompasses the x and y variable specifications. Next, we add on the stat_summary() function. For this function, we specify that we want to calculate the mean of the y axis in the first argument (fun.y asks what function to use for the y variable). Then, we specify what graphing/geom element to plot. Here, we specified that we want points (other options could be bar, line, etc.).

## Creating an object named Subject 
Subject <- c("Wendy", "Wendy", "Wendy", # you can press ENTER to auto-indent 
             "John", "John", "John",    # the code. This produces better 
             "Helen", "Helen", "Helen") # formatting for the user.

## Creating an object named Date
Date <- c("2019-08-08", "2019-09-05", "2019-12-07", "2019-08-08", 
          "2019-09-05", "2019-12-07", "2019-08-08", "2019-09-05", "2019-12-07")

## Creating an object named Score
Score <- c(2, 15, 34,
           5, 10, 27,
           16, 8, 40)

## Creating an object
mydata <- tibble(Subject, Date, Score) 

ggplot(mydata, aes(x = Subject, y = Score)) +
  stat_summary(fun.y = "mean", geom = "point")

An example with geom="line", modified from https://stackoverflow.com/questions/21239418/use-stat-summary-in-ggplot2-to-calculate-the-mean-and-sd-then-connect-mean-poin

data(ToothGrowth)
ToothGrowth$F3 <- letters[1:2]
     # coerce dose to a factor
ToothGrowth$dose <- factor(ToothGrowth$dose, levels = c(0.5,1,2))

ggplot(ToothGrowth, aes(y = len, x = supp, colour = dose, group = dose)) + 
  stat_summary(fun = mean,
               fun.min = function(x) mean(x) - sd(x), 
               fun.max = function(x) mean(x) + sd(x), 
               geom = "pointrange") +
  stat_summary(fun = mean,
               geom = "line") +
  facet_wrap( ~ F3)

4.1 Scales and colours

Let’s save a boxplot and a scatterplot first to show the changes on them. Let’s also set a theme for the plots, something that one can tweak further for better looks - we will do so further on.

theme_set(
  theme_minimal() +
    theme(legend.position = "right")
  )
# Box plot
bp <- ggplot(sdats, aes(Geschlecht, Gewicht))

# Scatter plot
sp <- ggplot(sdats, aes(Gewicht, Groesse)) + xlim(40,125) # let's cut off the outlier already

Plot bp and sp with an arbitrary colour, not given in the aesthetics.

# Box plot
bp + geom_boxplot(fill = "#19FFFF", color = "red")

# Scatter plot
sp + geom_point(colour="steelblue")

If we colour it by a certain variable, the legend appears. Note the difference between fill in aes() and color outside of it.

# Box plot
bp + geom_boxplot(aes(fill = Geschlecht), color = "red")

Let’s colour the scatterplot based on the canton of origin.

# Scatter plot
sp + geom_point(aes(colour=Ref_KT))

Notice that many of these colours are hard to distinguish. The colours for the aesthetic mapping were chosen by default. We should use more diverging colours, and do so automatically if possible.

Scales map data values to the visual values of an aesthetic. To change a mapping, we have to add a new scale.

Scales take on the scale_*_#() pattern, where * stands for an aesthetic, and # for a possible way to change the aesthetic.

But first, let’s set the values of the colours manually. This would take a long time if we had many different categories.
Check the built-in colours of R.
colors() shows the named colours in base R.
This pdf shows also the corresponding colours.
http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf

# Box plot
bp + 
  geom_boxplot(aes(fill = Geschlecht)) +
  scale_fill_manual(values = c("tan","thistle","tomato2","wheat4"))

# Scatter plot
sp + 
  geom_point(aes(colour=Geschlecht)) +
  scale_colour_manual(values = c("tan","thistle","tomato2","wheat4"))

Palettes in R behave like functions.

With basic plots, colour palettes are recycled.

temp <- c(5,7,6,4,8,2,4,6,9)
par(mfrow=c(2,3))
barplot(temp, col=c("tan","thistle","tomato2",
"wheat4","violetred"), main="With 5 colors")
barplot(temp, col=c("sienna3","royalblue1","snow2"), main="With 3 colors")
barplot(temp, col=rainbow(9), main="rainbow")
barplot(temp, col=heat.colors(9), main="heat.colors")
barplot(temp, col=terrain.colors(9), main="terrain.colors")
barplot(temp, col=topo.colors(9), main="topo.colors")

# check out the hexnames of colours
rainbow(9)

## [1] "#FF0000FF" "#FFAA00FF" "#AAFF00FF" "#00FF00FF" "#00FFAAFF" "#00AAFFFF"
## [7] "#0000FFFF" "#AA00FFFF" "#FF00AAFF"

terrain.colors(9)

## [1] "#00A600FF" "#3EBB00FF" "#8BD000FF" "#E6E600FF" "#E8C32EFF" "#EBB25EFF"
## [7] "#EDB48EFF" "#F0C9C0FF" "#F2F2F2FF"

In the first two plots, you can see that we added 5 and 3 arbitrarily chosen colours and they got recycled to show the 9 bars.
In the four later plots, the parameter 9 in the palette functions tell the function to choose 9 colours along the linear colourspace

We can set gradient colours for the aesthetics as well.

par(mfrow=c(2,2))
sp + geom_point(aes(colour=Alter))+
   scale_color_gradient(low = "#fed9a6", high = "#797878")

sp + geom_point(aes(colour=Alter))+
 scale_color_gradientn(colours = rainbow(9))

sp + geom_point(aes(colour=Alter))+
 scale_color_gradientn(colours = terrain.colors(9))

median <- median(sdats$Alter) # note that we set the midpoint as the median of the Age (26)
sp + geom_point(aes(colour=Alter))+
  scale_color_gradient2(midpoint = median, low = "blue", mid = "white",
                            high = "red")

And take a look at two of the best colour apps in R, RColorBrewer and viridis. https://colorbrewer2.org
https://vega.github.io/vega/docs/schemes/

library(viridis) # for pleaseing colourscapes
library(RColorBrewer) # for creating sequential, diverging and qualitative palettes

# RColorBrewer colourscales
display.brewer.all()

display.brewer.all(colorblindFriendly = TRUE)

# viridis colourscales
par(mfrow=c(2,2))
barplot(1:10, col = viridis(10), main="Color space: viridis")
barplot(1:10, col = inferno(10), main="Color space: inferno")
barplot(1:10, col = plasma(10), main="Color space: plasma")
barplot(1:10, col = cividis(10), main="Color space: cividis")

We can set up our own palette, however, using the hexcodes of colours or using the named colours. Include some diverging colours from RColorBrewer too. Note that this palette does not work like a function.

## lets make enough colours to diverge by canton
myPalette <- c(brewer.pal(8, "Set1"),"black","#67dd90","#8dd3c7", "#bebada","#fb8072","#b3de69","#fed9a6","#a65628", "#409ae3","#c7a632",
               "#797878","#b3ff68","#b32968")
print("is.function(myPalette)")

## [1] "is.function(myPalette)"

is.function(myPalette)

## [1] FALSE

print("is.vector(myPalette)")

## [1] "is.vector(myPalette)"

is.vector(myPalette)

## [1] TRUE

print("typeof(myPalette)")

## [1] "typeof(myPalette)"

typeof(myPalette)

## [1] "character"

If we use it for plotting, ggplot will automatically take a subset of it, if not all colours are needed. Colours, however are NOT recycled in ggplot if we don’t provide enough of them.

sp + geom_point(aes(colour=Geschlecht))+
 scale_colour_manual(values = myPalette)

sp + geom_jitter(aes(colour=Ref_KT), size=2)+
 scale_colour_manual(values = myPalette)

Notice the jitter!

Let’s try some colours from viridis as well. Colour our respondents by age while looking at their height and weight. And recolor our big fuzzy plot using viridis colours.

sp + geom_point(aes(colour=Alter))+
 scale_color_viridis(option = "inferno")

p + geom_hex() +
 scale_fill_viridis(option = "viridis", alpha = 0.5) #not that beforehand we didn't map any aesthetics in geom_hex, but it is logical that it is a 'fill' that we had to change

scale_color_viridis() acts like scale_color_gratient*() unless we tell it not to behave as a gradient.
Let’s plot the height and weight and see the trends by gender. We will use the discrete parameter not to supply a continuous scale to a discrete variable.

sp + geom_point(aes(colour=Geschlecht))+
  geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")+ 
  scale_color_viridis(discrete = TRUE, option = "D")+ #option "D"="viridis"
  scale_fill_viridis(discrete = TRUE) +
  xlim(40,110)

We can also change the limits of what we colour. Let’s colour for example only those between 18 and 26.
All data points above 26 years old will get grayed out (dark gray).

sp + geom_point(aes(colour=Alter))+
  scale_color_gradientn(colours = terrain.colors(9), limits=c(18,26))

And let’s put some of what we’ve seen in a brand new plot.

We might want to find out about the geographic distribution of political stances - ‘Are rural areas more conservative?’, for example.

Let’s plot political stance by cantons and age.
Even though we use geom_point(), we can add jitter as a position.
direction=-1 reverses the colour scale.

ggplot(sdats,aes(x=Alter, y=Ref_KT))+
  geom_point(aes(color=Politik_LinksRechts),
             position = position_jitter(width = 0, height = 0.25)) + # we only allow jitter to add a vertical value for this plot
  scale_color_viridis(discrete = TRUE,option = "B", direction = -1) # notice that we made the colour scale discrete here too

We actually recorded political stance on two scales. Let’s use bot and map our respondents in the two dimensions, coloured by cantons.
Let’s map the colours to our custom palette.

ggplot(sdats,aes(x=Politik_LinksRechts, y=Politik_LiberalKons))+
  geom_point(aes(color=Ref_KT),
             position = position_jitter(width = 0.35, height = 0.35)) + 
  scale_color_manual(values=myPalette)

The picture is not very clear despite the jittering.

5 Facets

A faceted plot might help. Split your plot into facets, subplots that each display one subset of the data.

The variable that you pass to facet_wrap() using a ~ should be discrete.

As we supply the canton of origin as the faceting factor, we can use the aesthetic of colour for something else. Age would be interesting to see! Colour it as a gradient

ggplot(sdats,aes(x=Politik_LinksRechts, y=Politik_LiberalKons))+
  geom_point(aes(color=Alter),
             position = position_jitter(width = 0.35, height = 0.35)) + 
  scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = 40) + # midpoint defaults to 0
  facet_wrap(~Ref_KT, ncol = 5)

To facet your plot on the combination of two variables, add facet_grid() to your plot. Let’s take two discrete variables and plot the political map based on whether someone uses “OMG!”, and based on gender. If we colour it by canton, then you can see that it’s a whoopping 5 variables that you just plotted!

ggplot(sdats[which(sdats$Geschlecht %in% c("Männlich","Weiblich")),], # we subset the data based on gender first
       aes(x=Politik_LinksRechts, y=Politik_LiberalKons)) +
  geom_point(aes(color=Ref_KT),
             position = position_jitter(width = 0.35, height = 0.35)) + 
  scale_color_manual(values=myPalette) +
  facet_grid(Geschlecht~OMG_Sprache)

6 Let’s combine it all, step by step

ggplot(sdats, aes(Gewicht, Groesse)) +
  geom_point(aes(colour=Geschlecht))

ggplot(sdats, aes(Gewicht, Groesse)) +
  xlim(40,110)+
  geom_point(aes(colour=Geschlecht))+
  geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")

ggplot(sdats, aes(Gewicht, Groesse)) + 
  xlim(40,110)+
  geom_point(aes(colour=Geschlecht))+
  geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")+ 
  scale_color_viridis(discrete = TRUE, option = "D")+ #option "D"="viridis"
  scale_fill_viridis(discrete = TRUE)

Let’s set the theme to black and white bw and let’s change the axes’ legend (labs() is another function for that).

ggplot(sdats, aes(Gewicht, Groesse)) + xlim(40,110)+
  geom_point(aes(colour=Geschlecht))+
  geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")+ 
  scale_color_viridis(discrete = TRUE, option = "D")+ #option "D"="viridis"
  scale_fill_viridis(discrete = TRUE) +
  theme_bw() +
  xlab("Weight") + ylab("Height")

Edit some of the parameters in the bw theme and plot the above plot slightly modified

theme_set(theme_bw())

theme_update(legend.position=c(0.1,0.7), #Default legend position is topright
             legend.background=element_blank(), #No legend background
             legend.text=element_text(size=10), #Legend text size
             legend.title=element_text(size=10), #Legend title text size
             panel.grid.major = element_blank(), #remove grids
             panel.grid.minor = element_blank(), #remove grids
             panel.border = element_blank(), #remov the panel border (inner plot area border)
             axis.line = element_line(colour = "darkgray"), #Axis lines
             plot.title=element_text(hjust = 0.5,face="bold", family="mono"), #Settings for  title. Set font, boldness and center.
             plot.subtitle=element_text(hjust = 0.5,face="italic",family="mono"), #Subtitle. italic and centered.
             axis.title.x = element_text(face="italic"), #Axis labels italic
             axis.title.y = element_text(face="bold",size=8,angle = 0), #Axis y label bold, different size and angle
             plot.margin=margin(1,1,1,1, unit="cm") # set margins of the plot, defaults to unit="pt"
)

ggplot(sdats, aes(Gewicht, Groesse)) + xlim(40,110)+
  geom_jitter(aes(colour=Geschlecht))+
  geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")+ 
  scale_color_viridis(discrete = TRUE, option = "D")+ #option "D"="viridis"
  scale_fill_viridis(discrete = TRUE) +
  xlab("Weight (kg)") + ylab("Height (cm)") + labs(title="Proportion across weight and height in Swiss German respondents",
                                         subtitle="by gender",
                                         fill= "Gender", colour="Gender") # modifies the legend title

Annotate the plot and remove the legend.
Add some horizontal and vertical lines, with the means of heights and weights by gender.
Overwrite the labs’ and title’s fontface for this plot only.

To change all the fonts in your plot plot + theme(text=element_text(family="mono")) Where mono is your chosen font.

Note: when you’re writing a long plot or a function, selecting the code lines and pressing Ctrl/Cmd+I organises the code neatly.

women.meanheight <-round(mean(sdats[which(sdats$Geschlecht=="Weiblich"),"Groesse"],na.rm = T),2) 
men.meanheight <-round(mean(sdats[which(sdats$Geschlecht=="Männlich"),"Groesse"],na.rm = T),2) 
women.meanweight <-round(mean(sdats[which(sdats$Geschlecht=="Weiblich"),"Gewicht"],na.rm = T),2) 
men.meanweight <-round(mean(sdats[which(sdats$Geschlecht=="Männlich"),"Gewicht"],na.rm = T),2) 

a <- ggplot(sdats, aes(Gewicht, Groesse)) + xlim(40,110)+
  geom_jitter(aes(colour=Geschlecht))+
  geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")+ 
  scale_color_viridis(discrete = TRUE, option = "D")+ #option "D"="viridis"
  scale_fill_viridis(discrete = TRUE) +
  xlab("Weight (kg)") + ylab("Height (cm)") + labs(title="Proportion across weight and height in Swiss German respondents",
                                                   subtitle="by gender")+
  annotate("text",x=90,y=198,label="Men are, on average,\n taller and heavier",size=4,fontface="bold", color="darkgreen") +
  annotate("text",x=70,y=155,label="Women are, on average,\n shorter and lighter",size=4,fontface="bold", color="wheat4") +
  geom_hline(yintercept = men.meanheight, linetype="dashed", color="darkgreen", size=1, alpha=0.5)+
  geom_hline(yintercept = women.meanheight, linetype="dashed", color="wheat4", size=1, alpha=0.5)+
  geom_vline(xintercept = men.meanweight, linetype="dashed", color="darkgreen", size=1, alpha=0.5)+
  geom_vline(xintercept = women.meanweight, linetype="dashed", color="wheat4", size=1, alpha=0.5)+
  annotate("text",x=men.meanweight + 6,y=202,label=paste0("Men's avearage weight:\n",men.meanweight),size=2)+
  annotate("text",x=women.meanweight - 6,y=202,label=paste0("Women's avearage weight:\n",women.meanweight),size=2)+
  annotate("text",y=men.meanheight + 6,x=40,label=paste0("Men's avearage height:\n",men.meanheight),size=2, angle = 90)+ # angle rotates the annotation
  annotate("text",y=women.meanheight - 6,x=40,label=paste0("Women's avearage height:\n",women.meanheight),size=2, angle = 90) +
  theme(
    plot.title = element_text(hjust = 0.5,color="red", size=14, family="sans"),
    axis.title.x = element_text(color="blue", size=10, face="italic"),
    axis.title.y = element_text(color="#993333", size=10, face="italic"),
    legend.position = "none") # remove legend completely
  
a

You can use specific themes by installing the ggthemes package.
A quick example using the Wall Street Journal theme.

library(ggthemes)

m <- ggplot(sdats, aes(Gewicht, Groesse, colour = Geschlecht))+
  geom_jitter()

m <- m + theme_wsj()+ scale_colour_wsj("colors6")+
  ggtitle("Biometrics by gender")

m

7 Display multiple plots

A solution,par(mfrow=c(2,2)) that works with basic plots doesn’t work with ggplots.

An affiliated package, gridExtra comes to the rescue!

library(gridExtra)
library(grid)

sp2 <- sp + geom_point()
bp2 <- bp + geom_boxplot()

grid.arrange(sp2,bp2,
             ncol=2,
             top=textGrob("Arranging plots on a grid",gp=gpar(fontface="bold"))) #Adding a main title

grid.arrange(sp2,bp2,m,p,
             ncol=2, nrow=2,
             top=textGrob("Arranging plots on a grid",gp=gpar(fontface="bold")))

Or build it using a custom made function: http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

8 Saving plots

To have less hassle with having to Export your plots in Rstudio by hand, you can use ggsave(), where you can set which plot you want to save (default is the last one), the extension and size of your plot, etc. You can build it in your loops as well.

Plot size in units (“in”, “cm”, or “mm”). If not supplied, it uses the size of current graphics device.

ggsave("beautiful_plot1.png", m,
       dpi = 300)
ggsave("beautiful_plot2.png", m,device="png",
       dpi = 300, width = 16, height = 12, units="cm")

# Another, non-ggplot method is opening a png or pdf device
png(filename="mykludgyplot.png",
    width = 480, height = 480, units = "px", pointsize = 12, bg = "white")
print(a)
dev.off()

9 Correlation plot

A nice extra skill. Check here corrplot package

10 Further reading

Following the structure of this document

10.1 General

Ggplot cheat sheet - the essence: http://r-statistics.co/ggplot2-cheatsheet.html Ggplot for beginners: https://www.tutorialspoint.com/ggplot2/
What can you do with ggplot in a nutshell: https://beanumber.github.io/sds192/lab-ggplot2.html
Ggplot examples: https://rstudio-pubs-static.s3.amazonaws.com/244477_0f53a9c257bd434fbf7af03518df3587.html The R graph gallery: https://www.r-graph-gallery.com/index.html
Top 50 ggplot visualisations: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html

The basics of aesthetics (aes()): https://www.datanovia.com/en/blog/ggplot-aes-how-to-assign-aesthetics-in-ggplot2/
ggplot and stat_summary() in the tidyverse environment: https://bookdown.org/yih_huynh/Guide-to-R-Book/important-tidyverse-functions.html
The same book from Y. Wendy Huynh — ‘R for Graduate Students’ has a section on graphing as well: https://bookdown.org/yih_huynh/Guide-to-R-Book/how-graphing-works.html

Master the basics (tutorial): http://www.rebeccabarter.com/blog/2017-11-17-ggplot2_tutorial/ Master the details (part of the book ‘ggplot2: elegant graphics for data analysis’): https://ggplot2-book.org/mastery.html
Be awesome: http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization

The R cookbook: http://www.cookbook-r.com/
Hadley Wickham’s handbook ‘R for Data Science’ (data visualisation part): https://r4ds.had.co.nz/data-visualisation.html
Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton, ‘Modern Data Science with R’ (2020-10-08): https://beanumber.github.io/mdsr2e/

10.2 Histograms and quick plots

http://www.sthda.com/english/wiki/ggplot2-histogram-plot-quick-start-guide-r-software-and-data-visualization Many examples, including ggmap: https://www.rpubs.com/foocheeyoong/rws-module-2x

10.3 Box plots and bar plots

Box plot quick start incl themes: http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization Visualisation of ndividual observations and group means: https://drsimonj.svbtle.com/plotting-individual-observations-and-group-means-with-ggplot2
Grouped observations: http://www.sthda.com/english/articles/32-r-graphics-essentials/132-plot-grouped-data-box-plot-bar-plot-and-more/ Grouped and stacked barplots: https://www.r-graph-gallery.com/48-grouped-barplot-with-ggplot2.html

10.4 Colours and scaling

Colour basics: https://www.datamentor.io/r-programming/color/
Ggplot best colour tricks: https://www.datanovia.com/en/blog/ggplot-colors-best-tricks-you-will-love/
viridisdocumentation: https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html
Color Brewer’s online colour picker (awesome for maps): https://colorbrewer2.org/#type=diverging&scheme=RdYlGn&n=9
Scale colours outside your data range: https://stackoverflow.com/questions/13888222/ggplot-scale-color-gradient-to-range-outside-of-data-range

10.5 Themes and decoration

Themes and plot background: http://www.sthda.com/english/wiki/ggplot2-themes-and-background-colors-the-3-elements
Legend Title, Position and Labels : https://www.datanovia.com/en/blog/ggplot-legend-title-position-and-labels/
Detailed theme modifications: https://ggplot2-book.org/polishing.html

10.6 External packages

Gallery of 80 external ggplot affiliate packages: https://exts.ggplot2.tidyverse.org/gallery/
ggrepel: https://ggrepel.slowkow.com/articles/examples.html
ggthemes - including palettes: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/

“Official” cheatsheets: https://rstudio.com/resources/cheatsheets/

Ecosystem of R packages

ggplot basics, tips and tricks

Péter Jeszenszky

08.12.2020

1 What is ggplot and why you should use it

1.1 How it works

2 Basics

2.1 `qplot()` Quick plot with ggplot2

2.2 Parts of an actual ggplot

2.3 Analyse a complex plot

3 Geometric Objects and Aesthetics

3.1 Geometric objects (`geom_<>()`)

3.2 Aesthetic mapping (`aes()`)

3.3 Barplot

3.4 Frequency of a continuous variable

3.4.1 Area and density plots

3.4.2 Histogram

3.5 One continuous and one discrete variable

3.5.1 Boxplot and friends

3.6 Two continuous variables

3.6.1 Scatterplots

4 Stats

4.1 Scales and colours

5 Facets

6 Let’s combine it all, step by step

7 Display multiple plots

8 Saving plots

9 Correlation plot

10 Further reading

10.1 General

10.2 Histograms and quick plots

10.3 Box plots and bar plots

10.4 Colours and scaling

10.5 Themes and decoration

10.6 External packages

ggplot basics, tips and tricks

Péter Jeszenszky

08.12.2020

1 What is ggplot and why you should use it

1.1 How it works

2 Basics

2.1 qplot() Quick plot with ggplot2

2.2 Parts of an actual ggplot

2.3 Analyse a complex plot

3 Geometric Objects and Aesthetics

3.1 Geometric objects (geom_<>())

3.2 Aesthetic mapping (aes())

3.3 Barplot

3.4 Frequency of a continuous variable

3.4.1 Area and density plots

3.4.2 Histogram

3.5 One continuous and one discrete variable

3.5.1 Boxplot and friends

3.6 Two continuous variables

3.6.1 Scatterplots

4 Stats

4.1 Scales and colours

5 Facets

6 Let’s combine it all, step by step

7 Display multiple plots

8 Saving plots

9 Correlation plot

10 Further reading

10.1 General

10.2 Histograms and quick plots

10.3 Box plots and bar plots

10.4 Colours and scaling

10.5 Themes and decoration

10.6 External packages

2.1 `qplot()` Quick plot with ggplot2

3.1 Geometric objects (`geom_<>()`)

3.2 Aesthetic mapping (`aes()`)