This tutorial builds on Hadley Wickham’s book ‘R for Data Science’; r-statistics.co; www.sthda.com tutorials; www.datanovia.com tutorials and www.r-graph-gallery.com .
Links to the other two parts of the workshop:
Introduction to R Markdown
: https://verticalmeadows.github.io/basics.html
Introduction to ggmap
: https://verticalmeadows.github.io/ggmap_basics.html
ggplot2 implements the grammar of graphics, a coherent system for describing and building graphs.
Official package documentation: https://ggplot2.tidyverse.org/index.html
Ggplot is part of tidyverse, along with dplyr, tidyr, stringr etc., which you might have learnt last time.
“You provide the data, tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.” (https://ggplot2.tidyverse.org/)
A succinct description of how ggplot2 works is hard because it embodies a deep philosophy of visualisation. However, in most cases you
#install.packages("ggplot2")
library(tidyverse) # tidyverse contains several other needed packages beside ggplot, such as dplyr
#library(ggplot2)
#library(dplyr)
qplot()
Quick plot with ggplot2The qplot() function is very similar to the standard R plot()
function. It can be used to create quickly and easily different types of graphs: scatter plots, box plots, violin plots, histogram and density plots.
attach(mpg)
qplot(x=displ, y=cty) # aesthetics are passed to x and y, the axes of the plot
qplot(x=displ, geom="histogram") # default 'geom' is a point plot, if only x is specified, then a histogram
detach(mpg)
qplot(wt, mpg, data = mtcars) # miles per gallon vs. weight in 'cars'
mod <- lm(wt ~ mpg, data = mtcars) # we can conjure up a little linear model and plot the residuals against the fitted values
qplot(resid(mod), fitted(mod))
First, let’s analyse a simple scatterplot presenting three variables.
# Based on mpg, a dataset about car emissions included in the ggplot2 package, where
# hwy = fuel efficiency on the highway
# displ = a car’s engine size, in litres
# class = the kind of car
dplyr::glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "au…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quatt…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 199…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)",…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17,…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25,…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "…
## $ class <chr> "compact", "compact", "compact", "compact", "compac…
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = class))
The basic structure of a ggplot:
Basically on its own, ggplot(data = mpg)
wouldn’t display anything since we only talk about the data and there are no defaults to the geom function and the aesthetics. The geom function adds another layer to the plot, importantly, using the +
sign.
gg <- ggplot(diamonds, aes(x=carat, y=price))
gg + geom_point(aes(shape=color, color=cut), alpha =0.5) +
geom_smooth(aes(color=cut), size=0.5) #+
# facet_wrap(~ color, nrow = 2) # carat, cut and color are variables in `diamonds`
gg + geom_point(aes(shape=color, color=cut), size =0.5, alpha =0.5) +
geom_smooth(aes(color=cut), size=0.5)+
facet_wrap(~ color, nrow = 2) + ggtitle("Diamonds by colour")
This is called the layered grammar of graphics.
In practice, you rarely need to supply all seven parameters to make a graph because ggplot2 will provide useful defaults for everything except the data, the mappings, and the geom function.
The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of a dataset, a geom, a set of mappings, a stat, a position adjustment, a coordinate system, and a faceting scheme.
More on this in: https://r4ds.had.co.nz/data-visualisation.html#the-layered-grammar-of-graphics
Each geom function in ggplot2
takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes()
, and the x and y arguments of aes()
specify which variables to map to the x and y axes. `ggplot2 looks for the mapped variables in the data argument, in this case, mpg. (https://r4ds.had.co.nz/data-visualisation.html#creating-a-ggplot)
Here we will cover the most useful geometric objects, what to use them for and how to use them.
geom_<>()
)They are essentially the layers on the plot including points, lines, text and labels, boxplots, histograms and many more.
You can get help and an overview of the existing geoms by typing help.search("geom_", package = "ggplot2")
in the console. Or when typing your code, type geom_
to get a list of available functions.
aes()
)In ggplot2
, aesthetic means “something you can see”. Each aesthetic is a mapping between a visual cue and a variable. Examples include:
Each type of geom accepts only a subset of all aesthetics—refer to the geom help pages to see what mappings each geom accepts. Aesthetic mappings are set with the aes()
function.
Concisely: https://beanumber.github.io/sds192/lab-ggplot2.html
Officially (package documentation): https://ggplot2.tidyverse.org/articles/ggplot2-specs.html
Used for displaying
- counts in a categorical variable
- values of one continuous variable across instances
Let’s plot the number of people saying OMG in different ways in Switzerland.
ggplot(data = sdats) +
geom_bar(mapping = aes(x =OMG_Sprache))
If we want to break it down by another variable, say, the gender of the respondents, then we can map the fill colour to the gender variable.
The default option of the position will be stacked
.
ggplot(data = sdats) +
geom_bar(mapping = aes(x =OMG_Sprache, fill=Geschlecht))
position = "fill"
shows the proportion rather than the count. As position is not an aesthetic, the value needs to have " "
around it. Instead, position is a parameter in the geom_bar()
function, and actually a game changer in many geometric objects
ggplot(data = sdats) +
geom_bar(mapping = aes(x =OMG_Sprache, fill=Geschlecht), position = "fill")
position = "dodge"
shows the genders next to each other, in comparison.
ggplot(data = sdats) +
geom_bar(mapping = aes(x =OMG_Sprache, fill=Geschlecht), position = "dodge")
Plotting similar data as a pie chart in ggplot needs more tweaking. To try it, visit https://www.r-graph-gallery.com/128-ring-or-donut-plot
Area and density plots can present the count data in continuous variables, by binning them, similarly to histograms.
ggplot(data = sdats) +
geom_area(mapping = aes(x =Gewicht),stat = "bin", bins=15)
We can similarly use the fill
aesthetic here too to show weight by gender, areas will be stacked.
ggplot(data = sdats) +
geom_area(mapping = aes(x=Gewicht, fill=Geschlecht),stat = "bin", bins=15)
Or we can display the weights by gender but as density.
ggplot(data = sdats) +
geom_density(mapping = aes(x =Gewicht, fill=Geschlecht), alpha=0.4)
And we can adjust the smoothing of this density plot.
ggplot(data = sdats) +
geom_density(mapping = aes(x =Gewicht, fill=Geschlecht), alpha=0.4, adjust=1/2)
Behold alpha
, using which we can adjust the transparency of just about anything in our plots.
A simple histogram has even got a default binwidth
.
ggplot(data = sdats) +
geom_histogram(mapping = aes(x =Gewicht))
Let’s adjust this binwidth
.
ggplot(data = sdats) +
geom_histogram(mapping = aes(x =Gewicht), binwidth = 2)
Let’s rearrange y axis to show the density rather than the count. This way we can also add a smoothed density on the same plot, otherwise the y axis, having different values, would prevent this.
ggplot(data = sdats) +
geom_histogram(aes(x =Gewicht, y=..density..), binwidth = 2, colour="black", fill="white")+
geom_density(aes(x =Gewicht), alpha=.2, fill="#FF6666")
Here we can observe some new things. What is fill
doing outside the aes()
?
colour
means the “outside” colour, while y=..density..
means that we want to deviate from the default ..count..
statistics used in histograms.
"white"
and "#FF6666"
are named colours, more on this later on.
Also, you don’t need to write mapping
all the time, ggplot will also understand if you only put aes()
.
We also see that x=Gewicht
comes up in the aes()
of both layers. It is possible to set this as a global aesthetic by putting it up into the ggplot()
function, and thus, all subsequent layers will use it as an aesthetic.
ggplot(data = sdats, mapping = aes(x=Gewicht)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666") # we don't even have to write 0, when we mean 0.2
We can show the genders the same way in a histogram as well.
ggplot(data = sdats, mapping = aes(x=Gewicht)) +
geom_histogram(aes( fill=Geschlecht), binwidth = 5, alpha=0.5, position="identity")
What do you think, what would happen if the position="identity"
wasn’t there?
Discrete or Nominal or Categorical
A basic boxplot, coloured again by a discrete variable, gender. Notice that the numbers along the x axes don’t really have a meaning beside denoting the Cartesian coordinates of the plot.
ggplot(data = sdats) +
geom_boxplot(aes(y=Gewicht, fill=Geschlecht))
Limit: We’ve got an outlier there, which makes it harder to interpret the plot. Let’s remove it by limiting the bounds of the plot. Let’s also turn it around, to use the Coordinate_function
grammar mentioned before.
ggplot(data = sdats) +
geom_boxplot(aes(y=Gewicht, fill=Geschlecht)) +
ylim(40,125) +
coord_flip()
Let’s also change the shape
aesthetics of the points and make the boxplot look different.
The possible basic shapes in R:
ggplot(data = sdats) +
geom_boxplot(aes(y=Gewicht, fill=Geschlecht), notch=TRUE,
outlier.colour="red",
outlier.shape=8,
outlier.size=2) +
ylim(40,125)
It is possible to plot the data points as non-overlapping dots along with the boxplot using geom_dotplot
.
ggplot(data = sdats ) +
geom_boxplot(aes(y=Gewicht, x=Geschlecht, fill=Geschlecht)) +
geom_dotplot(aes(y=Gewicht, x=Geschlecht, fill=Geschlecht), binaxis = "y", stackdir = "center", dotsize = 0.3) +
ylim(40,125)
Violin plots can look even better, essentially plotting the density of the data points.
ggplot(data = sdats ) +
geom_violin(aes(y=Gewicht, x=Geschlecht, fill=Geschlecht),scale="area") +
ylim(40,125)
Why don’t we plot both boxplots and violin plots?
ggplot(data = sdats,aes(y=Gewicht, x=Geschlecht, fill=Geschlecht) ) +
geom_violin(scale="area", trim = TRUE) +
geom_boxplot(width=0.1)+
ylim(40,125)
mean_sdl
computes the mean plus or minus a constant times the standard deviation (mult
sets this time as 1, as the default is 2). This can be shown as a pointrange
.
ggplot(data = sdats,aes(y=Gewicht, x=Geschlecht, fill=Geschlecht) ) +
geom_violin(scale="area", trim = TRUE) +
stat_summary(fun.data=mean_sdl, mult=1,
geom="pointrange", color="red")+
ylim(40,125)
stat_summary()
is a more advanced way to construct plots and to include summary statistics within them.
Let’s plot the height of our participants against their weight.
ggplot(data = sdats, mapping = aes(x=Gewicht, y=Groesse)) +
geom_point() +
xlim(40,125)
The values are integers, so there is a possibility of overlaps, people with the exact same weight and height as others.
ggplot(data = sdats, mapping = aes(x=Gewicht, y=Groesse)) +
geom_point(alpha=0.25) +
xlim(40,125)
geom_jitter()
moves the points apart from each other by adding a small number to their position. We can even control this, but the default is 40% of the resolution of the data. Besides, geom_rug()
helps show the position of the points on the axes as well. You can use alpha on these too.
ggplot(data = sdats, mapping = aes(x=Gewicht, y=Groesse)) +
geom_jitter(alpha=0.5) +
xlim(40,125) +
geom_rug(sides = "bl", col="steelblue",alpha=0.1, size=1.5)
geom_smooth()
is smoothed conditional mean, essentially giving us a trendline, including confidence intervals. See individual modelling functions for more details: lm
for linear smooths, glm
for generalised linear smooths, loess
for local smooths.
ggplot(data = sdats, mapping = aes(x=Gewicht, y=Groesse)) +
geom_jitter(alpha=0.5) +
geom_smooth() +
xlim(40,125)
The default method for smoothing is method="loess"
(for less than 1000 observations). NULL
by default, corresponding to y ~ x
, the formula argument can be adjusted such that we plot polynomials.
ggplot(data = sdats, mapping = aes(x=Gewicht, y=Groesse)) +
geom_jitter(alpha=0.5) +
geom_smooth(method = "lm",formula = y ~ poly(x, 1), size = 1, col="red") +
geom_smooth(method = "lm",formula = y ~ poly(x, 2), size = 1, col="green") +
geom_smooth(method = "lm",formula = y ~ poly(x, 3), size = 1, col="blue") +
xlim(40,125)
geom_smooth()
more detailed: https://ropensci.github.io/plotly/ggplot2/geom_smooth.html
More fitted lines: https://aosmith.rbind.io/2018/11/16/plot-fitted-lines/
Let’s see if people from the Alps are smaller than those living in the Swiss Plateau. To this end, plot a sampled subset of the data, but by labelling out which cantons our participants are from.
set.seed(77) # if using the same seed, the same random subsets are reproduced
ggplot(data = sdats[sample(1:375, 100),], mapping = aes(x=Gewicht, y=Groesse)) +
geom_label(aes(label=Ref_KT)) +
xlim(40,95) + ylim(155,185)
Even though we limited the bounds even more, not all label
s can be visible. How about plotting them as points AND labelling them with texts.
The ggrepel
package helps with this task.
There are lots of other similar extensions to ggplot: https://exts.ggplot2.tidyverse.org/gallery/ .
library(ggrepel)
set.seed(77)
ggplot(data = sdats[sample(1:375, 100),], mapping = aes(x=Gewicht, y=Groesse)) +
geom_jitter(alpha=0.5, colour="red") +
geom_text_repel(aes(label=Ref_KT))+
xlim(40,95) + ylim(155,185)
When even jittering or alpha doesn’t help, we can turn to density plotting in 2D as well.
The following graph plots a 383 X 383 correspondence table, plotting how linear distance corresponds to linguistic distance among 383 Swiss localities, with alpha=0.08
.
syntdist <- read.csv("syntdist_vs_eucldist.csv", header=T, stringsAsFactors = F)
ggplot(syntdist,aes(DISTANCE, Syntdist60)) +
geom_point(alpha = 0.08, size = 1.5)
The command ggMarginal()
from the package ggExtra
helps us plot histograms, density graphs or even boxplots outside the axes to see the distribution of the points. For ggMarginal()
to work, we have to save the plot as an object first.
library(ggExtra)
p <- ggplot(syntdist,aes(DISTANCE, Syntdist60)) +
geom_point(alpha = 0.08, size = 1.5)
ggMarginal(p, type="histogram")
If we save a plot as an object we can keep expanding it without having to write the basics again. But for clarity I’m trying to avoid that this time.
At least we can see that the densest area is not even in the middle of that blob. Try type="density"
and type="boxplot"
as well.
But we can plot the density of the points itself on the graph, in different ways.
p + geom_density2d() # density mapped with countour lines
p + geom_bin2d() # with default binwidth
p + geom_bin2d(binwidth=c(2500,1)) # with finer granularity (binwidths on the x and y axis, in their respective scales)
p + geom_hex() # with hexagons instead of squares
A quick look on stats_<>
. Details in Hadley Wickham’s handbook ‘R for Data Science’: https://r4ds.had.co.nz/data-visualisation.html#statistical-transformations
Stats are an alternative way to build a layer.
A stat builds new variables to plot (e.g.,count
, prop
).
Visualize a stat by changing the default stat of a geom function, geom_bar(stat="count")
or by using a stat function, stat_count(geom="bar")
, which calls a default geom to make a layer (equivalent to a geom function).
One of the classic methods to graph is by using the stat_summary()
function. We begin by using the ggplot()
function, which requires the name of the dataset, we’ll use mydata
that we create first, followed by the aes()
function that encompasses the x and y variable specifications. Next, we add on the stat_summary()
function. For this function, we specify that we want to calculate the mean of the y axis in the first argument (fun.y
asks what function to use for the y variable). Then, we specify what graphing/geom element to plot. Here, we specified that we want points (other options could be bar, line, etc.).
## Creating an object named Subject
Subject <- c("Wendy", "Wendy", "Wendy", # you can press ENTER to auto-indent
"John", "John", "John", # the code. This produces better
"Helen", "Helen", "Helen") # formatting for the user.
## Creating an object named Date
Date <- c("2019-08-08", "2019-09-05", "2019-12-07", "2019-08-08",
"2019-09-05", "2019-12-07", "2019-08-08", "2019-09-05", "2019-12-07")
## Creating an object named Score
Score <- c(2, 15, 34,
5, 10, 27,
16, 8, 40)
## Creating an object
mydata <- tibble(Subject, Date, Score)
ggplot(mydata, aes(x = Subject, y = Score)) +
stat_summary(fun.y = "mean", geom = "point")
An example with geom="line"
, modified from https://stackoverflow.com/questions/21239418/use-stat-summary-in-ggplot2-to-calculate-the-mean-and-sd-then-connect-mean-poin
data(ToothGrowth)
ToothGrowth$F3 <- letters[1:2]
# coerce dose to a factor
ToothGrowth$dose <- factor(ToothGrowth$dose, levels = c(0.5,1,2))
ggplot(ToothGrowth, aes(y = len, x = supp, colour = dose, group = dose)) +
stat_summary(fun = mean,
fun.min = function(x) mean(x) - sd(x),
fun.max = function(x) mean(x) + sd(x),
geom = "pointrange") +
stat_summary(fun = mean,
geom = "line") +
facet_wrap( ~ F3)
Let’s save a boxplot and a scatterplot first to show the changes on them. Let’s also set a theme
for the plots, something that one can tweak further for better looks - we will do so further on.
theme_set(
theme_minimal() +
theme(legend.position = "right")
)
# Box plot
bp <- ggplot(sdats, aes(Geschlecht, Gewicht))
# Scatter plot
sp <- ggplot(sdats, aes(Gewicht, Groesse)) + xlim(40,125) # let's cut off the outlier already
Plot bp
and sp
with an arbitrary colour, not given in the aesthetics.
# Box plot
bp + geom_boxplot(fill = "#19FFFF", color = "red")
# Scatter plot
sp + geom_point(colour="steelblue")
If we colour it by a certain variable, the legend appears. Note the difference between fill
in aes()
and color
outside of it.
# Box plot
bp + geom_boxplot(aes(fill = Geschlecht), color = "red")
Let’s colour the scatterplot based on the canton of origin.
# Scatter plot
sp + geom_point(aes(colour=Ref_KT))
Notice that many of these colours are hard to distinguish. The colours for the aesthetic mapping were chosen by default. We should use more diverging colours, and do so automatically if possible.
Scales
map data values to the visual values of an aesthetic. To change a mapping, we have to add a new scale.
Scales take on the scale_*_#()
pattern, where * stands for an aesthetic, and # for a possible way to change the aesthetic.
But first, let’s set the values of the colours manually. This would take a long time if we had many different categories.
Check the built-in colours of R.
colors()
shows the named colours in base R.
This pdf shows also the corresponding colours.
http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf
# Box plot
bp +
geom_boxplot(aes(fill = Geschlecht)) +
scale_fill_manual(values = c("tan","thistle","tomato2","wheat4"))
# Scatter plot
sp +
geom_point(aes(colour=Geschlecht)) +
scale_colour_manual(values = c("tan","thistle","tomato2","wheat4"))
Palettes in R behave like functions.
With basic plots, colour palettes are recycled.
temp <- c(5,7,6,4,8,2,4,6,9)
par(mfrow=c(2,3))
barplot(temp, col=c("tan","thistle","tomato2",
"wheat4","violetred"), main="With 5 colors")
barplot(temp, col=c("sienna3","royalblue1","snow2"), main="With 3 colors")
barplot(temp, col=rainbow(9), main="rainbow")
barplot(temp, col=heat.colors(9), main="heat.colors")
barplot(temp, col=terrain.colors(9), main="terrain.colors")
barplot(temp, col=topo.colors(9), main="topo.colors")
# check out the hexnames of colours
rainbow(9)
## [1] "#FF0000FF" "#FFAA00FF" "#AAFF00FF" "#00FF00FF" "#00FFAAFF" "#00AAFFFF"
## [7] "#0000FFFF" "#AA00FFFF" "#FF00AAFF"
terrain.colors(9)
## [1] "#00A600FF" "#3EBB00FF" "#8BD000FF" "#E6E600FF" "#E8C32EFF" "#EBB25EFF"
## [7] "#EDB48EFF" "#F0C9C0FF" "#F2F2F2FF"
In the first two plots, you can see that we added 5 and 3 arbitrarily chosen colours and they got recycled to show the 9 bars.
In the four later plots, the parameter 9 in the palette functions tell the function to choose 9 colours along the linear colourspace
We can set gradient colours for the aesthetics as well.
par(mfrow=c(2,2))
sp + geom_point(aes(colour=Alter))+
scale_color_gradient(low = "#fed9a6", high = "#797878")
sp + geom_point(aes(colour=Alter))+
scale_color_gradientn(colours = rainbow(9))
sp + geom_point(aes(colour=Alter))+
scale_color_gradientn(colours = terrain.colors(9))
median <- median(sdats$Alter) # note that we set the midpoint as the median of the Age (26)
sp + geom_point(aes(colour=Alter))+
scale_color_gradient2(midpoint = median, low = "blue", mid = "white",
high = "red")
And take a look at two of the best colour apps in R, RColorBrewer and viridis. https://colorbrewer2.org
https://vega.github.io/vega/docs/schemes/
library(viridis) # for pleaseing colourscapes
library(RColorBrewer) # for creating sequential, diverging and qualitative palettes
# RColorBrewer colourscales
display.brewer.all()
display.brewer.all(colorblindFriendly = TRUE)
# viridis colourscales
par(mfrow=c(2,2))
barplot(1:10, col = viridis(10), main="Color space: viridis")
barplot(1:10, col = inferno(10), main="Color space: inferno")
barplot(1:10, col = plasma(10), main="Color space: plasma")
barplot(1:10, col = cividis(10), main="Color space: cividis")
We can set up our own palette, however, using the hexcodes of colours or using the named colours. Include some diverging colours from RColorBrewer
too. Note that this palette does not work like a function.
## lets make enough colours to diverge by canton
myPalette <- c(brewer.pal(8, "Set1"),"black","#67dd90","#8dd3c7", "#bebada","#fb8072","#b3de69","#fed9a6","#a65628", "#409ae3","#c7a632",
"#797878","#b3ff68","#b32968")
print("is.function(myPalette)")
## [1] "is.function(myPalette)"
is.function(myPalette)
## [1] FALSE
print("is.vector(myPalette)")
## [1] "is.vector(myPalette)"
is.vector(myPalette)
## [1] TRUE
print("typeof(myPalette)")
## [1] "typeof(myPalette)"
typeof(myPalette)
## [1] "character"
If we use it for plotting, ggplot will automatically take a subset of it, if not all colours are needed. Colours, however are NOT recycled in ggplot if we don’t provide enough of them.
sp + geom_point(aes(colour=Geschlecht))+
scale_colour_manual(values = myPalette)
sp + geom_jitter(aes(colour=Ref_KT), size=2)+
scale_colour_manual(values = myPalette)
Notice the jitter!
Let’s try some colours from viridis as well. Colour our respondents by age while looking at their height and weight. And recolor our big fuzzy plot using viridis colours.
sp + geom_point(aes(colour=Alter))+
scale_color_viridis(option = "inferno")
p + geom_hex() +
scale_fill_viridis(option = "viridis", alpha = 0.5) #not that beforehand we didn't map any aesthetics in geom_hex, but it is logical that it is a 'fill' that we had to change
scale_color_viridis()
acts like scale_color_gratient*()
unless we tell it not to behave as a gradient.
Let’s plot the height and weight and see the trends by gender. We will use the discrete
parameter not to supply a continuous scale to a discrete variable.
sp + geom_point(aes(colour=Geschlecht))+
geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")+
scale_color_viridis(discrete = TRUE, option = "D")+ #option "D"="viridis"
scale_fill_viridis(discrete = TRUE) +
xlim(40,110)
We can also change the limits of what we colour. Let’s colour for example only those between 18 and 26.
All data points above 26 years old will get grayed out (dark gray).
sp + geom_point(aes(colour=Alter))+
scale_color_gradientn(colours = terrain.colors(9), limits=c(18,26))
And let’s put some of what we’ve seen in a brand new plot.
We might want to find out about the geographic distribution of political stances - ‘Are rural areas more conservative?’, for example.
Let’s plot political stance by cantons and age.
Even though we use geom_point()
, we can add jitter
as a position.
direction=-1
reverses the colour scale.
ggplot(sdats,aes(x=Alter, y=Ref_KT))+
geom_point(aes(color=Politik_LinksRechts),
position = position_jitter(width = 0, height = 0.25)) + # we only allow jitter to add a vertical value for this plot
scale_color_viridis(discrete = TRUE,option = "B", direction = -1) # notice that we made the colour scale discrete here too
We actually recorded political stance on two scales. Let’s use bot and map our respondents in the two dimensions, coloured by cantons.
Let’s map the colours to our custom palette.
ggplot(sdats,aes(x=Politik_LinksRechts, y=Politik_LiberalKons))+
geom_point(aes(color=Ref_KT),
position = position_jitter(width = 0.35, height = 0.35)) +
scale_color_manual(values=myPalette)
The picture is not very clear despite the jittering.
A faceted plot might help. Split your plot into facets, subplots that each display one subset of the data.
The variable that you pass to facet_wrap()
using a ~
should be discrete.
As we supply the canton of origin as the faceting factor, we can use the aesthetic of colour for something else. Age would be interesting to see! Colour it as a gradient
ggplot(sdats,aes(x=Politik_LinksRechts, y=Politik_LiberalKons))+
geom_point(aes(color=Alter),
position = position_jitter(width = 0.35, height = 0.35)) +
scale_color_gradient2(low = "blue", mid = "white", high = "red", midpoint = 40) + # midpoint defaults to 0
facet_wrap(~Ref_KT, ncol = 5)
To facet your plot on the combination of two variables, add facet_grid()
to your plot. Let’s take two discrete variables and plot the political map based on whether someone uses “OMG!”, and based on gender. If we colour it by canton, then you can see that it’s a whoopping 5 variables that you just plotted!
ggplot(sdats[which(sdats$Geschlecht %in% c("Männlich","Weiblich")),], # we subset the data based on gender first
aes(x=Politik_LinksRechts, y=Politik_LiberalKons)) +
geom_point(aes(color=Ref_KT),
position = position_jitter(width = 0.35, height = 0.35)) +
scale_color_manual(values=myPalette) +
facet_grid(Geschlecht~OMG_Sprache)
ggplot(sdats, aes(Gewicht, Groesse)) +
geom_point(aes(colour=Geschlecht))
ggplot(sdats, aes(Gewicht, Groesse)) +
xlim(40,110)+
geom_point(aes(colour=Geschlecht))+
geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")
ggplot(sdats, aes(Gewicht, Groesse)) +
xlim(40,110)+
geom_point(aes(colour=Geschlecht))+
geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")+
scale_color_viridis(discrete = TRUE, option = "D")+ #option "D"="viridis"
scale_fill_viridis(discrete = TRUE)
Let’s set the theme to black and white bw
and let’s change the axes’ legend (labs()
is another function for that).
ggplot(sdats, aes(Gewicht, Groesse)) + xlim(40,110)+
geom_point(aes(colour=Geschlecht))+
geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")+
scale_color_viridis(discrete = TRUE, option = "D")+ #option "D"="viridis"
scale_fill_viridis(discrete = TRUE) +
theme_bw() +
xlab("Weight") + ylab("Height")
Edit some of the parameters in the bw theme and plot the above plot slightly modified
theme_set(theme_bw())
theme_update(legend.position=c(0.1,0.7), #Default legend position is topright
legend.background=element_blank(), #No legend background
legend.text=element_text(size=10), #Legend text size
legend.title=element_text(size=10), #Legend title text size
panel.grid.major = element_blank(), #remove grids
panel.grid.minor = element_blank(), #remove grids
panel.border = element_blank(), #remov the panel border (inner plot area border)
axis.line = element_line(colour = "darkgray"), #Axis lines
plot.title=element_text(hjust = 0.5,face="bold", family="mono"), #Settings for title. Set font, boldness and center.
plot.subtitle=element_text(hjust = 0.5,face="italic",family="mono"), #Subtitle. italic and centered.
axis.title.x = element_text(face="italic"), #Axis labels italic
axis.title.y = element_text(face="bold",size=8,angle = 0), #Axis y label bold, different size and angle
plot.margin=margin(1,1,1,1, unit="cm") # set margins of the plot, defaults to unit="pt"
)
ggplot(sdats, aes(Gewicht, Groesse)) + xlim(40,110)+
geom_jitter(aes(colour=Geschlecht))+
geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")+
scale_color_viridis(discrete = TRUE, option = "D")+ #option "D"="viridis"
scale_fill_viridis(discrete = TRUE) +
xlab("Weight (kg)") + ylab("Height (cm)") + labs(title="Proportion across weight and height in Swiss German respondents",
subtitle="by gender",
fill= "Gender", colour="Gender") # modifies the legend title
Annotate the plot and remove the legend.
Add some horizontal and vertical lines, with the means of heights and weights by gender.
Overwrite the labs’ and title’s fontface for this plot only.
To change all the fonts in your plot plot + theme(text=element_text(family="mono"))
Where mono
is your chosen font.
Note: when you’re writing a long plot or a function, selecting the code lines and pressing Ctrl/Cmd+I organises the code neatly.
women.meanheight <-round(mean(sdats[which(sdats$Geschlecht=="Weiblich"),"Groesse"],na.rm = T),2)
men.meanheight <-round(mean(sdats[which(sdats$Geschlecht=="Männlich"),"Groesse"],na.rm = T),2)
women.meanweight <-round(mean(sdats[which(sdats$Geschlecht=="Weiblich"),"Gewicht"],na.rm = T),2)
men.meanweight <-round(mean(sdats[which(sdats$Geschlecht=="Männlich"),"Gewicht"],na.rm = T),2)
a <- ggplot(sdats, aes(Gewicht, Groesse)) + xlim(40,110)+
geom_jitter(aes(colour=Geschlecht))+
geom_smooth(aes(colour=Geschlecht, fill=Geschlecht), method="lm")+
scale_color_viridis(discrete = TRUE, option = "D")+ #option "D"="viridis"
scale_fill_viridis(discrete = TRUE) +
xlab("Weight (kg)") + ylab("Height (cm)") + labs(title="Proportion across weight and height in Swiss German respondents",
subtitle="by gender")+
annotate("text",x=90,y=198,label="Men are, on average,\n taller and heavier",size=4,fontface="bold", color="darkgreen") +
annotate("text",x=70,y=155,label="Women are, on average,\n shorter and lighter",size=4,fontface="bold", color="wheat4") +
geom_hline(yintercept = men.meanheight, linetype="dashed", color="darkgreen", size=1, alpha=0.5)+
geom_hline(yintercept = women.meanheight, linetype="dashed", color="wheat4", size=1, alpha=0.5)+
geom_vline(xintercept = men.meanweight, linetype="dashed", color="darkgreen", size=1, alpha=0.5)+
geom_vline(xintercept = women.meanweight, linetype="dashed", color="wheat4", size=1, alpha=0.5)+
annotate("text",x=men.meanweight + 6,y=202,label=paste0("Men's avearage weight:\n",men.meanweight),size=2)+
annotate("text",x=women.meanweight - 6,y=202,label=paste0("Women's avearage weight:\n",women.meanweight),size=2)+
annotate("text",y=men.meanheight + 6,x=40,label=paste0("Men's avearage height:\n",men.meanheight),size=2, angle = 90)+ # angle rotates the annotation
annotate("text",y=women.meanheight - 6,x=40,label=paste0("Women's avearage height:\n",women.meanheight),size=2, angle = 90) +
theme(
plot.title = element_text(hjust = 0.5,color="red", size=14, family="sans"),
axis.title.x = element_text(color="blue", size=10, face="italic"),
axis.title.y = element_text(color="#993333", size=10, face="italic"),
legend.position = "none") # remove legend completely
a
You can use specific themes by installing the ggthemes
package.
A quick example using the Wall Street Journal theme.
library(ggthemes)
m <- ggplot(sdats, aes(Gewicht, Groesse, colour = Geschlecht))+
geom_jitter()
m <- m + theme_wsj()+ scale_colour_wsj("colors6")+
ggtitle("Biometrics by gender")
m
A solution,par(mfrow=c(2,2))
that works with basic plots doesn’t work with ggplots.
An affiliated package, gridExtra
comes to the rescue!
library(gridExtra)
library(grid)
sp2 <- sp + geom_point()
bp2 <- bp + geom_boxplot()
grid.arrange(sp2,bp2,
ncol=2,
top=textGrob("Arranging plots on a grid",gp=gpar(fontface="bold"))) #Adding a main title
grid.arrange(sp2,bp2,m,p,
ncol=2, nrow=2,
top=textGrob("Arranging plots on a grid",gp=gpar(fontface="bold")))
Or build it using a custom made function: http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
To have less hassle with having to Export
your plots in Rstudio by hand, you can use ggsave()
, where you can set which plot you want to save (default is the last one), the extension and size of your plot, etc. You can build it in your loops as well.
Plot size in units (“in”, “cm”, or “mm”). If not supplied, it uses the size of current graphics device.
ggsave("beautiful_plot1.png", m,
dpi = 300)
ggsave("beautiful_plot2.png", m,device="png",
dpi = 300, width = 16, height = 12, units="cm")
# Another, non-ggplot method is opening a png or pdf device
png(filename="mykludgyplot.png",
width = 480, height = 480, units = "px", pointsize = 12, bg = "white")
print(a)
dev.off()
A nice extra skill. Check here corrplot
package
Following the structure of this document
Ggplot cheat sheet - the essence: http://r-statistics.co/ggplot2-cheatsheet.html Ggplot for beginners: https://www.tutorialspoint.com/ggplot2/
What can you do with ggplot in a nutshell: https://beanumber.github.io/sds192/lab-ggplot2.html
Ggplot examples: https://rstudio-pubs-static.s3.amazonaws.com/244477_0f53a9c257bd434fbf7af03518df3587.html The R graph gallery: https://www.r-graph-gallery.com/index.html
Top 50 ggplot visualisations: http://r-statistics.co/Top50-Ggplot2-Visualizations-MasterList-R-Code.html
The basics of aesthetics (aes()
): https://www.datanovia.com/en/blog/ggplot-aes-how-to-assign-aesthetics-in-ggplot2/
ggplot
and stat_summary()
in the tidyverse environment: https://bookdown.org/yih_huynh/Guide-to-R-Book/important-tidyverse-functions.html
The same book from Y. Wendy Huynh — ‘R for Graduate Students’ has a section on graphing as well: https://bookdown.org/yih_huynh/Guide-to-R-Book/how-graphing-works.html
Master the basics (tutorial): http://www.rebeccabarter.com/blog/2017-11-17-ggplot2_tutorial/ Master the details (part of the book ‘ggplot2: elegant graphics for data analysis’): https://ggplot2-book.org/mastery.html
Be awesome: http://www.sthda.com/english/wiki/be-awesome-in-ggplot2-a-practical-guide-to-be-highly-effective-r-software-and-data-visualization
The R cookbook: http://www.cookbook-r.com/
Hadley Wickham’s handbook ‘R for Data Science’ (data visualisation part): https://r4ds.had.co.nz/data-visualisation.html
Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton, ‘Modern Data Science with R’ (2020-10-08): https://beanumber.github.io/mdsr2e/
http://www.sthda.com/english/wiki/ggplot2-histogram-plot-quick-start-guide-r-software-and-data-visualization Many examples, including ggmap: https://www.rpubs.com/foocheeyoong/rws-module-2x
Box plot quick start incl themes: http://www.sthda.com/english/wiki/ggplot2-box-plot-quick-start-guide-r-software-and-data-visualization Visualisation of ndividual observations and group means: https://drsimonj.svbtle.com/plotting-individual-observations-and-group-means-with-ggplot2
Grouped observations: http://www.sthda.com/english/articles/32-r-graphics-essentials/132-plot-grouped-data-box-plot-bar-plot-and-more/ Grouped and stacked barplots: https://www.r-graph-gallery.com/48-grouped-barplot-with-ggplot2.html
Colour basics: https://www.datamentor.io/r-programming/color/
Ggplot best colour tricks: https://www.datanovia.com/en/blog/ggplot-colors-best-tricks-you-will-love/
viridis
documentation: https://cran.r-project.org/web/packages/viridis/vignettes/intro-to-viridis.html
Color Brewer’s online colour picker (awesome for maps): https://colorbrewer2.org/#type=diverging&scheme=RdYlGn&n=9
Scale colours outside your data range: https://stackoverflow.com/questions/13888222/ggplot-scale-color-gradient-to-range-outside-of-data-range
Themes and plot background: http://www.sthda.com/english/wiki/ggplot2-themes-and-background-colors-the-3-elements
Legend Title, Position and Labels : https://www.datanovia.com/en/blog/ggplot-legend-title-position-and-labels/
Detailed theme modifications: https://ggplot2-book.org/polishing.html
Gallery of 80 external ggplot affiliate packages: https://exts.ggplot2.tidyverse.org/gallery/
ggrepel
: https://ggrepel.slowkow.com/articles/examples.html
ggthemes
- including palettes: https://yutannihilation.github.io/allYourFigureAreBelongToUs/ggthemes/
“Official” cheatsheets: https://rstudio.com/resources/cheatsheets/
Ecosystem of R packages