Hello!

This week I read Chapter 4 in Data Visualization of R by Rob Kabacoff. 

Last week I read Chapter 3 on univariate graphs, which plot data about a single variable. Chapter 4 is about bivariate graph, which are used to show the relationship between 2 variables. The type of graph used depends on whether the variables are categorical or quantitative.

Categorical vs. Categorical

Much like a univariate graph, a bivariate graph typically uses various types of bar charts to display categorical data.

There are 3 different types of bar charts used, and I'll display pictures of them all. This time, I am not going to go into detail about modifications such as color, size, etc. because it is the same as for univariate graphs.

The data set used displays the relationship between automobile class and drive type: 4 wheel drive, front-wheel drive, and rear-wheel drive.

Stacked bar chart

The stacked bar chart is the default bar chart in R, so there are no special commands to use this type.

Grouped bar chart

Grouped bar charts display the variables side-by-side. To create this bar chart, you type postion = "dodge" in the geom_ function.


Segmented Bar Chart

Each bar in a segmented bar chart represents 100%. This chart can be made using the position = "filled" option. This graph type is useful is you are trying to compare the percentage of a variable across other variables. You can see that the percentage of front-wheel drive cars increases depending on vehicle type.


Quantitative vs. Quantitative

The relationship between two quantitative variables is typically shown using scatterplots or line graphs.

Scatterplot

The scatterplot is the simplest way to display two quantitative variables. For this example, the book uses the Salaries dataset, which plots the experience (yrs.since.phd) vs. academic salary (salary) for college professors. You can use all the same options used on a univariate graph to change how the elements are displayed, as well as lines:


Line Plot

Line plot graphs are good if one of the variables represents time. The graph example used displays the relationship between time (year) and life expectancy in the US between 1952 and 2007. These graphs can have dots added to make them easier to read, and well... the book doesn't talk about many other modifications, but you can, of course, change color, line width, dot size, etc.:


 Categorical vs. Quantitative

There are several different graph types available for categorical vs. quantitative variables inlcuding: bar charts, grouped kernel density plots, side-by-side box plots, side-by-side violin plots, mean/sem plots, ridgeline plots, and Cleveland plots. Whew! That was a lot.

Bar Chart (on summary statistics)

Previously bar charts were used to display the number of cases in a category, but you can also use them to show other summary statistics on a quantitative variable for each level of a categorical variable. This graph shows the mean salary for some university professors by academic rank:


Grouped Kernel Density Plots

You can compare groups on a numeric variable using kernel density plots on a single graph:

Box Plots

Box plots display the 25th percentile, median, and 75th percentile of a distribution. The vertical lines are called whiskers, and capture 99% of a normal distribution. Observations outside of that range are plotted as points that represent outliers. Box plots are very useful if placed side-by-side in order to compare groups on a numerical variable. The picture on the left is a basic box plot, and the picture on the right is a notched box plot. A notched box plot provides an approximation for visualizing if groups differ. If the notches of 2 boxplots do not overlap, there is a 95% certainty that the medians of the 2 groups differ:


Violin Plots

This I have never heard of until today... Violin plots are similar to kernel density plots, but are rotated 90 degrees, and reflected over a vertical axis. There isn't much information on these, but it is common and useful to superimpose boxplots onto violin plots like so:


Ridgeline Plot (or joyplot)

A ridgeline plots shows the distribution of a quantitative variable for several groups. These are also similar to kernel density plots with vertical faceting, but are smaller. These plots are made using the ggridges package. This graph shows the Fuel economy dataset, which displays city miles per gallon and car type.


Mean / SEM Plots

A method used to compare groups on a numeric variable is the mean plot with error bars. The bars represent standard deviations, standard error of the mean, or confidence intervals. The graph below compares salary across rank and sex. According to the book this is technically not bivariate because there are 3 variables: rank, sex, and salary, but it works in this case:


Strip Plots

A strip plot, in short, is a one-dimensional scatter plot. Basically, it shows each group in single, horizontal rows. These graphs can sometimes be difficult to decipher because elements in one group often overlap too much. To fix this, you can "jitter" them, which basically adds a small, random number to each y-coordinate, which makes the dots spread up or down away from each other. You can actually combine jitter plots with some other plot types, such as box plots, or violin plots. A jitter plot combined with a box plot is called a beeswarm plot, which is the type I decided to picture because it looks fun:


Cleveland Dot Charts

These plots are useful for comparing numeric statistics for lots of groups. For this example, a dataset displaying the 2007 life expectancy for Asian countries is used. These graphs are usually easier to read if each group is sorted from descending or ascending order. If you add lines to each point on a Cleveland dot chart, it is called a lollipop graph, and you can see why below:


That's it for this week! Next week I learn about multivariate graphs. :) 

Comments

Popular posts from this blog

Week 13

Blog Post 2 -- Spring 2023 -- Professional Identity

Semester 2 week 4