Hello!

This week I looked at both time-dependent graphs in chapter 7, and statistical models in chapter 8.

Time-dependent Graphs

Time-dependent graphs are used to show change over a period of time.  The most common graph type used is a time series line graph, but you an also use dumbbell charts and slope graphs.

Time series

Time series graphs are a set of quantitative values at different time points that are an equal length of time apart.

The data used for this chapter is the Economics time series included in the ggplot2 package. This shows economic data from January 1967 through January 2015.

Using the scale_x_date function, you can reformat the dates.


Dumbbell charts

Dumbbell charts are good for showing the change in time between 2 time points for several variables. This chart uses the geom_dumbbell function from the ggalt package.

For some reason my RStudio is not liking the ggalt or the geom_dumbbell()function and it keeps giving me an error. I tried looking into it on Google, and the function doesn't seem to have changed, so I'm not sure what is wrong. Maybe they weren't included when I was setting up and downloading all the packages into RStudio? The information above is pretty much it for the dumbbell graphs, though! Here's what the graph SHOULD look like after some edits and color changes are done:

Slope graphs

Slope graphs are used to show several time points and several variables. Using the gapminder, this graph shows the life expectancy for 6 Central American countries from 1992 - 2007. 

To create a slope graph, you use the newggslopegraph function from the CGPfunctions package. The parameters for this function are:
  • data frame
  • time variable (must be a factor)
  • numeric variable to be plotted
  • grouping variable (one line per group)

Area charts

An area chart is just a line graph that is shaded below the line. A stacked area chart can be used to show differences between variables over time. Stacked area charts are best when interest is on both group change over time, and overall change overtime.

The default for this has the population number set to thousands, which gives a label in scientific notation. To fix this, you simply divide the word "thousands" by 1000, and add a label saying that the population is in million. 

Here is the final product: 


Statistical Models

Statistical models show the relationship between explanatory variables and response variables. This chapter goes over models with a single response variable that is either quantitative or binary (yes/no).

There are 5 different types of plots used to visualize statistical models: correlation plots, linear regression, logistic regression, survival plots, mosaic plots.

Correlation plots

Correlation plots show the pairwise relationship between quantitative variables using color and shading.

This example uses the Saratoga Houses dataset, which shows sale price and characteristics of Saratoga County, NY homes in 2006.

It is super easy to visualize the data. You just type library(ggcorrplot) and ggcorplot(r) along with your usual library(ggplot2)package.

The ggcorrplot function has tons of options such as:

  • hc.order = TRUE reorders and sorts variables that are similar.
  • type = "lower" plots the lower portion of the correlation matrix
  • lab = TRUE displays the correlation coefficients on the plot.


Linear Regression

Linear regression visualizes the relationship between a quantitative response variable and an explanatory variable.


In this section, the graphs predicts home prices in Saratoga based on age, lot size, land value, living area, bedrooms, and bathrooms, and whether the home is on a waterfront.

To help visualize relationships in this type of graph, you use the visreg function. This function take the model and variable of interest and plots the conditional relationship, while controlling for other variables.

Logistic Regression

Logistic regression is used to show the relationship between a binary response variable (yes/no, lived/died, pass/fail) and an explanatory variable. This section uses the CPS85 data to predict the log-odds of being married gives sex, age, race, and job sector. 

Using the visreg function again, you can compare 2 variables while holding the others constant.
Multiple conditional plots can be created by adding a by option. In the example below,  you can see the probability of being married by age, separately by sex:

Survival Plots

This statistical model is exactly how it sounds. It is common healthcare research, where there is interest in time to recovery, time to death, and time to relapse. This plot can also be used to show the probability that an individual will survive up to time t.

This example uses the NCTG Lung Cancer dataset in the survival package, showing the survival times of patients with advanced lung cancer after treatment.
One particularly useful option the ggsurvplot has is the conf.int option that shows the confidence in time intervals, and the pval compares survival curves.


Mosaic Plots

Mosaic plots show the relationship between categorical variables using rectangles whose areas are used to measure cases for different levels. Color is also used to show relationship.

To make these plots, you can use ggplot2 with the ggmosaic package, the book recommends using the vcd package.

The example uses data from the Titanic sinking, and what role sex played in survival. This does not seem to want to work on my version of R, so I'll just screenshot from the text.



Comments

Popular posts from this blog

Week 13

Blog Post 2 -- Spring 2023 -- Professional Identity

Semester 2 week 4