2  Getting Started

2.1 Why plot your data?

Getting information from a table is like extracting sunlight from a cucumber. Farquhar & Farquhar (1891)

At the time the Farhquhar brothers wrote this pithy aphorism, graphical methods for understanding data had advanced considerably, but were not universally practiced, prompting their complaint.

The main graphic forms we use today—the pie chart, line graphs and bar—were invented by William Playfair around 1800 (Playfair, 1786, 1801). The scatterplot arrived shortly after (Herschel, 1833) and thematic maps showing the spatial distributions of social variables (crime, suicides, literacy) were used for the first time to reason about important societal questions (Guerry, 1833) such as “is increased education associated with lower rates of crime?”

In the last half of the 18th Century, the idea of correlation was developed (Galton, 1886; Pearson, 1896) and the period, roughly 1860–1890, dubbed the “Golden Age of Graphics (Funkhouser, 1937) became the richest period of innovation and beauty in the entire history of data visualization. During this time there was an incredible development of visual thinking, represented by the work of Charles Joseph Minard, advances in the role of visualization within scientific discovery, as illustrated through Francis Galton, and graphical excellence, embodied in state statistical atlases produced in France and elsewhere. See Friendly (2008); Friendly & Wainer (2021) for this history.

2.1.1 Anscombe’s Quartet

In 1973, Francis Anscombe (Anscombe, 1973) famously constructed a set of four datasets illustrate the importance of plotting the graphs before analyzing and model building, and the effect of unusual observations on fitted models. Now known as Anscombe’s Quartet, these datasets had identical statistical properties: the same means, standard devitions, correlations and regression lines.

His purpose was to debunk three notions that had been prevalent at the time:

  • Numerical calculations are exact, but graphs are rough;
  • For any particular kind of statistical data there is just one set of calculations constituting a correct statistical analysis;
  • Performing intricate calculations is virtuous, whereas actually looking at the data is cheating.

The dataset datasets::anscombe has 11 observations, recorded in wide format, with variables x1:x4 and y1:y4. ::: {.cell layout-align=“center”}

data(anscombe) 
head(anscombe)
#>   x1 x2 x3 x4   y1   y2    y3   y4
#> 1 10 10 10  8 8.04 9.14  7.46 6.58
#> 2  8  8  8  8 6.95 8.14  6.77 5.76
#> 3 13 13 13  8 7.58 8.74 12.74 7.71
#> 4  9  9  9  8 8.81 8.77  7.11 8.84
#> 5 11 11 11  8 8.33 9.26  7.81 8.47
#> 6 14 14 14  8 9.96 8.10  8.84 7.04

:::

The following code transforms this data to long format and calculates some summary statistics for each dataset.

anscombe_long <- anscombe |> 
  pivot_longer(everything(), 
               names_to = c(".value", "dataset"), 
               names_pattern = "(.)(.)"
  ) |>
  arrange(dataset)

anscombe_long |>
  group_by(dataset) |>
  summarise(xbar      = mean(x),
            ybar      = mean(y),
            r         = cor(x, y),
            intercept = coef(lm(y ~ x))[1],
            slope     = coef(lm(y ~ x))[2]
         )
#> # A tibble: 4 × 6
#>   dataset  xbar  ybar     r intercept slope
#>   <chr>   <dbl> <dbl> <dbl>     <dbl> <dbl>
#> 1 1           9  7.50 0.816      3.00 0.500
#> 2 2           9  7.50 0.816      3.00 0.5  
#> 3 3           9  7.5  0.816      3.00 0.500
#> 4 4           9  7.50 0.817      3.00 0.500

As we can see, all four datasets have nearly identical univariate and bivariate statistical measures. You can only see how they differ in graphs, which show their true natures to be vastly different.

Figure 2.1 is an enhanced version of Anscombe’s plot of these data, adding helpful annotations to show visually the underlying statistical summaries.

Figure 2.1: Scatterplots of Anscombe’s Quartet. Each plot shows the fitted regression line and a 68% data ellipse representing the correlation between \(x\) and \(y\).

This figure is produced as follows, using a single call to ggplot(), faceted by dataset. As we will see later (Section 3.2), the data ellipse (produced by stat_ellipse()) reflects the correlation between the variables.

desc <- tibble(
  dataset = 1:4,
  label = c("Pure error", "Lack of fit", "Outlier", "Influence")
)

ggplot(anscombe_long, aes(x = x, y = y)) +
  geom_point(color = "blue", size = 4) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE,
              color = "red", linewidth = 1.5) +
  scale_x_continuous(breaks = seq(0,20,2)) +
  scale_y_continuous(breaks = seq(0,12,2)) +
  stat_ellipse(level = 0.5, color=col, type="norm") +
  geom_label(data=desc, aes(label = label), x=6, y=12) +
  facet_wrap(~dataset, labeller = label_both) 

The subplots are labeled with the statistical idea they reflect:

  • dataset 1: Pure error. This is the typical case with well-behaved data. Variation of the points around the line reflect only measurement error or unreliability in the response, \(y\).

  • dataset 2: Lack of fit. The data is clearly curvilinear, and would be very well described by a quadratic, y ~ poly(x, 2). This violates the assumption of linear regression that the fitted model has the correct form.

  • dataset 3: Outlier. One point, second from the right, has a very large residual. Because this point is near the extreme of \(x\), it pulls the regression line towards it, as you can see by imagining a line through the remaining points.

  • dataset 4: Influence. All but one of the points have the same \(x\) value. The one unusual point has sufficient influence to force the regression line to fit it exactly.

One moral from this example:

Linear regression only “sees” a line. It does its’ best when the data are really linear. Because the line is fit by least squares, it pulls the line toward discrepant points to minimize the sum of squared residuals.

Datasaurus Dozen

The method Anscombe used to compose his quartet is unknown, but it turns out that that there is a method to construct a wider collection of datasets with identical statistical properties. After all, in a bivariate dataset with \(n\) observations, the correlation has \((n-2)\) degrees of freedom, so it is possible to choose \(n-2\) of the \((x, y)\) pairs to yield any given value. As it happens, it is also possible to create any number of datasets with the same means, standard deviations and correlations with nearly any shape you like — even a dinosaur!

The Datasaurus Dozen was first publicized by Alberto Cairo in a blog post and are available in the datasauRus package Davies et al. (2022). As shown in Figure 2.2, the sets include a star, cross, circle, bullseye, horizontal and vertical lines, and, of course the “dino”. The method (Matejka & Fitzmaurice, 2017) uses simulated annealing, an iterative process that perturbs the points in a scatterplot, moving them towards a given shape while keeping the statistical summaries close to the fixed target value.

The datasauRus package just contains the datasets, but a general method, called statistical metamers, for producing such datasets has been described by Elio Campitelli and implemented in the metamer package.

Figure 2.2: Animation of the Dinosaur Dozen datasets. Source: https://youtu.be/It4UA75z_KQ
Quartets

The essential idea of a statistical “quartet” is to illustrate four quite different datasets or circumstances that seem superficially the same, but yet are paradoxically very different when you look behind the scenes. For example, in the context of causal analysis Gelman et al. (2023), illustrated sets of four graphs, within each of which all four represent the same average (latent) causal effect but with much different patterns of individual effects; McGowan et al. (2023) provide another illustration with four seemingly identical data sets each generated by a different causal mechanism. As an example of machine learning models, Biecek et al. (2023), introduced the “Rashamon Quartet”, a synthetic dataset for which four models from different classes (linear model, regression tree, random forest, neural network) have practically identical predictive performance. In all cases, the paradox is solved when their visualization reveals the distinct ways of understanding structure in the data. The quartets package contains these and other variations on this theme.

2.1.2 A real example

In the mid 1980s, a consulting client had a strange problem. She was conducting a study of the relation between body image and weight preoccupation in exercising and non-exercising people (Davis, 1990). As part of the design, the researcher wanted to know if self-reported weight could be taken as a reliable indicator of true weight measured on a scale. It was expected that the correlations between reported and measured weight should be close to 1.0, and the slope of the regression lines for men and women should also be close to 1.0. The dataset is car::Davis.

She was therefore very surprise to see the following numerical results: For men, the correlation was nearly perfect, but not so for women.

data(Davis, package="carData")
Davis <- Davis |>
  drop_na()          # drop missing cases
Davis |>
  group_by(sex) |>
  select(sex, weight, repwt) |>
  summarise(r = cor(weight, repwt))
#> # A tibble: 2 × 2
#>   sex       r
#>   <fct> <dbl>
#> 1 F     0.501
#> 2 M     0.979

Similarly, the regression lines showed the expected slope for men, but that for women was only 0.26.

Davis |>
  nest(data = -sex) |>
  mutate(model = map(data, ~ lm(repwt ~ weight, data = .)),
         tidied = map(model, tidy)) |>
  unnest(tidied) |>
  filter(term == "weight") |>
  select(sex, term, estimate, std.error)
#> # A tibble: 2 × 4
#>   sex   term   estimate std.error
#>   <fct> <chr>     <dbl>     <dbl>
#> 1 M     weight    0.990    0.0229
#> 2 F     weight    0.262    0.0459

What could be wrong here?, the client asked. The consultant replied with the obvious question:

Did you plot your data?

The answer turned out to be one discrepant point, a female, whose measured weight was 166 kg (366 lbs!). This single point exerted so much influence that it pulled the fitted regression line down to a slope of only 0.26.

Davis |>
  ggplot(aes(x = weight, y = repwt, 
             color = sex, shape=sex)) +
  geom_point(size = ifelse(Davis$weight==166, 6, 2)) +
  labs(y = "Measured weight (kg)", 
       x = "Reported weight (kg)") +
  geom_smooth(method = "lm", formula = y~x, se = FALSE) +
  theme(legend.position = c(.8, .8))
Figure 2.3: Regression for Davis’ data on reported weight and measures weight for men and women. Separate regression lines, predicting reported weight from measured weight are shown for males and females. One highly unusual point is highlighted.

In this example, it was arguable that \(x\) and \(y\) axes should be reversed, to determine how well measured weight can be predicted from reported weight. In ggplot this can easily be done by reversing the x and y aesthetics.

Davis |>
  ggplot(aes(y = weight, x = repwt, color = sex, shape=sex)) +
  geom_point(size = ifelse(Davis$weight==166, 6, 2)) +
  labs(y = "Measured weight (kg)", x = "Reported weight (kg)") +
  geom_smooth(method = "lm", formula = y~x, se = FALSE) +
  theme(legend.position = c(.8, .8))
Figure 2.4: Regression for Davis’ data on reported weight and measures weight for men and women. Separate regression lines, predicting measured weight from re[ported] weight are shown for males and females. The highly unusual point no longer has an effect on the fitted lines.

In Figure 2.4, this discrepant observation again stands out like a sore thumb, but it makes very little difference in the fitted line for females. The reason is that this point is well within the range of the \(x\) variable (repwt). To impact the slope of the regression line, an observation must be unusual in_both_ \(x\) and \(y\). We take up the topic of how to detect influential observations and what to do about them in Chapter XX.

The value of such plots is not only that they can reveal possible problems with an analysis, but also help identify their reasons and suggest corrective action. What went wrong here? Examination of the original data showed that this person switched the values, recording her reported weight in the box for measured weight and vice versa.

2.2 Plots for data analysis

Visualization methods take an enormous variety of forms, but it is useful to distinguish several broad categories according to their use in data analysis:

  • data plots : primarily plot the raw data, often with annotations to aid interpretation (regression lines and smooths, data ellipses, marginal distributions)

  • reconnaissance plots : with more than a few variables, reconnaissance plots provide a high-level, bird’s-eye overview of the data, allowing you to see patterns that might not be visible in a set of separate plots. Some examples are scatterplot matrices, showing all bivariate plots of variables in a dataset; correlation diagrams, using visual glyphs to represent the correlations between all pairs of variables and “trellis” or faceted plots that show how a focal relation of one or more variables differs across values of other variables.

  • model plots : plot the results of a fitted model, such as a regression line or curve to show uncertainty, or a regression surface in 3D, or a plot of coefficients in model together with confidence intervals. Other model plots try to take into account that a fitted model may involve more variables than can be shown in a static 2D plot. Some examples of these are added variable plots, and marginal effect plots, both of which attempt to show the net relation of two focal variables, controlling or adjusting for other variables in a model.

  • diagnostic plots : indicating potential problems with the fitted model. These include residual plots, influence plots, plots for testing homogeneity of variance and so forth.

  • dimension reduction plots : plot representations of the data into a space of fewer dimensions than the number of variables in the dataset. Simple examples include principal components analysis (PCA) and the related biplots, and multidimensional scaling (MDS) methods.

We give some more details and a few examples in the sections that follow.

2.3 Data plots

Data plots portray the data in a space where the coordinate axes are the observed variables.

  • 1D plots include line plots, histograms and density estimates.
  • 2D plots are most often scatterplots, but contour plots or hex-binned plots are also useful when the sample size is large.

2.4 Model plots

Model plots show the fitted or predicted values from a statistical model and provide visual summaries…

2.5 Diagnostic plots

2.6 Principles of graphical display

[This could be a separate chapter]

  • Criteria for assessing graphs: communication goals
  • Effective data display:
    • Make the data stand out
    • Make graphical comparison easy
    • Effect ordering: For variables and unordered factors, arrange them according to the effects to be seen
  • Visual thinning: As the data becomes more complex, focus more on impactful summaries

Packages used here: 12 packages used here: broom, dplyr, forcats, ggplot2, knitr, lubridate, purrr, readr, stringr, tibble, tidyr, tidyverse