Chapter 6

The Origin and Development of the Scatterplot

Synopsis

Among all the moderns forms of statistical graphics, the scatterplot may be considered the most versatile, and generally useful invention in the entire history of statistical graphics. It is also notable because William Playfair didn’t invent it. This chapter considers why Playfair was unable to think about such things, and it traces the invention of the scatterplot to the eminent astronomer John F. W. Herschel. Scatterplots achieved great importance in the work of Francis Galton [1822–1911] on the heritability of traits. Galton’s work, visualized through statistical diagrams, became the source of the statistical ideas of correlation and regression and thus most of modern statistical methods.

Chapter contents

  • Early Displays That Were Not Scatterplots
    • Johann Lambert
    • Why Not Playfair?
  • John F. W. Herschel and the Orbits Of Twin Stars
    • Herschel’s Graphical Impact Factor
  • Francis Galton and the Idea Of Correlation
    • Galton’s Elliptical Insight
    • Another Asymmetry
  • Some Remarkable Scatterplots
    • The Hertzsprung-Russell Diagram
    • The Phillips Curve
  • Spurious Correlations and Causation
  • Scatterplot Thinking

Selected Figures

Modern scatterplot: A conventional modern scatterplot depicting the relation between salary and years of experience for a hypothetical group of ten workers.

Figure 6.1: Modern scatterplot

A conventional modern scatterplot depicting the relation between salary and years of experience for a hypothetical group of ten workers.
Source: © The Authors.
Rcode: 06_1-workers.R
A theoretical relation: Edmund Halley’s 1686 bivariate plot of the theoretical relation between barometric pressure (y) and altitude (x), derived from observational data.

Figure 6.2: A theoretical relation

Edmund Halley’s 1686 bivariate plot of the theoretical relation between barometric pressure (y) and altitude (x), derived from observational data.
Source: Edmond Halley, “On the height of the mercury in the barometer at different elevations above the surface of the earth, and on the rising and falling of the mercury on the change of weather,” Philosophical Transactions, 16, (1686), 104–115.

Cotes’s diagram: Roger Cotes’s depiction of four two-dimensional fallible observations which could be combined using a weighted mean at their center of gravity, Z, to give a more precise estimate.

Figure 6.3: Cotes’s diagram

Roger Cotes’s depiction of four two-dimensional fallible observations which could be combined using a weighted mean at their center of gravity, Z, to give a more precise estimate.
Source: Roger Cotes, Aestimatio Errorum in Mixta Mathesis, per Variationes Planitum Trianguli Plani et Spherici, 1722.

Chart of soil temperatures: J. H. Lambert’s chart of the variation in soil temperatures at various latitudes over time.

Figure 6.4: Chart of soil temperatures

J. H. Lambert’s chart of the variation in soil temperatures at various latitudes over time. The horizontal axis represents time within a year using astronomical symbols.
Source: Johann Heinrich Lambert, Pyrometrie; oder, vom maasse des feuers und der wärme mit acht kupfertafeln. Berlin: Haude & Spener, 1779.

Plotting the ratio: Redrawn version of Playfair’s time-series graph, showing the ratio of price of wheat to wages, together with a nonparametric smooth curve.

Figure 6.5: Plotting the ratio

Redrawn version of Playfair’s time-series graph, showing the ratio of price of wheat to wages, together with a nonparametric smooth curve.
Source: © The Authors.
Rcode: 06_5-wheat2.R
Scatterplot view: Scatterplot version of Playfair’s data, joining the points in time order, and showing the linear regression of price of wheat on wages.

Figure 6.6: Scatterplot view

Scatterplot version of Playfair’s data, joining the points in time order, and showing the linear regression of price of wheat on wages.
Source: © The Authors.
Rcode: 06_6-wheat3.R
Herschel’s first scatterplot: Herschel’s graphical method applied to his data on the double star <U+03B3> Virginis.

Figure 6.7: Herschel’s first scatterplot

Herschel’s graphical method applied to his data on the double star <U+03B3> Virginis. The plot shows observations of position angle separating the twin stars on the vertical axis and time on the horizontal axis. Some observations, considered as more reliable, are shown as double circles. The key feature is the smoothed curve, from which he drew the tangent lines, giving him the angular velocities at evenly spaced distances along the curve.
Source: John F. W. Herschel, “On the investigation of the orbits of revolving double stars: Being a supplement to a paper entitled ‘micrometrical measures of 364 double stars,’” Memoirs of the Royal Astronomical Society, 5, (1833), 171–222.

Star measurements: Observations of a double star involved measuring two quantities to determine the orbit: the position angle of the fainter star and the angular distance between them.

Figure 6.8: Star measurements

Observations of a double star involved measuring two quantities to determine the orbit: the position angle of the fainter star and the angular distance between them.
Source: © The Authors.

Herschel redone: Reconstruction of Herschel’s graph of data on the orbits of <U+03B3> Virginis together with his eye-smoothed, interpolated curve (light gray) and a loess smoothed curve (dark gray). Circles around each data point are of area proportional to Herschel’s weight for the observation.

Figure 6.9: Herschel redone

Reconstruction of Herschel’s graph of data on the orbits of <U+03B3> Virginis together with his eye-smoothed, interpolated curve (light gray) and a loess smoothed curve (dark gray). Circles around each data point are of area proportional to Herschel’s weight for the observation.
Source: © The Authors.
Rcode: 06_9-herschel.R
Ellipse geometry: Herschel’s geometric construction of the apparent elliptical orbit of Gamma Virginis from the calculations based on Figure 6.7.

Figure 6.10: Ellipse geometry

Herschel’s geometric construction of the apparent elliptical orbit of Gamma Virginis from the calculations based on Figure 6.7. The principal star is marked “S.” The labeled points around the ellipse show the interpolated observations translated to this space.
Source: John F. W. Herschel, “On the investigation of the orbits of revolving double stars: Being a supplement to a paper entitled ‘micrometrical measures of 364 double stars,”’ Memoirs of the Royal Astronomical Society, 5, (1833), 171–222.

The first correlation diagram: NA

Figure 6.11: The first correlation diagram

Galton’s first correlation diagram, showing the relation between head circumference and height, from his undated notebook, “Special Peculiarities.”
Source: Reproduced from Victor L. Hilts, A Guide to Francis Galton’s English Men of Science, Vol. 65, Part 5. Philadelphia, PA: Transactions of the American Philosophical Society, 1975, pp. 1–85, fig. 5.

Sweet pea diagram: Galton’s semigraphic table of the sweet pea data, represented in classed intervals around the averages for parent and child seeds.

Figure 6.12: Sweet pea diagram

Galton’s semigraphic table of the sweet pea data, represented in classed intervals around the averages for parent and child seeds. The marks in the cells are supposed to represent the number of seeds in each combination.
Source: galton.org.

Visual argument for regression: Reconstruction of Galton’s argument from the experiment on sweet peas, from his 1877 graph.

Figure 6.13: Visual argument for regression

Reconstruction of Galton’s argument from the experiment on sweet peas, from his 1877 graph. The plotted points show the data that Galton later reported in his 1889 Natural Inheritance (table 2, p. 226), but jittered horizontally because the sizes of the parent seeds were discrete. The solid line shows the fitted linear relation of means (+), joined by a thicker line. The fact that the fitted line of the means had a slope less than 1.0 implied what Galton later called “regression toward the mean” in filial characteristics.
Source: © The Authors.
Rcode: 06_13-galton-peas.R
Semigraphic table: Galton’s Table I on the relationship between heights of parents and their children.

Figure 6.14: Semigraphic table

Galton’s Table I on the relationship between heights of parents and their children. Entries in the table are the number of adult children.
Source: Reformatted from Francis Galton, “Regression towards Mediocrity in Hereditary Stature,” Journal of the Anthropological Institute of Great Britain and Ireland, 15 (1886), 246–263, table I.

Finding contours: A reconstruction of Galton’s method for finding contours of approximately equal frequency in the relation between heights of parents and their children. Finding contours: A reconstruction of Galton’s method for finding contours of approximately equal frequency in the relation between heights of parents and their children.

Figure 6.15: Finding contours

A reconstruction of Galton’s method for finding contours of approximately equal frequency in the relation between heights of parents and their children. Top: The original data from his table (Figure 6.14) are shown in a small font (dark); the smoothed values are shown in a large font (light). Bottom: the lines joining the means of y|x and the means of x|y are added, together with the corresponding regression lines.
Source: © The Authors.
Rcode: 06_15-galton-ex2.R
Elliptical insight: Galton’s smoothed correlation diagram for the data on heights of parents and children, showing one ellipse of constant frequency.

Figure 6.16: Elliptical insight

Galton’s smoothed correlation diagram for the data on heights of parents and children, showing one ellipse of constant frequency. The geometric relations between the regression lines and the tangent lines and major and minor axes are also shown.
Source: Extracted from Francis Galton, “Regression towards Mediocrity in Hereditary Stature,” Journal of the Anthropological Institute of Great Britain and Ireland, 15 (1886), 246–263, Plate X.

Early HR diagram: Russell’s plot of absolute magnitude against spectral class.

Figure 6.17: Early HR diagram

Russell’s plot of absolute magnitude against spectral class.
Source: Ian Spence and Robert F. Garrison, “A Remarkable Scatterplot,” The American Statistician, 47:1 (1993), pp. 12–19, fig. 1. Reprinted by permission of Taylor & Francis Ltd.

Outliers fool least-squares regression: Hertzsprung–Russell diagrams for a collection of stars in the CYG OB1 star cluster. Outliers fool least-squares regression: Hertzsprung–Russell diagrams for a collection of stars in the CYG OB1 star cluster.

Figure 6.18: Outliers fool least-squares regression

Hertzsprung–Russell diagrams for a collection of stars in the CYG OB1 star cluster. Top: regression line and data ellipse covering 50% of the points, fit using standard least-squares methods; bottom: regression line and 50% data ellipse using a robust method which is insensitive to the obvious unusual points in the upper left corner.
Source: © The Authors.
Rcode: 06_18-hertzsprung.R
The Phillips curve: Phillips’s data on wage inflation and unemployment, 1861–1913, with the fitted Phillips curve.

Figure 6.19: The Phillips curve

Phillips’s data on wage inflation and unemployment, 1861–1913, with the fitted Phillips curve. Points used in the fitting process are shown by + signs.
Source: A. W. Phillips, “The Relation between Unemployment and the Rate of Change of Money Wage Rates in the United Kingdom, 1861–1957,” Economica, 25:100 (1958), pp. 283–299, fig. 1.

Unemployment cycles: Phillips’s data showing one cycle in wage inflation and unem- ployment, 1893–1904, with the fitted curve from 1861 to 1913.

Figure 6.20: Unemployment cycles

Phillips’s data showing one cycle in wage inflation and unem- ployment, 1893–1904, with the fitted curve from 1861 to 1913.
Source: A. W. Phillips, “The Relation between Unemployment and the Rate of Change of Money Wage Rates in the United Kingdom, 1861–1957,” Economica, 25:100 (1958), pp. 283–299, fig. 6.

A spurious correlation: Chocolate consumption per capita and the number of Nobel prizes per million population in twenty-three countries.

Figure 6.21: A spurious correlation

Chocolate consumption per capita and the number of Nobel prizes per million population in twenty-three countries.
Source: © The Authors. Redrawn from Franz H. Messerli, “Chocolate Consumption, Cognitive Function, and Nobel Laureates,” The New Eng- land Journal of Medicine, 367:16 (2012), pp. 1562–1564, fig. 1.
Rcode: 06_21-nobel-chocolate.R
More spurious correlations: Correlations between countries’ number of Nobel laureates per 10 million population and other spurious causes: (A) annual per capita wine consumption; (B) number of IKEA stores per 10 million population.

Figure 6.22: More spurious correlations

Correlations between countries’ number of Nobel laureates per 10 million population and other spurious causes: (A) annual per capita wine consumption; (B) number of IKEA stores per 10 million population. Data source: Pierre Maurage, Alexandre Heeren, and Mauro Pesenti, “Does Chocolate Consumption Really Boost Nobel Award Chances? The Peril of Over-Interpreting Correlations in Health Studies,” Journal of Nutrition, 143:6 (2013), pp. 931–933, figs. 1B and 1C.
Source: © The Authors.
Rcode: 06_22-nobel-wine.R
Polarization of politics: Ridgeline plot showing the increasing polarization of U.S. House of Representatives and Senate in the Democratic (left, blue) and Republican (right, red) parties from 1963 to 2013.

Plate P.9: Polarization of politics

Ridgeline plot showing the increasing polarization of U.S. House of Representatives and Senate in the Democratic (left, blue) and Republican (right, red) parties from 1963 to 2013. The DW-NOMINATE scores attempt to characterize the main dimensions that distinguish voting patterns of U.S. legislators. The first dimension, plotted here, is interpreted as a liberal-conservative or left-right dimension.
Source: Rpubs.
Rcode: 06_P9-ridgeline-politics.R
Playfair’s time-series chart: William Playfair’s 1821 time-series graph of prices, wages, and ruling monarch over a 250-year period.

Plate P.10: Playfair’s time-series chart

William Playfair’s 1821 time-series graph of prices, wages, and ruling monarch over a 250-year period.
Source: William Playfair, A Letter on our agricultural distresses, their causes and remedies. London: W. Sams, 1821. Image courtesy of Stephen Stigler.

 

Copyright © 2021 Michael Friendly. All rights reserved.

friendly AT yorku DOT ca

                  ORCID iD iconorcid.org/0000-0002-3237-0941