Data sets in the heplots package

Documenting package datasets

Datasets used in package examples are such an important part of making a package understandable and usable, but is often overlooked. In developing the heplots package I collected a large collection of data sets illustrating a variety of multivariate linear models with some an analyses, and graphical displays. Each of these have much more than the usual stub examples, that often look like:

data(dataset)
# str(dataset); plot(dataset)

But .Rd, and now roxygen, don’t make it easy to work with numerous datasets in a package, or, more importantly, to document what they illustrate. I’m showing the work to create this vignette, in case these ideas are useful to others.

In this release, I started with a file generated by:

vcdExtra::datasets("heplots") |> head(4)
#>        Item      class    dim                         Title
#> 1 AddHealth data.frame 4344x3 Adolescent Mental Health Data
#> 2   Adopted data.frame   62x6              Adopted Children
#> 3      Bees data.frame  246x6   Captive and maltreated bees
#> 4  Diabetes data.frame  145x6              Diabetes Dataset

Then, in the roxygen documentation, I added @concept tags to classify these datasets according to methods used. (@concept entries are indexed with the package, so they work via help.search()) For example, the documentation for the AddHealth data contains these lines:

#' @name AddHealth
#' @docType data
 ...
#' @keywords datasets
#' @concept MANOVA
#' @concept ordered

With standard processing, these concepts along with the keywords, appear in the Index section of the manual constructed by devtools::build_manual(). In the pkgdown site for this package, they are also searchable in the search box.

With a bit of extra processing, I created a dataset datasets.csv used below.

How this is done is described in this StackExchange question. The first step is to extract the concept tags from the man/*.Rd files:

concepts <- readLines(pipe("grep concept man/*.Rd")) %>%
  grep("concept{", ., fixed = TRUE, value = TRUE) %>% 
  read.table(text = ., sep = "{", comment.char = "}", 
    col.names = c("dataset", "tags")) %>%
  separate(dataset, c(NA, "dataset"), extra = "drop") %>%
  summarize(tags = paste(tags, collapse = " "), .by = dataset) %>%
  arrange(dataset)

head(concepts)

Then, I use vcdExtra::datasets() to extract title and other information for the datasets in the package, munge the result and left_join() with concepts:

dsets <- vcdExtra::datasets("heplots")[, c("Item", "dim", "Title")]     
rowcols <- as.data.frame(stringr::str_split_fixed(dsets$dim,"x", 2))
colnames(rowcols) <- c("rows", "cols")

dsets <- cbind(dsets, rowcols) |>
  rename(dataset = Item) |>
  select(-dim) |>
  relocate(c(rows, cols), .after=dataset) |>
  left_join(concepts, by = "dataset")

For use in pkgdown documentation, I also turn the name of the dataset into a hyperlink to the documentation.

refurl <- "http://friendly.github.io/candisc/reference/"

dsets <- dsets |>
  mutate(dataset = glue::glue("[{dataset}]({refurl}{dataset}.html)"))

Methods

The main methods used in the example datasets are shown in the table below:

MANOVA: Multivariate analysis of variance
MANCOVA: Multivariate of covariance
MMRA: Multivariate multiple regression
cancor: Canonical correlation (using the candisc package)
candisc: Canonical discriminant analysis (using candisc)
repeated: Repeated measures designs, analyzed from the multivariate perspective
robust: Robust estimation of MLMs

In addition, a few examples illustrate special handling for linear hypotheses concerning factors:

ordered: ordered factors
contrasts: other contrasts

The dataset names are linked to the documentation with graphical output on the pkgdown website, [http://friendly.github.io/heplots/].

Dataset table

library(here)
library(dplyr)
library(tinytable)
#dsets <- read.csv(here::here("extra", "datasets.csv"))  # doesn't work in a vignette
dsets <- read.csv("https://raw.githubusercontent.com/friendly/heplots/master/extra/datasets.csv")
dsets <- dsets |> 
  dplyr::select(-X) |> 
  arrange(tolower(dataset))

# link dataset to pkgdown doc
refurl <- "http://friendly.github.io/heplots/reference/"

dsets <- dsets |>
  mutate(dataset = glue::glue("[{dataset}]({refurl}{dataset}.html)")) 

#knitr::kable(dsets)
tinytable::tt(dsets)  |> format_tt(markdown = TRUE)

dataset	rows	cols	title	tags
AddHealth	4344	3	Adolescent Health Data	MANOVA ordered
Adopted	62	6	Adopted Children	MMRA repeated
Bees	246	6	Captive and maltreated bees	MANOVA
Diabetes	145	6	Diabetes Dataset	MANOVA
dogfood	16	3	Dogfood Preferences	MANOVA contrasts candisc
FootHead	90	7	Head measurements of football players	MANOVA contrasts
Headache	98	6	Treatment of Headache Sufferers for Sensitivity to Noise	MANOVA repeated
Hernior	32	9	Recovery from Elective Herniorrhaphy	MMRA candisc
Iwasaki_Big_Five	203	7	Personality Traits of Cultural Groups	MANOVA
mathscore	12	3	Math scores for basic math and word problems	MANOVA
MockJury	114	17	Effects Of Physical Attractiveness Upon Mock Jury Decisions	MANOVA candisc
NeuroCog	242	10	Neurocognitive Measures in Psychiatric Groups	MANOVA candisc
NLSY	243	6	National Longitudinal Survey of Youth Data	MMRA
oral	56	5	Effect of Delay in Oral Practice in Second Language Learning	MANOVA
Oslo	332	14	Oslo Transect Subset Data	MANOVA candisc
Overdose	17	7	Overdose of Amitriptyline	MMRA cancor
Parenting	60	4	Father Parenting Competence	MANOVA contrasts
peng	333	8	Size measurements for adult foraging penguins near Palmer Station	MANOVA
Plastic	20	5	Plastic Film Data	MANOVA
Pottery2	48	12	Chemical Analysis of Romano-British Pottery	MANOVA candisc
Probe	11	5	Response Speed in a Probe Experiment	MANOVA repeated
RatWeight	27	6	Weight Gain in Rats Exposed to Thiouracil and Thyroxin	MANOVA repeated
ReactTime	10	6	Reaction Time Data	repeated
Rohwer	69	10	Rohwer Data Set	MMRA MANCOVA
RootStock	48	5	Growth of Apple Trees from Different Root Stocks	MANOVA contrasts
Sake	30	10	Taste Ratings of Japanese Rice Wine (Sake)	MMRA
schooldata	70	8	School Data	MMRA robust
Skulls	150	5	Egyptian Skulls	MANOVA contrasts
SocGrades	40	10	Grades in a Sociology Course	MANOVA candisc
SocialCog	139	5	Social Cognitive Measures in Psychiatric Groups	MANOVA candisc
TIPI	1799	16	Data on the Ten Item Personality Inventory	MANOVA candisc
VocabGrowth	64	4	Vocabulary growth data	repeated
WeightLoss	34	7	Weight Loss Data	repeated

Concept table

This table can be inverted to list the datasets that illustrate each concept:

concepts <- dsets |>
  select(dataset, tags) |>
  tidyr::separate_longer_delim(tags, delim = " ") |>
  arrange(tags, dataset) |>
  summarize(datasets = toString(dataset), .by = tags) |>
  rename(concept = tags)

#knitr::kable(concepts)
tinytable::tt(concepts) |> format_tt(markdown = TRUE)

concept	datasets
MANCOVA	Rohwer
MANOVA	AddHealth, Bees, Diabetes, FootHead, Headache, Iwasaki_Big_Five, MockJury, NeuroCog, Oslo, Parenting, Plastic, Pottery2, Probe, RatWeight, RootStock, Skulls, SocGrades, SocialCog, TIPI, dogfood, mathscore, oral, peng
MMRA	Adopted, Hernior, NLSY, Overdose, Rohwer, Sake, schooldata
cancor	Overdose
candisc	Hernior, MockJury, NeuroCog, Oslo, Pottery2, SocGrades, SocialCog, TIPI, dogfood
contrasts	FootHead, Parenting, RootStock, Skulls, dogfood
ordered	AddHealth
repeated	Adopted, Headache, Probe, RatWeight, ReactTime, VocabGrowth, WeightLoss
robust	schooldata

Michael Friendly

2025-12-01

Documenting package datasets

Methods

Dataset table

Concept table