Skip to contents

Documenting package datasets

Datasets used in package examples are such an important part of making a package understandable and usable, but is often overlooked. In developing the heplots package I collected a large collection of data sets illustrating a variety of multivariate linear models with some an analyses, and graphical displays. Each of these have much more than the usual stub examples, that often look like:

data(dataset)
# str(dataset); plot(dataset)

But .Rd, and now roxygen, don’t make it easy to work with numerous datasets in a package, or, more importantly, to document what they illustrate. I’m showing the work to create this vignette, in case these ideas are useful to others.

In this release, I started with a file generated by:

vcdExtra::datasets("heplots") |> head(4)
#>        Item      class    dim                         Title
#> 1 AddHealth data.frame 4344x3 Adolescent Mental Health Data
#> 2   Adopted data.frame   62x6              Adopted Children
#> 3      Bees data.frame  246x6   Captive and maltreated bees
#> 4  Diabetes data.frame  145x6              Diabetes Dataset

Then, in the roxygen documentation, I added @concept tags to classify these datasets according to methods used. (@concept entries are indexed with the package, so they work via help.search()) For example, the documentation for the AddHealth data contains these lines:

#' @name AddHealth
#' @docType data
 ...
#' @keywords datasets
#' @concept MANOVA
#' @concept ordered

With standard processing, these concepts along with the keywords, appear in the Index section of the manual constructed by devtools::build_manual(). In the pkgdown site for this package, they are also searchable in the search box.

With a bit of extra processing, I created a dataset datasets.csv used below.

Methods

The main methods used in the example datasets are shown in the table below:

  • MANOVA: Multivariate analysis of variance
  • MANCOVA: Multivariate of covariance
  • MMRA: Multivariate multiple regression
  • cancor: Canonical correlation (using the candisc package)
  • candisc: Canonical discriminant analysis (using candisc)
  • repeated: Repeated measures designs, analyzed from the multivariate perspective
  • robust: Robust estimation of MLMs

In addition, a few examples illustrate special handling for linear hypotheses concerning factors:

  • ordered: ordered factors
  • contrasts: other contrasts

The dataset names are linked to the documentation with graphical output on the pkgdown website, [http://friendly.github.io/heplots/].

Dataset table

library(here)
library(dplyr)
library(tinytable)
#dsets <- read.csv(here::here("extra", "datasets.csv"))  # doesn't work in a vignette
dsets <- read.csv("https://raw.githubusercontent.com/friendly/heplots/master/extra/datasets.csv")
dsets <- dsets |> 
  dplyr::select(-X) |> 
  arrange(tolower(dataset))

# link dataset to pkgdown doc
refurl <- "http://friendly.github.io/heplots/reference/"

dsets <- dsets |>
  mutate(dataset = glue::glue("[{dataset}]({refurl}{dataset}.html)")) 

#knitr::kable(dsets)
tinytable::tt(dsets)  |> format_tt(markdown = TRUE)
dataset rows cols title tags
AddHealth 4344 3 Adolescent Health Data MANOVA ordered
Adopted 62 6 Adopted Children MMRA repeated
Bees 246 6 Captive and maltreated bees MANOVA
Diabetes 145 6 Diabetes Dataset MANOVA
dogfood 16 3 Dogfood Preferences MANOVA contrasts candisc
FootHead 90 7 Head measurements of football players MANOVA contrasts
Headache 98 6 Treatment of Headache Sufferers for Sensitivity to Noise MANOVA repeated
Hernior 32 9 Recovery from Elective Herniorrhaphy MMRA candisc
Iwasaki_Big_Five 203 7 Personality Traits of Cultural Groups MANOVA
mathscore 12 3 Math scores for basic math and word problems MANOVA
MockJury 114 17 Effects Of Physical Attractiveness Upon Mock Jury Decisions MANOVA candisc
NeuroCog 242 10 Neurocognitive Measures in Psychiatric Groups MANOVA candisc
NLSY 243 6 National Longitudinal Survey of Youth Data MMRA
oral 56 5 Effect of Delay in Oral Practice in Second Language Learning MANOVA
Oslo 332 14 Oslo Transect Subset Data MANOVA candisc
Overdose 17 7 Overdose of Amitriptyline MMRA cancor
Parenting 60 4 Father Parenting Competence MANOVA contrasts
peng 333 8 Size measurements for adult foraging penguins near Palmer Station MANOVA
Plastic 20 5 Plastic Film Data MANOVA
Pottery2 48 12 Chemical Analysis of Romano-British Pottery MANOVA candisc
Probe 11 5 Response Speed in a Probe Experiment MANOVA repeated
RatWeight 27 6 Weight Gain in Rats Exposed to Thiouracil and Thyroxin MANOVA repeated
ReactTime 10 6 Reaction Time Data repeated
Rohwer 69 10 Rohwer Data Set MMRA MANCOVA
RootStock 48 5 Growth of Apple Trees from Different Root Stocks MANOVA contrasts
Sake 30 10 Taste Ratings of Japanese Rice Wine (Sake) MMRA
schooldata 70 8 School Data MMRA robust
Skulls 150 5 Egyptian Skulls MANOVA contrasts
SocGrades 40 10 Grades in a Sociology Course MANOVA candisc
SocialCog 139 5 Social Cognitive Measures in Psychiatric Groups MANOVA candisc
TIPI 1799 16 Data on the Ten Item Personality Inventory MANOVA candisc
VocabGrowth 64 4 Vocabulary growth data repeated
WeightLoss 34 7 Weight Loss Data repeated

Concept table

This table can be inverted to list the datasets that illustrate each concept:

concepts <- dsets |>
  select(dataset, tags) |>
  tidyr::separate_longer_delim(tags, delim = " ") |>
  arrange(tags, dataset) |>
  summarize(datasets = toString(dataset), .by = tags) |>
  rename(concept = tags)

#knitr::kable(concepts)
tinytable::tt(concepts) |> format_tt(markdown = TRUE)