Data sets in the heplots package
Michael Friendly
2024-11-23
Source:vignettes/datasets.Rmd
datasets.Rmd
Documenting package datasets
Datasets used in package examples are such an important part of making a package understandable and usable, but is often overlooked.
In developing the heplots
package I collected a large collection of data sets illustrating a
variety of multivariate linear models with some an analyses, and graphical displays. Each of these have much more than the
usual stub examples, that often look like:
data(dataset)
# str(dataset); plot(dataset)
But .Rd
, and now roxygen
, don’t make it easy to work with numerous datasets in a package, or, more importantly, to document what they illustrate. I’m showing the work to create this vignette, in case these ideas are useful to others.
In this release, I started with a file generated by:
vcdExtra::datasets("heplots") |> head(4)
#> Item class dim Title
#> 1 AddHealth data.frame 4344x3 Adolescent Mental Health Data
#> 2 Adopted data.frame 62x6 Adopted Children
#> 3 Bees data.frame 246x6 Captive and maltreated bees
#> 4 Diabetes data.frame 145x6 Diabetes Dataset
Then, in the roxygen documentation, I added @concept
tags to classify these datasets according to methods used.
(@concept
entries are indexed with the package, so they work via help.search()
)
For example,
the documentation for the AddHealth
data contains these lines:
#' @name AddHealth
#' @docType data
...
#' @keywords datasets
#' @concept MANOVA
#' @concept ordered
With standard
processing, these concepts along with the keywords, appear in the Index section of the manual constructed by devtools::build_manual()
. In the pkgdown
site for this package, they are also searchable in the search box.
With a bit of extra processing, I created a dataset datasets.csv used below.
Methods
The main methods used in the example datasets are shown in the table below:
- MANOVA: Multivariate analysis of variance
- MANCOVA: Multivariate of covariance
- MMRA: Multivariate multiple regression
- cancor: Canonical correlation (using the candisc package)
- candisc: Canonical discriminant analysis (using candisc)
- repeated: Repeated measures designs, analyzed from the multivariate perspective
- robust: Robust estimation of MLMs
In addition, a few examples illustrate special handling for linear hypotheses concerning factors:
- ordered: ordered factors
- contrasts: other contrasts
The dataset names are linked to the documentation with graphical output on the
pkgdown
website, [http://friendly.github.io/heplots/].
Dataset table
library(here)
library(dplyr)
library(tinytable)
#dsets <- read.csv(here::here("extra", "datasets.csv")) # doesn't work in a vignette
dsets <- read.csv("https://raw.githubusercontent.com/friendly/heplots/master/extra/datasets.csv")
dsets <- dsets |>
dplyr::select(-X) |>
arrange(tolower(dataset))
# link dataset to pkgdown doc
refurl <- "http://friendly.github.io/heplots/reference/"
dsets <- dsets |>
mutate(dataset = glue::glue("[{dataset}]({refurl}{dataset}.html)"))
#knitr::kable(dsets)
tinytable::tt(dsets) |> format_tt(markdown = TRUE)
dataset | rows | cols | title | tags |
---|---|---|---|---|
AddHealth | 4344 | 3 | Adolescent Health Data | MANOVA ordered |
Adopted | 62 | 6 | Adopted Children | MMRA repeated |
Bees | 246 | 6 | Captive and maltreated bees | MANOVA |
Diabetes | 145 | 6 | Diabetes Dataset | MANOVA |
dogfood | 16 | 3 | Dogfood Preferences | MANOVA contrasts candisc |
FootHead | 90 | 7 | Head measurements of football players | MANOVA contrasts |
Headache | 98 | 6 | Treatment of Headache Sufferers for Sensitivity to Noise | MANOVA repeated |
Hernior | 32 | 9 | Recovery from Elective Herniorrhaphy | MMRA candisc |
Iwasaki_Big_Five | 203 | 7 | Personality Traits of Cultural Groups | MANOVA |
mathscore | 12 | 3 | Math scores for basic math and word problems | MANOVA |
MockJury | 114 | 17 | Effects Of Physical Attractiveness Upon Mock Jury Decisions | MANOVA candisc |
NeuroCog | 242 | 10 | Neurocognitive Measures in Psychiatric Groups | MANOVA candisc |
NLSY | 243 | 6 | National Longitudinal Survey of Youth Data | MMRA |
oral | 56 | 5 | Effect of Delay in Oral Practice in Second Language Learning | MANOVA |
Oslo | 332 | 14 | Oslo Transect Subset Data | MANOVA candisc |
Overdose | 17 | 7 | Overdose of Amitriptyline | MMRA cancor |
Parenting | 60 | 4 | Father Parenting Competence | MANOVA contrasts |
peng | 333 | 8 | Size measurements for adult foraging penguins near Palmer Station | MANOVA |
Plastic | 20 | 5 | Plastic Film Data | MANOVA |
Pottery2 | 48 | 12 | Chemical Analysis of Romano-British Pottery | MANOVA candisc |
Probe | 11 | 5 | Response Speed in a Probe Experiment | MANOVA repeated |
RatWeight | 27 | 6 | Weight Gain in Rats Exposed to Thiouracil and Thyroxin | MANOVA repeated |
ReactTime | 10 | 6 | Reaction Time Data | repeated |
Rohwer | 69 | 10 | Rohwer Data Set | MMRA MANCOVA |
RootStock | 48 | 5 | Growth of Apple Trees from Different Root Stocks | MANOVA contrasts |
Sake | 30 | 10 | Taste Ratings of Japanese Rice Wine (Sake) | MMRA |
schooldata | 70 | 8 | School Data | MMRA robust |
Skulls | 150 | 5 | Egyptian Skulls | MANOVA contrasts |
SocGrades | 40 | 10 | Grades in a Sociology Course | MANOVA candisc |
SocialCog | 139 | 5 | Social Cognitive Measures in Psychiatric Groups | MANOVA candisc |
TIPI | 1799 | 16 | Data on the Ten Item Personality Inventory | MANOVA candisc |
VocabGrowth | 64 | 4 | Vocabulary growth data | repeated |
WeightLoss | 34 | 7 | Weight Loss Data | repeated |
Concept table
This table can be inverted to list the datasets that illustrate each concept:
concepts <- dsets |>
select(dataset, tags) |>
tidyr::separate_longer_delim(tags, delim = " ") |>
arrange(tags, dataset) |>
summarize(datasets = toString(dataset), .by = tags) |>
rename(concept = tags)
#knitr::kable(concepts)
tinytable::tt(concepts) |> format_tt(markdown = TRUE)
concept | datasets |
---|---|
MANCOVA | Rohwer |
MANOVA | AddHealth, Bees, Diabetes, FootHead, Headache, Iwasaki_Big_Five, MockJury, NeuroCog, Oslo, Parenting, Plastic, Pottery2, Probe, RatWeight, RootStock, Skulls, SocGrades, SocialCog, TIPI, dogfood, mathscore, oral, peng |
MMRA | Adopted, Hernior, NLSY, Overdose, Rohwer, Sake, schooldata |
cancor | Overdose |
candisc | Hernior, MockJury, NeuroCog, Oslo, Pottery2, SocGrades, SocialCog, TIPI, dogfood |
contrasts | FootHead, Parenting, RootStock, Skulls, dogfood |
ordered | AddHealth |
repeated | Adopted, Headache, Probe, RatWeight, ReactTime, VocabGrowth, WeightLoss |
robust | schooldata |