Skip to contents

CASIdata provides the datasets from Efron & Hastie, Computer Age Statistical Inference in an accessible R format for those who want to use them for teaching, study or to try to reproduce or extend analyses from the book. They were downloaded from Trevor Hastie’s web site, http://hastie.su.domains/CASI_files/DATA/, but quite a few files were messy and required some processing to make into R datasets.

Even so, some of the datasets may require data cleaning, renaming of variables, re-shaping or other tidying steps to be useful for analysis. But that’s part of learning.

Installation

This package is not yet on CRAN. You can install it from this GitHub repo:

remotes::install.github("friendly/CASIdata")

Datasets included here

Loading package: CASIdata

Dataset Title
DTI DTI Brain Imaging Data
als ALS Data
baseball Baseball Batting Averages
bivnorm Bivariate Normal Data
butterfly Butterfly Species Data
cellinfusion Cell Infusion Data
cholesterol Cholesterol Data
diabetes Diabetes Data
doseresponse Dose Response Data
galaxy Galaxy Data
haplotype Human Ancestry Haplotype Data
insurance Insurance Life Table Data
leukemia_small Leukemia Gene Expression Data (Small)
ncog NCOG Head and Neck Cancer Data
nodes Lymph Nodes Cancer Data
pediatric Pediatric Cancer Survival Data
police Police Racial Bias Data
prostz Prostate Cancer Z-values
student_score Student Score Data
supernova Type Ia Supernova Data
vasoconstriction Vasoconstriction Data

Missing Datasets

The following dataset appears in data-raw/CASI-save.R but is not (yet) included in the package:

Dataset Reason
SPAM Variable names need cleanup; requires mapping from UCI Spambase documentation

See data-raw/missing-datasets.md for details on resolving this.

External Datasets (Not Included)

These large datasets are referenced in the book but not included in the package due to size constraints. They can be downloaded directly from the sources listed below.

CASI datasets (too large for CRAN)

  • protein_kernel: 1708 x 1708 inner-product (kernel) matrix for human proteins (Section 19.6). Computed using a string kernel on bag-of-4-grams amino acid representations.
  • protein_label: Response labels (-1/+1) for the 1708 proteins (45 positives, 1663 negatives).
  • prostmat: 6033 x 102 gene expression matrix comparing 50 controls vs 52 prostate cancer patients (Section 3.3).
  • leukemia_big: 7128 x 72 gene expression matrix (10MB). A larger version of leukemia_small.

Image datasets (hosted externally)

Variable Renaming

Some datasets had variables renamed for clarity:

Dataset Original Renamed
butterfly x, y k, count
police X2.411 z
prostz X1.47236666651029 z
galaxy Reshaped from wide to long format with mag, red, freq

Example

No examples yet.

library(CASIdata)
## basic example code