CASIdata provides the datasets from Efron & Hastie, Computer Age Statistical Inference in an accessible R format for those who want to use them for teaching, study or to try to reproduce or extend analyses from the book. They were downloaded from Trevor Hastie’s web site, http://hastie.su.domains/CASI_files/DATA/, but quite a few files were messy and required some processing to make into R datasets.
Even so, some of the datasets may require data cleaning, renaming of variables, re-shaping or other tidying steps to be useful for analysis. But that’s part of learning.
Installation
This package is not yet on CRAN. You can install it from this GitHub repo:
remotes::install.github("friendly/CASIdata")Datasets included here
Loading package: CASIdata
| Dataset | Title |
|---|---|
| DTI | DTI Brain Imaging Data |
| als | ALS Data |
| baseball | Baseball Batting Averages |
| bivnorm | Bivariate Normal Data |
| butterfly | Butterfly Species Data |
| cellinfusion | Cell Infusion Data |
| cholesterol | Cholesterol Data |
| diabetes | Diabetes Data |
| doseresponse | Dose Response Data |
| galaxy | Galaxy Data |
| haplotype | Human Ancestry Haplotype Data |
| insurance | Insurance Life Table Data |
| leukemia_small | Leukemia Gene Expression Data (Small) |
| ncog | NCOG Head and Neck Cancer Data |
| nodes | Lymph Nodes Cancer Data |
| pediatric | Pediatric Cancer Survival Data |
| police | Police Racial Bias Data |
| prostz | Prostate Cancer Z-values |
| student_score | Student Score Data |
| supernova | Type Ia Supernova Data |
| vasoconstriction | Vasoconstriction Data |
Missing Datasets
The following dataset appears in data-raw/CASI-save.R but is not (yet) included in the package:
| Dataset | Reason |
|---|---|
SPAM |
Variable names need cleanup; requires mapping from UCI Spambase documentation |
See data-raw/missing-datasets.md for details on resolving this.
External Datasets (Not Included)
These large datasets are referenced in the book but not included in the package due to size constraints. They can be downloaded directly from the sources listed below.
CASI datasets (too large for CRAN)
-
protein_kernel: 1708 x 1708 inner-product (kernel) matrix for human proteins (Section 19.6). Computed using a string kernel on bag-of-4-grams amino acid representations.
- Source: https://hastie.su.domains/CASI_files/DATA/protein_kernel.txt
- Load in R:
protein_kernel <- matrix(scan("http://hastie.su.domains/CASI_files/DATA/protein_kernel.txt", what=0), 1708, 1708)
-
protein_label: Response labels (-1/+1) for the 1708 proteins (45 positives, 1663 negatives).
- Source: https://hastie.su.domains/CASI_files/DATA/protein_label.txt
- Load in R:
protein_label <- scan("http://hastie.su.domains/CASI_files/DATA/protein_label.txt", what=0)
-
prostmat: 6033 x 102 gene expression matrix comparing 50 controls vs 52 prostate cancer patients (Section 3.3).
- Source: https://hastie.su.domains/CASI_files/DATA/prostmat.csv
- Load in R:
prostmat <- read.csv("http://hastie.su.domains/CASI_files/DATA/prostmat.csv") - Note: Column names need cleanup (see
data-raw/missing-datasets.mdfor renaming code)
-
leukemia_big: 7128 x 72 gene expression matrix (10MB). A larger version of
leukemia_small.- Source: https://hastie.su.domains/CASI_files/DATA/leukemia_big.csv
- Load in R:
leukemia_big <- read.csv("http://hastie.su.domains/CASI_files/DATA/leukemia_big.csv")
