library(dplyr)
library(ggplot2)
library(car)
library(effects)
library(palmerpenguins)
Remove units from variable names (for display purposes). Character variables should be factors.
peng <- penguins %>%
rename(
bill_length = bill_length_mm,
bill_depth = bill_depth_mm,
flipper_length = flipper_length_mm,
body_mass = body_mass_g
) %>%
mutate(species = as.factor(species),
island = as.factor(island),
sex = as.factor(substr(sex,1,1))) %>%
filter(!is.na(bill_depth),
!is.na(sex))
str(peng)
## tibble [333 x 8] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length : num [1:333] 39.1 39.5 40.3 36.7 39.3 38.9 39.2 41.1 38.6 34.6 ...
## $ bill_depth : num [1:333] 18.7 17.4 18 19.3 20.6 17.8 19.6 17.6 21.2 21.1 ...
## $ flipper_length: int [1:333] 181 186 195 193 190 181 195 182 191 198 ...
## $ body_mass : int [1:333] 3750 3800 3250 3450 3650 3625 4675 3200 3800 4400 ...
## $ sex : Factor w/ 2 levels "f","m": 2 1 1 1 2 1 2 1 2 2 ...
## $ year : int [1:333] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
ggplotColours <- function(n = 6, h = c(0, 360) + 15){
if ((diff(h) %% 360) < 1) h[2] <- h[2] - 360/n
hcl(h = (seq(h[1], h[2], length = n)), c = 100, l = 65)
}
car::scatterplotMatrix
with data ellipses and regression lines by species
scatterplotMatrix(~ bill_length + bill_depth + flipper_length + body_mass | species,
data=peng,
col = ggplotColours(3),
ellipse=list(levels=0.68))
Assume our main interest is in understanding how bill_length
is related to other variables. We can easily include categorical predictors (sex
and island
) using color and faceting.
ggplot(peng, aes(x=body_mass, y=bill_length, color=sex)) +
geom_point() +
geom_smooth(method="lm", formula = "y ~ x") +
facet_grid(~ island) +
theme_bw()
Fitting a quadratic with geom_smooth()
is one simple way to check for non-linearity. Another is to use a method="loess"
smoother.
ggplot(peng, aes(x=body_mass, y=bill_length, color=sex)) +
geom_point() +
geom_smooth(method="lm", formula = "y ~ poly(x,2)") +
facet_grid(~ island) +
theme_bw()
Do the same, but facet by island & species
ggplot(peng, aes(x=body_mass, y=bill_length, color=sex, shape=species)) +
geom_point() +
geom_smooth(method="lm", formula = "y ~ poly(x,2)") +
facet_wrap(island ~ species) +
theme_bw()
bill_length
Here, use body_mass
as main quantitative predictor, but control for sex
, species
and island
peng.mod0 <- lm(bill_length ~ body_mass + sex + species + island, data=peng)
Anova(peng.mod0)
## Anova Table (Type II tests)
##
## Response: bill_length
## Sum Sq Df F value Pr(>F)
## body_mass 100 1 19.61 1.3e-05 ***
## sex 251 1 49.12 1.4e-11 ***
## species 3572 2 349.25 < 2e-16 ***
## island 9 2 0.89 0.41
## Residuals 1667 326
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
peng.mod1 <- lm(bill_length ~ body_mass * sex + species + island, data=peng)
Anova(peng.mod1)
## Anova Table (Type II tests)
##
## Response: bill_length
## Sum Sq Df F value Pr(>F)
## body_mass 100 1 19.55 1.3e-05 ***
## sex 251 1 48.97 1.5e-11 ***
## species 3527 2 343.76 < 2e-16 ***
## island 9 2 0.89 0.41
## body_mass:sex 0 1 0.00 0.96
## Residuals 1667 325
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This is a simple screening test to check whether any interactions with body_mass
are important.
peng.mod2 <- lm(bill_length ~ body_mass * (sex + species + island), data=peng)
Anova(peng.mod2)
## Anova Table (Type II tests)
##
## Response: bill_length
## Sum Sq Df F value Pr(>F)
## body_mass 100 1 19.69 1.3e-05 ***
## sex 256 1 50.25 8.6e-12 ***
## species 3383 2 331.98 < 2e-16 ***
## island 9 2 0.91 0.40
## body_mass:sex 5 1 1.04 0.31
## body_mass:species 22 2 2.12 0.12
## body_mass:island 2 2 0.17 0.84
## Residuals 1636 321
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova()
tests the additional contribution of each model over the previous one.
anova(peng.mod0, peng.mod1, peng.mod2)
## Analysis of Variance Table
##
## Model 1: bill_length ~ body_mass + sex + species + island
## Model 2: bill_length ~ body_mass * sex + species + island
## Model 3: bill_length ~ body_mass * (sex + species + island)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 326 1667
## 2 325 1667 1 0.01 0.00 0.96
## 3 321 1636 4 31.44 1.54 0.19
By default, this plots the mean response (bill_length
) all the high-order terms in the model (species
, island
, and body_mass * sex
), averaged over all other variables not shown in a given plot.
peng.eff <- allEffects(peng.mod1)
plot(peng.eff)