Curso R

Modelos Lineares

Alexandre Adalardo de Oliveira

Ecologia- IBUSP abril 2019

Modelos Lineares

Conceitos

UNIFICAÇÃO METODOLÓGICA

unificação metodológica
regressão & ANOVA
variavel dummy _ modelos lineares (LM)
contrução e interpretação do LM
diagnóstico do modelo

Testes Clássicos

Resposta	Preditoras	Teste	Hipótese
Categórica	Categórica	Qui-quadrado	independência
Contínua	Categórica(2)	Teste-t	\(\mu _1 = \mu_2\)
Contínua	Categórica (>2)	Anova	\(\mu_1 = \mu_2 = \mu_3\)
Contínua	1 Contínua	Regressão	\(\beta_1 = 0\)
Contínua	>1 Contínua	Reg. múltipla	\(\beta_1 = 0; \beta_n = 0\)
Contínua	Cont + Cat	Ancova	\(\beta_1 = \beta_2; \alpha_1 = \alpha_2\)
Proporção	Contínua	Reg. Logística	\(logit(\beta_1) = 1\)

Ferramental Analítico

Drawing

Pink

Regressão Linear

O modelo de regressão

\[ y = \hat{\alpha} + \hat{\beta} x + \epsilon\] \[ \epsilon = N(0, \sigma) \]

SIMULANDO DADOS

Simulando dados

\[ y = \hat{\alpha} + \hat{\beta} x + \epsilon\] \[ \epsilon = N(0, \sigma) \]

Simulando dados

set.seed(2)
(x1 = seq(1,5, by=0.5))

## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

y1 = 4 + 3 * x1 + rnorm(n= 9, mean= 0, sd= 2.5 ) 
y1

## [1]  4.757714  8.962123 13.969613  8.674061 12.799371 14.831051 17.769887
## [8] 16.900755 23.961185

Dados REAIS

Modelo de Regressão

Estimar os parâmetros:

Regressão Linear

\[ y = \hat{\alpha} + \hat{\beta} x + \epsilon\]

Modelo simples: nulo

\[ y = \bar{y} ; \beta = 0\]

Resíduos e RSS

\[ d = y_i - \hat{y}_i \]

Mínimo RSS

\[ RSS = \sum{(y_i - \hat{y}_i)^2} \]

MMQ animado

Drawing

Método dos Mínimos Quadrados

Regressão: dados simulados

\[ y = \hat{\alpha} + \hat{\beta} x + \epsilon\]

Estimadores x Parâmetros

Modelo Linear no R

lmxy <- lm(y1~x1, data = xy)
summary(lmxy)

## 
## Call:
## lm(formula = y1 ~ x1, data = xy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0446 -1.2415 -0.7005  1.0564  4.1574 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1864     2.1097   1.036 0.334505    
## x1            3.8129     0.6459   5.903 0.000598 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.502 on 7 degrees of freedom
## Multiple R-squared:  0.8327, Adjusted R-squared:  0.8088 
## F-statistic: 34.84 on 1 and 7 DF,  p-value: 0.0005978

Resíduos Gaussianos

\[ N(3 + 4x, 2.5) \equiv 3 + 4 x + N(0, 2.5) \]

Modelo Linear no R

\[ y = 3 + 4x + (N, 2.5)\]

coef(lmxy)

## (Intercept)          x1 
##    2.186353    3.812911

confint(lmxy)

##                 2.5 %   97.5 %
## (Intercept) -2.802189 7.174894
## x1           2.285488 5.340333

summary(lmxy)$coefficients

##             Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 2.186353  2.1096552 1.036355 0.3345049717
## x1          3.812911  0.6459473 5.902820 0.0005977766

summary(lmxy)$sigma

## [1] 2.501743

Dieta de Lagarta

Exemplo: dieta de lagarta

lag <- read.table("data/regression.txt", header=TRUE) 
str(lag)

## 'data.frame':    9 obs. of  2 variables:
##  $ growth: int  12 10 8 11 6 7 2 3 3
##  $ tannin: int  0 1 2 3 4 5 6 7 8

kable(head(lag))

growth	tannin
12	0
10	1
8	2
11	3
6	4
7	5

Exemplo: dieta de lagarta

plot(growth ~ tannin, data = lag)

Modelo Linear: lagartos

lmlag <- lm(growth ~ tannin, data = lag)
summary(lmlag)

## 
## Call:
## lm(formula = growth ~ tannin, data = lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4556 -0.8889 -0.2389  0.9778  2.8944 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.7556     1.0408  11.295 9.54e-06 ***
## tannin       -1.2167     0.2186  -5.565 0.000846 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.693 on 7 degrees of freedom
## Multiple R-squared:  0.8157, Adjusted R-squared:  0.7893 
## F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

Modelo Linear: coeficientes

coef(lmlag)

## (Intercept)      tannin 
##   11.755556   -1.216667

\[ y = \hat{\alpha} + \hat{\beta} x + \epsilon\]

Modelo Linear: resíduos (desvios ou erros)

summary(lmlag)

## 
## Call:
## lm(formula = growth ~ tannin, data = lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4556 -0.8889 -0.2389  0.9778  2.8944 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.7556     1.0408  11.295 9.54e-06 ***
## tannin       -1.2167     0.2186  -5.565 0.000846 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.693 on 7 degrees of freedom
## Multiple R-squared:  0.8157, Adjusted R-squared:  0.7893 
## F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

Modelo Linear: Residuals

print(residuals(lmlag), digits = 2)

##     1     2     3     4     5     6     7     8     9 
##  0.24 -0.54 -1.32  2.89 -0.89  1.33 -2.46 -0.24  0.98

Modelo Linear: Residuals

summary(lmlag)

## 
## Call:
## lm(formula = growth ~ tannin, data = lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4556 -0.8889 -0.2389  0.9778  2.8944 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.7556     1.0408  11.295 9.54e-06 ***
## tannin       -1.2167     0.2186  -5.565 0.000846 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.693 on 7 degrees of freedom
## Multiple R-squared:  0.8157, Adjusted R-squared:  0.7893 
## F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

print(residuals(lmlag), digits = 2)

##     1     2     3     4     5     6     7     8     9 
##  0.24 -0.54 -1.32  2.89 -0.89  1.33 -2.46 -0.24  0.98

sqrt(sum(residuals(lmlag)^2)/(nrow(lag)-2))

## [1] 1.693358

Anova do Modelo

anova(lmlag)

## Analysis of Variance Table
## 
## Response: growth
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## tannin     1 88.817  88.817  30.974 0.0008461 ***
## Residuals  7 20.072   2.867                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Lógica da ANOVA

Anova: partição da variação

Drawing

Regressão: ANOVA

Lógica da Anova

\[ SS_{total} = SS_{entre} + SS_{intra} \]

Lógica da Regressão

\[ SS_{total} = SS_{regr} + SS_{residuos} \]

Modelo mínimo

nullag <- lm(growth ~ 1, data = lag)

Desvios quadráticos total

\(SS_{total} = \sum_{i=1}^n (y_{i} - \bar{y})^2\)

(dt <- lag$growth - mean(lag$growth))

## [1]  5.1111111  3.1111111  1.1111111  4.1111111 -0.8888889  0.1111111
## [7] -4.8888889 -3.8888889 -3.8888889

dt^2

## [1] 26.12345679  9.67901235  1.23456790 16.90123457  0.79012346  0.01234568
## [7] 23.90123457 15.12345679 15.12345679

(ss_total <- sum(dt^2))

## [1] 108.8889

Modelo Linear: lagarta

lmlag <- lm(growth ~ tannin, data = lag)

Desvios quadráticos do ERRO

\(SS_{error} = \sum_{i=1}^n (y_{i} - \hat{y})^2\)

(coeflag <- coef(lmlag))

## (Intercept)      tannin 
##   11.755556   -1.216667

(predlag <- coeflag[1] + coeflag[2] * lag$tannin)

## [1] 11.755556 10.538889  9.322222  8.105556  6.888889  5.672222  4.455556
## [8]  3.238889  2.022222

lag$growth

## [1] 12 10  8 11  6  7  2  3  3

Desvios quadráticos do ERRO

\[SS_{error} = \sum_{i=1}^n (y_{i} - \hat{y})^2\]

(ss_erro <- sum((lag$growth - predlag)^2))

## [1] 20.07222

Lógica da Regressão

\[ SS_{total} = SS_{regr} + SS_{erro} \]

ss_total
## [1] 108.8889
ss_erro
## [1] 20.07222
(ss_reg <- ss_total - ss_erro)
## [1] 88.81667

Tabela de Anova

Fonte	SumSquare	GL	MeanSquare
Regressão	88.82	1	88.82
Erro	20.07	7	2.87
Total	108.89	8

Teste de hipótese: F e \(r^2\)

(r2 <- ss_reg/ss_total)

## [1] 0.8156633

(flag <- ss_reg/(ss_erro/7))

## [1] 30.97398

1- pf(flag, 1, 7)

## [1] 0.0008460738

Regressão no R: lagarta

laglm <- lm(growth ~ tannin, data=lag)
anova(laglm)

## Analysis of Variance Table
## 
## Response: growth
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## tannin     1 88.817  88.817  30.974 0.0008461 ***
## Residuals  7 20.072   2.867                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparando Modelos no R

Modelos aninhados

nullag <- lm(growth ~ 1, data = lag)
anova(nullag, laglm)

## Analysis of Variance Table
## 
## Model 1: growth ~ 1
## Model 2: growth ~ tannin
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1      8 108.889                                  
## 2      7  20.072  1    88.817 30.974 0.0008461 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparando Modelos no R: lagarta

anova(laglm)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
tannin	1	88.81667	88.81667	30.97398	0.0008461
Residuals	7	20.07222	2.86746

anova(nullag, laglm)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
8	108.88889
7	20.07222	1	88.81667	30.97398	0.0008461

Resumo do Modelo

## 
## Call:
## lm(formula = growth ~ tannin, data = lag)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4556 -0.8889 -0.2389  0.9778  2.8944 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.7556     1.0408  11.295 9.54e-06 ***
## tannin       -1.2167     0.2186  -5.565 0.000846 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.693 on 7 degrees of freedom
## Multiple R-squared:  0.8157, Adjusted R-squared:  0.7893 
## F-statistic: 30.97 on 1 and 7 DF,  p-value: 0.0008461

Diagnóstico do Modelo

par( mfrow= c(2,2),mar=c(4,4.5,2,2), cex.lab=1.2, cex.axis=1.2, las=1, bg = "gray80", bty="l", pch=16)
plot(lmlag)

NÃO DESESPERE, ESPERE! KEEP CALM!!

Pink

Variável categórica

Anova: o exemplo

Drawing

Representação dos dados

Regressão de Variável Categórica

solos <- read.table("/home/aao/Ale2016/AleCursos/Planejamento&Analise/dados/crop.csv", header = TRUE, as.is=TRUE, sep="\t")
str(solos)

## 'data.frame':    30 obs. of  2 variables:
##  $ solo : chr  "are" "are" "are" "are" ...
##  $ colhe: int  6 10 8 6 14 17 9 11 7 11 ...

lmsolos <- lm(colhe ~ solo, data = solos)
anova(lmsolos)

## Analysis of Variance Table
## 
## Response: colhe
##           Df Sum Sq Mean Sq F value  Pr(>F)  
## solo       2   99.2  49.600  4.2447 0.02495 *
## Residuals 27  315.5  11.685                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Regressão de Categóricas

Variáveis Dummy ou Indicadoras

soloslin <- solos[,c( "colhe", "solo")]
soloslin$solo

##  [1] "are" "are" "are" "are" "are" "are" "are" "are" "are" "are" "arg"
## [12] "arg" "arg" "arg" "arg" "arg" "arg" "arg" "arg" "arg" "hum" "hum"
## [23] "hum" "hum" "hum" "hum" "hum" "hum" "hum" "hum"

soloslin$arg <-0
soloslin$arg[solos$solo=="arg"] <- 1
soloslin$hum <- 0 
soloslin$hum[solos$solo=="hum"] <- 1

Variável Dummy ou Indicadora

soloslin[c(1,2,3,11,12,13,21,22,23),]

	colhe	solo	arg	hum
1	6	are	0	0
2	10	are	0	0
3	8	are	0	0
11	17	arg	1	0
12	15	arg	1	0
13	3	arg	1	0
21	13	hum	0	1
22	16	hum	0	1
23	9	hum	0	1

Número de níveis do fator menos 1 (intercepto)

Modelo linear: dummy

Modelo

\(y = \alpha_{d_1} + \beta_{2} x_{d_2}+ \beta_3 x_{d_3}\)

Intercepto:

\(\alpha_{d_1} = \bar{x}_1\)

Coeficientes:

\(\beta_{2}= \bar{x}_2 - \bar{x}_1\)

\(\beta_{3}= \bar{x}_3 - \bar{x}_1\)

Regressão dummy

lmdum <- lm(colhe ~ arg + hum, soloslin)
summary(lmdum)

## 
## Call:
## lm(formula = colhe ~ arg + hum, data = soloslin)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -8.5   -1.8    0.3    1.7    7.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.900      1.081   9.158 9.04e-10 ***
## arg            1.600      1.529   1.047  0.30456    
## hum            4.400      1.529   2.878  0.00773 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.418 on 27 degrees of freedom
## Multiple R-squared:  0.2392, Adjusted R-squared:  0.1829 
## F-statistic: 4.245 on 2 and 27 DF,  p-value: 0.02495

Modelo Linear Normal

lmSolos <- lm(colhe~solo, data = solos)
summary(lmSolos)

## 
## Call:
## lm(formula = colhe ~ solo, data = solos)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -8.5   -1.8    0.3    1.7    7.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.900      1.081   9.158 9.04e-10 ***
## soloarg        1.600      1.529   1.047  0.30456    
## solohum        4.400      1.529   2.878  0.00773 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.418 on 27 degrees of freedom
## Multiple R-squared:  0.2392, Adjusted R-squared:  0.1829 
## F-statistic: 4.245 on 2 and 27 DF,  p-value: 0.02495

Coeficientes do modelo

coef(lmdum)

## (Intercept)         arg         hum 
##         9.9         1.6         4.4

tapply(solos$colhe, solos$solo, mean)

##  are  arg  hum 
##  9.9 11.5 14.3

\[y = \hat{\alpha}_{d_1} + \hat{\beta}_{2} x_{d_2}+ \hat{\beta}_3 x_{d_3}\]

Regressão de Fator

Modelo

\[y = \alpha_{d_1} + \beta_{2} x_{d_2}+ \beta_3 x_{d_3}\]

Intercepto:

\(\alpha_{d_1} = \bar{x}_1\)

Coeficientes:

\(\beta_{2}= \bar{x}_2 - \bar{x}_1\)

\(\beta_{3}= \bar{x}_3 - \bar{x}_1\)

Pink

Atividades desta tarde

Até as 16h:
- tutorial 7a
- apostila (só o começo!)
- dúvidas das unidades anteriores
Após as 16h
- dúvidas dos exercícios

Retomando a regressão

Peso ~ altura

data(Davis)
str(Davis)

## 'data.frame':    200 obs. of  5 variables:
##  $ sex   : Factor w/ 2 levels "F","M": 2 1 1 2 1 2 2 2 2 2 ...
##  $ weight: int  77 58 53 68 59 76 76 69 71 65 ...
##  $ height: int  182 161 161 177 157 170 167 186 178 171 ...
##  $ repwt : int  77 51 54 70 59 76 77 73 71 64 ...
##  $ repht : int  180 159 158 175 155 165 165 180 175 170 ...

Gráfico da Regressão:

Modelo da Regressão

lmdavis <- lm(weight~height, data = Davis)
summary(lmdavis)

## 
## Call:
## lm(formula = weight ~ height, data = Davis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.928  -5.406  -0.651   4.891  42.641 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -130.84185   12.30184  -10.64   <2e-16 ***
## height         1.15112    0.07193   16.00   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.635 on 178 degrees of freedom
## Multiple R-squared:  0.5899, Adjusted R-squared:  0.5876 
## F-statistic: 256.1 on 1 and 178 DF,  p-value: < 2.2e-16

Regressão: peso ~ altura

ANOVA

## Analysis of Variance Table
## 
## Response: weight
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## height      1  19095 19095.0  256.08 < 2.2e-16 ***
## Residuals 178  13273    74.6                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(davisNull,lmdavis)

## Analysis of Variance Table
## 
## Model 1: weight ~ 1
## Model 2: weight ~ height
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    179 32368                                  
## 2    178 13273  1     19095 256.08 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

\(p_{valor} = 2.2e-16\)

\(p_{valor} = 2.2 * 10^{-16}\)

\(r^2 = 0.587\)

Modelo de Regressão:

## 
## Call:
## lm(formula = weight ~ height + sex, data = Davis)
## 
## Coefficients:
## (Intercept)       height         sexM  
##    -80.2107       0.8341       7.7070

sexo: variável dummy com dois níveis (mulher = 0, homem = 1)

lmdavis01 <- lm(weight~ height + sex, data = Davis)
summary(lmdavis01)

## 
## Call:
## lm(formula = weight ~ height + sex, data = Davis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.302  -4.808  -0.335   5.239  41.366 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -80.2107    16.8415  -4.763 3.96e-06 ***
## height        0.8341     0.1021   8.169 5.71e-14 ***
## sexM          7.7070     1.8345   4.201 4.20e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.258 on 177 degrees of freedom
## Multiple R-squared:  0.6271, Adjusted R-squared:  0.6229 
## F-statistic: 148.8 on 2 and 177 DF,  p-value: < 2.2e-16

lm(weight ~ height + sex, data = Davis)

coeflm01<- coef(lmdavis01)
coeflm01

## (Intercept)      height        sexM 
## -80.2107328   0.8340964   7.7070166

Predição do Modelo

Mulher (\(sex = 0\))

\[w_f = \hat{\alpha}+ \hat{\beta_s} sex + \hat{\beta_h} *height\] \[w_f = \hat{\alpha} + \hat{\beta_h} * height\]

Homem (\(sex=1\))

\[w_h = \hat{\alpha} + \hat{\beta_s}* sex + \hat{\beta} * height\] \[w_h = \hat{\alpha}+ \hat{\beta_s} + \hat{\beta_h} * height\]

lm(weight ~ height + sex)

Interação

lmdavisfull <- lm(weight ~ height + sex + sex:height, data = Davis)

lmdavisfull <- lm(weight ~ height + sex*height, data=Davis)
summary(lmdavisfull)

## 
## Call:
## lm(formula = weight ~ height + sex * height, data = Davis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.990  -4.548  -0.926   4.821  41.023 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -45.7988    24.8453  -1.843   0.0670 .  
## height        0.6252     0.1507   4.148 5.22e-05 ***
## sexM        -57.4326    34.8293  -1.649   0.1009    
## height:sexM   0.3815     0.2037   1.873   0.0628 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.2 on 176 degrees of freedom
## Multiple R-squared:  0.6344, Adjusted R-squared:  0.6282 
## F-statistic: 101.8 on 3 and 176 DF,  p-value: < 2.2e-16

lm(weight ~ height + sex*height, data=Davis)

## (Intercept)      height        sexM height:sexM 
## -45.7988220   0.6252035 -57.4326307   0.3815088

Mulher (\(sex = 0\))

\[w = \hat{\alpha}+ \hat{\beta_s} sex + \hat{\beta_h} height + \hat{\beta}_{s:h} sex* height\] \[w_m = \hat{\alpha} + \hat{\beta_h} height\]

Homem (\(sex=1\))

\[w = \hat{\alpha} + \hat{\beta_s} sex + \hat{\beta_h} height + \hat{\beta}_{h:s} sex * height \] \[w_h = \hat{\alpha}+ \hat{\beta_s} + (\hat{\beta_h} + \hat{\beta}_{h:s}) * height\]

Predição do modelo

Uma mulher de 161 cm de altura

\[w = \hat{\alpha}+ \hat{\beta_s} sex + \hat{\beta_h} height + \hat{\beta}_{s:h} sex* height\] \[sex =0\]

(coefull <- coef(lmdavisfull))

## (Intercept)      height        sexM height:sexM 
## -45.7988220   0.6252035 -57.4326307   0.3815088

predMulher <- coefull[1] + coefull[2] * 161
(predMulher <- as.numeric(predMulher))

## [1] 54.85893

Previsto pelo LM

Uma mulher com 161 cm de altura tem peso 54.86 kg .

Predito do Modelo

Homem com 182 cm

\[w = \hat{\alpha}+ \hat{\beta_s} sex + \hat{\beta_h} height + \hat{\beta}_{s:h} sex* height\] \[ sex = 1\]

coefull

## (Intercept)      height        sexM height:sexM 
## -45.7988220   0.6252035 -57.4326307   0.3815088

predHomem <- (coefull[1]+ coefull[3]) + (coefull[2]
               + coefull[4]) * 182 

(predHomem <- as.numeric(predHomem))

## [1] 79.99018

Predito homem

Um homem com 182 cm de altura tem peso 79.99 kg .

Qual o melhor modelo?

Princípio da parcimônia (Navalha de Occam)

devem ter menos parâmetros possível
linear é melhor que não-linear
reter menos pressupostos
simplificado ao mínimo adequado
explicações mais simples são preferíveis

Simplificação do modelo

Método do cheio ao mínimo adequado

ajuste o modelo máximo (cheio)
simplifique o modelo:
- inspecione os coeficientes (summary)
- remova termos não significativos
ordem de remoção de termos:
- interação não significativos (maior ordem)
- termos quadráticos ou não lineares
- variáveis explicativas não significativas
- agrupe níveis de fatores sem diferença
- ANCOVA: intercepto não significativoa -> 0

Simplificação do modelo: continuação

Compare o modelo anterior com o simplificado

A diferença não é significativa:

* retenha o modelo mais simples
* continue simplificando

A difereça é significativa

* retenha o modelo complexo 
* este é o modelo MINÍMO ADEQUADO

Simplificando Modelo: exemplo

anova(lmdavisfull, lmdavis01)

## Analysis of Variance Table
## 
## Model 1: weight ~ height + sex * height
## Model 2: weight ~ height + sex
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1    176 11833                              
## 2    177 12069 -1   -235.82 3.5075 0.06275 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Simplificando Modelo: exemplo

anova(lmdavis01, lmdavis)

## Analysis of Variance Table
## 
## Model 1: weight ~ height + sex
## Model 2: weight ~ height
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    177 12069                                 
## 2    178 13273 -1   -1203.5 17.65 4.204e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Modelo Mínimo Adequado

summary(lmdavis01)

## 
## Call:
## lm(formula = weight ~ height + sex, data = Davis)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.302  -4.808  -0.335   5.239  41.366 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -80.2107    16.8415  -4.763 3.96e-06 ***
## height        0.8341     0.1021   8.169 5.71e-14 ***
## sexM          7.7070     1.8345   4.201 4.20e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.258 on 177 degrees of freedom
## Multiple R-squared:  0.6271, Adjusted R-squared:  0.6229 
## F-statistic: 148.8 on 2 and 177 DF,  p-value: < 2.2e-16

Modelo Mínimo Adequado

coef(lmdavis01)

## (Intercept)      height        sexM 
## -80.2107328   0.8340964   7.7070166

confint(lmdavis01)

##                  2.5 %     97.5 %
## (Intercept) -113.44661 -46.974852
## height         0.63259   1.035603
## sexM           4.08671  11.327323

Diagnóstico do Modelo: plot(modelo)

par(mfrow = c(2,2))
plot((lmdavis01)

	colhe	solo	arg	hum
1	6	are	0	0
2	10	are	0	0
3	8	are	0	0
11	17	arg	1	0
12	15	arg	1	0
13	3	arg	1	0
21	13	hum	0	1
22	16	hum	0	1
23	9	hum	0	1

	colhe	solo	arg	hum
1	6	are	0	0
2	10	are	0	0
3	8	are	0	0
11	17	arg	1	0
12	15	arg	1	0
13	3	arg	1	0
21	13	hum	0	1
22	16	hum	0	1
23	9	hum	0	1

	colhe	solo	arg	hum
1	6	are	0	0
2	10	are	0	0
3	8	are	0	0
11	17	arg	1	0
12	15	arg	1	0
13	3	arg	1	0
21	13	hum	0	1
22	16	hum	0	1
23	9	hum	0	1