---
title: "Data collection 2/2"
author: "The ASTA team"
output:
  slidy_presentation:
    css: "https://asta.math.aau.dk/course/asta/2018-2/?file=lecture_style.css"
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r, include = FALSE}
options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic", "tibble", "dplyr", "ggplot2"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

```{r, include = FALSE, message = FALSE}
library(tibble)
library(dplyr)
library(ggplot2)
```

# Important take-home messages

## Important take-home messages

* Population vs sample:
    + What is the population?
    + Is the entire population known -- is statistics at all needed?
* Sampling
    + Sampling strategy must ensure random sampling
        - Difficult to investigate it afterwards
    + Convenience sampling often used, dangerous!
    + Be honest with yourself, describe problems: Is the sample representative for the target group/population/market segment/...?
* Badly chosen big sample is much worse than a well-chosen small sample
* Watch out for biases
    + Sample/selection bias
    + Response bias
    + Non-response bias
    + (Survivorship bias)
* Data collection
    + Privacy vs necessary information (< 50 or >= 50, age in years, birth date)



# Brief overview of terminology

## Controlling (for)

* Multivariate analysis: "Controlled (for)" means that it's influence is removed
    + Size of effect often not of interest
    + Module 4: Cadmium exposure's effect on vital capacity, controlled for age
* Randomized experiments vs observational studies
* Example [A] 10.1

## Confounders

* Which variables to control for?
* Effect on response variable cannot be distinguished from another (or more) of the explanatory variables
* Variables affecting the association studied, but not measured are sometimes called *lurky*
* Example: correlation between college GPA and income later in life
    + Potential lurking variables: IQ, tendency to work (hard), ...
* Example:
    + Plant cucumbers in a garden, some in sun some in shade.
    + Add fertilizer to those in sun.
    + Wait...
    + More cucumbers on those in sun: due to sun light or fertilizer?
    + Effect of fertilizer confounded with effect of sun light. 
* Example:
    + Ice cream sale increases with number of shark attacks
    + Weather probably (!) has an impact?
* Analyze effect of explanatory variable: not observe a confounder explaining major part of effect
    + **Omitted variable bias**

## Multicolinearity

* If one or more explanatory variables are linearly dependent (or close to)


## Simpsons "paradox"

```{r, echo = FALSE}
set.seed(1)

group <- rep(0:1, 50)
x <- rnorm(length(group), mean=5*group)
DF2 <- data.frame(x = x, y = rnorm(length(group), mean = 3+8*group - x), group = group)

DF <- data.frame(Age = 50+DF2$x, SleepHrs = 5+DF2$y, group = DF2$group)
```

```{r, size = "scriptsize"}
mylm <- lm(SleepHrs ~ Age, data = DF)
summary(mylm)
```

## Simpsons "paradox"

```{r, echo = FALSE}
mycoef <- coef(mylm)
ggplot(DF, aes(Age, SleepHrs)) + 
  geom_abline(intercept = mycoef[1], slope = mycoef[2], color = "red") +
  geom_point()
```

## Simpsons "paradox"

```{r, echo = FALSE}
ggplot(DF, aes(Age, SleepHrs)) + 
  geom_smooth(aes(Age, SleepHrs, group = group), method = 'lm', se = FALSE, size = 1, col = "blue") + 
  geom_abline(intercept = mycoef[1], slope = mycoef[2], color = "red") +
  geom_point()
```


## Summary

* Some terms introduced, a lot more to it -- but gives some ideas of potential problems

# Data wrangling

## Data wrangling

Read data:

* `rio`: A Swiss-Army Knife for Data I/O
    + [`rio`: A Swiss-Army Knife for Data I/O](https://cloud.r-project.org/web/packages/rio/vignettes/rio.html)
    + Excel: `readxl` (part of `rio`)
* [R for Data Science](https://r4ds.had.co.nz/)

# Case-study

## Case: Questionnaire about biking habits in Region Sjælland

* Questionnaire:
    + Shared in approx 30 different Facebook groups
* Questions:
    + Representative for the entire region?
        - Each municipality represented in sample proportional to its population size?
        - Disabled people?
        - People biking (municipalities' age distribution may vary)
* Important take-home messages:
    + Sampling strategy must ensure random sampling
        - Difficult to investigate it afterwards
    + Convenience sampling often used, dangerous!

## Analysis

Demo
