---
title: "Intro and descriptive statistics"
author: "The ASTA team"
output:
  slidy_presentation:
    fig_caption: no
    highlight: tango
    theme: cerulean
  pdf_document:
    fig_caption: no
    highlight: tango
    number_section: yes
    toc: yes
---

```{r, include = FALSE}
options(digits = 2)
## Remember to add all packages used in the code below!
missing_pkgs <- setdiff(c("mosaic"), rownames(installed.packages()))
if(length(missing_pkgs)>0) install.packages(missing_pkgs)
```

# Software

##  **Rstudio** 

* Make a folder on your computer where you want to keep files to use in  **Rstudio**. **Do NOT use Danish characters æ, ø, å** in the folder name (or anywhere in the path to the folder).
* Set the working directory to this folder:
`Session -> Set Working Directory -> Choose Directory`
(shortcut: Ctrl+Shift+H).
* Make the change permanent by setting the default directory in:
`Tools -> Global Options -> Choose Directory`.


## **R** basics

* Ordinary calculations:
```{r}
4.6 * (2 + 3)^4 
```
* Make a (scalar) object and print it:
```{r}
a <- 4 
a
```
* Make a (vector) object and print it:
```{r}
b <- c(2, 5, 7)
b
```
* Make a sequence of numbers and print it:
```{r}
s <- 1:4
s
```
* Note: A more flexible command for sequences:
```{r}
s <- seq(1, 4, by = 1)
```

* **R** does elementwise calculations:
```{r}
a * b
a + b
b ^ 2
```
* Sum and product of elements:
```{r}
sum(b)
prod(b)
```



## **R** extensions

* The functionality of **R** can be extended through libraries or packages (much like plugins in browsers etc.). Some are installed by default in **R** and you just need to load them.
* To install a new package in  **Rstudio**  use the menu:
`Tools -> Install Packages`
* You need to know the name of the package you want to install. You can also do it through a command:
```{r eval=FALSE}
install.packages("mosaic")
```
* When it is installed you can load it through the `library` command:
```{r results="hide", message=FALSE}
library(mosaic)
```
* This loads the `mosaic` package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the `mosaic` package, but you can safely ignore that.



## **R** help

* You get help via `?<command>`:
```{r eval=FALSE}
?sum
```
* Use `tab` to make  **Rstudio**  guess what you have started typing.
* Search for help:
```{r eval=FALSE}
help.search("plot")
```

* You can find a cheat sheet with the **R** functions we use for this course 
[here](https://asta.math.aau.dk/course/asta/2022-2/?file=cheatsheet.pdf).
* Save your commands in a file for later usage:
    + Select history tab in top right pane in  **Rstudio** .
    + Mark the commands you want to save.
    + Press `To Source` button.

```{r, echo = FALSE, eval = FALSE}
# OLD VERSION OF LINK SHOWN ABOVE -- Short term fix implemented, but more permanent solution should be found... 
# [here](https://asta.math.aau.dk/course/asta/`r gsub("^.*asta/([^/]+)/.*$", "\\1", readLines("../../build-tools/SERVER_SEMESTER_DIR"))`/?file=cheatsheet.pdf).
```

# Data 

## Data example

Data: [Magazine Ads Readability](https://asta.math.aau.dk/datasets?file=magazineAds.txt) <!-- * The link (http://lib.stat.cmu.edu/DASL/Datafiles/magadsdat.html) is defective. Hence it has been replaced by the data on the asta webpage. Hence there is still no description of the data. -->

* Thirty magazines were ranked by educational level of their readers.
* Three magazines were **randomly** selected from each of the following groups:
    * Group 1: highest educational level
    * Group 2: medium educational level
    * Group 3: lowest educational level.
* Six advertisements were **randomly** selected from each of the following nine selected magazines:
    * Group 1: [1] Scientific American, [2] Fortune, [3] The New Yorker
    * Group 2: [4] Sports Illustrated, [5] Newsweek, [6] People
    * Group 3: [7] National Enquirer, [8] Grit, [9] True Confessions
* So, the data contains information about a total of 54 advertisements. 


## Data example (continued) - variables and format
* For each advertisement (54 cases), the data below were observed.
* **Variable names**:
    * WDS = number of words in advertisement
    * SEN = number of sentences in advertisement
    * 3SYL = number of 3+ syllable words in advertisement
    * MAG = magazine (1 through 9 as above)
    * GROUP = educational level (1 through 3 as above)
* Take a look at the data from within **Rstudio**:
```{r}
magAds <- read.delim("https://asta.math.aau.dk/datasets?file=magazineAds.txt")
head(magAds)
```
* Variable names are in the top row. They are not allowed to start with a digit, so an `X` has been prefixed in `X3SYL`.



## Data types

### Quantitative variables

* The measurements have numerical values.
* Quantative data often comes about in one of the following ways:
    * **Continuous variables**: measurements of e.g. waiting times in a queue, revenue, share prices, etc.
    * **Discrete variables**: counts of e.g. words in a text, hits on a webpage, number of arrivals to a queue in one hour, etc.
* Measurements like this have a well-defined scale and in **R** they are stored as the type **numeric**.
* It is important to be able to distinguish between discrete count variables and continuous variables, since this often determines how we describe the uncertainty of a measurement.


### Categorical/qualitative variables

* The measurement is one of a set of given categories, e.g. sex (male/female), social status, satisfaction score (low/medium/high), etc.
* The measurement is usually stored (which is also recommended) as a **factor** in **R**. The possible categories are called **levels**. Example: the levels of the factor "sex" is male/female.
* Factors have two so-called scales:
    + **Nominal scale**: There is no natural ordering of the factor levels, e.g. sex and hair color.
    + **Ordinal scale**: There is a natural ordering of the factor levels, e.g. social status and satisfaction score. A factor in **R** can have a          so-called **attribute** assigned, which tells if it is ordinal.

# Population and sample

## Aim of statistics

* Statistics is all about "saying something" about a population.
* Typically, this is done by taking a random sample from the population.
* The sample is then analysed and a statement about the population can be made.
* The process of making conclusions about a population from analysing a sample is called **statistical inference**.

## Selecting **randomly**
* For the magazine data:
    * First we select **randomly** 3 magazines from each group.
    * Then we select **randomly** 6 ads from each magazine.
    * An important detail is that the selection is done completely at
        **random**, i.e.
        * each magazine within a group have an equal chance of being chosen and
        * each ad within a magazine have an equal chance of being chosen.
* In the following it is a fundamental requirement that the data
  collection respects this principle of randomness and in this case we use
  the term **sample**.
* More generally:
    * We have a **population** of objects.
    * We choose completely at random $n$ of these objects, and from the $j$th object we get the measurement $y_j$, $j=1,2,\ldots,n$.
    * The measurements $y_1, y_2, \ldots, y_n$ are then called a **sample**.
* If we e.g. are measuring the water quality 4 times in a year then it is
  a bad idea to only collect data in fair weather. The chosen sampling
  time is not allowed to be influenced by something that might influence
  the measurement itself.



# Variable grouping and frequency tables

## Binning

* The function `cut` will divide the range of a numeric variable in a number of equally sized intervals, and record which interval each observation belongs to. E.g. for the variable `X3SYL` (the number of words with more 
than three syllables) in the magazine data:
```{r}
# Before 'cutting':
magAds$X3SYL[1:5]

# After 'cutting' into 4 intervals:
syll <- cut(magAds$X3SYL, 4)
syll[1:5]
```
* The result is a `factor` and the labels are the interval end points by default. Custom ones can be assigned through the `labels` argument:
```{r}
labs <- c("few", "some", "many", "lots")
syll <- cut(magAds$X3SYL, 4, labels = labs) # NB: this overwrites the 'syll' defined above
syll[1:5]
magAds$syll <- syll # Adding a new column to the dataset
```



## Tables

* To summarize the results we can use the function `tally` from the `mosaic` package (remember the package **must be loaded** via `library(mosaic)` if you did not do so yet):
```{r}
tally( ~ syll, data = magAds)
```
* In percent:
```{r}
tally( ~ syll, data = magAds, format = "percent")
```
* Here we use an **R** `formula` (characterized by the "tilde" sign `~`) to indicate that we want this variable from the dataset `magAds` (without the tilde it would look for a global variable called `syll` and use that rather than the one in the dataset).


## 2 factors: Cross tabulation

* To make a table of all combinations of two factors we use `tally` again:
```{r size="small"}
tally( ~ syll + GROUP, data = magAds)
```
* Relative frequencies (in percent) columnwise:
```{r size="small"}
tally( ~ syll | GROUP, data = magAds, format = "percent")
```
* So, the above table shows e.g. how many percentage of the 
  advertisements in group 1 that have 'few', 'some', 'many' or 'lots' words with more than
  3 syllables. 

# Graphics

## Bar graph

* To create a bar graph plot of table data we use the function `gf_bar` from `mosaic`.
  For each level of the factor a box is drawn with the height proportional to the frequency (count) of the level.
```{r bargraph}
gf_bar( ~ syll, data = magAds)
```
<!-- Another color theme can be chosen, e.g.as one provided by the `mosaic` package: -->

<!-- ```{r} -->
<!-- trellis.par.set(theme = theme.mosaic()) -->
<!-- ``` -->
* The bar graph can also be split by group:
```{r bargraph_grouped}
gf_bar( ~ syll | GROUP, data = magAds)
```

## The Ericksen data

* Description of data: [Ericksen 1980 U.S. Census Undercount](http://www.rdocumentation.org/packages/car/functions/Ericksen).
* This data contains the following variables:
    * `minority`: Percentage black or Hispanic.
    * `crime`: Rate of serious crimes per 1000 individuals in the population.
    * `poverty`: Percentage poor.
    * `language`: Percentage having difficulty speaking or writing English.
    * `highschool`: Percentage aged 25 or older who had not finished highschool.
    * `housing`: Percentage of housing in small, multiunit buildings.
    * `city`: A factor with levels: `city` (major city) and `state` (state or state-remainder).
    * `conventional`: Percentage of households counted by conventional personal enumeration.
    * `undercount`: Preliminary estimate of percentage undercount.
* The Ericksen data has 66 rows/observations and 9 columns/variables. 
* The observations are measured in 16 large cities, the remaining parts of the states in which these cities are located, and the other U.S. states.

```{r}
Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt")
head(Ericksen)
```
* Want to make a histogram for crime rate - how? 

## Histogram (quantitative variables)

* How to make a histogram for some variable `x`:
    * Divide the interval from the minimum value of `x` to the maximum value of `x` in an appropriate number of equal sized sub-intervals.
    * Draw a box over each sub-interval with the height being proportional to the number of observations in the sub-interval.
* Histogram of crime rates for the Ericksen data
```{r hist}
gf_histogram( ~ crime, data = Ericksen)
```

# Summary of quantitative variables

## Measures of center of  data: Mean and median

* We return to the magazine ads example (`WDS` =  number of words in advertisement). 
A number of numerical summaries for `WDS` can be retrieved using the `favstats`
function:
```{r}
favstats( ~ WDS, data = magAds)
```

```{r echo=FALSE,results='hide'}
y <- magAds$WDS
nn <- length(y)
```
* The observed values of the variable `WDS` are $y_1=`r y[1]`$, $y_2=`r y[2]`,\ldots,y_n=`r y[nn]`$, where there are a total of $n=`r nn`$ values.
As previously defined this constitutes a **sample**.
* **mean** = `r round(mean(y))` is the **average** of the sample, which is calculated by
    $$
      \bar{y}=\frac{1}{n}\sum_{i=1}^n y_i
    $$
    We may also call $\bar{y}$ the **(empirical) mean** or the **sample mean**.
* **median** = `r round(median(y))` is the 50-percentile,
    i.e. the value that splits the sample in 2 groups of equal size.
* An important property of the **mean** and the **median** is that they have the same unit as the observations (e.g. meter).



## Measures of variability of data: range, standard deviation and variance

* The **range** is the difference of the largest and smallest observation.
* The **(empirical) variance** is the average of the squared deviations from the mean:
      $$
        s^2=\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2.
      $$
* **sd** $=$ **standard deviation** $= s=\sqrt{s^2}$.
* Note:\ If the observations are measured in meter,\ the **variance** has unit $\text{meter}^2$ which is hard to interpret. The **standard deviation** on the other hand has the same unit as the observations.
* The standard deviation describes how much data varies around the (empirical) mean.

## Calculation of mean, median and standard deviation using **R**
The mean, median and standard deviation are just some of the summaries that 
can be read of the `favstats` output (shown on previous page). They may also
be calculated separately in the following way:

* Mean of `WDS`:
```{r}
mean( ~ WDS, data = magAds)
```
* Median of `WDS`:
```{r}
median( ~ WDS, data = magAds)
```
* Standard deviation for `WDS`:
```{r}
sd( ~ WDS, data = magAds)
```

We may also calculate the summaries for each group (variable `GROUP`), e.g. for 
the mean:
```{r}
mean( ~ WDS | GROUP, data = magAds)
```


```{r echo=FALSE,results='hide'}
xx <- sort(y)
```



## A word about terminology

* **Standard deviation**:\ a measure of variability of a population or a sample.
* **Standard error**:\ a measure of variability of an estimate. For example, a measure of variability of the sample mean.



## The empirical rule

```{r empRule,echo=FALSE,fig.width=12,fig.height=6}
set.seed(5)
x <- rnorm(900)
x <- (x-mean(x))/sd(x)
hist(x,axes=F,xlab="",breaks="FD",ylab="",main="",cex.main = 1.5,ylim=c(-.21,.4),probability=TRUE)
axis(1,at=-3:3,labels=F,pos=0)
axis(1,at=-3,labels=substitute(bar(y)-3*s),pos=0)
axis(1,at=-2,labels=substitute(bar(y)-2*s),pos=0)
axis(1,at=-1,labels=substitute(bar(y)-s),pos=0)
axis(1,at=0,labels=substitute(bar(y)),pos=0)
axis(1,at=1,labels=substitute(bar(y)+s),pos=0)
axis(1,at=2,labels=substitute(bar(y)+2*s),pos=0)
axis(1,at=3,labels=substitute(bar(y)+3*s),pos=0)
arrows(-1,-.1,1,-.1,col="red",code=3,length=.1)
text(-.01,-.115,"About 68% of measurements",col="red")
arrows(-2,-.15,2,-.15,col="blue",code=3,length=.1)
text(-.01,-.1655,"About 95% of measurements",col="blue")
arrows(-3,-.2,3,-.2,code=3,length=.1)
text(-.01,-.215,"All or nearly all measurements")
```

If the histogram of the sample looks like a bell shaped curve, then

* about 68% of the observations lie between $\bar{y}-s$ and $\bar{y}+s$.
* about 95% of the observations lie between $\bar{y}-2s$ and $\bar{y}+2s$.
* All or almost all (99.7%) of the observations lie between $\bar{y}-3s$ and $\bar{y}+3s$.



## Percentiles

* **The $p$th percentile** is a value such that about $p$% of the population (or sample) lies below or at this value and about $(100-p)$% of the population (or sample) lies above it.

### Percentile calculation for a sample:

* First, sort data in increasing order. For the `WDS` variable in the magazine data:
    $$
      y_{(1)}=`r xx[1]`, y_{(2)}=`r xx[2]`, y_{(3)}=`r xx[3]`, \ldots, y_{(n)} = `r xx[nn]`.
    $$
Here the number of observations is $n=`r nn`$.
```{r echo=FALSE}
p <- 5
N <- nn*p/100
```
* Find the $5$th percentile (i. e.\ $p = `r p`$):\ 
    * The observation number corresponding to the 5-percentile is $N = \frac{ `r nn` \cdot `r p`}{100} = `r N`$. 
    * So the 5-percentile lies between the observations with observation number $k=`r floor(N)`$ and $k+1=`r floor(N)+1`$. That is, its value lies somewhere in the interval between $y_2= 32$ and $y_3=34$
    * One of several methods for estimating the 5-percentile from the value of N is defined as:
    $$
      y_{(k)} + (N - k)(y_{(k+1)} - y_{(k)}) 
    $$
which in this case states
$$
    y_2 + (2.7 - 2)(y_3 - y_2) = 32 + 0.7 \cdot (34-32) = 33.4
$$    
    
## Median, quartiles and interquartile range

Recall
```{r}
favstats( ~ WDS, data = magAds)
```

* 50-percentile = 96 is the **median** and it is a measure of the center of data.
* 0-percentile = 31 is the **minimum** value.
* 25-percentile = 69 is called the **lower quartile** (Q1).\ Median of lower 50% of data.
* 75-percentile = 202 is called the **upper quartile** (Q3).\ Median of upper 50% of data.
* 100-percentile = 230 is the **maximum** value.
* **Interquartile Range (IQR)**:\ a measure of variability
  given by the difference of the upper and lower quartiles: 202 - 69 = 133.


# More graphics
## Box-and-whiskers plots (or simply box plots)

How to draw a box-and-whiskers plot:

* Box:
    * Calculate the median, lower and upper quartiles.
    * Plot a line by the median and draw a box between the upper and
      lower quartiles.
* Whiskers: 
    * Calculate interquartile range and call it IQR.
    * Calculate the following values:
        * L = lower quartile - 1.5*IQR
        * U = upper quartile + 1.5*IQR
    * Draw a line from lower quartile to the smallest measurement,
    which is larger than *L*. 
    * Similarly, draw a line from upper quartile to the largest measurement 
    which is smaller than *U*.
* Outliers: Measurements smaller than *L* or larger than *U* are drawn as circles.

*Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.*

```{r echo = FALSE}
set.seed(42)
dat <- data.frame(plottype = factor("boxplot", levels = c("points", "boxplot", "whiskers")),
                  y = c(7, runif(20,12,19), 22))
b <- boxplot(y ~ plottype, data = dat, ylim = c(6, 25), xlab = "", ylab = "", pch = 25)
points(rep(1, length(dat$y)), dat$y)
Q1 <- b$stats[2,2]
Q3 <- b$stats[4,2]
IQR <- Q3-Q1
arrows(3, Q1, 3, Q1-1.5*IQR)
text(x = 3, y = c(Q1-1.5*IQR,Q1), pos = c(1, 3) , labels = c("Q1-1.5IQR", "Q1"), cex = 1)
arrows(3, Q3, 3, Q3+1.5*IQR)
text(x = 3, y = c(Q3, Q3+1.5*IQR), pos = c(1,3), labels = c("Q3", "Q3+1.5IQR"), cex = 1)
abline(h = c(Q1-1.5*IQR, Q1, Q3, Q3+1.5*IQR), lty = 2)
```


----

### Boxplot for Ericksen data

Boxplot of the poverty rates separately for cities and states (variable `city`):

```{r boxplot}
favstats(poverty ~ city, data = Ericksen)
gf_boxplot(poverty ~ city, data = Ericksen)
```

* There seems to be more poverty in the cities.
* A single state differs noticeably from the others with a high poverty rate.

## 2 quantitative variables: Scatter plot

For two quantitative variables the usual graphic is a scatter plot:
```{r scatter0}
gf_point(poverty ~ highschool, data = Ericksen)
```

This can be either split or coloured according to the value of `city`:

```{r}
gf_point(poverty ~ highschool | city, data = Ericksen)
gf_point(poverty ~ highschool, col = ~city, data = Ericksen)
```

If we want a regression line along with the points we can do:
```{r scatter05}
gf_point(poverty ~ highschool, col = ~city, data = Ericksen) %>% gf_lm()
```




# Appendix

## Recoding variables

* The function `factor` will directly convert a vector to be of type `factor`.\ E.g.:
```{r}
head(magAds$GROUP)  
f <- factor(magAds$GROUP)
magAds$GROUP <- f
head(magAds$GROUP)
```
* Custom labels for the levels can also be used:
```{r}
f <- factor(magAds$GROUP, 
            levels = c("1", "2", "3"),
            labels = c("high", "medium", "low"))
magAds$GROUP <- f
head(magAds$GROUP)
```
* In this way the numbers are replaced by more informative labels describing the educational level.

