--- title: "Intro and descriptive statistics" author: "The ASTA team" output: slidy_presentation: fig_caption: no highlight: tango theme: cerulean pdf_document: fig_caption: no highlight: tango number_section: yes toc: yes --- ```{r, include = FALSE} options(digits = 2) ## Remember to add all packages used in the code below! missing_pkgs <- setdiff(c("mosaic"), rownames(installed.packages())) if(length(missing_pkgs)>0) install.packages(missing_pkgs) ``` # Software ## **Rstudio** * Make a folder on your computer where you want to keep files to use in **Rstudio**. **Do NOT use Danish characters æ, ø, å** in the folder name (or anywhere in the path to the folder). * Set the working directory to this folder: `Session -> Set Working Directory -> Choose Directory` (shortcut: Ctrl+Shift+H). * Make the change permanent by setting the default directory in: `Tools -> Global Options -> Choose Directory`. ## **R** basics * Ordinary calculations: ```{r} 4.6 * (2 + 3)^4 ``` * Make a (scalar) object and print it: ```{r} a <- 4 a ``` * Make a (vector) object and print it: ```{r} b <- c(2, 5, 7) b ``` * Make a sequence of numbers and print it: ```{r} s <- 1:4 s ``` * Note: A more flexible command for sequences: ```{r} s <- seq(1, 4, by = 1) ``` * **R** does elementwise calculations: ```{r} a * b a + b b ^ 2 ``` * Sum and product of elements: ```{r} sum(b) prod(b) ``` ## **R** extensions * The functionality of **R** can be extended through libraries or packages (much like plugins in browsers etc.). Some are installed by default in **R** and you just need to load them. * To install a new package in **Rstudio** use the menu: `Tools -> Install Packages` * You need to know the name of the package you want to install. You can also do it through a command: ```{r eval=FALSE} install.packages("mosaic") ``` * When it is installed you can load it through the `library` command: ```{r results="hide", message=FALSE} library(mosaic) ``` * This loads the `mosaic` package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the `mosaic` package, but you can safely ignore that. ## **R** help * You get help via `?`: ```{r eval=FALSE} ?sum ``` * Use `tab` to make **Rstudio** guess what you have started typing. * Search for help: ```{r eval=FALSE} help.search("plot") ``` * You can find a cheat sheet with the **R** functions we use for this course [here](https://asta.math.aau.dk/course/asta/2022-2/?file=cheatsheet.pdf). * Save your commands in a file for later usage: + Select history tab in top right pane in **Rstudio** . + Mark the commands you want to save. + Press `To Source` button. ```{r, echo = FALSE, eval = FALSE} # OLD VERSION OF LINK SHOWN ABOVE -- Short term fix implemented, but more permanent solution should be found... # [here](https://asta.math.aau.dk/course/asta/`r gsub("^.*asta/([^/]+)/.*$", "\\1", readLines("../../build-tools/SERVER_SEMESTER_DIR"))`/?file=cheatsheet.pdf). ``` # Data ## Data example Data: [Magazine Ads Readability](https://asta.math.aau.dk/datasets?file=magazineAds.txt) * Thirty magazines were ranked by educational level of their readers. * Three magazines were **randomly** selected from each of the following groups: * Group 1: highest educational level * Group 2: medium educational level * Group 3: lowest educational level. * Six advertisements were **randomly** selected from each of the following nine selected magazines: * Group 1: [1] Scientific American, [2] Fortune, [3] The New Yorker * Group 2: [4] Sports Illustrated, [5] Newsweek, [6] People * Group 3: [7] National Enquirer, [8] Grit, [9] True Confessions * So, the data contains information about a total of 54 advertisements. ## Data example (continued) - variables and format * For each advertisement (54 cases), the data below were observed. * **Variable names**: * WDS = number of words in advertisement * SEN = number of sentences in advertisement * 3SYL = number of 3+ syllable words in advertisement * MAG = magazine (1 through 9 as above) * GROUP = educational level (1 through 3 as above) * Take a look at the data from within **Rstudio**: ```{r} magAds <- read.delim("https://asta.math.aau.dk/datasets?file=magazineAds.txt") head(magAds) ``` * Variable names are in the top row. They are not allowed to start with a digit, so an `X` has been prefixed in `X3SYL`. ## Data types ### Quantitative variables * The measurements have numerical values. * Quantative data often comes about in one of the following ways: * **Continuous variables**: measurements of e.g. waiting times in a queue, revenue, share prices, etc. * **Discrete variables**: counts of e.g. words in a text, hits on a webpage, number of arrivals to a queue in one hour, etc. * Measurements like this have a well-defined scale and in **R** they are stored as the type **numeric**. * It is important to be able to distinguish between discrete count variables and continuous variables, since this often determines how we describe the uncertainty of a measurement. ### Categorical/qualitative variables * The measurement is one of a set of given categories, e.g. sex (male/female), social status, satisfaction score (low/medium/high), etc. * The measurement is usually stored (which is also recommended) as a **factor** in **R**. The possible categories are called **levels**. Example: the levels of the factor "sex" is male/female. * Factors have two so-called scales: + **Nominal scale**: There is no natural ordering of the factor levels, e.g. sex and hair color. + **Ordinal scale**: There is a natural ordering of the factor levels, e.g. social status and satisfaction score. A factor in **R** can have a so-called **attribute** assigned, which tells if it is ordinal. # Population and sample ## Aim of statistics * Statistics is all about "saying something" about a population. * Typically, this is done by taking a random sample from the population. * The sample is then analysed and a statement about the population can be made. * The process of making conclusions about a population from analysing a sample is called **statistical inference**. ## Selecting **randomly** * For the magazine data: * First we select **randomly** 3 magazines from each group. * Then we select **randomly** 6 ads from each magazine. * An important detail is that the selection is done completely at **random**, i.e. * each magazine within a group have an equal chance of being chosen and * each ad within a magazine have an equal chance of being chosen. * In the following it is a fundamental requirement that the data collection respects this principle of randomness and in this case we use the term **sample**. * More generally: * We have a **population** of objects. * We choose completely at random $n$ of these objects, and from the $j$th object we get the measurement $y_j$, $j=1,2,\ldots,n$. * The measurements $y_1, y_2, \ldots, y_n$ are then called a **sample**. * If we e.g. are measuring the water quality 4 times in a year then it is a bad idea to only collect data in fair weather. The chosen sampling time is not allowed to be influenced by something that might influence the measurement itself. # Variable grouping and frequency tables ## Binning * The function `cut` will divide the range of a numeric variable in a number of equally sized intervals, and record which interval each observation belongs to. E.g. for the variable `X3SYL` (the number of words with more than three syllables) in the magazine data: ```{r} # Before 'cutting': magAds$X3SYL[1:5] # After 'cutting' into 4 intervals: syll <- cut(magAds$X3SYL, 4) syll[1:5] ``` * The result is a `factor` and the labels are the interval end points by default. Custom ones can be assigned through the `labels` argument: ```{r} labs <- c("few", "some", "many", "lots") syll <- cut(magAds$X3SYL, 4, labels = labs) # NB: this overwrites the 'syll' defined above syll[1:5] magAds$syll <- syll # Adding a new column to the dataset ``` ## Tables * To summarize the results we can use the function `tally` from the `mosaic` package (remember the package **must be loaded** via `library(mosaic)` if you did not do so yet): ```{r} tally( ~ syll, data = magAds) ``` * In percent: ```{r} tally( ~ syll, data = magAds, format = "percent") ``` * Here we use an **R** `formula` (characterized by the "tilde" sign `~`) to indicate that we want this variable from the dataset `magAds` (without the tilde it would look for a global variable called `syll` and use that rather than the one in the dataset). ## 2 factors: Cross tabulation * To make a table of all combinations of two factors we use `tally` again: ```{r size="small"} tally( ~ syll + GROUP, data = magAds) ``` * Relative frequencies (in percent) columnwise: ```{r size="small"} tally( ~ syll | GROUP, data = magAds, format = "percent") ``` * So, the above table shows e.g. how many percentage of the advertisements in group 1 that have 'few', 'some', 'many' or 'lots' words with more than 3 syllables. # Graphics ## Bar graph * To create a bar graph plot of table data we use the function `gf_bar` from `mosaic`. For each level of the factor a box is drawn with the height proportional to the frequency (count) of the level. ```{r bargraph} gf_bar( ~ syll, data = magAds) ``` * The bar graph can also be split by group: ```{r bargraph_grouped} gf_bar( ~ syll | GROUP, data = magAds) ``` ## The Ericksen data * Description of data: [Ericksen 1980 U.S. Census Undercount](http://www.rdocumentation.org/packages/car/functions/Ericksen). * This data contains the following variables: * `minority`: Percentage black or Hispanic. * `crime`: Rate of serious crimes per 1000 individuals in the population. * `poverty`: Percentage poor. * `language`: Percentage having difficulty speaking or writing English. * `highschool`: Percentage aged 25 or older who had not finished highschool. * `housing`: Percentage of housing in small, multiunit buildings. * `city`: A factor with levels: `city` (major city) and `state` (state or state-remainder). * `conventional`: Percentage of households counted by conventional personal enumeration. * `undercount`: Preliminary estimate of percentage undercount. * The Ericksen data has 66 rows/observations and 9 columns/variables. * The observations are measured in 16 large cities, the remaining parts of the states in which these cities are located, and the other U.S. states. ```{r} Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt") head(Ericksen) ``` * Want to make a histogram for crime rate - how? ## Histogram (quantitative variables) * How to make a histogram for some variable `x`: * Divide the interval from the minimum value of `x` to the maximum value of `x` in an appropriate number of equal sized sub-intervals. * Draw a box over each sub-interval with the height being proportional to the number of observations in the sub-interval. * Histogram of crime rates for the Ericksen data ```{r hist} gf_histogram( ~ crime, data = Ericksen) ``` # Summary of quantitative variables ## Measures of center of data: Mean and median * We return to the magazine ads example (`WDS` = number of words in advertisement). A number of numerical summaries for `WDS` can be retrieved using the `favstats` function: ```{r} favstats( ~ WDS, data = magAds) ``` ```{r echo=FALSE,results='hide'} y <- magAds$WDS nn <- length(y) ``` * The observed values of the variable `WDS` are $y_1=`r y[1]`$, $y_2=`r y[2]`,\ldots,y_n=`r y[nn]`$, where there are a total of $n=`r nn`$ values. As previously defined this constitutes a **sample**. * **mean** = `r round(mean(y))` is the **average** of the sample, which is calculated by $$ \bar{y}=\frac{1}{n}\sum_{i=1}^n y_i $$ We may also call $\bar{y}$ the **(empirical) mean** or the **sample mean**. * **median** = `r round(median(y))` is the 50-percentile, i.e. the value that splits the sample in 2 groups of equal size. * An important property of the **mean** and the **median** is that they have the same unit as the observations (e.g. meter). ## Measures of variability of data: range, standard deviation and variance * The **range** is the difference of the largest and smallest observation. * The **(empirical) variance** is the average of the squared deviations from the mean: $$ s^2=\frac{1}{n-1}\sum_{i=1}^n (y_i-\bar{y})^2. $$ * **sd** $=$ **standard deviation** $= s=\sqrt{s^2}$. * Note:\ If the observations are measured in meter,\ the **variance** has unit $\text{meter}^2$ which is hard to interpret. The **standard deviation** on the other hand has the same unit as the observations. * The standard deviation describes how much data varies around the (empirical) mean. ## Calculation of mean, median and standard deviation using **R** The mean, median and standard deviation are just some of the summaries that can be read of the `favstats` output (shown on previous page). They may also be calculated separately in the following way: * Mean of `WDS`: ```{r} mean( ~ WDS, data = magAds) ``` * Median of `WDS`: ```{r} median( ~ WDS, data = magAds) ``` * Standard deviation for `WDS`: ```{r} sd( ~ WDS, data = magAds) ``` We may also calculate the summaries for each group (variable `GROUP`), e.g. for the mean: ```{r} mean( ~ WDS | GROUP, data = magAds) ``` ```{r echo=FALSE,results='hide'} xx <- sort(y) ``` ## A word about terminology * **Standard deviation**:\ a measure of variability of a population or a sample. * **Standard error**:\ a measure of variability of an estimate. For example, a measure of variability of the sample mean. ## The empirical rule ```{r empRule,echo=FALSE,fig.width=12,fig.height=6} set.seed(5) x <- rnorm(900) x <- (x-mean(x))/sd(x) hist(x,axes=F,xlab="",breaks="FD",ylab="",main="",cex.main = 1.5,ylim=c(-.21,.4),probability=TRUE) axis(1,at=-3:3,labels=F,pos=0) axis(1,at=-3,labels=substitute(bar(y)-3*s),pos=0) axis(1,at=-2,labels=substitute(bar(y)-2*s),pos=0) axis(1,at=-1,labels=substitute(bar(y)-s),pos=0) axis(1,at=0,labels=substitute(bar(y)),pos=0) axis(1,at=1,labels=substitute(bar(y)+s),pos=0) axis(1,at=2,labels=substitute(bar(y)+2*s),pos=0) axis(1,at=3,labels=substitute(bar(y)+3*s),pos=0) arrows(-1,-.1,1,-.1,col="red",code=3,length=.1) text(-.01,-.115,"About 68% of measurements",col="red") arrows(-2,-.15,2,-.15,col="blue",code=3,length=.1) text(-.01,-.1655,"About 95% of measurements",col="blue") arrows(-3,-.2,3,-.2,code=3,length=.1) text(-.01,-.215,"All or nearly all measurements") ``` If the histogram of the sample looks like a bell shaped curve, then * about 68% of the observations lie between $\bar{y}-s$ and $\bar{y}+s$. * about 95% of the observations lie between $\bar{y}-2s$ and $\bar{y}+2s$. * All or almost all (99.7%) of the observations lie between $\bar{y}-3s$ and $\bar{y}+3s$. ## Percentiles * **The $p$th percentile** is a value such that about $p$% of the population (or sample) lies below or at this value and about $(100-p)$% of the population (or sample) lies above it. ### Percentile calculation for a sample: * First, sort data in increasing order. For the `WDS` variable in the magazine data: $$ y_{(1)}=`r xx[1]`, y_{(2)}=`r xx[2]`, y_{(3)}=`r xx[3]`, \ldots, y_{(n)} = `r xx[nn]`. $$ Here the number of observations is $n=`r nn`$. ```{r echo=FALSE} p <- 5 N <- nn*p/100 ``` * Find the $5$th percentile (i. e.\ $p = `r p`$):\ * The observation number corresponding to the 5-percentile is $N = \frac{ `r nn` \cdot `r p`}{100} = `r N`$. * So the 5-percentile lies between the observations with observation number $k=`r floor(N)`$ and $k+1=`r floor(N)+1`$. That is, its value lies somewhere in the interval between $y_2= 32$ and $y_3=34$ * One of several methods for estimating the 5-percentile from the value of N is defined as: $$ y_{(k)} + (N - k)(y_{(k+1)} - y_{(k)}) $$ which in this case states $$ y_2 + (2.7 - 2)(y_3 - y_2) = 32 + 0.7 \cdot (34-32) = 33.4 $$ ## Median, quartiles and interquartile range Recall ```{r} favstats( ~ WDS, data = magAds) ``` * 50-percentile = 96 is the **median** and it is a measure of the center of data. * 0-percentile = 31 is the **minimum** value. * 25-percentile = 69 is called the **lower quartile** (Q1).\ Median of lower 50% of data. * 75-percentile = 202 is called the **upper quartile** (Q3).\ Median of upper 50% of data. * 100-percentile = 230 is the **maximum** value. * **Interquartile Range (IQR)**:\ a measure of variability given by the difference of the upper and lower quartiles: 202 - 69 = 133. # More graphics ## Box-and-whiskers plots (or simply box plots) How to draw a box-and-whiskers plot: * Box: * Calculate the median, lower and upper quartiles. * Plot a line by the median and draw a box between the upper and lower quartiles. * Whiskers: * Calculate interquartile range and call it IQR. * Calculate the following values: * L = lower quartile - 1.5*IQR * U = upper quartile + 1.5*IQR * Draw a line from lower quartile to the smallest measurement, which is larger than *L*. * Similarly, draw a line from upper quartile to the largest measurement which is smaller than *U*. * Outliers: Measurements smaller than *L* or larger than *U* are drawn as circles. *Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.* ```{r echo = FALSE} set.seed(42) dat <- data.frame(plottype = factor("boxplot", levels = c("points", "boxplot", "whiskers")), y = c(7, runif(20,12,19), 22)) b <- boxplot(y ~ plottype, data = dat, ylim = c(6, 25), xlab = "", ylab = "", pch = 25) points(rep(1, length(dat$y)), dat$y) Q1 <- b$stats[2,2] Q3 <- b$stats[4,2] IQR <- Q3-Q1 arrows(3, Q1, 3, Q1-1.5*IQR) text(x = 3, y = c(Q1-1.5*IQR,Q1), pos = c(1, 3) , labels = c("Q1-1.5IQR", "Q1"), cex = 1) arrows(3, Q3, 3, Q3+1.5*IQR) text(x = 3, y = c(Q3, Q3+1.5*IQR), pos = c(1,3), labels = c("Q3", "Q3+1.5IQR"), cex = 1) abline(h = c(Q1-1.5*IQR, Q1, Q3, Q3+1.5*IQR), lty = 2) ``` ---- ### Boxplot for Ericksen data Boxplot of the poverty rates separately for cities and states (variable `city`): ```{r boxplot} favstats(poverty ~ city, data = Ericksen) gf_boxplot(poverty ~ city, data = Ericksen) ``` * There seems to be more poverty in the cities. * A single state differs noticeably from the others with a high poverty rate. ## 2 quantitative variables: Scatter plot For two quantitative variables the usual graphic is a scatter plot: ```{r scatter0} gf_point(poverty ~ highschool, data = Ericksen) ``` This can be either split or coloured according to the value of `city`: ```{r} gf_point(poverty ~ highschool | city, data = Ericksen) gf_point(poverty ~ highschool, col = ~city, data = Ericksen) ``` If we want a regression line along with the points we can do: ```{r scatter05} gf_point(poverty ~ highschool, col = ~city, data = Ericksen) %>% gf_lm() ``` # Appendix ## Recoding variables * The function `factor` will directly convert a vector to be of type `factor`.\ E.g.: ```{r} head(magAds$GROUP) f <- factor(magAds$GROUP) magAds$GROUP <- f head(magAds$GROUP) ``` * Custom labels for the levels can also be used: ```{r} f <- factor(magAds$GROUP, levels = c("1", "2", "3"), labels = c("high", "medium", "low")) magAds$GROUP <- f head(magAds$GROUP) ``` * In this way the numbers are replaced by more informative labels describing the educational level.