Intro and descriptive statistics

The ASTA team

Software

Rstudio

R basics

4.6 * (2 + 3)^4 
## [1] 2875
a <- 4 
a
## [1] 4
b <- c(2, 5, 7)
b
## [1] 2 5 7
s <- 1:4
s
## [1] 1 2 3 4
s <- seq(1, 4, by = 1)
a * b
## [1]  8 20 28
a + b
## [1]  6  9 11
b ^ 2
## [1]  4 25 49
sum(b)
## [1] 14
prod(b)
## [1] 70

R extensions

install.packages("mosaic")
library(mosaic)

R help

?sum
help.search("plot")

Data

Data example

Data: Magazine Ads Readability

Data example (continued) - variables and format

magAds <- read.delim("https://asta.math.aau.dk/datasets?file=magazineAds.txt")
head(magAds)
##   WDS SEN X3SYL MAG GROUP
## 1 205   9    34   1     1
## 2 203  20    21   1     1
## 3 229  18    37   1     1
## 4 208  16    31   1     1
## 5 146   9    10   1     1
## 6 230  16    24   1     1

Data types

Quantitative variables

Categorical/qualitative variables

Population and sample

Aim of statistics

Selecting randomly

Variable grouping and frequency tables

Binning

# Before 'cutting':
magAds$X3SYL[1:5]
## [1] 34 21 37 31 10
# After 'cutting' into 4 intervals:
syll <- cut(magAds$X3SYL, 4)
syll[1:5]
## [1] (32.2,43]     (10.8,21.5]   (32.2,43]     (21.5,32.2]   (-0.043,10.8]
## Levels: (-0.043,10.8] (10.8,21.5] (21.5,32.2] (32.2,43]
labs <- c("few", "some", "many", "lots")
syll <- cut(magAds$X3SYL, 4, labels = labs) # NB: this overwrites the 'syll' defined above
syll[1:5]
## [1] lots some lots many few 
## Levels: few some many lots
magAds$syll <- syll # Adding a new column to the dataset

Tables

tally( ~ syll, data = magAds)
## syll
##  few some many lots 
##   26   14   10    4
tally( ~ syll, data = magAds, format = "percent")
## syll
##  few some many lots 
## 48.1 25.9 18.5  7.4

2 factors: Cross tabulation

tally( ~ syll + GROUP, data = magAds)
##       GROUP
## syll    1  2  3
##   few   8 11  7
##   some  4  2  8
##   many  3  5  2
##   lots  3  0  1
tally( ~ syll | GROUP, data = magAds, format = "percent")
##       GROUP
## syll      1    2    3
##   few  44.4 61.1 38.9
##   some 22.2 11.1 44.4
##   many 16.7 27.8 11.1
##   lots 16.7  0.0  5.6

Graphics

Bar graph

gf_bar( ~ syll, data = magAds)

gf_bar( ~ syll | GROUP, data = magAds)

The Ericksen data

Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt")
head(Ericksen)
##           name minority crime poverty language highschool housing  city
## 1      Alabama     26.1    49      19      0.2         44     7.6 state
## 2       Alaska      5.7    62      11      1.7         18    23.6 state
## 3      Arizona     18.9    81      13      3.2         28     8.1 state
## 4     Arkansas     16.9    38      19      0.2         44     7.0 state
## 5 California.R     24.3    73      10      5.0         26    11.8 state
## 6     Colorado     15.2    73      10      1.2         21     9.2 state
##   conventional undercount
## 1            0      -0.04
## 2          100       3.35
## 3           18       2.48
## 4            0      -0.74
## 5            4       3.60
## 6           19       1.34

Histogram (quantitative variables)

gf_histogram( ~ crime, data = Ericksen)

Summary of quantitative variables

Measures of center of data: Mean and median

favstats( ~ WDS, data = magAds)
##  min Q1 median  Q3 max mean sd  n missing
##   31 69     96 202 230  123 66 54       0

Measures of variability of data: range, standard deviation and variance

Calculation of mean, median and standard deviation using R

The mean, median and standard deviation are just some of the summaries that can be read of the favstats output (shown on previous page). They may also be calculated separately in the following way:

mean( ~ WDS, data = magAds)
## [1] 123
median( ~ WDS, data = magAds)
## [1] 96
sd( ~ WDS, data = magAds)
## [1] 66

We may also calculate the summaries for each group (variable GROUP), e.g. for the mean:

mean( ~ WDS | GROUP, data = magAds)
##   1   2   3 
## 140 121 106

A word about terminology

The empirical rule

If the histogram of the sample looks like a bell shaped curve, then

Percentiles

Percentile calculation for a sample:

Median, quartiles and interquartile range

Recall

favstats( ~ WDS, data = magAds)
##  min Q1 median  Q3 max mean sd  n missing
##   31 69     96 202 230  123 66 54       0

More graphics

Box-and-whiskers plots (or simply box plots)

How to draw a box-and-whiskers plot:

Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.

Boxplot for Ericksen data

Boxplot of the poverty rates separately for cities and states (variable city):

favstats(poverty ~ city, data = Ericksen)
##    city  min   Q1 median Q3 max mean  sd  n missing
## 1  city 11.5 13.8     19 21  23   18 4.0 16       0
## 2 state  6.8  9.5     11 14  24   12 3.7 50       0
gf_boxplot(poverty ~ city, data = Ericksen)

2 quantitative variables: Scatter plot

For two quantitative variables the usual graphic is a scatter plot:

gf_point(poverty ~ highschool, data = Ericksen)

This can be either split or coloured according to the value of city:

gf_point(poverty ~ highschool | city, data = Ericksen)

gf_point(poverty ~ highschool, col = ~city, data = Ericksen)

If we want a regression line along with the points we can do:

gf_point(poverty ~ highschool, col = ~city, data = Ericksen) %>% gf_lm()
## Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Appendix

Recoding variables

head(magAds$GROUP)  
## [1] 1 1 1 1 1 1
f <- factor(magAds$GROUP)
magAds$GROUP <- f
head(magAds$GROUP)
## [1] 1 1 1 1 1 1
## Levels: 1 2 3
f <- factor(magAds$GROUP, 
            levels = c("1", "2", "3"),
            labels = c("high", "medium", "low"))
magAds$GROUP <- f
head(magAds$GROUP)
## [1] high high high high high high
## Levels: high medium low