Introduction to R and descriptive statistics

The ASTA team

Introduction to R

Rstudio

R basics

4.6 * (2 + 3)^4 
## [1] 2875
a <- 4 
a
## [1] 4
b <- c(2, 5, 7)
b
## [1] 2 5 7
s <- 1:4
s
## [1] 1 2 3 4
s <- seq(1, 4, by = 1)
a * b
## [1]  8 20 28
a + b
## [1]  6  9 11
b ^ 2
## [1]  4 25 49
sum(b)
## [1] 14
prod(b)
## [1] 70

R markdown

R extensions

install.packages("mosaic")
library(mosaic)

R help

?sum
help.search("plot")

Data in R

Data example

Chile <- read.delim("https://asta.math.aau.dk/datasets?file=Chile.txt")
head(Chile)
##   region population sex age education income statusquo vote
## 1      N     175000   M  65         P  35000       1.0    Y
## 2      N     175000   M  29        PS   7500      -1.3    N
## 3      N     175000   F  38         P  15000       1.2    Y
## 4      N     175000   F  49         P  35000      -1.0    N
## 5      N     175000   F  23         S  35000      -1.1    N
## 6      N     175000   F  28         P   7500      -1.0    N

Data types

Quantitative variables

Categorical/qualitative variables

Variables in the data set

head(Chile)
##   region population sex age education income statusquo vote
## 1      N     175000   M  65         P  35000       1.0    Y
## 2      N     175000   M  29        PS   7500      -1.3    N
## 3      N     175000   F  38         P  15000       1.2    Y
## 4      N     175000   F  49         P  35000      -1.0    N
## 5      N     175000   F  23         S  35000      -1.1    N
## 6      N     175000   F  28         P   7500      -1.0    N

Descriptive statistics of categorical data

Tables

tally( ~ vote, data = Chile)
## vote
##    A    N    U    Y <NA> 
##  187  889  588  868  168
tally( ~ vote, data = Chile, format = "percent")
## vote
##    A    N    U    Y <NA> 
##  6.9 32.9 21.8 32.1  6.2

2 factors: Cross tabulation

tally( ~ vote + sex, data = Chile)
##       sex
## vote     F   M
##   A    104  83
##   N    363 526
##   U    362 226
##   Y    480 388
##   <NA>  70  98
tally( ~ vote | sex, data = Chile, format = "percent")
##       sex
## vote      F    M
##   A     7.5  6.3
##   N    26.3 39.8
##   U    26.3 17.1
##   Y    34.8 29.4
##   <NA>  5.1  7.4

Visualizing categorical data: Bar graph

gf_bar( ~ vote, data = Chile)

gf_bar( ~ vote | sex, data = Chile)

Descriptive statistics of quantitative variables

Data example: Fuel consumption of cars

head(mtcars)
##                   mpg cyl disp  hp drat  wt qsec vs am gear carb
## Mazda RX4          21   6  160 110  3.9 2.6   16  0  1    4    4
## Mazda RX4 Wag      21   6  160 110  3.9 2.9   17  0  1    4    4
## Datsun 710         23   4  108  93  3.9 2.3   19  1  1    4    1
## Hornet 4 Drive     21   6  258 110  3.1 3.2   19  1  0    3    1
## Hornet Sportabout  19   8  360 175  3.1 3.4   17  0  0    3    2
## Valiant            18   6  225 105  2.8 3.5   20  1  0    3    1

Visualizing quantitative data: Histogram

gf_histogram( ~ mpg, data = mtcars,bins=10)

Relation between histogram and denity function

Summary statistics for quantitative data

favstats( ~ mpg, data = mtcars)
##  min Q1 median Q3 max mean sd  n missing
##   10 15     19 23  34   20  6 32       0

Calculation of mean, median and standard deviation using R

length(mtcars$mpg) 
## [1] 32
mean( ~ mpg, data = mtcars)
## [1] 20
median( ~ mpg, data = mtcars)
## [1] 19
sd( ~ mpg, data = mtcars)
## [1] 6
mean( ~ mpg | factor(vs), data = mtcars)
##  0  1 
## 17 25

Interpretation of summary statistics: The empirical rule

Percentiles

Percentile calculation for a sample:

Median, quartiles and interquartile range

Recall

favstats( ~ mpg, data = mtcars)
##  min Q1 median Q3 max mean sd  n missing
##   10 15     19 23  34   20  6 32       0

Box-and-whiskers plots (or simply box plots)

How to draw a box-and-whiskers plot:

Boxplot for fuel consumption

favstats(mpg ~ vs, data = mtcars)
##   vs min Q1 median Q3 max mean  sd  n missing
## 1  0  10 15     16 19  26   17 3.9 18       0
## 2  1  18 21     23 30  34   25 5.4 14       0
gf_boxplot(mpg ~ factor(vs), data = mtcars)

2 quantitative variables: Scatter plot

gf_point(mpg ~ hp, data = mtcars)

gf_point(mpg ~ hp | factor(vs), data = mtcars)

gf_point(mpg ~ hp, col = ~factor(vs), data = mtcars)

gf_point(mpg ~ hp, col = ~factor(vs), data = mtcars) %>% gf_lm()
## Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Quantile plots

The empirical quantiles

Normal quantile-quantile plots

qqnorm(mtcars$mpg)
qqline(mtcars$mpg)