Introduction to R and descriptive statistics

The ASTA team (Modified by Søren Højsgaard, September 2025)

Introduction to R

Rstudio

R basics

4.6 * (2 + 3)^4 
## [1] 2875
a <- 4 
a
## [1] 4
b <- c(2, 5, 7)
b
## [1] 2 5 7
s <- 1:4
s
## [1] 1 2 3 4
s <- seq(1, 4, by = 1)
a * b
## [1]  8 20 28
a + b
## [1]  6  9 11
b ^ 2
## [1]  4 25 49
sum(b)
## [1] 14
prod(b)
## [1] 70

R markdown

R extensions

install.packages("mosaic")
library(mosaic)

R help

?sum
help.search("plot")

Data in R

Data example

Note: The referendum was held in Chile on 5 October 1988. The “No” side won with 56% of the vote. Democratic elections were held in 1989, leading to the establishment of a new government in 1990.

Chile <- read.delim("https://asta.math.aau.dk/datasets?file=Chile.txt")
head(Chile)
##   region population sex age education income statusquo vote
## 1      N     175000   M  65         P  35000      1.01    Y
## 2      N     175000   M  29        PS   7500     -1.30    N
## 3      N     175000   F  38         P  15000      1.23    Y
## 4      N     175000   F  49         P  35000     -1.03    N
## 5      N     175000   F  23         S  35000     -1.10    N
## 6      N     175000   F  28         P   7500     -1.05    N

Data types

Quantitative variables

Categorical/qualitative variables

Variables in the data set

head(Chile)
##   region population sex age education income statusquo vote
## 1      N     175000   M  65         P  35000      1.01    Y
## 2      N     175000   M  29        PS   7500     -1.30    N
## 3      N     175000   F  38         P  15000      1.23    Y
## 4      N     175000   F  49         P  35000     -1.03    N
## 5      N     175000   F  23         S  35000     -1.10    N
## 6      N     175000   F  28         P   7500     -1.05    N

Descriptive statistics of categorical data

Tables

tally( ~ vote, data = Chile)
## vote
##    A    N    U    Y <NA> 
##  187  889  588  868  168
tally( ~ vote, data = Chile, format = "percent")
## vote
##     A     N     U     Y  <NA> 
##  6.93 32.93 21.78 32.15  6.22

2 factors: Cross tabulation

tally( ~ vote + sex, data = Chile)
##       sex
## vote     F   M
##   A    104  83
##   N    363 526
##   U    362 226
##   Y    480 388
##   <NA>  70  98
tally( ~ vote | sex, data = Chile, format = "percent")
##       sex
## vote       F     M
##   A     7.54  6.28
##   N    26.32 39.82
##   U    26.25 17.11
##   Y    34.81 29.37
##   <NA>  5.08  7.42

Visualizing categorical data: Bar graph

gf_bar( ~ vote, data = Chile)

gf_bar( ~ vote | sex, data = Chile)

Descriptive statistics of quantitative variables

Data example: Fuel consumption of cars

head(mtcars)
##                    mpg cyl disp  hp drat   wt qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.62 16.5  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.88 17.0  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.32 18.6  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.21 19.4  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.44 17.0  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.46 20.2  1  0    3    1

Visualizing quantitative data: Histogram

gf_histogram( ~ mpg, data=mtcars, bins=10)

Relation between histogram and denity function

Summary statistics for quantitative data

favstats( ~ mpg, data = mtcars)
##   min   Q1 median   Q3  max mean   sd  n missing
##  10.4 15.4   19.2 22.8 33.9 20.1 6.03 32       0

Calculation of mean, median and standard deviation using R

length(mtcars$mpg) 
## [1] 32
mean( ~ mpg, data = mtcars)
## [1] 20.1
median( ~ mpg, data = mtcars)
## [1] 19.2
sd( ~ mpg, data = mtcars)
## [1] 6.03
mean( ~ mpg | factor(vs), data = mtcars)
##    0    1 
## 16.6 24.6

Interpretation of summary statistics: The empirical rule

Very practical rules of thumb

If the histogram of the sample is unimodal approximately bell shaped, then

mean(mtcars$mpg)
## [1] 20.1
median(mtcars$mpg)
## [1] 19.2

And the median is easy to find: Sort data and locate the middle observation.

If we say that 95% of the observations are “all observations” then we get the very practical rule of thumb: The range of all or nearly all observations is approximately \(4s\). That is a very useful interpretation of \(s\).

4*sd(mtcars$mpg)
## [1] 24.1
range(mtcars$mpg)[2]-range(mtcars$mpg)[1]
## [1] 23.5

Percentiles

Percentile calculation for a sample:

Median, quartiles and interquartile range

Recall

favstats( ~ mpg, data = mtcars)
##   min   Q1 median   Q3  max mean   sd  n missing
##  10.4 15.4   19.2 22.8 33.9 20.1 6.03 32       0

Box-and-whiskers plots (or simply box plots)

How to draw a box-and-whiskers plot:

Boxplot for fuel consumption

favstats(mpg ~ vs, data = mtcars)
##   vs  min   Q1 median   Q3  max mean   sd  n missing
## 1  0 10.4 14.8   15.7 19.1 26.0 16.6 3.86 18       0
## 2  1 17.8 21.4   22.8 29.6 33.9 24.6 5.38 14       0
gf_boxplot(mpg ~ factor(vs), data = mtcars)

2 quantitative variables: Scatter plot

gf_point(mpg ~ hp, data = mtcars)

gf_point(mpg ~ hp | factor(vs), data = mtcars)

gf_point(mpg ~ hp, col = ~factor(vs), data = mtcars)

gf_point(mpg ~ hp, col = ~factor(vs), data = mtcars) |> gf_lm()
## Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Quantile plots

The empirical quantiles

The quantiles of a distribution may be used to summarize the distribution or to investigate if a sample comes from a specific distribution.

\[\hat F(x) = \frac{\mbox{number of sample points} \le x}{n}\]

\[x_{(1)}\leq x_{(2)}\leq \ldots \leq x_{(n)}\]

plot(ecdf(mtcars$mpg))

\[\hat{F}(x_{(i)}) = \frac{i}{n}.\]

Quantile-quantile plots

\[ i / n = \hat F(x_{(i)}) = P(X\leq x_{(i)}) \]

\[ i/n = Pr(Y \le q_i) \]

Example: Does mpg come from a uniform distribution \(U(10, 34)\). Does not look like it:

x <- mtcars$mpg
n <- length(x)
x_sorted <- sort(x) 
i.n <- (1:n) / n
q.i <- qunif(i.n, min(x), max(x)) ## min(x) is 10, max(x) is 34
qqplot(q.i, x_sorted)
abline(a=0, b=1, col="red", lwd=2)

Normal quantile-quantile plots

\[ \begin{aligned} \frac{i}{n} &= P(Z\leq q_i) = P(\mu + \sigma Z \leq \mu + \sigma q_i)\\ &= P(Y \leq \mu + \sigma q_i)\\ \end{aligned} \]

So \(\mu + \sigma q_i\) is the corresponding \(\frac{i}{n}\)-quantile of a \(\texttt{norm}(\mu,\sigma)\) distribution.

Hence if the sample comes from a \(\texttt{norm}(\mu,\sigma)\) distribution, then the sample quantiles \(x_{(i)}\) should be approximately equal to the population quantiles \(\mu + \sigma q_i\):

\[ x_{(i)} \approx \mu + \sigma q_i \]

So if we plot (\(x_{(i)}, \mu + \sigma q_i)\) and if the sample comes from a \(\texttt{norm}(\mu,\sigma)\) distribution the points should be on a straight line with intercept \(\mu\) and slope \(\sigma\). Looks mostly like a straight line, so mpg could be described - approximately - by a normal distribution:

qqnorm(mtcars$mpg)
qqline(mtcars$mpg, col="red", lwd=2)