The ASTA team
Session -> Set Working Directory -> Choose Directory
(shortcut: Ctrl+Shift+H).Tools -> Global Options -> Choose Directory
.## [1] 2875
## [1] 4
## [1] 2 5 7
## [1] 1 2 3 4
## [1] 8 20 28
## [1] 6 9 11
## [1] 4 25 49
## [1] 14
## [1] 70
The slides and all exercises in R (including the exam questions) are made in the special Rmarkdown format.
This allows you to combine text and R code.
You can write formulas using standard LaTeX commands.
Tools -> Install Packages
library
command:mosaic
package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the mosaic
package, but you can safely ignore that.?<command>
:tab
to make Rstudio guess what you have started typing.To Source
button.Now we will have a look at a data set concerning the 1988 vote in Chile for or against Pinochet to continue as leader. The sample consists of 2700 voters randomly selected from the Chilenean population.
The data set contains the variables:
region
: The region in Chile where the voter livespopulation
: Population of the region.sex
: The gender of the voter.age
: The age of the voter.education
: Education level of the voter.income
: Monthly income of the voter.statusquo
: To which degree the voter supports the status quo.vote
: Should Pinochet continue? Y
= yes, N
= no, U
=undecided, A
= will abstain from voting.More information about the data set may be found here.
## region population sex age education income statusquo vote
## 1 N 175000 M 65 P 35000 1.0 Y
## 2 N 175000 M 29 PS 7500 -1.3 N
## 3 N 175000 F 38 P 15000 1.2 Y
## 4 N 175000 F 49 P 35000 -1.0 N
## 5 N 175000 F 23 S 35000 -1.1 N
## 6 N 175000 F 28 P 7500 -1.0 N
## region population sex age education income statusquo vote
## 1 N 175000 M 65 P 35000 1.0 Y
## 2 N 175000 M 29 PS 7500 -1.3 N
## 3 N 175000 F 38 P 15000 1.2 Y
## 4 N 175000 F 49 P 35000 -1.0 N
## 5 N 175000 F 23 S 35000 -1.1 N
## 6 N 175000 F 28 P 7500 -1.0 N
Quantitative variables in the Chile
data set:
population
, age
, income
, statusquo
Categorical variables:
region
, sex
, education
, vote
All the categorical variables are nominal except education
, which has three ordered categories (primary, secondary, post-secondary).
vote
we can use the function tally
from the mosaic
package (remember the package must be loaded via library(mosaic)
if you did not do so yet):## vote
## A N U Y <NA>
## 187 889 588 868 168
## vote
## A N U Y <NA>
## 6.9 32.9 21.8 32.1 6.2
formula
(characterized by the “tilde” sign ~
) to indicate that we want this variable from the dataset Chile
(without the tilde it would look for a global variable called vote
and use that rather than the one in the dataset).To get an overview over the relation between two categorical variables, we can make a cross tabulation.
To make a table of all combinations of the two factors vote
and sex
, we use tally
again:
## sex
## vote F M
## A 104 83
## N 363 526
## U 362 226
## Y 480 388
## <NA> 70 98
## sex
## vote F M
## A 7.5 6.3
## N 26.3 39.8
## U 26.3 17.1
## Y 34.8 29.4
## <NA> 5.1 7.4
gf_bar
from mosaic
. For each level of the factor, a box is drawn with the height proportional to the frequency (count) of the level.
In this data set, a car magazine tested the fuel consumption of 32 cars. The variable mpg
gives the fuel consumption in miles pr. gallon (the data set is from 1974).
The data set is built into R under the name mtcars
, so it does not need to be loaded before use.
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.6 16 0 1 4 4
## Mazda RX4 Wag 21 6 160 110 3.9 2.9 17 0 1 4 4
## Datsun 710 23 4 108 93 3.9 2.3 19 1 1 4 1
## Hornet 4 Drive 21 6 258 110 3.1 3.2 19 1 0 3 1
## Hornet Sportabout 19 8 360 175 3.1 3.4 17 0 0 3 2
## Valiant 18 6 225 105 2.8 3.5 20 1 0 3 1
The way to get a first impression of a quantitative variable is to draw a histogram.
The histogram of a variable x
is made as follows:
x
to the maximum value of x
in an appropriate number of equal sized sub-intervals.Histogram of mpg
for the mtcars
data. The bins
option sets the number of subintervals to 10.
Suppose a sample comes from a population having a continuous distribution with density function \(f\).
Draw a histogram where the \(y\)-axis is scaled such that the total area of the bars is 1.
When the number of observations (the sample size) increases we can make a finer interval division and get a more smooth histogram.
When the number of observations tends to infinity, we obtain a nice smooth curve, where the area below the curve is \(1\). This curve is exactly the probability density function \(f\).
mtcars
example. A summary of the fuel consumption mpg
can be retrieved using the favstats
function:## min Q1 median Q3 max mean sd n missing
## 10 15 19 23 34 20 6 32 0
min The minimal value in the sample is \(10.4\).
max The maximal value in the sample is \(33.9\).
n The sample size (number of observations) is 32.
mean The sample mean is \(20.1\). Recall that this was the average of all observations \(x_1,\ldots,x_n\), i.e. \[ \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i \]
sd The sample standard deviation is \(6.03\). Recall that this was given by \[ s=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2. \]
missing There are no missing values.
median The median (or 50-percentile) is the value such that half of the sample has lower values than the median and half the sample has larger values.
Q1 and Q3 will be introduced on later slides.
The mean, median and standard deviation are just some of the summaries that can be read of the favstats
output (shown on previous page). They may also be calculated separately in the following way:
Sample size of mpg
:
## [1] 32
mpg
:## [1] 20
mpg
:## [1] 19
mpg
:## [1] 6
vs
) the sample mean is:## 0 1
## 17 25
If the histogram of the sample looks like a bell shaped curve, then we have
First, sort data from smallest to largest. For the mpg
variable: \[
x_{(1)}=10.4, x_{(2)}=10.4, x_{(3)}=13.3, \ldots, x_{(n)} = 33.9.
\] Here the number of observations is \(n=32\).
Find the \(10\)th percentile (i. e. \(p = 10\)):
Recall
## min Q1 median Q3 max mean sd n missing
## 10 15 19 23 34 20 6 32 0
How to draw a box-and-whiskers plot:
Box:
Whiskers:
Outliers: Measurements smaller than L or larger than U are drawn as circles.
Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.
## vs min Q1 median Q3 max mean sd n missing
## 1 0 10 15 16 19 26 17 3.9 18 0
## 2 1 18 21 23 30 34 25 5.4 14 0
A scatter plot is used to visualize two quantitative variables.
For instance, we can plot the relation between fuel consumption and horse powers (hp
) of a car as follows
vs
:## Warning: Using the `size` aesthetic with geom_line was deprecated in ggplot2 3.4.0.
## ℹ Please use the `linewidth` aesthetic instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The quantiles may be used to investigate whether the sample comes from a normal distribution.
Call the \(\frac{i}{n}\)th quantile of a standard normal distribution \(q_i\), i.e. \(P(Z\leq q_i)= \frac{i}{n}\).
If \(Y\sim \texttt{norm}(\mu,\sigma)\), then this is equivalent to is equivalent to \[P(Y\leq \mu+\sigma q_i)= \frac{i}{n}.\]
Suppose the population follows a \(\texttt{norm}(\mu,\sigma)\) distribution, then the sample quantiles \(x_{(i)}\) should be approximately equal to the population quantiles \(\mu+\sigma q_i\).
If we make a scatter plot of the pair \((q_i,x_{(i)})\), these should lie on a straight line. We call this a normal Q-Q plot (or quantile-quantile plot).
mpg
variable in the mtcars
data set follows a normal distribution: