The ASTA team
Session -> Set Working Directory -> Choose Directory
(shortcut: Ctrl+Shift+H).Tools -> Global Options -> Choose Directory
.4.6 * (2 + 3)^4
## [1] 2875
a <- 4
a
## [1] 4
b <- c(2, 5, 7)
b
## [1] 2 5 7
s <- 1:4
s
## [1] 1 2 3 4
s <- seq(1, 4, by = 1)
a * b
## [1] 8 20 28
a + b
## [1] 6 9 11
b ^ 2
## [1] 4 25 49
sum(b)
## [1] 14
prod(b)
## [1] 70
Tools -> Install Packages
install.packages("mosaic")
library
command:library(mosaic)
mosaic
package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the mosaic
package, but you can safely ignore that.?<command>
:?sum
tab
to make Rstudio guess what you have started typing.help.search("plot")
To Source
button.Data: Magazine Ads Readability
magAds <- read.delim("https://asta.math.aau.dk/datasets?file=magazineAds.txt")
head(magAds)
## WDS SEN X3SYL MAG GROUP
## 1 205 9 34 1 1
## 2 203 20 21 1 1
## 3 229 18 37 1 1
## 4 208 16 31 1 1
## 5 146 9 10 1 1
## 6 230 16 24 1 1
X
has been prefixed in X3SYL
.cut
will divide the range of a numeric variable in a number of equally sized intervals, and record which interval each observation belongs to. E.g. for the variable X3SYL
(the number of words with more than three syllables) in the magazine data:# Before 'cutting':
magAds$X3SYL[1:5]
## [1] 34 21 37 31 10
# After 'cutting' into 4 intervals:
syll <- cut(magAds$X3SYL, 4)
syll[1:5]
## [1] (32.2,43] (10.8,21.5] (32.2,43] (21.5,32.2] (-0.043,10.8]
## Levels: (-0.043,10.8] (10.8,21.5] (21.5,32.2] (32.2,43]
factor
and the labels are the interval end points by default. Custom ones can be assigned through the labels
argument:labs <- c("few", "some", "many", "lots")
syll <- cut(magAds$X3SYL, 4, labels = labs) # NB: this overwrites the 'syll' defined above
syll[1:5]
## [1] lots some lots many few
## Levels: few some many lots
magAds$syll <- syll # Adding a new column to the dataset
tally
from the mosaic
package (remember the package must be loaded via library(mosaic)
if you did not do so yet):tally( ~ syll, data = magAds)
## syll
## few some many lots
## 26 14 10 4
tally( ~ syll, data = magAds, format = "percent")
## syll
## few some many lots
## 48.1 25.9 18.5 7.4
formula
(characterized by the “tilde” sign ~
) to indicate that we want this variable from the dataset magAds
(without the tilde it would look for a global variable called syll
and use that rather than the one in the dataset).tally
again:tally( ~ syll + GROUP, data = magAds)
## GROUP
## syll 1 2 3
## few 8 11 7
## some 4 2 8
## many 3 5 2
## lots 3 0 1
tally( ~ syll | GROUP, data = magAds, format = "percent")
## GROUP
## syll 1 2 3
## few 44.4 61.1 38.9
## some 22.2 11.1 44.4
## many 16.7 27.8 11.1
## lots 16.7 0.0 5.6
gf_bar
from mosaic
. For each level of the factor a box is drawn with the height proportional to the frequency (count) of the level.gf_bar( ~ syll, data = magAds)
gf_bar( ~ syll | GROUP, data = magAds)
minority
: Percentage black or Hispanic.crime
: Rate of serious crimes per 1000 individuals in the population.poverty
: Percentage poor.language
: Percentage having difficulty speaking or writing English.highschool
: Percentage aged 25 or older who had not finished highschool.housing
: Percentage of housing in small, multiunit buildings.city
: A factor with levels: city
(major city) and state
(state or state-remainder).conventional
: Percentage of households counted by conventional personal enumeration.undercount
: Preliminary estimate of percentage undercount.Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt")
head(Ericksen)
## name minority crime poverty language highschool housing city
## 1 Alabama 26.1 49 19 0.2 44 7.6 state
## 2 Alaska 5.7 62 11 1.7 18 23.6 state
## 3 Arizona 18.9 81 13 3.2 28 8.1 state
## 4 Arkansas 16.9 38 19 0.2 44 7.0 state
## 5 California.R 24.3 73 10 5.0 26 11.8 state
## 6 Colorado 15.2 73 10 1.2 21 9.2 state
## conventional undercount
## 1 0 -0.04
## 2 100 3.35
## 3 18 2.48
## 4 0 -0.74
## 5 4 3.60
## 6 19 1.34
x
:
x
to the maximum value of x
in an appropriate number of equal sized sub-intervals.gf_histogram( ~ crime, data = Ericksen)
WDS
= number of words in advertisement). A number of numerical summaries for WDS
can be retrieved using the favstats
function:favstats( ~ WDS, data = magAds)
## min Q1 median Q3 max mean sd n missing
## 31 69 96 202 230 123 66 54 0
WDS
are \(y_1=205\), \(y_2=203,\ldots,y_n=208\), where there are a total of \(n=54\) values. As previously defined this constitutes a sample.The mean, median and standard deviation are just some of the summaries that can be read of the favstats
output (shown on previous page). They may also be calculated separately in the following way:
WDS
:mean( ~ WDS, data = magAds)
## [1] 123
WDS
:median( ~ WDS, data = magAds)
## [1] 96
WDS
:sd( ~ WDS, data = magAds)
## [1] 66
We may also calculate the summaries for each group (variable GROUP
), e.g. for the mean:
mean( ~ WDS | GROUP, data = magAds)
## 1 2 3
## 140 121 106
If the histogram of the sample looks like a bell shaped curve, then
First, sort data in increasing order. For the WDS
variable in the magazine data: \[
y_{(1)}=31, y_{(2)}=32, y_{(3)}=34, \ldots, y_{(n)} = 230.
\] Here the number of observations is \(n=54\).
Recall
favstats( ~ WDS, data = magAds)
## min Q1 median Q3 max mean sd n missing
## 31 69 96 202 230 123 66 54 0
How to draw a box-and-whiskers plot:
Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.
Boxplot of the poverty rates separately for cities and states (variable city
):
favstats(poverty ~ city, data = Ericksen)
## city min Q1 median Q3 max mean sd n missing
## 1 city 11.5 13.8 19 21 23 18 4.0 16 0
## 2 state 6.8 9.5 11 14 24 12 3.7 50 0
gf_boxplot(poverty ~ city, data = Ericksen)
For two quantitative variables the usual graphic is a scatter plot:
gf_point(poverty ~ highschool, data = Ericksen)
This can be either split or coloured according to the value of city
:
gf_point(poverty ~ highschool | city, data = Ericksen)
gf_point(poverty ~ highschool, col = ~city, data = Ericksen)
If we want a regression line along with the points we can do:
gf_point(poverty ~ highschool, col = ~city, data = Ericksen) %>% gf_lm()
factor
will directly convert a vector to be of type factor
. E.g.:head(magAds$GROUP)
## [1] 1 1 1 1 1 1
f <- factor(magAds$GROUP)
magAds$GROUP <- f
head(magAds$GROUP)
## [1] 1 1 1 1 1 1
## Levels: 1 2 3
f <- factor(magAds$GROUP,
levels = c("1", "2", "3"),
labels = c("high", "medium", "low"))
magAds$GROUP <- f
head(magAds$GROUP)
## [1] high high high high high high
## Levels: high medium low