The ASTA team
Session -> Set Working Directory -> Choose Directory (shortcut: Ctrl+Shift+H).Tools -> Global Options -> Choose Directory.4.6 * (2 + 3)^4 ## [1] 2875a <- 4 
a## [1] 4b <- c(2, 5, 7)
b## [1] 2 5 7s <- 1:4
s## [1] 1 2 3 4s <- seq(1, 4, by = 1)a * b## [1]  8 20 28a + b## [1]  6  9 11b ^ 2## [1]  4 25 49sum(b)## [1] 14prod(b)## [1] 70Tools -> Install Packagesinstall.packages("mosaic")library command:library(mosaic)mosaic package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the mosaic package, but you can safely ignore that.?<command>:?sumtab to make Rstudio guess what you have started typing.help.search("plot")To Source button.Data: Magazine Ads Readability
magAds <- read.delim("https://asta.math.aau.dk/datasets?file=magazineAds.txt")
head(magAds)##   WDS SEN X3SYL MAG GROUP
## 1 205   9    34   1     1
## 2 203  20    21   1     1
## 3 229  18    37   1     1
## 4 208  16    31   1     1
## 5 146   9    10   1     1
## 6 230  16    24   1     1X has been prefixed in X3SYL.cut will divide the range of a numeric variable in a number of equally sized intervals, and record which interval each observation belongs to. E.g. for the variable X3SYL (the number of words with more than three syllables) in the magazine data:# Before 'cutting':
magAds$X3SYL[1:5]## [1] 34 21 37 31 10# After 'cutting' into 4 intervals:
syll <- cut(magAds$X3SYL, 4)
syll[1:5]## [1] (32.2,43]     (10.8,21.5]   (32.2,43]     (21.5,32.2]   (-0.043,10.8]
## Levels: (-0.043,10.8] (10.8,21.5] (21.5,32.2] (32.2,43]factor and the labels are the interval end points by default. Custom ones can be assigned through the labels argument:labs <- c("few", "some", "many", "lots")
syll <- cut(magAds$X3SYL, 4, labels = labs) # NB: this overwrites the 'syll' defined above
syll[1:5]## [1] lots some lots many few 
## Levels: few some many lotsmagAds$syll <- syll # Adding a new column to the datasettally from the mosaic package (remember the package must be loaded via library(mosaic) if you did not do so yet):tally( ~ syll, data = magAds)## syll
##  few some many lots 
##   26   14   10    4tally( ~ syll, data = magAds, format = "percent")## syll
##  few some many lots 
## 48.1 25.9 18.5  7.4formula (characterized by the “tilde” sign ~) to indicate that we want this variable from the dataset magAds (without the tilde it would look for a global variable called syll and use that rather than the one in the dataset).tally again:tally( ~ syll + GROUP, data = magAds)##       GROUP
## syll    1  2  3
##   few   8 11  7
##   some  4  2  8
##   many  3  5  2
##   lots  3  0  1tally( ~ syll | GROUP, data = magAds, format = "percent")##       GROUP
## syll      1    2    3
##   few  44.4 61.1 38.9
##   some 22.2 11.1 44.4
##   many 16.7 27.8 11.1
##   lots 16.7  0.0  5.6gf_bar from mosaic. For each level of the factor a box is drawn with the height proportional to the frequency (count) of the level.gf_bar( ~ syll, data = magAds) 
gf_bar( ~ syll | GROUP, data = magAds)minority: Percentage black or Hispanic.crime: Rate of serious crimes per 1000 individuals in the population.poverty: Percentage poor.language: Percentage having difficulty speaking or writing English.highschool: Percentage aged 25 or older who had not finished highschool.housing: Percentage of housing in small, multiunit buildings.city: A factor with levels: city (major city) and state (state or state-remainder).conventional: Percentage of households counted by conventional personal enumeration.undercount: Preliminary estimate of percentage undercount.Ericksen <- read.delim("https://asta.math.aau.dk/datasets?file=Ericksen.txt")
head(Ericksen)##           name minority crime poverty language highschool housing  city
## 1      Alabama     26.1    49      19      0.2         44     7.6 state
## 2       Alaska      5.7    62      11      1.7         18    23.6 state
## 3      Arizona     18.9    81      13      3.2         28     8.1 state
## 4     Arkansas     16.9    38      19      0.2         44     7.0 state
## 5 California.R     24.3    73      10      5.0         26    11.8 state
## 6     Colorado     15.2    73      10      1.2         21     9.2 state
##   conventional undercount
## 1            0      -0.04
## 2          100       3.35
## 3           18       2.48
## 4            0      -0.74
## 5            4       3.60
## 6           19       1.34x:
x to the maximum value of x in an appropriate number of equal sized sub-intervals.gf_histogram( ~ crime, data = Ericksen)WDS = number of words in advertisement). A number of numerical summaries for WDS can be retrieved using the favstats function:favstats( ~ WDS, data = magAds)##  min Q1 median  Q3 max mean sd  n missing
##   31 69     96 202 230  123 66 54       0WDS are \(y_1=205\), \(y_2=203,\ldots,y_n=208\), where there are a total of \(n=54\) values. As previously defined this constitutes a sample.The mean, median and standard deviation are just some of the summaries that can be read of the favstats output (shown on previous page). They may also be calculated separately in the following way:
WDS:mean( ~ WDS, data = magAds)## [1] 123WDS:median( ~ WDS, data = magAds)## [1] 96WDS:sd( ~ WDS, data = magAds)## [1] 66We may also calculate the summaries for each group (variable GROUP), e.g. for the mean:
mean( ~ WDS | GROUP, data = magAds)##   1   2   3 
## 140 121 106If the histogram of the sample looks like a bell shaped curve, then
First, sort data in increasing order. For the WDS variable in the magazine data: \[
  y_{(1)}=31, y_{(2)}=32, y_{(3)}=34, \ldots, y_{(n)} = 230.
\] Here the number of observations is \(n=54\).
Recall
favstats( ~ WDS, data = magAds)##  min Q1 median  Q3 max mean sd  n missing
##   31 69     96 202 230  123 66 54       0How to draw a box-and-whiskers plot:
Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.
Boxplot of the poverty rates separately for cities and states (variable city):
favstats(poverty ~ city, data = Ericksen)##    city  min   Q1 median Q3 max mean  sd  n missing
## 1  city 11.5 13.8     19 21  23   18 4.0 16       0
## 2 state  6.8  9.5     11 14  24   12 3.7 50       0gf_boxplot(poverty ~ city, data = Ericksen)For two quantitative variables the usual graphic is a scatter plot:
gf_point(poverty ~ highschool, data = Ericksen)This can be either split or coloured according to the value of city:
gf_point(poverty ~ highschool | city, data = Ericksen)gf_point(poverty ~ highschool, col = ~city, data = Ericksen)If we want a regression line along with the points we can do:
gf_point(poverty ~ highschool, col = ~city, data = Ericksen) %>% gf_lm()factor will directly convert a vector to be of type factor. E.g.:head(magAds$GROUP)  ## [1] 1 1 1 1 1 1f <- factor(magAds$GROUP)
magAds$GROUP <- f
head(magAds$GROUP)## [1] 1 1 1 1 1 1
## Levels: 1 2 3f <- factor(magAds$GROUP, 
            levels = c("1", "2", "3"),
            labels = c("high", "medium", "low"))
magAds$GROUP <- f
head(magAds$GROUP)## [1] high high high high high high
## Levels: high medium low