The ASTA team
Session -> Set Working Directory -> Choose Directory
(shortcut: Ctrl+Shift+H).Tools -> Global Options -> Choose Directory
.Tools -> Install Packages
install.packages("mosaic")
library
command:library(mosaic)
mosaic
package which has a lot of convenient functions for this course (we will get back to that later). It also prints a lot of info about functions that have been changed by the mosaic
package, but you can safely ignore that.?<command>
:?sum
tab
to make Rstudio guess what you have started typing.help.search("plot")
We use data about pengiuns from the R package palmerpenguins
pingviner <- palmerpenguins::penguins
pingviner
## # A tibble: 344 x 8
## species island bill_length_mm bill_depth_mm flipp… body… sex year
## <fctr> <fctr> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18.0 195 3250 fema… 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
## 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
## 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie Torgersen 42.0 20.2 190 4250 <NA> 2007
## # ... with 334 more rows
species
and body_mass_g
?It is important to be able to distinguish between discrete count variables and continuous variables, since this often determines how we describe the uncertainty of a measurement.
Are any of the measurements in our data set quantitative?
gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner)
gf_point(bill_length_mm ~ bill_depth_mm | species, color = ~ species, data = pingviner)
gf_point(bill_length_mm ~ bill_depth_mm, color = ~ species, data = pingviner) %>%
gf_lm()
gf_point(1 ~ bill_length_mm, color = ~ species, data = pingviner)
gf_point(species ~ bill_length_mm, color = ~ species, data = pingviner)
gf_histogram( ~ bill_length_mm, data = pingviner)
x
:
x
to the maximum value of x
in an appropriate number of equal sized sub-intervals.gf_histogram( ~ bill_length_mm, color = ~species, data = pingviner)
Boxplot can be good for comparing groups (notice we put the values on the y-axis here as it is more conventional for boxplots):
gf_boxplot(bill_length_mm ~ species, color = ~ species, data = pingviner)
To understand the details of the boxplot we need to introduce percentiles/quantiles and in particular quartiles which can be seen here:
Q <- quantile(bill_length_mm ~ species, data = pingviner, na.rm = TRUE)
Q
## species 0% 25% 50% 75% 100%
## 1 Adelie 32 37 39 41 46
## 2 Chinstrap 41 46 50 51 58
## 3 Gentoo 41 45 47 50 60
Interquartile Range (IQR): a measure of variability given by the difference of the upper and lower quartiles:
Details about how to find/calculate percentiles are postponed to later.
Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.
select()
(automatically loaded from dplyr
package by mosaic
):# Assigning the data subset to an object
bill_data <- select(pingviner, bill_length_mm, bill_depth_mm,
species, sex, island, year)
# This particular subset can be written shorter by (overwriting the object):
bill_data <- select(pingviner, -flipper_length_mm, -body_mass_g)
%>%
) syntax:bill_data <- pingviner %>% select(-flipper_length_mm, -body_mass_g)
We read this as: first take the dataset pingviner
and then select all columns except flipper_length_mm
and body_mass_g
.
The resulting dataset doesn’t have flipper length and body mass:
names(bill_data)
## [1] "species" "island" "bill_length_mm" "bill_depth_mm"
## [5] "sex" "year"
filter()
to subset rows/cases. E.g. all penguins from Biscoe islands:pingviner %>% filter(island == "Biscoe")
## # A tibble: 168 x 8
## species island bill_length_mm bill_depth_mm flipper… body_… sex year
## <fctr> <fctr> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Biscoe 37.8 18.3 174 3400 fema… 2007
## 2 Adelie Biscoe 37.7 18.7 180 3600 male 2007
## 3 Adelie Biscoe 35.9 19.2 189 3800 fema… 2007
## 4 Adelie Biscoe 38.2 18.1 185 3950 male 2007
## 5 Adelie Biscoe 38.8 17.2 180 3800 male 2007
## 6 Adelie Biscoe 35.3 18.9 187 3800 fema… 2007
## 7 Adelie Biscoe 40.6 18.6 183 3550 male 2007
## 8 Adelie Biscoe 40.5 17.9 187 3200 fema… 2007
## 9 Adelie Biscoe 37.9 18.6 172 3150 fema… 2007
## 10 Adelie Biscoe 40.5 18.9 180 3950 male 2007
## # ... with 158 more rows
pingviner %>% filter(sex == "male") %>%
filter(species == "Gentoo") %>%
filter(flipper_length_mm>220)
## # A tibble: 34 x 8
## species island bill_length_mm bill_depth_mm flipper… body_… sex year
## <fctr> <fctr> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Gentoo Biscoe 50.0 16.3 230 5700 male 2007
## 2 Gentoo Biscoe 49.2 15.2 221 6300 male 2007
## 3 Gentoo Biscoe 48.7 15.1 222 5350 male 2007
## 4 Gentoo Biscoe 47.3 15.3 222 5250 male 2007
## 5 Gentoo Biscoe 59.6 17.0 230 6050 male 2007
## 6 Gentoo Biscoe 49.6 16.0 225 5700 male 2008
## 7 Gentoo Biscoe 50.5 15.9 222 5550 male 2008
## 8 Gentoo Biscoe 50.5 15.9 225 5400 male 2008
## 9 Gentoo Biscoe 50.1 15.0 225 5000 male 2008
## 10 Gentoo Biscoe 50.4 15.3 224 5550 male 2008
## # ... with 24 more rows
filter()
command (output not shown):pingviner %>% filter(sex == "male" & species == "Gentoo" & flipper_length_mm>220)
pingviner %>% filter(species == "Gentoo" | species == "Adelie")
## # A tibble: 276 x 8
## species island bill_length_mm bill_depth_mm flipp… body… sex year
## <fctr> <fctr> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Adelie Torgersen 39.1 18.7 181 3750 male 2007
## 2 Adelie Torgersen 39.5 17.4 186 3800 fema… 2007
## 3 Adelie Torgersen 40.3 18.0 195 3250 fema… 2007
## 4 Adelie Torgersen NA NA NA NA <NA> 2007
## 5 Adelie Torgersen 36.7 19.3 193 3450 fema… 2007
## 6 Adelie Torgersen 39.3 20.6 190 3650 male 2007
## 7 Adelie Torgersen 38.9 17.8 181 3625 fema… 2007
## 8 Adelie Torgersen 39.2 19.6 195 4675 male 2007
## 9 Adelie Torgersen 34.1 18.1 193 3475 <NA> 2007
## 10 Adelie Torgersen 42.0 20.2 190 4250 <NA> 2007
## # ... with 266 more rows
pingviner %>% filter(species != "Chinstrap")
arrange()
to arrange the order of the rows/cases:pingviner %>% filter(sex == "female") %>% arrange(body_mass_g)
## # A tibble: 165 x 8
## species island bill_length_mm bill_de… flipper… body_… sex year
## <fctr> <fctr> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Chinstrap Dream 46.9 16.6 192 2700 fema… 2008
## 2 Adelie Biscoe 36.5 16.6 181 2850 fema… 2008
## 3 Adelie Biscoe 36.4 17.1 184 2850 fema… 2008
## 4 Adelie Biscoe 34.5 18.1 187 2900 fema… 2008
## 5 Adelie Dream 33.1 16.1 178 2900 fema… 2008
## 6 Adelie Torgersen 38.6 17.0 188 2900 fema… 2009
## 7 Chinstrap Dream 43.2 16.6 187 2900 fema… 2007
## 8 Adelie Biscoe 37.9 18.6 193 2925 fema… 2009
## 9 Adelie Dream 37.0 16.9 185 3000 fema… 2007
## 10 Adelie Dream 37.3 16.8 192 3000 fema… 2009
## # ... with 155 more rows
arrange(desc())
for descending values:pingviner %>% filter(sex == "female") %>% arrange(desc(body_mass_g))
## # A tibble: 165 x 8
## species island bill_length_mm bill_depth_mm flipper… body_… sex year
## <fctr> <fctr> <dbl> <dbl> <int> <int> <fct> <int>
## 1 Gentoo Biscoe 46.5 14.8 217 5200 fema… 2008
## 2 Gentoo Biscoe 45.2 14.8 212 5200 fema… 2009
## 3 Gentoo Biscoe 49.1 14.8 220 5150 fema… 2008
## 4 Gentoo Biscoe 44.9 13.3 213 5100 fema… 2008
## 5 Gentoo Biscoe 45.1 14.5 207 5050 fema… 2007
## 6 Gentoo Biscoe 45.1 14.5 215 5000 fema… 2007
## 7 Gentoo Biscoe 42.9 13.1 215 5000 fema… 2007
## 8 Gentoo Biscoe 50.5 15.2 216 5000 fema… 2009
## 9 Gentoo Biscoe 47.2 15.5 215 4975 fema… 2009
## 10 Gentoo Biscoe 42.6 13.7 213 4950 fema… 2008
## # ... with 155 more rows