Intro and descriptive statistics

The ASTA team

Software

Integrated development environments (IDEs)

Basics

import numpy as np
4.6 * (2 + 3)**4 
## 2875.0
a = 4 
a
## 4
b = np.array([2, 5, 7])
b
## array([2, 5, 7])
s = np.arange(1, 5)
s
## array([1, 2, 3, 4])
s = np.arange(1, 5, 2)  # same idea
s
## array([1, 3])
a * b
## array([ 8, 20, 28])
a + b
## array([ 6,  9, 11])
b ** 2
## array([ 4, 25, 49])
np.sum(b)
## np.int64(14)
np.prod(b)
## np.int64(70)

Data

Data example

Data: Magazine Ads Readability

Data example (continued) - variables and format

import pandas as pd

magAds = pd.read_csv("https://asta.math.aau.dk/datasets?file=magazineAds.txt", sep='\t')
magAds.head()
##    WDS  SEN  X3SYL  MAG  GROUP
## 0  205    9     34    1      1
## 1  203   20     21    1      1
## 2  229   18     37    1      1
## 3  208   16     31    1      1
## 4  146    9     10    1      1

Data types

Quantitative variables

Categorical/qualitative variables

Population and sample

Aim of statistics

Selecting randomly

Variable grouping and frequency tables

Binning

# Before 'cutting':
magAds["X3SYL"].iloc[0:5]
## 0    34
## 1    21
## 2    37
## 3    31
## 4    10
## Name: X3SYL, dtype: int64
# After 'cutting' into 4 intervals:
syll = pd.cut(magAds["X3SYL"], bins=4)

# First 5 values
syll.iloc[0:5]
## 0      (32.25, 43.0]
## 1      (10.75, 21.5]
## 2      (32.25, 43.0]
## 3      (21.5, 32.25]
## 4    (-0.043, 10.75]
## Name: X3SYL, dtype: category
## Categories (4, interval[float64, right]): [(-0.043, 10.75] < (10.75, 21.5] < (21.5, 32.25] <
##                                            (32.25, 43.0]]
labs = ["few", "some", "many", "lots"]
syll = pd.cut(magAds["X3SYL"], bins = 4, labels = labs)
syll.iloc[0:5]
## 0    lots
## 1    some
## 2    lots
## 3    many
## 4     few
## Name: X3SYL, dtype: category
## Categories (4, object): ['few' < 'some' < 'many' < 'lots']
magAds["syll"] = syll
magAds.head()
##    WDS  SEN  X3SYL  MAG  GROUP  syll
## 0  205    9     34    1      1  lots
## 1  203   20     21    1      1  some
## 2  229   18     37    1      1  lots
## 3  208   16     31    1      1  many
## 4  146    9     10    1      1   few

Tables

magAds["syll"].value_counts()
## syll
## few     26
## some    14
## many    10
## lots     4
## Name: count, dtype: int64
magAds["syll"].value_counts(normalize = True) * 100
## syll
## few     48.148148
## some    25.925926
## many    18.518519
## lots     7.407407
## Name: proportion, dtype: float64

2 factors: Cross tabulation

pd.crosstab(magAds["syll"], magAds["GROUP"])
## GROUP  1   2  3
## syll           
## few    8  11  7
## some   4   2  8
## many   3   5  2
## lots   3   0  1
magAds.groupby("GROUP")["syll"].value_counts(normalize=True).mul(100)
## GROUP  syll
## 1      few     44.444444
##        some    22.222222
##        many    16.666667
##        lots    16.666667
## 2      few     61.111111
##        many    27.777778
##        some    11.111111
##        lots     0.000000
## 3      some    44.444444
##        few     38.888889
##        many    11.111111
##        lots     5.555556
## Name: proportion, dtype: float64

Graphics

Bar graph

import seaborn as sns
## Matplotlib is building the font cache; this may take a moment.
import matplotlib.pyplot as plt

g = sns.countplot(data = magAds, x = "syll")
g.set_title("Bar plot of syll")

g = sns.catplot(data = magAds, x = "syll", kind = "count", col = "GROUP")

The Ericksen data

Ericksen = pd.read_csv("https://asta.math.aau.dk/datasets?file=Ericksen.txt", sep='\t')
Ericksen.head()
##            name  minority  crime  ...   city  conventional  undercount
## 0       Alabama      26.1     49  ...  state             0       -0.04
## 1        Alaska       5.7     62  ...  state           100        3.35
## 2       Arizona      18.9     81  ...  state            18        2.48
## 3      Arkansas      16.9     38  ...  state             0       -0.74
## 4  California.R      24.3     73  ...  state             4        3.60
## 
## [5 rows x 10 columns]
pd.set_option('display.max_columns', None)  # Show all columns
Ericksen.head()
##            name  minority  crime  poverty  language  highschool  housing  \
## 0       Alabama      26.1     49     18.9       0.2        43.5      7.6   
## 1        Alaska       5.7     62     10.7       1.7        17.5     23.6   
## 2       Arizona      18.9     81     13.2       3.2        27.6      8.1   
## 3      Arkansas      16.9     38     19.0       0.2        44.5      7.0   
## 4  California.R      24.3     73     10.4       5.0        26.0     11.8   
## 
##     city  conventional  undercount  
## 0  state             0       -0.04  
## 1  state           100        3.35  
## 2  state            18        2.48  
## 3  state             0       -0.74  
## 4  state             4        3.60

Histogram (quantitative variables)

sns.histplot(data = Ericksen, x = "crime", bins = 30, edgecolor = "black")
plt.xlabel("Crime")
plt.ylabel("Count")
plt.title("Histogram of Crime")
plt.show()

Summary of quantitative variables

Measures of center of data: Mean and median

col = magAds["WDS"]
summary = {
    "min": col.min(),
    "Q1": col.quantile(0.25),
    "median": col.median(),
    "mean": col.mean(),
    "Q3": col.quantile(0.75),
    "max": col.max(),
    "sd": col.std(),
    "n": col.count(),
    "missing": col.isna().sum()
}
pd.DataFrame([summary])
##    min    Q1  median       mean     Q3  max         sd   n  missing
## 0   31  69.0    95.5  122.62963  201.5  230  65.877043  54        0
#or: magAds["WDS"].describe()

Measures of variability of data: range, standard deviation and variance

Calculation of mean, median and standard deviation

The mean, median and standard deviation are just some of the summaries that can be read of the output (shown on previous page). They may also be calculated separately in the following way:

magAds["WDS"].mean()
## np.float64(122.62962962962963)
magAds["WDS"].median()
## np.float64(95.5)
magAds["WDS"].std()
## np.float64(65.87704278349153)

We may also calculate the summaries for each group (variable GROUP), e.g. for the mean:

magAds.groupby("GROUP")["WDS"].mean().reset_index() # reset_index() resets the grouping again
##    GROUP         WDS
## 0      1  140.000000
## 1      2  121.388889
## 2      3  106.500000

A word about terminology

The empirical rule

If the histogram of the sample looks like a bell shaped curve, then

Percentiles

Percentile calculation for a sample:

Median, quartiles and interquartile range

Recall

col = magAds["WDS"]
summary = {
    "min": col.min(),
    "Q1": col.quantile(0.25),
    "median": col.median(),
    "Q3": col.quantile(0.75),
    "max": col.max(),
    "n": col.count(),
    "missing": col.isna().sum()
}
pd.DataFrame([summary])
##    min    Q1  median     Q3  max   n  missing
## 0   31  69.0    95.5  201.5  230  54        0
#or: magAds["WDS"].describe()

More graphics

Box-and-whiskers plots (or simply box plots)

How to draw a box-and-whiskers plot:

Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.

Boxplot for Ericksen data

Boxplot of the poverty rates separately for cities and states (variable city):

summary_by_city = Ericksen.groupby("city")["poverty"].agg(
    ['min', 
     lambda x: x.quantile(0.25),  # Q1
     'median',
     'mean',
     lambda x: x.quantile(0.75),  # Q3
     'max',
     'std',
     'count',
     lambda x: x.isna().sum()      # missing values
])

# Rename columns for clarity
summary_by_city.columns = ['min', 'Q1', 'median', 'mean', 'Q3', 'max', 'sd', 'n', 'missing']
summary_by_city
##         min      Q1  median      mean     Q3   max        sd   n  missing
## city                                                                     
## city   11.5  13.775    19.3  17.69375  20.90  22.9  4.041859  16        0
## state   6.8   9.500    10.7  12.11600  14.45  23.9  3.733596  50        0
sns.boxplot(data=Ericksen, x="city", y="poverty")
plt.xlabel("City")
plt.ylabel("Poverty")
plt.title("Boxplot of Poverty by City")
plt.show()

2 quantitative variables: Scatter plot

For two quantitative variables the usual graphic is a scatter plot:

sns.scatterplot(data=Ericksen, x="highschool", y="poverty")
plt.xlabel("Highschool")
plt.ylabel("Poverty")
plt.title("Scatterplot of Poverty vs Highschool")
plt.show()

This can be either split or coloured according to the value of city:

g = sns.relplot(data=Ericksen, x="highschool", y="poverty", col="city", kind="scatter")

g = sns.scatterplot(data=Ericksen, x="highschool", y="poverty", hue="city")
plt.legend(title="City")

If we want a regression line along with the points we can do:

g = sns.lmplot(data=Ericksen, x="highschool", y="poverty", hue="city", ci=None)
plt.xlabel("Highschool")
plt.ylabel("Poverty")
plt.title("Poverty vs Highschool with regression lines by City")

Appendix

Recoding variables

magAds["GROUP"].head()
## 0    1
## 1    1
## 2    1
## 3    1
## 4    1
## Name: GROUP, dtype: int64
f = magAds["GROUP"].astype("category")
magAds["GROUP"] = f
magAds["GROUP"].head()
## 0    1
## 1    1
## 2    1
## 3    1
## 4    1
## Name: GROUP, dtype: category
## Categories (3, int64): [1, 2, 3]
f = magAds["GROUP"].astype("category")
f.value_counts()
## GROUP
## 1    18
## 2    18
## 3    18
## Name: count, dtype: int64
f = f.cat.rename_categories({
    1: 'high',
    2: 'medium', 
    3: 'low'
})
magAds['GROUP'] = f
magAds["GROUP"].head()
## 0    high
## 1    high
## 2    high
## 3    high
## 4    high
## Name: GROUP, dtype: category
## Categories (3, object): ['high', 'medium', 'low']
magAds['GROUP'].value_counts()
## GROUP
## high      18
## medium    18
## low       18
## Name: count, dtype: int64