The ASTA team
## 2875.0
## 4
## array([2, 5, 7])
## array([1, 2, 3, 4])
## array([1, 3])
## array([ 8, 20, 28])
## array([ 6, 9, 11])
## array([ 4, 25, 49])
## np.int64(14)
## np.int64(70)
Data: Magazine Ads Readability
import pandas as pd
magAds = pd.read_csv("https://asta.math.aau.dk/datasets?file=magazineAds.txt", sep='\t')
magAds.head()
## WDS SEN X3SYL MAG GROUP
## 0 205 9 34 1 1
## 1 203 20 21 1 1
## 2 229 18 37 1 1
## 3 208 16 31 1 1
## 4 146 9 10 1 1
X
has been prefixed in X3SYL
.The measurement is one of a set of given categories, e.g. sex (male/female), social status, satisfaction score (low/medium/high), etc.
The measurement is usually stored (which is also recommended) as a categorical type in Python. The possible categories are called categories. Sometimes these are also refered to as factors with levels. Example: the categories of the categorical type “sex” is male/female.
Categorical types have two so-called scales:
cut
will divide the range of a numeric variable in a number of equally sized intervals, and record which interval each observation belongs to. E.g. for the variable X3SYL
(the number of words with more than three syllables) in the magazine data:## 0 34
## 1 21
## 2 37
## 3 31
## 4 10
## Name: X3SYL, dtype: int64
# After 'cutting' into 4 intervals:
syll = pd.cut(magAds["X3SYL"], bins=4)
# First 5 values
syll.iloc[0:5]
## 0 (32.25, 43.0]
## 1 (10.75, 21.5]
## 2 (32.25, 43.0]
## 3 (21.5, 32.25]
## 4 (-0.043, 10.75]
## Name: X3SYL, dtype: category
## Categories (4, interval[float64, right]): [(-0.043, 10.75] < (10.75, 21.5] < (21.5, 32.25] <
## (32.25, 43.0]]
category
and the labels are the interval end points by default. Custom ones can be assigned through the labels
argument:labs = ["few", "some", "many", "lots"]
syll = pd.cut(magAds["X3SYL"], bins = 4, labels = labs)
syll.iloc[0:5]
## 0 lots
## 1 some
## 2 lots
## 3 many
## 4 few
## Name: X3SYL, dtype: category
## Categories (4, object): ['few' < 'some' < 'many' < 'lots']
## WDS SEN X3SYL MAG GROUP syll
## 0 205 9 34 1 1 lots
## 1 203 20 21 1 1 some
## 2 229 18 37 1 1 lots
## 3 208 16 31 1 1 many
## 4 146 9 10 1 1 few
value_counts()
from pandas
package:## syll
## few 26
## some 14
## many 10
## lots 4
## Name: count, dtype: int64
## syll
## few 48.148148
## some 25.925926
## many 18.518519
## lots 7.407407
## Name: proportion, dtype: float64
## GROUP 1 2 3
## syll
## few 8 11 7
## some 4 2 8
## many 3 5 2
## lots 3 0 1
## GROUP syll
## 1 few 44.444444
## some 22.222222
## many 16.666667
## lots 16.666667
## 2 few 61.111111
## many 27.777778
## some 11.111111
## lots 0.000000
## 3 some 44.444444
## few 38.888889
## many 11.111111
## lots 5.555556
## Name: proportion, dtype: float64
countplot
from seaborn
. For each level of the factor a box is drawn with the height proportional to the frequency (count) of the level.## Matplotlib is building the font cache; this may take a moment.
import matplotlib.pyplot as plt
g = sns.countplot(data = magAds, x = "syll")
g.set_title("Bar plot of syll")
minority
: Percentage black or Hispanic.crime
: Rate of serious crimes per 1000 individuals in the population.poverty
: Percentage poor.language
: Percentage having difficulty speaking or writing English.highschool
: Percentage aged 25 or older who had not finished highschool.housing
: Percentage of housing in small, multiunit buildings.city
: A factor with levels: city
(major city) and state
(state or state-remainder).conventional
: Percentage of households counted by conventional personal enumeration.undercount
: Preliminary estimate of percentage undercount.Ericksen = pd.read_csv("https://asta.math.aau.dk/datasets?file=Ericksen.txt", sep='\t')
Ericksen.head()
## name minority crime ... city conventional undercount
## 0 Alabama 26.1 49 ... state 0 -0.04
## 1 Alaska 5.7 62 ... state 100 3.35
## 2 Arizona 18.9 81 ... state 18 2.48
## 3 Arkansas 16.9 38 ... state 0 -0.74
## 4 California.R 24.3 73 ... state 4 3.60
##
## [5 rows x 10 columns]
## name minority crime poverty language highschool housing \
## 0 Alabama 26.1 49 18.9 0.2 43.5 7.6
## 1 Alaska 5.7 62 10.7 1.7 17.5 23.6
## 2 Arizona 18.9 81 13.2 3.2 27.6 8.1
## 3 Arkansas 16.9 38 19.0 0.2 44.5 7.0
## 4 California.R 24.3 73 10.4 5.0 26.0 11.8
##
## city conventional undercount
## 0 state 0 -0.04
## 1 state 100 3.35
## 2 state 18 2.48
## 3 state 0 -0.74
## 4 state 4 3.60
x
:
x
to the maximum value of x
in an appropriate number of equal sized sub-intervals.sns.histplot(data = Ericksen, x = "crime", bins = 30, edgecolor = "black")
plt.xlabel("Crime")
plt.ylabel("Count")
plt.title("Histogram of Crime")
plt.show()
WDS
= number of words in advertisement). A number of numerical summaries for WDS
can be retrieved using the favstats
function:col = magAds["WDS"]
summary = {
"min": col.min(),
"Q1": col.quantile(0.25),
"median": col.median(),
"mean": col.mean(),
"Q3": col.quantile(0.75),
"max": col.max(),
"sd": col.std(),
"n": col.count(),
"missing": col.isna().sum()
}
pd.DataFrame([summary])
## min Q1 median mean Q3 max sd n missing
## 0 31 69.0 95.5 122.62963 201.5 230 65.877043 54 0
WDS
are \(y_1=205\), \(y_2=203,\ldots,y_n=208\), where there are a total of \(n=54\) values. As previously defined this constitutes a sample.The mean, median and standard deviation are just some of the summaries that can be read of the output (shown on previous page). They may also be calculated separately in the following way:
WDS
:## np.float64(122.62962962962963)
WDS
:## np.float64(95.5)
WDS
:## np.float64(65.87704278349153)
We may also calculate the summaries for each group (variable GROUP
), e.g. for the mean:
## GROUP WDS
## 0 1 140.000000
## 1 2 121.388889
## 2 3 106.500000
If the histogram of the sample looks like a bell shaped curve, then
First, sort data in increasing order. For the WDS
variable in the magazine data: \[
y_{(1)}=31, y_{(2)}=32, y_{(3)}=34, \ldots, y_{(n)} = 230.
\] Here the number of observations is \(n=54\).
Find the \(5\)th percentile (i. e. \(p = 5\)):
Recall
col = magAds["WDS"]
summary = {
"min": col.min(),
"Q1": col.quantile(0.25),
"median": col.median(),
"Q3": col.quantile(0.75),
"max": col.max(),
"n": col.count(),
"missing": col.isna().sum()
}
pd.DataFrame([summary])
## min Q1 median Q3 max n missing
## 0 31 69.0 95.5 201.5 230 54 0
How to draw a box-and-whiskers plot:
Note: Whiskers are minimum and maximum of the observations that are not deemed to be outliers.
Boxplot of the poverty rates separately for cities and states (variable city
):
summary_by_city = Ericksen.groupby("city")["poverty"].agg(
['min',
lambda x: x.quantile(0.25), # Q1
'median',
'mean',
lambda x: x.quantile(0.75), # Q3
'max',
'std',
'count',
lambda x: x.isna().sum() # missing values
])
# Rename columns for clarity
summary_by_city.columns = ['min', 'Q1', 'median', 'mean', 'Q3', 'max', 'sd', 'n', 'missing']
summary_by_city
## min Q1 median mean Q3 max sd n missing
## city
## city 11.5 13.775 19.3 17.69375 20.90 22.9 4.041859 16 0
## state 6.8 9.500 10.7 12.11600 14.45 23.9 3.733596 50 0
sns.boxplot(data=Ericksen, x="city", y="poverty")
plt.xlabel("City")
plt.ylabel("Poverty")
plt.title("Boxplot of Poverty by City")
plt.show()
For two quantitative variables the usual graphic is a scatter plot:
sns.scatterplot(data=Ericksen, x="highschool", y="poverty")
plt.xlabel("Highschool")
plt.ylabel("Poverty")
plt.title("Scatterplot of Poverty vs Highschool")
plt.show()
This can be either split or coloured according to the value of city
:
g = sns.scatterplot(data=Ericksen, x="highschool", y="poverty", hue="city")
plt.legend(title="City")
If we want a regression line along with the points we can do:
g = sns.lmplot(data=Ericksen, x="highschool", y="poverty", hue="city", ci=None)
plt.xlabel("Highschool")
plt.ylabel("Poverty")
plt.title("Poverty vs Highschool with regression lines by City")
astype
can be used to convert a vector of numerical values to be a categorical variable. E.g.:## 0 1
## 1 1
## 2 1
## 3 1
## 4 1
## Name: GROUP, dtype: int64
## 0 1
## 1 1
## 2 1
## 3 1
## 4 1
## Name: GROUP, dtype: category
## Categories (3, int64): [1, 2, 3]
## GROUP
## 1 18
## 2 18
## 3 18
## Name: count, dtype: int64
f = f.cat.rename_categories({
1: 'high',
2: 'medium',
3: 'low'
})
magAds['GROUP'] = f
magAds["GROUP"].head()
## 0 high
## 1 high
## 2 high
## 3 high
## 4 high
## Name: GROUP, dtype: category
## Categories (3, object): ['high', 'medium', 'low']
## GROUP
## high 18
## medium 18
## low 18
## Name: count, dtype: int64