STAT1013: Statistics in Python II
Descriptive statistics
- Try to summarize the data in a compact, easily understood fashion.
- Use
pandas
to show the descriptive statistics of DataFrame. pandas.DataFrame.info
andpandas.DataFrame.describe()
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')
## take a look of the dataframe
df.head(5)
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
## Types of columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 891 non-null int64
1 pclass 891 non-null int64
2 sex 891 non-null object
3 age 714 non-null float64
4 sibsp 891 non-null int64
5 parch 891 non-null int64
6 fare 891 non-null float64
7 embarked 889 non-null object
8 class 891 non-null category
9 who 891 non-null object
10 adult_male 891 non-null bool
11 deck 203 non-null category
12 embark_town 889 non-null object
13 alive 891 non-null object
14 alone 891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
## check the basic statistics for all columns
df.describe(include='all').T
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
survived | 891.0 | NaN | NaN | NaN | 0.383838 | 0.486592 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
pclass | 891.0 | NaN | NaN | NaN | 2.308642 | 0.836071 | 1.0 | 2.0 | 3.0 | 3.0 | 3.0 |
sex | 891 | 2 | male | 577 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
age | 714.0 | NaN | NaN | NaN | 29.699118 | 14.526497 | 0.42 | 20.125 | 28.0 | 38.0 | 80.0 |
sibsp | 891.0 | NaN | NaN | NaN | 0.523008 | 1.102743 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 |
parch | 891.0 | NaN | NaN | NaN | 0.381594 | 0.806057 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 |
fare | 891.0 | NaN | NaN | NaN | 32.204208 | 49.693429 | 0.0 | 7.9104 | 14.4542 | 31.0 | 512.3292 |
embarked | 889 | 3 | S | 644 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
class | 891 | 3 | Third | 491 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
who | 891 | 3 | man | 537 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
adult_male | 891 | 2 | True | 537 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
deck | 203 | 7 | C | 59 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
embark_town | 889 | 3 | Southampton | 644 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
alive | 891 | 2 | no | 549 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
alone | 891 | 2 | True | 537 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Data visualization
- Before demonstrating data visualization in Python, we need to figure the type of data of interest. We mainly focus on categorical data and continuous data
- Use
seaborn
to visualize of your data, - for both categorical and continuous data
- Example gallery in seaborn
Single variable
- Overall distribution -> histgram (both cate and cont) ->
sns.histplot
orsns.stripplot
orsns.violinplot
- Quantile information -> boxplot (cont) ->
sns.boxplot
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = [10, 5]
sns.set()
## 1. histplot
sns.histplot(df, x='age')
plt.show()
sns.histplot(df, x='survived')
plt.show()
## 1(b) stripplot
sns.stripplot(x=df['age'])
plt.show()
<AxesSubplot:xlabel='age'>
## 1(c) violinplot
sns.violinplot(x=df['age'])
plt.show()
## 2 boxplot
sns.violinplot(x=df["age"])
plt.show()
Two variables: effect X to Y
- X (cate) to Y(cate) -> histgram ->
sns.histplot
withhue
- X (cate) to Y(cont) -> histgram ->
sns.histplot
withhue
orsns.stripplot
withhue
orviolinplot
withhue
- X (cont) to Y(cont) -> regplot ->
sns.scatterplot
- X (cont) to Y(cate) -> histgram ->
sns.histplot
withhue
orsns.stripplot
withhue
orviolinplot
withhue
# 1. `sns.histplot` with `hue`
## For example: sex -> survived
## put the Y(outcome) in x-axis, since we want to check the histgram of the outcome
sns.histplot(df, x='survived', hue='sex')
plt.show()
# 2(a). `sns.histplot` with `hue`
## For example: sex -> age
## put the Y(outcome) in x-axis, since we want to check the histgram of the outcome
sns.histplot(data=df, x="age", y="sex")
plt.show()
# 2(b). `sns.stripplot` with `hue`
sns.stripplot(data=df, x="age", y="sex")
plt.show()
# 2(c). `sns.violinplot` with `hue`
sns.violinplot(data=df, x="age", y="sex")
plt.show()
# 3 `sns.regplot`
sns.scatterplot(data=df, x="age", y='fare')
plt.show()
# 4(a) 'stripplot'
sns.stripplot(data=df, x="sex", y="age")
plt.show()
# 4(b) 'violinplot'
sns.violinplot(data=df, x="sex", y="age")
plt.show()
Three variables
- X1 (cate) + X2 (cate) to Y(cate) -> histgram ->
FacetGrid
+sns.histplot
- X1 (cate) + X2 (cont) to Y(cate) -> histgram ->
FacetGrid
+sns.histplot
- X1 (cate) + X2 (cate) to Y(cont) -> violinplot ->
sns.stripplot
withhue
andy
orviolinplot
withhue
andy
- X1 (cate) + X2 (cont) to Y(cont) -> reg ->
sns.regplot
withhue
- X1 (cont) + X2 (cont) to Y(cate) -> reg ->
sns.kdeplot
- X1 (cont) + X2 (cont) to Y(cont) -> reg ->
sns.kdeplot
# 1 `sns.histplot` with `x`, `y`, `hue`
g = sns.FacetGrid(df, col="pclass")
g.map(sns.histplot, "survived", "sex")
plt.show()
# 2 `sns.histplot` with `x`, `y`, `hue`
g = sns.FacetGrid(df, col="pclass")
g.map(sns.histplot, "survived", 'age')
plt.show()
# 3(a) `sns.violinplot` with `hue` and `y` or `violinplot` with `hue` and `y`
sns.violinplot(data=df, x="sex", y="age", hue='survived')
plt.show()
# 3(b) `sns.stripplot` with `hue` and `y` or `violinplot` with `hue` and `y`
sns.stripplot(data=df, x="sex", y="age", hue='survived', dodge=True, alpha=.25, zorder=1)
plt.show()
# 4 `sns.scatterplot` with `hue`
sns.scatterplot(data=df, x="age", y="fare", hue='survived', style='survived')
plt.show()
# 5 `sns.kdeplot`
sns.kdeplot(data=df, x="age", y='fare', hue='survived')
plt.show()
g = sns.jointplot(data=df, x="age", y="fare", hue="survived", kind="kde")
plt.show()
# 6 sns.PairGrid
g = sns.PairGrid(df[['age', 'fare', 'pclass']])
g.map_upper(sns.scatterplot, s=15)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=2)
plt.show()
Even more variables
use facetgrid
to include more variables
# 1 facetGrid
g = sns.FacetGrid(df, col="pclass", row='alone')
g.map(sns.scatterplot, "fare", 'age')
plt.show()
InClass practice
- Produce a boxplot of
x='pclass'
andy='age'
- What happen if we add
palette='rainbow'
intosns.boxplot
- What happen if we add
Histgram plot for ‘fare’ with
bins=10
andbins=50
- visulize the effect
age
->fare
conditional on differentparch
andsurvived