STAT1013: Statistics in Python II

Descriptive statistics

  • Try to summarize the data in a compact, easily understood fashion.
  • Use pandas to show the descriptive statistics of DataFrame.
  • pandas.DataFrame.info and pandas.DataFrame.describe()
import pandas as pd
import seaborn as sns

df = sns.load_dataset('titanic')
## take a look of the dataframe
df.head(5)
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue
## Types of columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB
## check the basic statistics for all columns
df.describe(include='all').T
countuniquetopfreqmeanstdmin25%50%75%max
survived891.0NaNNaNNaN0.3838380.4865920.00.00.01.01.0
pclass891.0NaNNaNNaN2.3086420.8360711.02.03.03.03.0
sex8912male577NaNNaNNaNNaNNaNNaNNaN
age714.0NaNNaNNaN29.69911814.5264970.4220.12528.038.080.0
sibsp891.0NaNNaNNaN0.5230081.1027430.00.00.01.08.0
parch891.0NaNNaNNaN0.3815940.8060570.00.00.00.06.0
fare891.0NaNNaNNaN32.20420849.6934290.07.910414.454231.0512.3292
embarked8893S644NaNNaNNaNNaNNaNNaNNaN
class8913Third491NaNNaNNaNNaNNaNNaNNaN
who8913man537NaNNaNNaNNaNNaNNaNNaN
adult_male8912True537NaNNaNNaNNaNNaNNaNNaN
deck2037C59NaNNaNNaNNaNNaNNaNNaN
embark_town8893Southampton644NaNNaNNaNNaNNaNNaNNaN
alive8912no549NaNNaNNaNNaNNaNNaNNaN
alone8912True537NaNNaNNaNNaNNaNNaNNaN

Data visualization

  • Before demonstrating data visualization in Python, we need to figure the type of data of interest. We mainly focus on categorical data and continuous data
  • Use seaborn to visualize of your data,
  • for both categorical and continuous data
  • Example gallery in seaborn

Single variable

  1. Overall distribution -> histgram (both cate and cont) -> sns.histplot or sns.stripplot or sns.violinplot
  2. Quantile information -> boxplot (cont) -> sns.boxplot
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = [10, 5]

sns.set()
## 1. histplot
sns.histplot(df, x='age')
plt.show()

sns.histplot(df, x='survived')
plt.show()

png

png

## 1(b) stripplot
sns.stripplot(x=df['age'])
plt.show()
<AxesSubplot:xlabel='age'>

png

## 1(c) violinplot
sns.violinplot(x=df['age'])
plt.show()

png

## 2 boxplot
sns.violinplot(x=df["age"])
plt.show()

png

Two variables: effect X to Y

  1. X (cate) to Y(cate) -> histgram -> sns.histplot with hue
  2. X (cate) to Y(cont) -> histgram -> sns.histplot with hue or sns.stripplot with hue or violinplot with hue
  3. X (cont) to Y(cont) -> regplot -> sns.scatterplot
  4. X (cont) to Y(cate) -> histgram -> sns.histplot with hue or sns.stripplot with hue or violinplot with hue
# 1. `sns.histplot` with `hue`

## For example: sex -> survived
## put the Y(outcome) in x-axis, since we want to check the histgram of the outcome
sns.histplot(df, x='survived', hue='sex')
plt.show()

png

# 2(a). `sns.histplot` with `hue`

## For example: sex -> age
## put the Y(outcome) in x-axis, since we want to check the histgram of the outcome
sns.histplot(data=df, x="age", y="sex")
plt.show()

png

# 2(b). `sns.stripplot` with `hue`

sns.stripplot(data=df, x="age", y="sex")
plt.show()

png

# 2(c). `sns.violinplot` with `hue`

sns.violinplot(data=df, x="age", y="sex")
plt.show()

png

# 3 `sns.regplot`

sns.scatterplot(data=df, x="age", y='fare')
plt.show()

png

# 4(a) 'stripplot'

sns.stripplot(data=df, x="sex", y="age")
plt.show()

png

# 4(b) 'violinplot'

sns.violinplot(data=df, x="sex", y="age")
plt.show()

png

Three variables

  1. X1 (cate) + X2 (cate) to Y(cate) -> histgram -> FacetGrid+sns.histplot
  2. X1 (cate) + X2 (cont) to Y(cate) -> histgram -> FacetGrid+sns.histplot
  3. X1 (cate) + X2 (cate) to Y(cont) -> violinplot -> sns.stripplot with hue and y or violinplot with hue and y
  4. X1 (cate) + X2 (cont) to Y(cont) -> reg -> sns.regplot with hue
  5. X1 (cont) + X2 (cont) to Y(cate) -> reg -> sns.kdeplot
  6. X1 (cont) + X2 (cont) to Y(cont) -> reg -> sns.kdeplot
# 1 `sns.histplot` with `x`, `y`, `hue` 
g = sns.FacetGrid(df, col="pclass")
g.map(sns.histplot, "survived", "sex")
plt.show()

png

# 2 `sns.histplot` with `x`, `y`, `hue`
g = sns.FacetGrid(df, col="pclass")
g.map(sns.histplot, "survived", 'age')
plt.show()

png

# 3(a) `sns.violinplot` with `hue` and `y` or `violinplot` with `hue` and `y`

sns.violinplot(data=df, x="sex", y="age", hue='survived')
plt.show()

png

# 3(b) `sns.stripplot` with `hue` and `y` or `violinplot` with `hue` and `y`

sns.stripplot(data=df, x="sex", y="age", hue='survived', dodge=True, alpha=.25, zorder=1)
plt.show()

png

# 4 `sns.scatterplot` with `hue`

sns.scatterplot(data=df, x="age", y="fare", hue='survived', style='survived')
plt.show()

png

# 5 `sns.kdeplot`

sns.kdeplot(data=df, x="age", y='fare', hue='survived')
plt.show()

png

g = sns.jointplot(data=df, x="age", y="fare", hue="survived", kind="kde")
plt.show()

png

# 6 sns.PairGrid

g = sns.PairGrid(df[['age', 'fare', 'pclass']])
g.map_upper(sns.scatterplot, s=15)
g.map_lower(sns.kdeplot)
g.map_diag(sns.kdeplot, lw=2)
plt.show()

png

Even more variables

use facetgrid to include more variables

# 1 facetGrid

g = sns.FacetGrid(df, col="pclass", row='alone')
g.map(sns.scatterplot, "fare", 'age')
plt.show()

png

InClass practice

  • Produce a boxplot of x='pclass' and y='age'
    • What happen if we add palette='rainbow' into sns.boxplot
  • Histgram plot for ‘fare’ with bins=10 and bins=50

  • visulize the effect age -> fare conditional on different parch and survived