STAT1013: Statistics in Python I

Load data in Pandas

  • Read a CSV File
  • Read an Excel File

Read a CSV file

import pandas as pd

## Read a CSV File via Github

# Step 1: Find raw csv data in a link (end up with .csv)
# E.g., find titanic dataset at https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
# If you use Github, please click `Raw` for the data

# Step 2: load data based on 'pd.read_csv'
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
df.head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
## Read a CSV file via Raw data

# Step 1: Download the raw csv file

# Step 2: put the file in the same folder of this python code

# step 3: load data based on 'pd.read_csv'

df = pd.read_csv('data.csv')
df.head(5)

Read an Excel file

import pandas as pd

## Read a xls file via Raw data

# Step 1: Download the raw xls file

# Step 2: put the file in the same folder of this python code

# step 3: load data based on 'pd.read_csv'

df = pd.read_excel('data.xls')
df.head(5)

Descriptive-statistics

  • Average; Conditional Average, Sample variance
  • df.mean(), df.std(), df.median(), df.quantile(),
df.head(5)
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
## Types of columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
## compute some sample estimators (continuous)

## Measures of central tendency

print('sample mean of age')
print(df['Age'].mean())

print('sample median of fare')
print(df['Fare'].median())

print('sample mean of age conditional on <Sex = male>')
print(df[df['Sex'] == 'male']['Age'].mean())

## Measures of variability
print('sample std of fare')
print(df['Fare'].std())

print('sample quantile of fare')
print(df['Fare'].quantile(.32))
sample mean of age
29.69911764705882
sample median of fare
14.4542
sample mean of age conditional on <Sex = male>
30.72664459161148
sample std of fare
49.6934285971809
sample quantile of fare
8.100000000000001
## compute some sample estimators (categorical)

## Measures of central tendency

print('all possible outcome of pclass')
print(set(df['Pclass']))

print('freq of who of man')
print(len(df[df['Sex'] == 'male'])/len(df))

print('conditional prob of survived | male')
print(len(df[(df['Survived'] == 1) & (df['Sex'] == 'male')])/len(df[df['Sex'] == 'male']))

print('conditional prob of survived | female')
print(len(df[(df['Survived'] == 1) & (df['Sex'] == 'female')])/len(df[df['Sex'] == 'female']))
## Measures of variability

# using data visualization
all possible outcome of pclass
{1, 2, 3}
freq of who of man
0.6475869809203143
conditional prob of survived | male
0.18890814558058924
conditional prob of survived | female
0.7420382165605095
## check the basic statistics for all columns
df.describe(include='all').T
countuniquetopfreqmeanstdmin25%50%75%max
PassengerId891.0NaNNaNNaN446.0257.3538421.0223.5446.0668.5891.0
Survived891.0NaNNaNNaN0.3838380.4865920.00.00.01.01.0
Pclass891.0NaNNaNNaN2.3086420.8360711.02.03.03.03.0
Name891891Braund, Mr. Owen Harris1NaNNaNNaNNaNNaNNaNNaN
Sex8912male577NaNNaNNaNNaNNaNNaNNaN
Age714.0NaNNaNNaN29.69911814.5264970.4220.12528.038.080.0
SibSp891.0NaNNaNNaN0.5230081.1027430.00.00.01.08.0
Parch891.0NaNNaNNaN0.3815940.8060570.00.00.00.06.0
Ticket8916813470827NaNNaNNaNNaNNaNNaNNaN
Fare891.0NaNNaNNaN32.20420849.6934290.07.910414.454231.0512.3292
Cabin204147B96 B984NaNNaNNaNNaNNaNNaNNaN
Embarked8893S644NaNNaNNaNNaNNaNNaNNaN

Groupby in DataFrame

DataFrame.groupby is a very powerful tool to grouping data and compute conditional estimators.

  • How it works

  • Grouping the data -> then take mean, std, min, max, median

## conditional mean and std on one outcome

print('conditional mean of Survived | Sex')
print(df.groupby('Sex')['Survived'].mean())

print('conditional std of Survived | Sex')
print(df.groupby('Sex')['Survived'].std())
conditional mean of Survived | Sex
Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64
conditional std of Survived | Sex
Sex
female    0.438211
male      0.391775
Name: Survived, dtype: float64
## conditional median on multiple outcome

print('conditional median of Survived | Sex')
print(df.groupby('Sex')[['Survived', 'Pclass']].median())

print('conditional median of ALL | Sex')
print(df.groupby('Sex').median())
conditional median of Survived | Sex
        Survived  Pclass
Sex                     
female       1.0     2.0
male         0.0     3.0
conditional median of ALL | Sex
        PassengerId  Survived  Pclass   Age  SibSp  Parch  Fare
Sex                                                            
female        414.5       1.0     2.0  27.0    0.0    0.0  23.0
male          464.0       0.0     3.0  29.0    0.0    0.0  10.5
## conditional (min, max, median, mean, std) on multiple outcome
print('conditional (min, max, median, mean, std) of Survived | Sex')
print(df.groupby('Sex')['Survived'].agg(['min', 'max', 'median', 'mean', 'std']))


print('conditional (min, max, median, mean, std) of Survived, Pclass | Sex')
print(df.groupby('Sex')[['Survived', 'Pclass']].agg(['min', 'max', 'median', 'mean', 'std']))

conditional (min, max, median, mean, std) of Survived | Sex
        min  max  median      mean       std
Sex                                         
female    0    1     1.0  0.742038  0.438211
male      0    1     0.0  0.188908  0.391775
conditional (min, max, median, mean, std) of Survived, Pclass | Sex
       Survived                                Pclass                       \
            min max median      mean       std    min max median      mean   
Sex                                                                          
female        0   1    1.0  0.742038  0.438211      1   3    2.0  2.159236   
male          0   1    0.0  0.188908  0.391775      1   3    3.0  2.389948   

                 
            std  
Sex              
female  0.85729  
male    0.81358  

InClass Practice

  • How many Passenger’s pClass higher than 2?
  • What is the survival ratio for each pClass?
  • Print the median Fare for every pClass.
  • Print the std of Fare for every Embarked.