STAT1013: Statistics in Python I
Load data in Pandas
- Read a CSV File
- Read an Excel File
Read a CSV file
import pandas as pd
## Read a CSV File via Github
# Step 1: Find raw csv data in a link (end up with .csv)
# E.g., find titanic dataset at https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
# If you use Github, please click `Raw` for the data
# Step 2: load data based on 'pd.read_csv'
df = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')
df.head(5)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
## Read a CSV file via Raw data
# Step 1: Download the raw csv file
# Step 2: put the file in the same folder of this python code
# step 3: load data based on 'pd.read_csv'
df = pd.read_csv('data.csv')
df.head(5)
Read an Excel file
import pandas as pd
## Read a xls file via Raw data
# Step 1: Download the raw xls file
# Step 2: put the file in the same folder of this python code
# step 3: load data based on 'pd.read_csv'
df = pd.read_excel('data.xls')
df.head(5)
Descriptive-statistics
- Average; Conditional Average, Sample variance
df.mean()
,df.std()
,df.median()
,df.quantile()
,
df.head(5)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
## Types of columns
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
## compute some sample estimators (continuous)
## Measures of central tendency
print('sample mean of age')
print(df['Age'].mean())
print('sample median of fare')
print(df['Fare'].median())
print('sample mean of age conditional on <Sex = male>')
print(df[df['Sex'] == 'male']['Age'].mean())
## Measures of variability
print('sample std of fare')
print(df['Fare'].std())
print('sample quantile of fare')
print(df['Fare'].quantile(.32))
sample mean of age
29.69911764705882
sample median of fare
14.4542
sample mean of age conditional on <Sex = male>
30.72664459161148
sample std of fare
49.6934285971809
sample quantile of fare
8.100000000000001
## compute some sample estimators (categorical)
## Measures of central tendency
print('all possible outcome of pclass')
print(set(df['Pclass']))
print('freq of who of man')
print(len(df[df['Sex'] == 'male'])/len(df))
print('conditional prob of survived | male')
print(len(df[(df['Survived'] == 1) & (df['Sex'] == 'male')])/len(df[df['Sex'] == 'male']))
print('conditional prob of survived | female')
print(len(df[(df['Survived'] == 1) & (df['Sex'] == 'female')])/len(df[df['Sex'] == 'female']))
## Measures of variability
# using data visualization
all possible outcome of pclass
{1, 2, 3}
freq of who of man
0.6475869809203143
conditional prob of survived | male
0.18890814558058924
conditional prob of survived | female
0.7420382165605095
## check the basic statistics for all columns
df.describe(include='all').T
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
PassengerId | 891.0 | NaN | NaN | NaN | 446.0 | 257.353842 | 1.0 | 223.5 | 446.0 | 668.5 | 891.0 |
Survived | 891.0 | NaN | NaN | NaN | 0.383838 | 0.486592 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
Pclass | 891.0 | NaN | NaN | NaN | 2.308642 | 0.836071 | 1.0 | 2.0 | 3.0 | 3.0 | 3.0 |
Name | 891 | 891 | Braund, Mr. Owen Harris | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Sex | 891 | 2 | male | 577 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Age | 714.0 | NaN | NaN | NaN | 29.699118 | 14.526497 | 0.42 | 20.125 | 28.0 | 38.0 | 80.0 |
SibSp | 891.0 | NaN | NaN | NaN | 0.523008 | 1.102743 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 |
Parch | 891.0 | NaN | NaN | NaN | 0.381594 | 0.806057 | 0.0 | 0.0 | 0.0 | 0.0 | 6.0 |
Ticket | 891 | 681 | 347082 | 7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Fare | 891.0 | NaN | NaN | NaN | 32.204208 | 49.693429 | 0.0 | 7.9104 | 14.4542 | 31.0 | 512.3292 |
Cabin | 204 | 147 | B96 B98 | 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Embarked | 889 | 3 | S | 644 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Groupby in DataFrame
DataFrame.groupby
is a very powerful tool to grouping data and compute conditional estimators.
How it works
Grouping the data -> then take
mean
,std
,min
,max
,median
## conditional mean and std on one outcome
print('conditional mean of Survived | Sex')
print(df.groupby('Sex')['Survived'].mean())
print('conditional std of Survived | Sex')
print(df.groupby('Sex')['Survived'].std())
conditional mean of Survived | Sex
Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64
conditional std of Survived | Sex
Sex
female 0.438211
male 0.391775
Name: Survived, dtype: float64
## conditional median on multiple outcome
print('conditional median of Survived | Sex')
print(df.groupby('Sex')[['Survived', 'Pclass']].median())
print('conditional median of ALL | Sex')
print(df.groupby('Sex').median())
conditional median of Survived | Sex
Survived Pclass
Sex
female 1.0 2.0
male 0.0 3.0
conditional median of ALL | Sex
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
female 414.5 1.0 2.0 27.0 0.0 0.0 23.0
male 464.0 0.0 3.0 29.0 0.0 0.0 10.5
## conditional (min, max, median, mean, std) on multiple outcome
print('conditional (min, max, median, mean, std) of Survived | Sex')
print(df.groupby('Sex')['Survived'].agg(['min', 'max', 'median', 'mean', 'std']))
print('conditional (min, max, median, mean, std) of Survived, Pclass | Sex')
print(df.groupby('Sex')[['Survived', 'Pclass']].agg(['min', 'max', 'median', 'mean', 'std']))
conditional (min, max, median, mean, std) of Survived | Sex
min max median mean std
Sex
female 0 1 1.0 0.742038 0.438211
male 0 1 0.0 0.188908 0.391775
conditional (min, max, median, mean, std) of Survived, Pclass | Sex
Survived Pclass \
min max median mean std min max median mean
Sex
female 0 1 1.0 0.742038 0.438211 1 3 2.0 2.159236
male 0 1 0.0 0.188908 0.391775 1 3 3.0 2.389948
std
Sex
female 0.85729
male 0.81358
InClass Practice
- How many Passenger’s pClass higher than 2?
- What is the survival ratio for each pClass?
- Print the median Fare for every pClass.
- Print the std of Fare for every Embarked.