STAT1013: Data Science Overview
Software prepare
Colab
Colab notebooks allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more.
from IPython.display import YouTubeVideo
YouTubeVideo('inN8seMm7UI')
Install numpy
, pandas
, seaborn
in your colab
!pip install numpy pandas seaborn scikit-learn
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple
Requirement already satisfied: numpy in /home/ben/np/lib/python3.10/site-packages (1.22.3)
Requirement already satisfied: pandas in /home/ben/np/lib/python3.10/site-packages (1.4.2)
Requirement already satisfied: seaborn in /home/ben/np/lib/python3.10/site-packages (0.11.2)
Requirement already satisfied: scikit-learn in /home/ben/np/lib/python3.10/site-packages (1.0.2)
Requirement already satisfied: pytz>=2020.1 in /home/ben/np/lib/python3.10/site-packages (from pandas) (2022.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/ben/np/lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: scipy>=1.0 in /home/ben/np/lib/python3.10/site-packages (from seaborn) (1.8.0)
Requirement already satisfied: matplotlib>=2.2 in /home/ben/np/lib/python3.10/site-packages (from seaborn) (3.5.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/ben/np/lib/python3.10/site-packages (from scikit-learn) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /home/ben/np/lib/python3.10/site-packages (from scikit-learn) (1.1.0)
Requirement already satisfied: pyparsing>=2.2.1 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (3.0.8)
Requirement already satisfied: fonttools>=4.22.0 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (4.34.4)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (1.4.4)
Requirement already satisfied: pillow>=6.2.0 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (9.2.0)
Requirement already satisfied: cycler>=0.10 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (0.11.0)
Requirement already satisfied: packaging>=20.0 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (21.3)
Requirement already satisfied: six>=1.5 in /home/ben/np/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
1.22.3
1.4.2
Basic Usage and Types of Data
Basic types in Python
Adapted from Scipy lecture notes and Python Numpy Tutorial. Python supports the following types:
## int
1 + 1
2
x = 4
type(x)
int
## float
x = 2.1
type(x)
float
## Booleans
1 > 2
False
x = (1 > 2)
type(x)
bool
## Type conversion (casting):
float(1)
1.0
## arithmetic operation
x, y = 3, 2
print(x + y)
print(x - y)
print(x * y)
print(x / y)
print(x // y)
print(x % y)
print(-x)
print(abs(-x))
print(int(3.9))
print(float(x))
print(x ** y)
5
1
6
1.5
1
1
-3
3
3
3.0
9
Containers
Python provides many efficient types of containers, in which collections of objects can be stored.
## lists
colors = ['red', 'blue', 'green', 'black', 'white']
print(colors)
['red', 'blue', 'green', 'black', 'white']
type(colors)
list
## Indexing: accessing individual objects contained in the list:
colors[0]
'red'
colors[-1]
'white'
colors[2:4]
['green', 'black']
## Dictionaries
tel = {'emmanuelle': 5752, 'sebastian': 5578}
tel
{'emmanuelle': 5752, 'sebastian': 5578}
tel['francis'] = 5915
tel
{'emmanuelle': 5752, 'sebastian': 5578, 'francis': 5915}
tel['sebastian']
5578
## function
def add(x, y):
return x+y
z = add(x=1,y=1)
print(z)
2
## position argument
z = add(1,2)
print(z)
3
## default argument
def add(x, y=10):
return x+y
add(1)
11
Control flow
Type the following lines in your Python interpreter, and be careful to respect the indentation depth. The Ipython shell automatically increases the indentation depth after a colon : sign; to decrease the indentation depth, go four spaces to the left with the Backspace key. Press the Enter key twice to leave the logical block.
## if/elif/else
if 2**2 == 4:
print('correct!')
correct!
## step function
## f(x) = 1 if x <0; f(x)=0, if 0<=x<=1, f(x)=-1, if x>1
def step_fun(x):
if x < 0:
return 1
elif x> 1:
return -1
else:
return 0
step_fun(100)
-1
step_fun(.5)
0
step_fun(-5)
1
## For loop
for i in range(5):
print(i)
0
1
2
3
4
## loop over a list
for name in ['A', 'B', 'C', 1, 5, 'eric']:
print(name)
A
B
C
1
5
eric
Data in Pandas and Numpy
Read and store .csv data in pandas and numpy
Pandas
Load fmri dataset based on pandas
import pandas as pd
## load from website
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/fmri.csv')
df.head(5)
subject | timepoint | event | region | signal | |
---|---|---|---|---|---|
0 | s13 | 18 | stim | parietal | -0.017552 |
1 | s5 | 14 | stim | parietal | -0.080883 |
2 | s12 | 18 | stim | parietal | -0.081033 |
3 | s11 | 18 | stim | parietal | -0.046134 |
4 | s10 | 18 | stim | parietal | -0.037970 |
## header option in read_csv
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/fmri.csv', header=0)
df.head(5)
subject | timepoint | event | region | signal | |
---|---|---|---|---|---|
0 | s13 | 18 | stim | parietal | -0.017552 |
1 | s5 | 14 | stim | parietal | -0.080883 |
2 | s12 | 18 | stim | parietal | -0.081033 |
3 | s11 | 18 | stim | parietal | -0.046134 |
4 | s10 | 18 | stim | parietal | -0.037970 |
## load from csv file
df = pd.read_csv('fmri.csv')
## read partial columns
df.columns
Index(['subject', 'timepoint', 'event', 'region', 'signal'], dtype='object')
## single column
df['timepoint']
0 18
1 14
2 18
3 18
4 18
..
1059 8
1060 7
1061 7
1062 7
1063 0
Name: timepoint, Length: 1064, dtype: int64
## multiple columns
df[['subject', 'timepoint']]
subject | timepoint | |
---|---|---|
0 | s13 | 18 |
1 | s5 | 14 |
2 | s12 | 18 |
3 | s11 | 18 |
4 | s10 | 18 |
... | ... | ... |
1059 | s0 | 8 |
1060 | s13 | 7 |
1061 | s12 | 7 |
1062 | s11 | 7 |
1063 | s0 | 0 |
1064 rows × 2 columns
## select one row
df.iloc[0]
subject s13
timepoint 18
event stim
region parietal
signal -0.017552
Name: 0, dtype: object
## select many rows
df.iloc[0:10]
subject | timepoint | event | region | signal | |
---|---|---|---|---|---|
0 | s13 | 18 | stim | parietal | -0.017552 |
1 | s5 | 14 | stim | parietal | -0.080883 |
2 | s12 | 18 | stim | parietal | -0.081033 |
3 | s11 | 18 | stim | parietal | -0.046134 |
4 | s10 | 18 | stim | parietal | -0.037970 |
5 | s9 | 18 | stim | parietal | -0.103513 |
6 | s8 | 18 | stim | parietal | -0.064408 |
7 | s7 | 18 | stim | parietal | -0.060526 |
8 | s6 | 18 | stim | parietal | -0.007029 |
9 | s5 | 18 | stim | parietal | -0.040557 |
## select many rows
df.iloc[[1,5,7]]
subject | timepoint | event | region | signal | |
---|---|---|---|---|---|
1 | s5 | 14 | stim | parietal | -0.080883 |
5 | s9 | 18 | stim | parietal | -0.103513 |
7 | s7 | 18 | stim | parietal | -0.060526 |
## conditional select
df[df['region'] == 'parietal']
subject | timepoint | event | region | signal | |
---|---|---|---|---|---|
0 | s13 | 18 | stim | parietal | -0.017552 |
1 | s5 | 14 | stim | parietal | -0.080883 |
2 | s12 | 18 | stim | parietal | -0.081033 |
3 | s11 | 18 | stim | parietal | -0.046134 |
4 | s10 | 18 | stim | parietal | -0.037970 |
... | ... | ... | ... | ... | ... |
930 | s7 | 16 | cue | parietal | -0.024589 |
931 | s8 | 9 | cue | parietal | -0.039664 |
949 | s6 | 9 | cue | parietal | -0.069248 |
967 | s5 | 9 | cue | parietal | -0.056757 |
1063 | s0 | 0 | cue | parietal | -0.006899 |
532 rows × 5 columns
df[df['signal'] > 0]
subject | timepoint | event | region | signal | |
---|---|---|---|---|---|
17 | s7 | 9 | stim | parietal | 0.058897 |
36 | s8 | 9 | stim | parietal | 0.170227 |
118 | s8 | 10 | stim | parietal | 0.065198 |
119 | s7 | 10 | stim | parietal | 0.001648 |
123 | s3 | 10 | stim | parietal | 0.089231 |
... | ... | ... | ... | ... | ... |
1043 | s10 | 12 | cue | frontal | 0.005711 |
1044 | s9 | 12 | cue | frontal | 0.024292 |
1051 | s8 | 8 | cue | frontal | 0.007278 |
1052 | s7 | 8 | cue | frontal | 0.015765 |
1059 | s0 | 8 | cue | frontal | 0.018165 |
406 rows × 5 columns
df[(df['signal'] > 0) & (df['region'] == 'parietal')]
subject | timepoint | event | region | signal | |
---|---|---|---|---|---|
17 | s7 | 9 | stim | parietal | 0.058897 |
36 | s8 | 9 | stim | parietal | 0.170227 |
118 | s8 | 10 | stim | parietal | 0.065198 |
119 | s7 | 10 | stim | parietal | 0.001648 |
123 | s3 | 10 | stim | parietal | 0.089231 |
... | ... | ... | ... | ... | ... |
866 | s2 | 13 | cue | parietal | 0.039476 |
872 | s11 | 13 | cue | parietal | 0.028161 |
881 | s2 | 14 | cue | parietal | 0.035840 |
885 | s2 | 12 | cue | parietal | 0.030923 |
908 | s2 | 11 | cue | parietal | 0.009137 |
194 rows × 5 columns
## generate columns from existing columns
df['period'] = df['timepoint'] > 5
df
subject | timepoint | event | region | signal | period | |
---|---|---|---|---|---|---|
0 | s13 | 18 | stim | parietal | -0.017552 | True |
1 | s5 | 14 | stim | parietal | -0.080883 | True |
2 | s12 | 18 | stim | parietal | -0.081033 | True |
3 | s11 | 18 | stim | parietal | -0.046134 | True |
4 | s10 | 18 | stim | parietal | -0.037970 | True |
... | ... | ... | ... | ... | ... | ... |
1059 | s0 | 8 | cue | frontal | 0.018165 | True |
1060 | s13 | 7 | cue | frontal | -0.029130 | True |
1061 | s12 | 7 | cue | frontal | -0.004939 | True |
1062 | s11 | 7 | cue | frontal | -0.025367 | True |
1063 | s0 | 0 | cue | parietal | -0.006899 | False |
1064 rows × 6 columns
## more about panads
from IPython.display import YouTubeVideo
YouTubeVideo('vmEHCJofslg')
Numpy
Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. If you are already familiar with MATLAB, you might find this tutorial useful to get started with Numpy.
## array
import numpy as np
a = np.array([1, 2, 3]) # Create a rank 1 array
print(type(a)) # Prints "<class 'numpy.ndarray'>"
print(a.shape) # Prints "(3,)"
print(a[0], a[1], a[2]) # Prints "1 2 3"
a[0] = 5 # Change an element of the array
print(a) # Prints "[5, 2, 3]"
b = np.array([[1,2,3],[4,5,6]]) # Create a rank 2 array
print(b.shape) # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0]) # Prints "1 2 4"
<class 'numpy.ndarray'>
(3,)
1 2 3
[5 2 3]
(2, 3)
1 2 4
a = np.zeros((2,2)) # Create an array of all zeros
print(a) # Prints "[[ 0. 0.]
# [ 0. 0.]]"
b = np.ones((1,2)) # Create an array of all ones
print(b) # Prints "[[ 1. 1.]]"
c = np.full((2,2), 7) # Create a constant array
print(c) # Prints "[[ 7. 7.]
# [ 7. 7.]]"
d = np.eye(2) # Create a 2x2 identity matrix
print(d) # Prints "[[ 1. 0.]
# [ 0. 1.]]"
[[0. 0.]
[0. 0.]]
[[1. 1.]]
[[7 7]
[7 7]]
[[1. 0.]
[0. 1.]]
- Basic usage Array
- row selection
- conditional selection
- casting
## array index
import numpy as np
# Create the following rank 2 array with shape (3, 4)
# [[ 1 2 3 4]
# [ 5 6 7 8]
# [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
a[1:3, 2:3]
array([[ 7],
[11]])
## conditional select
a[a>8]
array([ 9, 10, 11, 12])
## casting
x = np.array([1, 2], dtype=np.int64)
print(x)
[1 2]
np.array(x, dtype=np.float32)
array([1., 2.], dtype=float32)
- Math operator
- vector-based
- matrix-based
## Math operators
## note: do not conduct math based on a list
## add scalar
a = np.array([1, 2, 3, 4])
a + 1
array([2, 3, 4, 5])
## two array
b = np.ones(4) + 1
a - b
array([-1., 0., 1., 2.])
## mat elementwise operator
c = np.ones((3, 3))
c+c
array([[2., 2., 2.],
[2., 2., 2.],
[2., 2., 2.]])
c*c
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
## matrix operator
np.dot(c, c) # or c.dot(c)
array([[3., 3., 3.],
[3., 3., 3.],
[3., 3., 3.]])
## some basic functions
x = np.array([1, 2, 3, 4])
x.sum()
10
np.sort(x)
array([1, 2, 3, 4])
x.min()
1
x.argmin()
0
## matrix operator
x = np.array([[1, 1], [2, 2]])
x.sum()
6
x.sum(axis=0)
array([3, 3])
x.sum(axis=1)
array([2, 4])
x.min()
1
x.min(axis=0)
array([1, 1])
from IPython.display import YouTubeVideo
YouTubeVideo('QUT1VHiLmmI')
InClass Exercise
- Example 1
Compute the sum of $1 + 2 + 3 + \cdots + 100$
- Example 2
Compute the sum of $1/1 + 1/2 + 1/3 + \cdots + 1/100$
- Example 4
Given a list a = [1, 3, 5, 7, 4]
, replace the max value by its minimum.
- Example 3
ReLU function is a widely-used activation function $f: \mathbb{R}^d \to \mathbb{R}$ in Machine learning and Deep learning, it is defined as: \(f(x) = max(0, x)\) Implement a python function relu()
returning the relu of a given vector.