STAT1013: Data Science Overview

  1. Software prepare
  2. Basic Usage and Types of Data
  3. Data in Pandas and Numpy

Software prepare

Colab

Colab notebooks allow you to combine executable code and rich text in a single document, along with images, HTML, LaTeX and more.

from IPython.display import YouTubeVideo
YouTubeVideo('inN8seMm7UI')

Install numpy, pandas, seaborn in your colab

!pip install numpy pandas seaborn scikit-learn
Looking in indexes: https://pypi.org/simple, https://packagecloud.io/github/git-lfs/pypi/simple
Requirement already satisfied: numpy in /home/ben/np/lib/python3.10/site-packages (1.22.3)
Requirement already satisfied: pandas in /home/ben/np/lib/python3.10/site-packages (1.4.2)
Requirement already satisfied: seaborn in /home/ben/np/lib/python3.10/site-packages (0.11.2)
Requirement already satisfied: scikit-learn in /home/ben/np/lib/python3.10/site-packages (1.0.2)
Requirement already satisfied: pytz>=2020.1 in /home/ben/np/lib/python3.10/site-packages (from pandas) (2022.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/ben/np/lib/python3.10/site-packages (from pandas) (2.8.2)
Requirement already satisfied: scipy>=1.0 in /home/ben/np/lib/python3.10/site-packages (from seaborn) (1.8.0)
Requirement already satisfied: matplotlib>=2.2 in /home/ben/np/lib/python3.10/site-packages (from seaborn) (3.5.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/ben/np/lib/python3.10/site-packages (from scikit-learn) (3.1.0)
Requirement already satisfied: joblib>=0.11 in /home/ben/np/lib/python3.10/site-packages (from scikit-learn) (1.1.0)
Requirement already satisfied: pyparsing>=2.2.1 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (3.0.8)
Requirement already satisfied: fonttools>=4.22.0 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (4.34.4)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (1.4.4)
Requirement already satisfied: pillow>=6.2.0 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (9.2.0)
Requirement already satisfied: cycler>=0.10 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (0.11.0)
Requirement already satisfied: packaging>=20.0 in /home/ben/np/lib/python3.10/site-packages (from matplotlib>=2.2->seaborn) (21.3)
Requirement already satisfied: six>=1.5 in /home/ben/np/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
import numpy as np
print(np.__version__)

import pandas as pd
print(pd.__version__)
1.22.3
1.4.2

Basic Usage and Types of Data

Basic types in Python

Adapted from Scipy lecture notes and Python Numpy Tutorial. Python supports the following types:

Types of data

## int
1 + 1
2
x = 4
type(x)
int
## float
x = 2.1
type(x)
float
## Booleans
1 > 2
False
x = (1 > 2)
type(x)
bool
## Type conversion (casting):
float(1)
1.0
## arithmetic operation
x, y = 3, 2
print(x + y) 
print(x - y) 
print(x * y) 
print(x / y) 
print(x // y) 
print(x % y) 
print(-x) 
print(abs(-x)) 
print(int(3.9)) 
print(float(x)) 
print(x ** y) 
5
1
6
1.5
1
1
-3
3
3
3.0
9

Containers

Python provides many efficient types of containers, in which collections of objects can be stored.

## lists
colors = ['red', 'blue', 'green', 'black', 'white']
print(colors)
['red', 'blue', 'green', 'black', 'white']
type(colors)
list
## Indexing: accessing individual objects contained in the list:
colors[0]
'red'
colors[-1]
'white'
colors[2:4]
['green', 'black']
## Dictionaries
tel = {'emmanuelle': 5752, 'sebastian': 5578}
tel
{'emmanuelle': 5752, 'sebastian': 5578}
tel['francis'] = 5915
tel
{'emmanuelle': 5752, 'sebastian': 5578, 'francis': 5915}
tel['sebastian']
5578
## function
def add(x, y):
    return x+y
z = add(x=1,y=1)
print(z)
2
## position argument
z = add(1,2)
print(z)
3
## default argument
def add(x, y=10):
    return x+y
add(1)
11

Control flow

Type the following lines in your Python interpreter, and be careful to respect the indentation depth. The Ipython shell automatically increases the indentation depth after a colon : sign; to decrease the indentation depth, go four spaces to the left with the Backspace key. Press the Enter key twice to leave the logical block.

## if/elif/else
if 2**2 == 4:
    print('correct!')
correct!
## step function
## f(x) = 1 if x <0; f(x)=0, if 0<=x<=1, f(x)=-1, if x>1
def step_fun(x):
    if x < 0:
        return 1
    elif x> 1:
        return -1
    else:
        return 0
step_fun(100)
-1
step_fun(.5)
0
step_fun(-5)
1
## For loop
for i in range(5):
    print(i)
0
1
2
3
4
## loop over a list
for name in ['A', 'B', 'C', 1, 5, 'eric']:
    print(name)
A
B
C
1
5
eric

Data in Pandas and Numpy

Read and store .csv data in pandas and numpy

Types of data

Pandas

Load fmri dataset based on pandas

import pandas as pd

## load from website
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/fmri.csv')
df.head(5)
subjecttimepointeventregionsignal
0s1318stimparietal-0.017552
1s514stimparietal-0.080883
2s1218stimparietal-0.081033
3s1118stimparietal-0.046134
4s1018stimparietal-0.037970
## header option in read_csv
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/fmri.csv', header=0)
df.head(5)
subjecttimepointeventregionsignal
0s1318stimparietal-0.017552
1s514stimparietal-0.080883
2s1218stimparietal-0.081033
3s1118stimparietal-0.046134
4s1018stimparietal-0.037970
## load from csv file
df = pd.read_csv('fmri.csv')
## read partial columns
df.columns
Index(['subject', 'timepoint', 'event', 'region', 'signal'], dtype='object')
## single column
df['timepoint']
0       18
1       14
2       18
3       18
4       18
        ..
1059     8
1060     7
1061     7
1062     7
1063     0
Name: timepoint, Length: 1064, dtype: int64
## multiple columns
df[['subject', 'timepoint']]
subjecttimepoint
0s1318
1s514
2s1218
3s1118
4s1018
.........
1059s08
1060s137
1061s127
1062s117
1063s00

1064 rows × 2 columns

## select one row
df.iloc[0]
subject           s13
timepoint          18
event            stim
region       parietal
signal      -0.017552
Name: 0, dtype: object
## select many rows
df.iloc[0:10]
subjecttimepointeventregionsignal
0s1318stimparietal-0.017552
1s514stimparietal-0.080883
2s1218stimparietal-0.081033
3s1118stimparietal-0.046134
4s1018stimparietal-0.037970
5s918stimparietal-0.103513
6s818stimparietal-0.064408
7s718stimparietal-0.060526
8s618stimparietal-0.007029
9s518stimparietal-0.040557
## select many rows
df.iloc[[1,5,7]]
subjecttimepointeventregionsignal
1s514stimparietal-0.080883
5s918stimparietal-0.103513
7s718stimparietal-0.060526
## conditional select
df[df['region'] == 'parietal']
subjecttimepointeventregionsignal
0s1318stimparietal-0.017552
1s514stimparietal-0.080883
2s1218stimparietal-0.081033
3s1118stimparietal-0.046134
4s1018stimparietal-0.037970
..................
930s716cueparietal-0.024589
931s89cueparietal-0.039664
949s69cueparietal-0.069248
967s59cueparietal-0.056757
1063s00cueparietal-0.006899

532 rows × 5 columns

df[df['signal'] > 0]
subjecttimepointeventregionsignal
17s79stimparietal0.058897
36s89stimparietal0.170227
118s810stimparietal0.065198
119s710stimparietal0.001648
123s310stimparietal0.089231
..................
1043s1012cuefrontal0.005711
1044s912cuefrontal0.024292
1051s88cuefrontal0.007278
1052s78cuefrontal0.015765
1059s08cuefrontal0.018165

406 rows × 5 columns

df[(df['signal'] > 0) & (df['region'] == 'parietal')]
subjecttimepointeventregionsignal
17s79stimparietal0.058897
36s89stimparietal0.170227
118s810stimparietal0.065198
119s710stimparietal0.001648
123s310stimparietal0.089231
..................
866s213cueparietal0.039476
872s1113cueparietal0.028161
881s214cueparietal0.035840
885s212cueparietal0.030923
908s211cueparietal0.009137

194 rows × 5 columns

## generate columns from existing columns
df['period'] = df['timepoint'] > 5
df
subjecttimepointeventregionsignalperiod
0s1318stimparietal-0.017552True
1s514stimparietal-0.080883True
2s1218stimparietal-0.081033True
3s1118stimparietal-0.046134True
4s1018stimparietal-0.037970True
.....................
1059s08cuefrontal0.018165True
1060s137cuefrontal-0.029130True
1061s127cuefrontal-0.004939True
1062s117cuefrontal-0.025367True
1063s00cueparietal-0.006899False

1064 rows × 6 columns

## more about panads
from IPython.display import YouTubeVideo
YouTubeVideo('vmEHCJofslg')

Numpy

Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. If you are already familiar with MATLAB, you might find this tutorial useful to get started with Numpy.

## array
import numpy as np

a = np.array([1, 2, 3])   # Create a rank 1 array
print(type(a))            # Prints "<class 'numpy.ndarray'>"
print(a.shape)            # Prints "(3,)"
print(a[0], a[1], a[2])   # Prints "1 2 3"
a[0] = 5                  # Change an element of the array
print(a)                  # Prints "[5, 2, 3]"

b = np.array([[1,2,3],[4,5,6]])    # Create a rank 2 array
print(b.shape)                     # Prints "(2, 3)"
print(b[0, 0], b[0, 1], b[1, 0])   # Prints "1 2 4"
<class 'numpy.ndarray'>
(3,)
1 2 3
[5 2 3]
(2, 3)
1 2 4
a = np.zeros((2,2))   # Create an array of all zeros
print(a)              # Prints "[[ 0.  0.]
                      #          [ 0.  0.]]"

b = np.ones((1,2))    # Create an array of all ones
print(b)              # Prints "[[ 1.  1.]]"

c = np.full((2,2), 7)  # Create a constant array
print(c)               # Prints "[[ 7.  7.]
                       #          [ 7.  7.]]"

d = np.eye(2)         # Create a 2x2 identity matrix
print(d)              # Prints "[[ 1.  0.]
                      #          [ 0.  1.]]"
[[0. 0.]
 [0. 0.]]
[[1. 1.]]
[[7 7]
 [7 7]]
[[1. 0.]
 [0. 1.]]
  • Basic usage Array
    • row selection
    • conditional selection
    • casting
## array index
import numpy as np

# Create the following rank 2 array with shape (3, 4)
# [[ 1  2  3  4]
#  [ 5  6  7  8]
#  [ 9 10 11 12]]
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

a[1:3, 2:3]
array([[ 7],
       [11]])
## conditional select
a[a>8]
array([ 9, 10, 11, 12])
## casting
x = np.array([1, 2], dtype=np.int64)
print(x)
[1 2]
np.array(x, dtype=np.float32)
array([1., 2.], dtype=float32)
  • Math operator
    • vector-based
    • matrix-based
## Math operators

## note: do not conduct math based on a list

## add scalar
a = np.array([1, 2, 3, 4])
a + 1
array([2, 3, 4, 5])
## two array
b = np.ones(4) + 1
a - b
array([-1.,  0.,  1.,  2.])
## mat elementwise operator
c = np.ones((3, 3))
c+c
array([[2., 2., 2.],
       [2., 2., 2.],
       [2., 2., 2.]])
c*c
array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])
## matrix operator
np.dot(c, c) # or c.dot(c)
array([[3., 3., 3.],
       [3., 3., 3.],
       [3., 3., 3.]])
## some basic functions
x = np.array([1, 2, 3, 4])
x.sum()
10
np.sort(x)
array([1, 2, 3, 4])
x.min()
1
x.argmin()
0
## matrix operator
x = np.array([[1, 1], [2, 2]])
x.sum()
6
x.sum(axis=0)
array([3, 3])
x.sum(axis=1)
array([2, 4])
x.min()
1
x.min(axis=0)
array([1, 1])
from IPython.display import YouTubeVideo
YouTubeVideo('QUT1VHiLmmI')

InClass Exercise

  • Example 1

Compute the sum of $1 + 2 + 3 + \cdots + 100$

  • Example 2

Compute the sum of $1/1 + 1/2 + 1/3 + \cdots + 1/100$

  • Example 4

Given a list a = [1, 3, 5, 7, 4], replace the max value by its minimum.

  • Example 3

ReLU function is a widely-used activation function $f: \mathbb{R}^d \to \mathbb{R}$ in Machine learning and Deep learning, it is defined as: \(f(x) = max(0, x)\) Implement a python function relu() returning the relu of a given vector.