# STAT1013: A/B Testing with Cookie Cats

This notebook is adapted from A/B Testing with Cookie Cats Game

Cookie Cats is a hugely popular mobile puzzle game developed by Tactile Entertainment. It’s a classic “connect three” style puzzle game where the player must connect tiles of the same color in order to clear the board and win the level. It also features singing cats.

from IPython.display import YouTubeVideo
YouTubeVideo('pIMzD9ayPiE')


As players progress through the levels of the game, they will occasionally encounter gates that force them to wait a non-trivial amount of time or make an in-app purchase to progress. In addition to driving in-app purchases, these gates serve the important purpose of giving players an enforced break from playing the game, hopefully resulting in that the player’s enjoyment of the game being increased and prolonged.

But where should the gates be placed? Initially the first gate was placed at level 30. Tactile Entertainment is planning to move Cookie Cats’ time gates from level 30 to 40, but they don’t know by how much the user retention can be impacted by this decision.

## 🧪 Hypothesis

So seeing this viewpoint, a decision like this can impact not only user retention, the expected revenue as well that’s why we are going to set the initial hypothesis as:

• $H_0:$ Moving the Time Gate from Level 30 to Level 40 will not increase our user retention.
• $H_1:$ Moving the Time Gate from Level 30 to Level 40 will increase our user retention.

Alternatively, denote $X$ as the user retention, $\mu_A$ as user retention based on Time Gate on Level 30, and $\mu_B$ as user retention based on Time Gate on Level 40.

• $H_0: \mu_A = \mu_B$
• $H_1: \mu_A < \mu_B$

## 🗃️ Data

The Cookie Cats dataset is public avaliable at Github with a structured CSV file.

# Importing pandas
import pandas as pd

# Reading in the data
df = pd.read_csv('https://raw.githubusercontent.com/ryanschaub/Mobile-Games-A-B-Testing-with-Cookie-Cats/master/cookie_cats.csv')

# Showing the first few rows
df.head()

useridversionsum_gameroundsretention_1retention_7
0116gate_303FalseFalse
1337gate_3038TrueFalse
2377gate_40165TrueFalse
3483gate_401FalseFalse
4488gate_40179TrueTrue

Data Description. This dataset contains around 90,189 records of players that started the game while the telemetry system was running. Among the variables collected are the next:

• userid - a unique number that identifies each player.
• version - whether the player was put in the control group (gate_30 - a gate at level 30) or the test group (gate_40 - a gate at level 40).
• sum_gamerounds - the number of game rounds played by the player during the first week after installation
• retention_1 - did the player come back and play 1 day after installing?
• retention_7 - did the player come back and play 7 days after installing?

When a player installed the game, he or she was randomly assigned to either gate_30 or gate_40.

Note: An important fact to keep in mind is that in the game industry one crucial metric is retention_1, since it defines if the game generate a first engagement with the first log-in of the player.

## Checking missing data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90189 entries, 0 to 90188
Data columns (total 5 columns):
#   Column          Non-Null Count  Dtype
---  ------          --------------  -----
0   userid          90189 non-null  int64
1   version         90189 non-null  object
2   sum_gamerounds  90189 non-null  int64
3   retention_1     90189 non-null  bool
4   retention_7     90189 non-null  bool
dtypes: bool(2), int64(2), object(1)
memory usage: 2.2+ MB

## Overview of data
df.describe()

useridsum_gamerounds
count9.018900e+0490189.000000
mean4.998412e+0651.872457
std2.883286e+06195.050858
min1.160000e+020.000000
25%2.512230e+065.000000
50%4.995815e+0616.000000
75%7.496452e+0651.000000
max9.999861e+0649854.000000
## Summary statistics

df.groupby(["version"]).sum_gamerounds.agg(["count", "median", "mean", "std", "max"])

countmedianmeanstdmax
version
gate_304470017.052.456264256.71642349854
gate_404548916.051.298776103.2944162640
df.groupby(["version"]).retention_1.agg(["count", "median", "mean", "std", "max"])

countmedianmeanstdmax
version
gate_30447000.00.4481880.497314True
gate_40454890.00.4422830.496663True
df.groupby(["version"]).retention_7.agg(["count", "median", "mean", "std", "max"])

countmedianmeanstdmax
version
gate_30447000.00.1902010.392464True
gate_40454890.00.1820000.385849True
## Joint effect

df.groupby(["version", "retention_1"]).sum_gamerounds.agg(["count", "median", "mean", "std", "max"])

countmedianmeanstdmax
versionretention_1
gate_30False246666.018.379591319.42323249854
True2003448.094.411700135.0376972961
gate_40False253706.016.34040235.9257561241
True2011949.095.381182137.8872562640

## 🔥 Visualization

The most accurate way to test changes is to perform A/B testing by targeting a specific variable, in the case retention (for 1 and 7 days after installation).

import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [20, 9]
sns.set_theme()


### sum_gamerounds VS version

• contunous VS categorical features
• violinplot or stripplot or kdeplot
## violinplot
sns.violinplot(df, x='sum_gamerounds', y="version")

<Axes: xlabel='sum_gamerounds', ylabel='version'>


## stripplot
sns.stripplot(df, x='sum_gamerounds', y="version", hue="version", alpha=0.4)

<Axes: xlabel='sum_gamerounds', ylabel='version'>


## kdeplot
sns.kdeplot(data=df, x="sum_gamerounds", hue="version", multiple="stack", alpha=.5,)

<Axes: xlabel='sum_gamerounds', ylabel='Density'>


Note: We found some outliers or extreme points hurt the visualization, we can just focus on sum_gamerounds < 3000.

## violinplot (refined)
sns.violinplot(df[df['sum_gamerounds']<3000], x='sum_gamerounds', y="version", hue="version", alpha=0.4)

<Axes: xlabel='sum_gamerounds', ylabel='version'>


## stripplot (refined)
sns.stripplot(df[df['sum_gamerounds']<3000], x='sum_gamerounds', y="version", hue="version", alpha=0.4)

<Axes: xlabel='sum_gamerounds', ylabel='version'>


sns.kdeplot(data=df[df['sum_gamerounds']<3000], x="sum_gamerounds", hue="version", multiple="stack")

<Axes: xlabel='sum_gamerounds', ylabel='Density'>


### version VS retention_1/retention_7

• bool VS bool
• histplot
sns.histplot(df,x="retention_1", hue="version", stat='probability', multiple="dodge")

<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.

<Axes: xlabel='retention_1', ylabel='Probability'>


sns.histplot(df,x="retention_7", hue="version", stat='probability', multiple="dodge")

<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.

<Axes: xlabel='retention_7', ylabel='Probability'>


Conclusion: It is very difficult to tell which one is better.

## 📊 A/B Test

Assumptions:

• Sample size

Steps:

• Split & Define Control Group & Test Group
• Independet populations OR paired population
• If variances are known?
• If variances are equal?
• What’s the alternative?
# Split & Define Control Group & Test Group
data_g30 = df[df['version'] == 'gate_30']
data_g40 = df[df['version'] == 'gate_40']

## Independet populations + unknown and unequal variances
metric = 'retention_1'
from scipy.stats import ttest_ind

t_value, p = ttest_ind(a=1.0*data_g30[metric], b=1.0*data_g40[metric],
alternative="less", equal_var=False)

print('A/B test result on %s: test-stat: %.3f; p-value: %.4f' %(metric, t_value, p))

A/B test result on retention_1: test-stat: 1.784; p-value: 0.9628

## result for retention_7
metric = 'retention_7'
from scipy.stats import ttest_ind

t_value, p = ttest_ind(a=1.0*data_g30[metric], b=1.0*data_g40[metric],
alternative="less", equal_var=False)

print('A/B test result on %s: test-stat: %.3f; p-value: %.4f' %(metric, t_value, p))

A/B test result on retention_7: test-stat: 3.164; p-value: 0.9992

## result for retention_7
metric = 'sum_gamerounds'
from scipy.stats import ttest_ind

t_value, p = ttest_ind(a=1.0*data_g30[metric], b=1.0*data_g40[metric],
alternative="less", equal_var=False)

print('A/B test result on %s: test-stat: %.3f; p-value: %.4f' %(metric, t_value, p))

A/B test result on sum_gamerounds: test-stat: 0.885; p-value: 0.8120


## 🔃 Conclusion

There is not enough evidence to support the alternative hypothesis that Moving the Time Gate from Level 30 to Level 40 will increase our user retention_1/retention_7/sum_gamerounds. at the significance level of α=0.05.

### InClass Practice (10 min)

What if we change the hypotheses as:

• $H_0:$ Moving the Time Gate from Level 40 to Level 30 will not increase our user retention.
• $H_1:$ Moving the Time Gate from Level 40 to Level 30 will increase our user retention.

Alternatively, denote $X$ as the user retention, $\mu_A$ as user retention based on Time Gate on Level 40, and $\mu_B$ as user retention based on Time Gate on Level 30.

• $H_0: \mu_A = \mu_B$
• $H_1: \mu_A < \mu_B$

## 🏠 Take-home message

### Intuition can make mistake

So, why is retention higher when the gate is positioned earlier? Normally, we could expect the opposite: The later the obstacle, the longer people get engaged with the game. But this is not what the data tells us, we explained this with the theory of hedonic adaptation.

For coming strategies the Game Designers can consider that, by pushing players to take a break when they reach a gate, the fun of the game is postponed. But, when the gate is moved to level 40, they are more likely to quit the game because they simply got bored of it.

### What could the stakeholders do to take action?

(After InClass Practice) Now we have enough statistical evidence to say that 7-day retention is higher when the gate is at level 30 than when it is at level 40, the same as we concluded for 1-day retention. If we want to keep consumer retention high, we should not move the gate from level 30 to level 40, it means we keep our Control method in the current gate system.