STAT1013: A/B Testing with Cookie Cats
This notebook is adapted from A/B Testing with Cookie Cats Game
Cookie Cats is a hugely popular mobile puzzle game developed by Tactile Entertainment. It’s a classic “connect three” style puzzle game where the player must connect tiles of the same color in order to clear the board and win the level. It also features singing cats.
from IPython.display import YouTubeVideo
YouTubeVideo('pIMzD9ayPiE')
As players progress through the levels of the game, they will occasionally encounter gates that force them to wait a non-trivial amount of time or make an in-app purchase to progress. In addition to driving in-app purchases, these gates serve the important purpose of giving players an enforced break from playing the game, hopefully resulting in that the player’s enjoyment of the game being increased and prolonged.
But where should the gates be placed? Initially the first gate was placed at level 30. Tactile Entertainment is planning to move Cookie Cats’ time gates from level 30 to 40, but they don’t know by how much the user retention can be impacted by this decision.
🧪 Hypothesis
So seeing this viewpoint, a decision like this can impact not only user retention, the expected revenue as well that’s why we are going to set the initial hypothesis as:
- $H_0:$ Moving the Time Gate from Level 30 to Level 40 will not increase our user retention.
- $H_1:$ Moving the Time Gate from Level 30 to Level 40 will increase our user retention.
Alternatively, denote $X$ as the user retention
, $\mu_A$ as user retention
based on Time Gate on Level 30
, and $\mu_B$ as user retention
based on Time Gate on Level 40
.
- $H_0: \mu_A = \mu_B$
- $H_1: \mu_A < \mu_B$
🗃️ Data
The Cookie Cats dataset is public avaliable at Github with a structured CSV file.
# Importing pandas
import pandas as pd
# Reading in the data
df = pd.read_csv('https://raw.githubusercontent.com/ryanschaub/Mobile-Games-A-B-Testing-with-Cookie-Cats/master/cookie_cats.csv')
# Showing the first few rows
df.head()
userid | version | sum_gamerounds | retention_1 | retention_7 | |
---|---|---|---|---|---|
0 | 116 | gate_30 | 3 | False | False |
1 | 337 | gate_30 | 38 | True | False |
2 | 377 | gate_40 | 165 | True | False |
3 | 483 | gate_40 | 1 | False | False |
4 | 488 | gate_40 | 179 | True | True |
Data Description. This dataset contains around 90,189 records of players that started the game while the telemetry system was running. Among the variables collected are the next:
userid
- a unique number that identifies each player.version
- whether the player was put in the control group (gate_30 - a gate at level 30) or the test group (gate_40 - a gate at level 40).sum_gamerounds
- the number of game rounds played by the player during the first week after installationretention_1
- did the player come back and play 1 day after installing?retention_7
- did the player come back and play 7 days after installing?
When a player installed the game, he or she was randomly assigned to either gate_30
or gate_40
.
Note: An important fact to keep in mind is that in the game industry one crucial metric is retention_1
, since it defines if the game generate a first engagement with the first log-in of the player.
## Checking missing data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90189 entries, 0 to 90188
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userid 90189 non-null int64
1 version 90189 non-null object
2 sum_gamerounds 90189 non-null int64
3 retention_1 90189 non-null bool
4 retention_7 90189 non-null bool
dtypes: bool(2), int64(2), object(1)
memory usage: 2.2+ MB
## Overview of data
df.describe()
userid | sum_gamerounds | |
---|---|---|
count | 9.018900e+04 | 90189.000000 |
mean | 4.998412e+06 | 51.872457 |
std | 2.883286e+06 | 195.050858 |
min | 1.160000e+02 | 0.000000 |
25% | 2.512230e+06 | 5.000000 |
50% | 4.995815e+06 | 16.000000 |
75% | 7.496452e+06 | 51.000000 |
max | 9.999861e+06 | 49854.000000 |
## Summary statistics
df.groupby(["version"]).sum_gamerounds.agg(["count", "median", "mean", "std", "max"])
count | median | mean | std | max | |
---|---|---|---|---|---|
version | |||||
gate_30 | 44700 | 17.0 | 52.456264 | 256.716423 | 49854 |
gate_40 | 45489 | 16.0 | 51.298776 | 103.294416 | 2640 |
df.groupby(["version"]).retention_1.agg(["count", "median", "mean", "std", "max"])
count | median | mean | std | max | |
---|---|---|---|---|---|
version | |||||
gate_30 | 44700 | 0.0 | 0.448188 | 0.497314 | True |
gate_40 | 45489 | 0.0 | 0.442283 | 0.496663 | True |
df.groupby(["version"]).retention_7.agg(["count", "median", "mean", "std", "max"])
count | median | mean | std | max | |
---|---|---|---|---|---|
version | |||||
gate_30 | 44700 | 0.0 | 0.190201 | 0.392464 | True |
gate_40 | 45489 | 0.0 | 0.182000 | 0.385849 | True |
## Joint effect
df.groupby(["version", "retention_1"]).sum_gamerounds.agg(["count", "median", "mean", "std", "max"])
count | median | mean | std | max | ||
---|---|---|---|---|---|---|
version | retention_1 | |||||
gate_30 | False | 24666 | 6.0 | 18.379591 | 319.423232 | 49854 |
True | 20034 | 48.0 | 94.411700 | 135.037697 | 2961 | |
gate_40 | False | 25370 | 6.0 | 16.340402 | 35.925756 | 1241 |
True | 20119 | 49.0 | 95.381182 | 137.887256 | 2640 |
🔥 Visualization
The most accurate way to test changes is to perform A/B testing by targeting a specific variable, in the case retention (for 1 and 7 days after installation).
import seaborn as sns
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [20, 9]
sns.set_theme()
sum_gamerounds
VS version
- contunous VS categorical features
violinplot
orstripplot
orkdeplot
## violinplot
sns.violinplot(df, x='sum_gamerounds', y="version")
<Axes: xlabel='sum_gamerounds', ylabel='version'>
## stripplot
sns.stripplot(df, x='sum_gamerounds', y="version", hue="version", alpha=0.4)
<Axes: xlabel='sum_gamerounds', ylabel='version'>
## kdeplot
sns.kdeplot(data=df, x="sum_gamerounds", hue="version", multiple="stack", alpha=.5,)
<Axes: xlabel='sum_gamerounds', ylabel='Density'>
Note: We found some outliers or extreme points hurt the visualization, we can just focus on sum_gamerounds < 3000
.
## violinplot (refined)
sns.violinplot(df[df['sum_gamerounds']<3000], x='sum_gamerounds', y="version", hue="version", alpha=0.4)
<Axes: xlabel='sum_gamerounds', ylabel='version'>
## stripplot (refined)
sns.stripplot(df[df['sum_gamerounds']<3000], x='sum_gamerounds', y="version", hue="version", alpha=0.4)
<Axes: xlabel='sum_gamerounds', ylabel='version'>
sns.kdeplot(data=df[df['sum_gamerounds']<3000], x="sum_gamerounds", hue="version", multiple="stack")
<Axes: xlabel='sum_gamerounds', ylabel='Density'>
version
VS retention_1
/retention_7
bool
VSbool
histplot
sns.histplot(df,x="retention_1", hue="version", stat='probability', multiple="dodge")
<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<Axes: xlabel='retention_1', ylabel='Probability'>
sns.histplot(df,x="retention_7", hue="version", stat='probability', multiple="dodge")
<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<__array_function__ internals>:180: RuntimeWarning: Converting input from bool to <class 'numpy.uint8'> for compatibility.
<Axes: xlabel='retention_7', ylabel='Probability'>
Conclusion: It is very difficult to tell which one is better.
📊 A/B Test
Assumptions:
- Sample size
Steps:
- Split & Define Control Group & Test Group
- Independet populations OR paired population
- If variances are known?
- If variances are equal?
- What’s the alternative?
# Split & Define Control Group & Test Group
data_g30 = df[df['version'] == 'gate_30']
data_g40 = df[df['version'] == 'gate_40']
## Independet populations + unknown and unequal variances
metric = 'retention_1'
from scipy.stats import ttest_ind
t_value, p = ttest_ind(a=1.0*data_g30[metric], b=1.0*data_g40[metric],
alternative="less", equal_var=False)
print('A/B test result on %s: test-stat: %.3f; p-value: %.4f' %(metric, t_value, p))
A/B test result on retention_1: test-stat: 1.784; p-value: 0.9628
## result for retention_7
metric = 'retention_7'
from scipy.stats import ttest_ind
t_value, p = ttest_ind(a=1.0*data_g30[metric], b=1.0*data_g40[metric],
alternative="less", equal_var=False)
print('A/B test result on %s: test-stat: %.3f; p-value: %.4f' %(metric, t_value, p))
A/B test result on retention_7: test-stat: 3.164; p-value: 0.9992
## result for retention_7
metric = 'sum_gamerounds'
from scipy.stats import ttest_ind
t_value, p = ttest_ind(a=1.0*data_g30[metric], b=1.0*data_g40[metric],
alternative="less", equal_var=False)
print('A/B test result on %s: test-stat: %.3f; p-value: %.4f' %(metric, t_value, p))
A/B test result on sum_gamerounds: test-stat: 0.885; p-value: 0.8120
🔃 Conclusion
There is not enough evidence to support the alternative hypothesis that Moving the Time Gate from Level 30 to Level 40 will increase our user retention_1/retention_7/sum_gamerounds
. at the significance level of α=0.05.
InClass Practice (10 min)
What if we change the hypotheses as:
- $H_0:$ Moving the Time Gate from Level 40 to Level 30 will not increase our user retention.
- $H_1:$ Moving the Time Gate from Level 40 to Level 30 will increase our user retention.
Alternatively, denote $X$ as the user retention
, $\mu_A$ as user retention
based on Time Gate on Level 40
, and $\mu_B$ as user retention
based on Time Gate on Level 30
.
- $H_0: \mu_A = \mu_B$
- $H_1: \mu_A < \mu_B$
🏠 Take-home message
Intuition can make mistake
So, why is retention higher when the gate is positioned earlier? Normally, we could expect the opposite: The later the obstacle, the longer people get engaged with the game. But this is not what the data tells us, we explained this with the theory of hedonic adaptation.
For coming strategies the Game Designers can consider that, by pushing players to take a break when they reach a gate, the fun of the game is postponed. But, when the gate is moved to level 40, they are more likely to quit the game because they simply got bored of it.
What could the stakeholders do to take action?
(After InClass Practice) Now we have enough statistical evidence to say that 7-day retention is higher when the gate is at level 30 than when it is at level 40, the same as we concluded for 1-day retention. If we want to keep consumer retention high, we should not move the gate from level 30 to level 40, it means we keep our Control method in the current gate system.