Cohort Analysis with Seaborn

A guide to performing Cohort Analysis in Python

Why ?

Photo by Emily Morter on Unsplash

To build a product that delivers value to users (whether individuals or companies) it is important to understand how (and indeed if!) they use what you have built and whether that changes over time and across different cohorts of users.

A cohort is simply a subset of users grouped by shared characteristics. Often the shared characteristic is a cohort_date which indicates when a user has first been onboarded or began using your product. This date (or date-time) can be as granular as e.g.the hour the user first engaged with your website or as indeterminate as the year they first subscribed to your product.

Analysing cohorts of users to compare behaviour of different cohorts over time helps you understand your user base, the value of your product and the effectiveness of your growth strategies. Questions you can answer with such an analysis include:

  1. Does the user retention rate (retained users are those which still engage with or use your product and haven’t churned) change over time and depend on when a user first began using your product?
  2. Are there differences in absolute activity rates and trends in activity between those onboarded to your product at different times?
  3. What proportion of users complete an action (for example filling in a feedback form) within x time periods (e.g. weeks) after some prompt (in this example this could be an email to a user asking them for feedback)?

In order to explore cohort analysis, this article details the steps in python required to create a custom Seaborn heatmap for a fictional recipe app: Saucy Chef 😉

What?

Saucy Chef is a fictional (sadly 😩) app where users can upload recipes and share with their network. A user is considered onboarded to the app once they create an account with their email address. The developers of the app want to measure how engaged onboarded users are with the app to determine value of their product. They define as a KPI weekly activity:

A Weekly Active User (WAU) is an onboarded user who uploads at least one recipe to the Saucy Chef app during a week (an ISO Calendar Week)

They are interested in what proportion of onboarded users end up as Weekly Active Users for each week subsequent to their onboarding but also whether that proportion differs for different onboarding cohorts (defined as the calendar week they were onboarded) as they have employed different marketing strategies to attract users to sign up over time. Enter Cohort Analysis 🤗

How?

Photo by Matt Ridley on Unsplash

First, the below snippet reads in the fake dataset and imports the required python libraries:

import pandas as pd
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
from colour import Color
from matplotlib.colors import LinearSegmentedColormap
plt.style.use('ggplot') # you can use whichever style you preferdf = pd.read_csv('saucy_chef_data.csv')

The data has columns user_id (unique identifier for an onboarded user), cohort_week(the calendar week that a user was onboarded during) and recipe_submission_day which is a string indicating a date (e.g. 2021–07–26) for which that user submitted a recipe. The count of rows for a given user_id will then give the total number of recipes submitted by this user.

In order to define a recipe_submission_week in iso calendar format the below first converts the column to date-time format and then extracts the calendar week.

df['recipe_submission_day'] = pd.to_datetime(df['recipe_submission_day'], format = '%Y-%m-%d')df["recipe_submission_week"] = df['recipe_submission_day'].dt.isocalendar().week

In order to determine the number of WAUs per calendar week from each cohort: df_cohort is the dataset grouped by cohort and recipe submission weeks where the values are the counts of the number of unique (since a user may upload more than one recipe in a week) users. Another column, weeks_since_onboarding , is created to indicate how long in weeks after the cohort’s onboarding the recipe submission week was .

df_cohort = df.groupby(['cohort_week', 'recipe_submission_week']) \
.agg(n_users=('user_id', 'nunique')) \
.reset_index(drop=False)
df_cohort['weeks_since_onboarding'] = (df_cohort. - df_cohort.recipe_submission_week)

In order to get the activity matrix (the number of WAUs over of all onboarded users [matrix entries], per cohort [rows], per week since onboarding [columns]): the below pivots the grouped DataFrame and divides each row by the cohort size:

cohort_pivot = df_cohort.pivot_table(index = 'cohort_week',
columns ='weeks_since_onboarding', values = 'n_users')
cohort_size = cohort_pivot.iloc[:,0]
activity_matrix = cohort_pivot.divide(cohort_size, axis = 0)

Finally the code snippet below plots the activity matrix heatmap together with a heatmap of the cohort size (to create the legend on the left below).

with sns.axes_style("white"):
fig, ax = plt.subplots(1, 2, figsize=(12, 8), sharey=True, gridspec_kw={'width_ratios': [1, 11]})

# activity matrix
sns.heatmap(activity_matrix,
mask=activity_matrix.isnull(),
annot=True,
fmt='.0%',
ax=ax[1])
ax[1].set_title('Saucy Chefs: Weekly Active Users', fontsize=16)
ax[1].set(xlabel='# of weeks since onboarding',
ylabel='')
# cohort size
cohort_size_df = pd.DataFrame(cohort_size).rename(columns{0:'cohort_size'})
sns.heatmap(cohort_size_df,
annot=True,
cbar=False,
fmt='g',
ax=ax[0])
fig.tight_layout()

There are many different default colour palettes you can explore but it is also possible to create your own custom colour map with the colours you like or those of your brand and pass it to the heatmap function with cmap=pretty_color_map 🌈

def custom_color_map(map_colors: List[str]) -> Any:
"""Creates and displays custom color map."""
color_map = LinearSegmentedColormap.from_list(
"my_list", [Color(col).rgb for col in map_colors]
)
plt.figure(figsize=(15, 3))
plt.imshow(
[list(np.arange(0, len(map_colors), 0.1))],
interpolation="nearest",
origin="lower",
cmap=color_map,
)
plt.xticks([])
plt.yticks([])
return color_map


pretty_color_map = custom_color_map(['#E4BDF0','#685acd'])

So?

What insights might the developers of Saucy Chef glean from this Cohort Analysis?

  • Generally cohort size is increasing over time. Good news for Saucy Chef as it means that the user base has been growing 🚀
  • Activity appears to typically decline over time: the longer a user has been using the app, the less likely they are to upload a recipe on a weekly basis.
  • The proportions of users onboarded in calendar weeks 31 and 38 who remain active several weeks after sign up appears quite low in comparison with other cohorts. For cohort week 31 the small sample size probably prohibits any definitive conclusions but for week 38 perhaps an investigation of the marketing strategy at that time might reveal why these users were less active over time.
  • Many other things that could be interesting for future analysis! 🤓

Hopefully it is clear why performing a Cohort Analysis is a good idea and how to visualise the data with Seaborn — the next step is then explaining the visualisation to stake-holders 😝 For this please look out for my next article on Explaining Data.

--

--

--

Data Scientist in Berlin 😍

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

The Ways in Which Big Data can Transform Talent Management and Human Resources

Basic Data Analysis with Python

Super Weird BI: The challenge of making Business Intelligence smarter with VR/XR/AR

How to Build a Climatic Map in 30 Lines of Code?

How to Analyse Academic Performance Data

Train your own object detector with Faster-RCNN & PyTorch

Customer Segmentation with K-means Clustering

Is the British company with a 20-second coronavirus test for real?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gráinne Mcknight

Gráinne Mcknight

Data Scientist in Berlin 😍

More from Medium

Beginning to explore Google Analytics data in Pandas.

Sklearn from beggining(Part 3) Model Training and performance Analysis

Let's Perform a Simple Data Analysis

Spotify User Profile Analysis With Sportifyr — RStudio