top of page

The Paradox of Completion: Predicting Churn in an Audiobook App

  • smileytr
  • Apr 4
  • 2 min read

Updated: Apr 4

In the world of digital subscriptions, churn is the silent killer. For audiobook platforms especially, retention is everything — acquiring a user only to lose them months later can be a costly, recurring problem. When I set out to build a machine learning model to predict which users were likely to leave an audiobook app, I expected the usual suspects: lack of engagement, no purchases, minimal interaction. But what I found turned that assumption on its head. In a dataset of 14,000+ users, it wasn’t the inactive or disengaged users who left — it was the ones who finished the books. The more a user completed, the more likely they were to churn. This “one-and-done” behavior became the foundation of a broader insight: completion isn’t always commitment — sometimes, it’s closure.



The journey started with a thorough data cleanup, where I renamed columns for clarity and eliminated multicollinear features using a combination of correlation heatmaps and Variance Inflation Factor (VIF) analysis. With the dataset stripped to its most informative features, I explored the behaviors that correlated most strongly with user retention. Book completion percentage emerged as a striking variable. Not a single user who completed more than 0% of a book remained active — a pattern confirmed visually through a completion-status heatmap. Similarly, violin plots of listening time and reviews revealed that users with deep engagement often disappeared afterward. This led to a counterintuitive but compelling theory: some users come for one specific experience, complete it, and leave satisfied. They’re not disengaged — they’re fulfilled.

Figure showing the Truth Table of Completion & User Activity
Figure showing the Truth Table of Completion & User Activity

From there, I tackled the modeling process. Given the class imbalance — most users were churned — I employed SMOTE to synthetically oversample the active class, experimenting with different ratios to strike the right balance between realism and recall.


I tested four models: Logistic Regression, Support Vector Classifier (SVC), Random Forest, and Histogram-Based Gradient Boosting. The Random Forest model emerged as the best performer, especially after hyperparameter tuning. A final SMOTE ratio of 0.75 gave it the best balance, achieving 88% precision and 82% F1-score for predicting active users.


Tuned Random Forest Model w/ SMOTE @ 75%
Tuned Random Forest Model w/ SMOTE @ 75%
Baseline Logistic Regression
Baseline Logistic Regression

But the value of the project wasn’t just in performance metrics — it was in what the model taught me. Retention isn’t about how much a user consumes; it’s about how long they stay engaged after the initial payoff.


Figure shows that Active users had a significantly greater engagement longevity compared to Churned users
Figure shows that Active users had a significantly greater engagement longevity compared to Churned users

This means strategies like recommending new content after a user finishes a book, nudging them to explore more, or rewarding post-purchase activity could be far more effective than simply tracking how many minutes they listened. In the end, understanding churn meant redefining what engagement really looks like.

Comentários


bottom of page