The Paradox of Completion: Predicting Churn in an Audiobook App
- smileytr
- Apr 4
- 2 min read
Updated: Apr 4
In the world of digital subscriptions, churn is the silent killer. For audiobook platforms especially, retention is everything — acquiring a user only to lose them months later can be a costly, recurring problem. When I set out to build a machine learning model to predict which users were likely to leave an audiobook app, I expected the usual suspects: lack of engagement, no purchases, minimal interaction. But what I found turned that assumption on its head. In a dataset of 14,000+ users, it wasn’t the inactive or disengaged users who left — it was the ones who finished the books. The more a user completed, the more likely they were to churn. This “one-and-done” behavior became the foundation of a broader insight: completion isn’t always commitment — sometimes, it’s closure.
The journey started with a thorough data cleanup, where I renamed columns for clarity and eliminated multicollinear features using a combination of correlation heatmaps and Variance Inflation Factor (VIF) analysis. With the dataset stripped to its most informative features, I explored the behaviors that correlated most strongly with user retention. Book completion percentage emerged as a striking variable. Not a single user who completed more than 0% of a book remained active — a pattern confirmed visually through a completion-status heatmap. Similarly, violin plots of listening time and reviews revealed that users with deep engagement often disappeared afterward. This led to a counterintuitive but compelling theory: some users come for one specific experience, complete it, and leave satisfied. They’re not disengaged — they’re fulfilled.

From there, I tackled the modeling process. Given the class imbalance — most users were churned — I employed SMOTE to synthetically oversample the active class, experimenting with different ratios to strike the right balance between realism and recall.
I tested four models: Logistic Regression, Support Vector Classifier (SVC), Random Forest, and Histogram-Based Gradient Boosting. The Random Forest model emerged as the best performer, especially after hyperparameter tuning. A final SMOTE ratio of 0.75 gave it the best balance, achieving 88% precision and 82% F1-score for predicting active users.


But the value of the project wasn’t just in performance metrics — it was in what the model taught me. Retention isn’t about how much a user consumes; it’s about how long they stay engaged after the initial payoff.

This means strategies like recommending new content after a user finishes a book, nudging them to explore more, or rewarding post-purchase activity could be far more effective than simply tracking how many minutes they listened. In the end, understanding churn meant redefining what engagement really looks like.
Comentários