Measuring Customer Usage Retention Rate (CURR) lift through A/B testing involves a disciplined approach to experimentation, focusing on understanding user behavior and the long-term impact of product changes. While the term "CURR lift" may not be universally familiar, the principles of measuring retention lift are well-established and critical for sustainable growth
.
Core Principles for A/B Testing Retention
Successful A/B testing for retention begins with a clear understanding of the experiment's purpose and careful design:
Define a Clear Hypothesis and Goals Experimentation is a discovery tool that helps avoid biases and gather quantitative data . A hypothesis should outline how a product change will alter user behavior or outcomes, informed by existing data or insights . For instance, a problem like "too many people open the app and don’t request a ride" can lead to a hypothesis that "passengers don’t request rides because they don’t see enough cars on the map" . It's crucial to define what questions need answering and the simplest way to answer them confidently .
Design the Experiment and Measurement Strategy Clearly identify the success metrics for retention lift and potential tradeoff metrics that might be negatively impacted . This includes defining the experiment's trigger, targeting, eligibility, assignment criteria, and response metrics . For retention, key metrics often include daily active users (DAUs), D7 retention, and D14 retention .
Ensure Statistical Significance and Power Calculate the Minimum Detectable Effect (MDE) to determine the smallest improvement that isn't due to chance, which helps set the experiment's runtime . Low statistical power can cause real effects to be missed . It's important to run tests long enough to achieve statistical significance and avoid "peeking early" at results . For early-stage companies with lower traffic, achieving statistical significance can be challenging, sometimes making it more practical to launch changes and monitor pre/post data if the expected impact is substantial .
Implement Guardrail Metrics For every metric optimized, pair it with another to address potential adverse consequences, preventing short-term gains at the expense of long-term customer satisfaction or brand value . These are business-critical metrics that should never significantly decline, such as app crashes or app store ratings . For example, a change in subscription button copy at Duolingo increased revenue but decreased new user D1 retention, leading to the experiment being killed to prioritize long-term trust . Net Promoter Score (NPS) can also serve as a long-term guardrail .
Strategies for Measuring Retention Lift
Retention is often considered one of the hardest metrics to move and can take time to manifest
.
Focus on Early Proxies and Correlated Behaviors Since true long-term retention takes time, identify and experiment on early proxies that strongly correlate with future retention, often occurring early in the customer's journey . For example, at Thumbtack, optimizing for "the number of customer requests that had a certain number of quotes" became a key proxy for retention . Similarly, instead of directly measuring long-term retention for a single feature, focus on behaviors and "setup moments" that correlate with it . Facebook's "10 friends in 7 days" was a setup moment correlating with long-term retention .
Long-Term Measurement and Holdout Experiments Many experiments showing short-term lift may not result in long-term incremental lift . To understand long-term effects, implement holdouts, such as holding out a percentage of users from seeing a new change for an extended period, or running a 50-50 split for new users and tracking the original cohort for a year or more . At Shopify, a 5% general holdout and 50/50 splits for new merchants allowed for quick shipping while tracking long-term Gross Merchandise Volume (GMV) and merchant activation .
Beware of the "Pull Forward Effect" Short-term wins can sometimes just pull revenue or success forward in time without generating additional long-term value . An experiment at Udemy that increased day-of conversion by 7% on a success page only increased total session revenue by 2% because users were less likely to browse afterward .
Practical Examples of Driving and Measuring Retention Lift
Gamification and Engagement: At Duolingo, a leaderboard system, adapted from a game, led to a 17% increase in overall learning time and tripled highly engaged learners . Badges also proved to be a significant win for retention .
Notifications and Email Systems: These are powerful tools for driving engagement. Pinterest built a system that logged user preferences for email delivery and ranked content based on likelihood of engagement, dramatically improving engagement . Duolingo found that optimizing notifications significantly impacted D1 retention . However, aggressive A/B testing of notifications can lead to users opting out, destroying the channel .
Onboarding and Activation Flow Improvements: Improving the initial activation experience directly impacts long-term retention . At Airtable, changing the onboarding flow to encourage document uploads led to higher activation and monetization rates . Pinterest doubled its activation rate through dozens of experiments, simplifying the experience and customizing initial content recommendations .
Core Product Improvements: Retention is often driven by reducing friction in the core product . At Grubhub, improving restaurant variety, lowering minimums, and burying guest ordering increased retention .
Behavioral Psychology and Messaging: Duolingo A/B tested emotional cues, like the number of tears its mascot cried, and growth mindset messaging, which showed exciting results in retention .
Identifying "Setup Moments": For Gojek, users who used the app for transportation during the morning rush were more likely to retain, indicating an "aha moment" tied to urgent, reliable service .
Tools and Frameworks
A robust experimentation framework is essential, whether built internally or using third-party tools
. Tools like Optimizely, Amplitude, Eppo, and Statsig are commonly used
. For in-app experiments, a system to bucket users and integrate experiment IDs with data analysis tools like Segment and Amplitude is beneficial
. The TARS framework (Target Audience, Adoption, Retention, Satisfaction) can help evaluate features by focusing on proxy metrics rather than direct, isolated retention measurements
.
Common Mistakes and Best Practices
Avoid Overbuilding and Over-testing: For discovery, experiments don't need to be perfect or infinitely scalable . Not every change requires an A/B test; if a feature is clearly better or required, it might be more efficient to launch and monitor .
Clean Experiments: Limit the difference between test and control to only the variation being tested, avoiding multiple layered experiments .
Culture of Experimentation: Make experiment results transparent across the company to foster learning . Any code change or feature introduction should ideally be part of an experiment .
Focus on Directional Impact: Especially for early-stage companies, focus on clear, directional impact rather than fake precision, as over-engineered attribution models can break down .
Avoid Data Theater: Be wary of collecting data and creating dashboards that don't improve decision-making quality .
In conclusion, measuring retention lift through A/B testing requires a strategic approach that balances rigorous experimental design with an understanding of long-term user behavior. By defining clear hypotheses, focusing on early proxies, implementing guardrail metrics, and embracing a culture of continuous learning, organizations can effectively identify changes that drive sustainable customer usage and retention.