Template

A/B Test Planning Template

A comprehensive template for planning statistically rigorous A/B tests. Used by the Wise Uplift team to ensure every experiment is properly designed, tracked, and analyzed.

Why Proper Test Planning Matters

Most A/B tests fail not because of poor ideas, but because of poor planning. Without a structured approach, you risk running tests too short, misinterpreting results, or testing the wrong things entirely. This template ensures your experiments are scientifically valid and actionable.

We've run thousands of A/B tests across industries, and this template captures the essential elements that separate successful testing programs from those that waste time and resources.

1. Hypothesis Development

A strong hypothesis is specific, measurable, and based on insights from data or user research. It should explain what you're changing, why you expect it to work, and how you'll measure success.

Hypothesis Format Template

If we [make this change]

Then we expect [this measurable outcome]

Because [this insight or research finding]

We'll measure success by [specific metric and target]

Hypothesis Checklist

  • Insight-Driven: Is your hypothesis based on data (analytics, heatmaps, user research) rather than opinion?
  • Specific Change: Can a developer or designer implement exactly what you're proposing without ambiguity?
  • Measurable Outcome: Have you defined the primary metric and the minimum lift you're testing for?
  • Business Justification: If this test wins, will it meaningfully impact revenue or key business goals?

Example Strong Hypothesis:

"If we add customer testimonials with photos and company logos above the fold on our pricing page, then we expect to increase trial signups by at least 15% because heatmap data shows 78% of visitors scroll to read our testimonials section, and exit surveys indicate trust is the #1 barrier to signup. We'll measure success by trial signup rate over 2 weeks with 10,000 visitors per variation."

2. Sample Size & Test Duration

Running a test for the wrong duration is one of the most common mistakes. Too short and you lack statistical power; too long and you're leaving wins on the table.

Sample Size Calculation

  • Baseline Conversion Rate: What's your current conversion rate for the metric you're testing? Pull from the last 30 days.
  • Minimum Detectable Effect: What's the smallest lift that would be worth implementing? Usually 10-20% relative improvement.
  • Statistical Power: Use 80% power (standard) unless you need higher confidence. Higher power requires more traffic.
  • Significance Level: Use 95% confidence (α = 0.05) as standard. Don't peek at results early or you'll inflate false positives.

Use Our Calculator:

Don't guess at sample size—use our Sample Size Calculator to determine exactly how many visitors you need per variation based on your baseline rate and target lift.

Test Duration Best Practices

  • Full Week Cycles: Run tests in complete 7-day cycles to account for weekday vs weekend behavior differences.
  • Minimum 2 Weeks: Even with enough traffic, run for at least 2 weeks to capture behavioral variations and smooth out anomalies.
  • Avoid External Events: Don't run tests during major holidays, sales events, or product launches that could skew results.
  • Traffic Allocation: Use 50/50 splits for most tests. Only use uneven splits if you need to minimize risk exposure.

3. Tracking & Implementation

Proper tracking setup is critical. If your tracking fails, weeks of test runtime are wasted. Always QA your tracking before launching.

Tracking Setup Checklist

  • Primary Metric Tracking: Confirm your conversion goal fires correctly in both variations. Test with browser dev tools.
  • Secondary Metrics: Track 2-3 secondary metrics to understand side effects (bounce rate, time on page, scroll depth).
  • Variation Assignment: Ensure users are consistently bucketed into the same variation across sessions (use cookies or user IDs).
  • Guardrail Metrics: Monitor revenue per visitor and cart abandonment to catch unexpected negative impacts early.

Pre-Launch QA Checklist

  • Visual QA: Test both variations on desktop, mobile, and tablet. Check all major browsers (Chrome, Safari, Firefox).
  • Tracking Verification: Complete the conversion flow in both variations and confirm events fire in your analytics tool.
  • Flicker Testing: Check for visual "flicker" when the test loads. Use server-side testing or anti-flicker snippets if needed.
  • Performance Impact: Confirm the test doesn't slow page load. Run Lighthouse tests on both variations.

4. Analysis & Decision Framework

Analyzing test results is where many teams make critical errors. This framework ensures you make data- driven decisions while avoiding common statistical pitfalls.

Statistical Analysis Checklist

  • Reach Sample Size: Don't end the test early, even if results look promising. Wait until you hit your calculated sample size.
  • Statistical Significance: Use a proper significance calculator (like our T-Test Calculator) rather than relying on platform calculations alone.
  • Practical Significance: Even if statistically significant, is the lift large enough to matter to your business?
  • Secondary Metrics Review: Check if the winning variation had negative impacts on other important metrics.

Segmentation Analysis

  • Device Segmentation: Did the treatment perform differently on mobile vs desktop? Common for design changes.
  • Traffic Source: Compare results by traffic source (paid, organic, direct, email). Different audiences may respond differently.
  • New vs Returning: New visitors often respond differently than returning visitors to messaging changes.
  • Geographic Differences: For international sites, check if results vary by region or language.

Decision Matrix

ResultAction
Clear Winner
Stat sig, practical lift, no negative impacts
Implement winning variation immediately
Mixed Results
Stat sig but negative secondary metrics
Analyze trade-offs; consider iteration or segmented rollout
Inconclusive
Not stat sig after reaching sample size
Keep control, document learnings, test bolder variation or different hypothesis
Significant Loss
Treatment performed worse
End test, document why it failed, use insights for next hypothesis

5. Documentation & Knowledge Sharing

The value of a testing program compounds when you document and share learnings. Build an institutional knowledge base that prevents repeated mistakes and accelerates future tests.

What to Document

  • Test Details: Original hypothesis, variations tested, sample size, and duration.
  • Results Summary: Primary metric lift, statistical significance, and confidence intervals.
  • Key Learnings: Why did it win or lose? What does this tell you about your users?
  • Next Steps: Follow-up tests to run, other pages to apply learnings to.

Need Help Running Rigorous A/B Tests?

Wise Uplift manages the entire experimentation process for you: from hypothesis development and test design to implementation, analysis, and scaling winners. We've run thousands of tests and know how to extract maximum insights from every experiment.

Start Your Testing Program

Related Resources