Topics in AI
    8 min read

    The Death of A/B Testing Best Practices

    Best practices like one variable at a time and 200 conversions per arm assumed variants were scarce. They are no longer scarce. The math of marketing experimentation is changing under our feet.

    ByJames R. GosnellEducational content. Not legal advice.

    The Death of A/B Testing Best Practices

    The Rules Were Built for Scarcity

    Run one variable at a time. Wait for 200 conversions per arm. Run for at least two weeks to control for day-of-week effects. Do not peek at the data. Pick a winner only at 95% significance.

    These were the right rules for a market where each ad variant cost real money to produce. A video shoot was a budget line. A landing page was a sprint of designer and engineer time. Test cautiously, control rigorously, declare a winner only when the data earned the verdict.

    That cost structure is gone. A marketer with an Adobe Firefly seat and a Meta Advantage+ subscription ships 50 ad variants for the price of a coffee. Meta's 2026 Advantage+ benchmarks show AI-generated variants delivering 22 to 34 percent higher ROAS than statically produced creatives over a 90-day window. eMarketer projects AI-powered US ad spend to hit $57 billion in 2026, about 12 percent of the total US ad market.

    The constraint moved. It is no longer producing the variant. It is learning which variant works fast enough to spend the rest of the quarter on the winner.

    What the Platforms Just Shipped

    The experimentation platforms felt it. In January 2026, Everstone Capital announced the merger of VWO and AB Tasty, creating a combined $100 million ARR business with 4,000 customers. Both companies had Bayesian engines. Both had bandit modes. Neither had grown into them fast enough alone.

    Amplitude pushed multi-armed bandits into Amplitude Experiment. The implementation uses Thompson sampling, reallocates traffic hourly or daily, and ships with a 100-exposure minimum per variant. No other statistical methodology is offered. Statsig, founded by ex-Facebook experimentation engineers, now runs the testing layer at OpenAI, Notion, and Atlassian. Its product positions bandits as the default and frequentist sequential testing as a fallback for cases that need clean error control.

    Forty-one percent of CRO programs ran Bayesian frameworks in 2026, up from 18 percent in 2022. Bayesian teams report 14 percent shorter test durations with similar win rates. The two-week pure A/B run is not banned anywhere. It is just being skipped.

    The Quiet Death of Frequentist Marketing

    The argument is not methodological. It is cost-structural. Frequentist A/B testing earns its keep when the cost of running the wrong variant through the test is high relative to the cost of waiting for power. That tradeoff was real in 2015. It is not real in 2026 for most marketing decisions.

    When variants are nearly free, the marginal cost of running a bandit that reallocates to the leader by day three is the small statistical loss from killing variants before they reach classical significance. The marginal benefit is two weeks of additional budget aimed at the variants that are working. A bandit that gets the wrong answer 5 percent of the time and reallocates fast beats an A/B test that gets the right answer 95 percent of the time and reallocates never.

    Frequentist methods do not vanish. They remain the right call for product launches where one variant ships for a year, for pricing tests where customer trust is at stake, and for any decision that needs a defensible point estimate of lift in a deck for the CFO. But the bulk of marketing decisions, the ad creative, the landing page hero, the subject line, the send time, none of them need that level of rigor. They need fast reallocation against a moving baseline. Bandits do that. A/B does not.

    The Wealth Firm Cannot Wait Six Weeks

    The math we kept running into building LeadLord forced this shift. A wealth firm does not have time to wait six weeks for a clean A/B on a single ad. They have a quarter to fill seats at a webinar. The system has to run live: ship 30 variants Monday, kill the bottom half by day three, double down on the top quartile, retire the next worst tier by day seven. That is not A/B testing. That is a multi-armed bandit with compliance constraints. The best practices we grew up on assume a market that no longer exists.

    The Skill Mix That Actually Pays

    A growth team in 2018 needed a CRO lead who could run a clean test, a copywriter, and a designer. The bottleneck was production. It is now learning.

    The growth lead worth hiring in 2026 designs experiments rather than runs them. They write the test plan: hypothesis, segment, metric, reallocation cadence, kill criteria, spend ceiling, compliance guardrails. They know the difference between epsilon-greedy and Thompson sampling well enough to pick the right one for a webinar funnel versus a retention email. They read attribution outputs across a fragmented post-iOS-14 measurement stack and can tell the team which lifts to trust.

    The copywriter and designer matter less because the model writes the variants. The variant reader matters more. A creative director who can explain why segment A responded to a fee-transparency angle while segment B responded to a tax-efficiency angle is worth more than one who defends a single hero concept in a tissue meeting.

    Agencies built around weekly test reviews and pre-launch approval calls are working off the old org chart. They are losing budgets to in-house teams who run bandits and feed the agency only the winners to scale.

    What to Watch

    The first thing worth watching is whether experimentation platforms differentiate on bandit math or collapse into commodity reallocation. If Thompson sampling becomes a checkbox, the value moves to the layer above: experiment design, attribution modeling, multi-platform orchestration.

    The second is the regulators. Bandits work by reallocating traffic toward winners while a test is still inconclusive. In a regulated industry, that means a compliance team has to sign off on variants that may never reach statistical significance individually. The frameworks for that approval do not exist yet. The teams that write them first will set the standard for the next decade of regulated marketing.