Retell AI now lets you split inbound call traffic across multiple agent versions. Test prompt changes on live calls with built-in performance comparison.

Test voice agent changes on live traffic with statistical confidence, reducing iteration cycles and deployment risk.
Signal analysis
Retell AI launched A/B testing for voice agents - a feature that routes incoming calls to different agent configurations based on percentage splits you define. Instead of deploying a new prompt or configuration to all calls at once, you can now test variants on a subset of your live traffic.
This is a direct response to a real friction point in voice AI workflows. Previously, you had two options: test changes in a sandbox environment (which doesn't reflect real call patterns), or deploy to production and hope nothing breaks. Neither approach gives you confidence that a prompt change actually improves performance on your actual call distribution.
The implementation includes comparative metrics across agent versions. You can compare call completion rates, transcription accuracy, user satisfaction signals, or any metric you track - all within the same time period and call volume distribution.
Voice agents are notoriously hard to iterate on. Unlike chat interfaces where you can test with a small user cohort, inbound voice systems serve customers immediately. A bad prompt change can tank your first-call resolution rate across thousands of calls before you notice the problem.
A/B testing in production gives you guardrails. You can test a new prompt on 10% of inbound calls while 90% use your proven agent. If the variant performs worse, you lose the impact to only that 10% slice. If it's better, you gradually increase the split or roll it to 100%.
This is especially valuable for optimization work: testing different temperature settings, more aggressive follow-up questions, domain-specific context injection, or new model versions. Each test becomes a data point instead of a binary deploy-or-rollback decision.
The comparison metrics are the critical piece. You need to know whether your variant actually improved handle time, resolution rate, or customer satisfaction - not guess based on anecdotal calls.
Start by identifying one specific thing you want to test. Don't try to test multiple variables at once - you won't know which change moved the needle. Good candidates: a new prompt instruction, different handling of objections, adjusted model temperature, or a completely new agent architecture.
Set up your split conservatively. If this is your first test, try 20-30% on the variant and 70-80% on your current production agent. If the variant is a dramatic change (new model, new prompt from scratch), go even lower - 10-15%. Run the test for at least 3-5 days to capture your full call distribution and avoid day-of-week bias.
Define your success metric before the test starts. Will you measure call completion? Resolution rate? Average handle time? Transcription accuracy? Pick one primary metric plus maybe two secondary ones. Don't add metrics after seeing results.
Compare the results honestly. If the variant loses on your primary metric, you know to iterate further. If it wins, you now have data to justify the change to stakeholders. Then either roll it to 100% or do a second test with a new variant.
A/B testing for voice agents is still relatively rare in the market. Most voice AI platforms treat deployment as a one-shot event - you update your agent configuration and it applies to all calls. Retell's move suggests they're positioning for customers who run high-call-volume operations and need statistical confidence before rolling changes.
This feature also signals Retell's focus on the production voice workflow, not just the build-and-deploy stage. They're acknowledging that voice agents need continuous optimization, and that optimization requires real traffic data and controlled experiments.
The absence of this feature elsewhere means most builders are still managing agent iterations manually - maintaining multiple versions, manually routing calls, and cobbling together comparison metrics across different time periods. Retell has removed friction from a core workflow that every production voice system needs.
Best use cases
Open the scenarios below to see where this shift creates the clearest practical advantage.
One concise email with the releases, workflow changes, and AI dev moves worth paying attention to.
More updates in the same lane.
The latest Cursor update enhances AI tool integration, streamlining developer workflows and increasing productivity.
Unlock new productivity with the latest Cursor update, featuring enhanced AI tools for developers.
OpenAI's recent update introduces enhanced features that streamline developer workflows and boost automation capabilities.