New ways to balance cost and reliability in the Gemini API
Google is introducing two new inference tiers to the Gemini API, Flex and Priority, to balance cost and latency.
What’s Happening
Here’s the thing: Google is introducing two new inference tiers to the Gemini API, Flex and Priority, to balance cost and latency.
Breadcrumb Innovation & AI Technology Developer tools New ways to balance cost and reliability in the Gemini API Apr 02, 2026 · x. Com Facebook LinkedIn Mail Copy link Introducing Flex and Priority inference: advanced controls for developers to optimize costs and reliability through a single, unified interface. (and honestly, same)
Lucia Loher Product Manager, Gemini API Hussein Hassan Harrirou Engineering, Gemini API x.
The Details
Com Facebook LinkedIn Mail Copy link Sorry, your browser doesn’t support embedded videos, but don’t worry, you can download it and watch it with your favorite video player! Your browser does not support the audio element.
Listen to article This content is generated . Generative AI is experimental [[duration]] minutes Voice Speed Voice Speed 0.
Why This Matters
5X 2X Today, we are adding two new service tiers to the Gemini API: Flex and Priority . These new options give you granular control over cost and reliability through a single, unified interface. As AI evolves from simple chat into complex, autonomous agents, developers typically have to manage two distinct types of logic: Background tasks : High-volume workflows like data enrichment or “thinking” processes that don’t need instant responses.
As AI capabilities expand, we’re seeing more announcements like this reshape the industry.
Key Takeaways
- Interactive tasks : User-facing features like chatbots and copilots where high reliability is needed.
- Until now, supporting both meant splitting your architecture between standard synchronous serving and the asynchronous Batch API.
- Flex and Priority help to bridge this gap.
- You can now route background jobs to Flex and interactive jobs to Priority, both using standard synchronous endpoints.
The Bottom Line
This eliminates the complexity of async job management while giving you the economic and performance benefits of specialized tiers. Flex Inference : grow innovation for 50% less Flex Inference is our new cost-optimized tier, designed for latency-tolerant workloads without the overhead of batch processing.
Sound off in the comments.
Daily briefing
Get the next useful briefing
If this story was worth your time, the next one should be too. Get the daily briefing in one clean email.
Reader reaction