New ways to balance cost and reliability in the Gemini API

What’s Happening

Here’s the thing: Google is introducing two new inference tiers to the Gemini API, Flex and Priority, to balance cost and latency.

Breadcrumb Innovation & AI Technology Developer tools New ways to balance cost and reliability in the Gemini API Apr 02, 2026 · x. Com Facebook LinkedIn Mail Copy link Introducing Flex and Priority inference: advanced controls for developers to optimize costs and reliability through a single, unified interface. (and honestly, same)

Lucia Loher Product Manager, Gemini API Hussein Hassan Harrirou Engineering, Gemini API x.

The Details

Com Facebook LinkedIn Mail Copy link Sorry, your browser doesn’t support embedded videos, but don’t worry, you can download it and watch it with your favorite video player! Your browser does not support the audio element.

Listen to article This content is generated . Generative AI is experimental [[duration]] minutes Voice Speed Voice Speed 0.

Why This Matters

5X 2X Today, we are adding two new service tiers to the Gemini API: Flex and Priority . These new options give you granular control over cost and reliability through a single, unified interface. As AI evolves from simple chat into complex, autonomous agents, developers typically have to manage two distinct types of logic: Background tasks : High-volume workflows like data enrichment or “thinking” processes that don’t need instant responses.

As AI capabilities expand, we’re seeing more announcements like this reshape the industry.

Key Takeaways

Interactive tasks : User-facing features like chatbots and copilots where high reliability is needed.
Until now, supporting both meant splitting your architecture between standard synchronous serving and the asynchronous Batch API.
Flex and Priority help to bridge this gap.
You can now route background jobs to Flex and interactive jobs to Priority, both using standard synchronous endpoints.

The Bottom Line

This eliminates the complexity of async job management while giving you the economic and performance benefits of specialized tiers. Flex Inference : grow innovation for 50% less Flex Inference is our new cost-optimized tier, designed for latency-tolerant workloads without the overhead of batch processing.

Sound off in the comments.

New ways to balance cost and reliability in the Gemini API

What’s Happening

The Details

Why This Matters

Key Takeaways

The Bottom Line

Get the next useful briefing

More from this section

10 Best X (Twitter) Accounts to Follow for LLM Updates

10 Lesser-Known Python Libraries Every Data Scientist Sho...

10 Most Popular GitHub Repositories for Learning AI