Google does not serve all models equally

Darsan Guruvayurappan

15 Dec 2025

Tl;dr - Gemini API calls with images may give you incorrect responses if you’re using the Google Developer API endpoint. Fix it by switching to the Vertex endpoint.

Early last week, one of our clients filed a ticket complaining that our translation feature was not working properly. It wasn’t that the feature was throwing errors, rather for specific pages in some languages, the translation was wildly incorrect.

Our PM flagged this, and I thought, “hmm… must be an issue with our image resolutions, I’m sure we can solve it by tonight”. Boy, was I in for some pain.

Step 1 - Triage

The ticket we received was simple enough: PDF documents in Marathi weren’t translating correctly. For some context, Lucio has a translation feature that takes in non-English documents and returns a translated version that maintains the formatting of the original document.

Lawyers typically don’t share documents with third-parties, so we couldn’t get our hands on the problematic document. To triage this, we found a bunch of non-English documents online and ran it through our systems.

Not a single error.

We found more documents. And more. And even more. Yet we were consistently receiving high quality outputs on our testing and production environments.

Deployment Woes

Most of our clients are law firms that often have incredibly high data privacy standards. As a result, we offer a bevy of deployment options for clients, based on jurisdiction, data residency requirements, and more. For the most sensitive firms, we also offer the option to deploy Lucio fully on the firm’s VPC; so that their data never leaves their infrastructure.

While this makes for a great sales pitch, it results in fairly large technical surface area for us to keep track of. Managing different cloud providers, LLM availability, rate limits, and more can get tricky very fast.

This client, in particular, was on a self-hosted deployment. Their configs seemed correct, there was nothing out of place there. Our PM tried uploading a document (which had worked on our instance) into their instance.

Complete gibberish came out the other end.

We had a reproduction.

Broken Pipelines?

Mind you, we didn’t receive any obvious error codes. The pipelines were still working, it was just that some pages gave results that looked right, but were completely wrong.

Our first thought, of course, was that our configs were somehow broken. We audited their entire config set top-to-bottom; but nothing was amiss. In fact, it was almost exactly the same as our testing environment.

Our next thought was our pipeline was somehow broken. For the translation feature, when a user uploads a document, we run it through a fairly complex set of transformations to get the final result. At a very high level it looks somewhat like this:

User uploads a document
Normalize and extract text from it
Translate text using conventional translation engines and LLMs
Use vLLMs to parse document layout and structure
Synthesize the content from 3 and 4 for final result

There are several places where things could go wrong here. Extraction errors, low-res image generation due to PDF conversion failing, LLMs acting up, and more. We audited every single line of code, ran our manual tests, yet nothing seemed broken.

The next few days were a blur:

Our stack has multiple ways of calling LLMs. We switch between Azure OpenAI, Gemini SDKs, Portkey, and OpenRouter for calls depending upon the deployment.
Some poorly generated PDF files have incorrect encodings. They visually look correct, but extracting the text results in gibberish. This was our leading thesis for a while.
We switched models, pinned versions, changed prompts, fiddled around with every possible setting when calling LLMs.

Many coffees later, we realized that the pipelines were throwing out garbage results at a particular point where we were relying on Gemini models.

Step 2 - Eureka

Before we move forward, a quick primer on how Google serves their LLMs.

Google offers you 2 main ways of accessing their models: Gemini Developer API and Gemini Vertex. Google’s had a long and annoying history with their LLM SDKs. They started out with Vertex AI, which is enterprise-focused and a complete pain to use. Once their quickstart reached the point where it required 4 days and 2 full-stack developers to complete, they decided to abandon it and start over with the Gemini Developer API.

This is a much simpler API key based service that is the currently recommended method to access LLMs. Their new Python SDK (called google-genai, which is completely different from the now-deprecated google-generativeai) offers a simple way to switch between Vertex and GDAPI endpoints.

Subtleties

At Lucio, we consume Gemini models through a variety of sources:

Our Dev instances use models through OpenRouter; sometimes through their BYOK service to route requests through our own endpoints.
Our production instances use Portkey to route requests through our own GDAPI and Vertex endpoints.
Some deployments where Portkey and Openrouter are both not allowed directly use the google-genai package. Where possible, we use the Vertex endpoint first and fallback to using GDAPI.

Theoretically, every single one of these calls should give the same result, since at the end of the day, they are all routed into Google servers.

We were, however, at a point where we were randomly throwing stuff on the wall to see what stuck. When messing around with our Openrouter configs, we discovered something interesting: disabling our Vertex API BYOK, the response quality degraded.

That made me want to test every single model call separately. Here’s what we saw:

OpenRouter call via google/gemini-2.5-flash — worked
Gemini call (through vertex) — worked
OpenRouter via google-vertex/global — worked
OpenRouter via google-vertex — worked
OpenRouter via google-ai-studio — worked
Gemini OpenAI-compatible endpoint on Vertex — worked
OpenRouter with BYOK Vertex endpoint disabled — garbage?!
OpenRouter with BYOK AI Studio endpoint disabled — worked
Gemini Call (through GDAPI) - garbage?!

The tests were clear: for some reason, LLM calls with images through our GDAPI API key was giving poor quality results. OpenRouter through google-ai-studio was fine? Switching to Vertex somehow made it work?

Madness.

Step 3 - Reproduction

The error seemed to be fairly consistent. GDAPI would consistently give irrelevant results, while both Vertex and OpenRouter consistently gave the correct response.

You can run these snippets to verify the error yourself. Fetch this random test image to get started. I ran these on 15th December 2025 to get these results.

GDAPI Bad Result

from google import genai
from google.genai import types as genai_types
from openai import OpenAI
from base64 import b64encode

# Load the image file
with open("test.png", "rb") as f:
    image_data = f.read()
    
# Test Case 1 - Calling the Gemini 2.5 Flash model directly through Google GenAI Client
# This returns terrible quality output, with very high latency.

GEMINI_STUDIO_API_KEY = "YOUR KEY"
client = genai.Client(api_key=GEMINI_STUDIO_API_KEY)

# Create generation configuration
generation_config = genai_types.GenerateContentConfig(
    temperature=0.2,
    max_output_tokens=6000,
    system_instruction="You are an expert translator.",
)

contents = [
    genai_types.Part.from_bytes(
        data=image_data,
        mime_type="image/png",
        ),
    "Extract all the text and translate to English. Return just the english version.",
]

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=contents,
    config=generation_config,
)

print(response.text)

Output (completely irrelevant)

One line change to use Vertex

# Test Case 2 - Calling the Gemini 2.5 Flash model through Google GenAI Vertex AI Client
# This returns perfect output with low latency. WHY?!
GOOGLE_CLOUD_PROJECT_ID = "YOUR PROJECT ID"
client = genai.Client(vertexai=True, project=GOOGLE_CLOUD_PROJECT_ID, location='global')

# Create generation configuration
generation_config = genai_types.GenerateContentConfig(
    temperature=0.2,
    max_output_tokens=6000,
    system_instruction="You are an expert translator.",
)

contents = [
    genai_types.Part.from_bytes(
        data=image_data,
        mime_type="image/png",
    ),
    "Extract all the text and translate to English. Return just the english version.",
]

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=contents,
    config=generation_config,
)

print(response.text)

Output (correct)

Calls through OpenRouter

# Test Case 3 - Calling the Gemini 2.5 Flash model through OpenRouter OpenAI-compatible API
# This returns perfect output with low latency, across all providers.
OPENROUTER_API_KEY = "YOUR KEY"
client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY
)

encoded_image = encoded_image = b64encode(image_data).decode("utf-8")
messages = [
    {
        "role": "system",
        "content": "You are an expert translator.",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/png;base64,{encoded_image}"},
            }
        ],
    },
    {
        "role": "user",
        "content": "Extract all the text and translate to English. Return just the english version.",
    }
]

config = {
    "max_tokens": 6000,
    "temperature": 0.2,
}

model = "google/gemini-2.5-flash"

# Accessing through any of these providers gives the same result
# model = "google/gemini-2.5-flash:google-ai-studio"
# model = "google/gemini-2.5-flash:google-vertex"
# model = "google/gemini-2.5-flash:google-vertex/global"

response = client.chat.completions.create(
    model=model,
    messages=messages,
    **config,
)
print(response.choices[0].message.content)

Output (same as Vertex call)

Conclusions

Shifting to the Vertex endpoint and removing the GDAPI endpoint as a translation fallback model fixed the issue. 1 week reduced to a 1-line fix.

It is clear that somehow, though the model identifiers are the same, they serve different versions of the model depending on how you access it. What is confusing, however, is why they would provide a degraded version of their model when accessing it directly, but provide a better version through OpenRouter.

Is it a quantization thing? Or possibly some kind of a botched update? Maybe our API key is cursed. LLMs are already difficult to work with and scale; these kinds of issues add an extra level of spiciness to the whole affair!

Darsan is a co-founder of Lucio, and is passionate about all things Law, AI, and Tech. If you enjoy solving fun problems like this at the cutting edge of AI, come work with us at Lucio!

Lucio Raises $5M to Build AI Native Workspace for Lawyers

By Lucio team

7 Oct 2025

Lucio Raises $5M to Build AI Native Workspace for Lawyers

By Lucio team

7 Oct 2025

No Customer Left Behind: A Self-balancing Task Scheduler

By Lucio team

20 Nov 2025

No Customer Left Behind: A Self-balancing Task Scheduler

By Lucio team

20 Nov 2025

Back to Blogs