Some AI failures are easy to recognize: an invented date, a fake source, an overconfident conclusion. The Gemini 3.1 Pro + Deep Research case analyzed here, based on a real user-provided sample, is different. The answer does not merely get something wrong. It starts as a long technical report about a specialized topic, with sections, sources, and operational recommendations, but at some point it loses control of the writing and falls into an avalanche of synonyms, adjectives, and repeated connectors.

The clearest signal is not that it says one false sentence. It is that it stops saying anything. In the full sample, a simple count finds roughly 24,800 words, with the Spanish word “asertivamente” appearing almost 9,700 times. Terms such as “única”, “de manera exclusiva”, “poda”, “purga”, “ineludible”, and “crítico” also repeat massively. That is no longer a long answer: it is a degenerated output.

The most reasonable hypothesis is not a single magical Gemini bug, but the combination of three layers: a generative model, an agentic research system, and a long synthesis stage. Deep Research does not behave like a simple chat answer. Google describes it as a product that plans, searches, reasons, explores many sources, and generates extended reports. In the API documentation, Google also describes Deep Research as an agentic researcher built for autonomous, multi-step investigations and cited reports.

When one of those layers loses stability, the error can amplify across hundreds or thousands of tokens.

Deep Research pipeline with planning, search, context, synthesis, and a failure point in repetitive generation.
Deep Research is not a single response: it is a chain of planning, search, context, and synthesis. A failure in the final stage can drag the whole report down.

In short

The visible Gemini 3.1 Pro + Deep Research error can be described as repetitive output degeneration. The system keeps producing grammatically recognizable text, but loses information density. Instead of advancing with evidence, structure, and conclusions, it starts inflating sentences with increasingly weak semantic variations.

It is not enough to call this a hallucination. A typical hallucination invents data, sources, or causal relationships. Here something more basic and more mechanical happens: generation enters a loop where each repetition increases the chance of further repetition. The output may look intentional because it preserves a technical tone, but functionally it breaks.

The practical difference is this:

Failure typeWhat happensHow it appears in the sample
Factual hallucinationThe model invents or distorts data.A specific figure, source, or claim does not hold up.
Semantic driftThe text moves away from the original objective.The report shifts from developing the original topic to accumulating empty solemnity.
Lexical loopA word or family of words repeats out of control.”asertivamente”, “única”, and “de manera exclusiva” cascade through the output.
Synthesis collapseThe output no longer summarizes or organizes evidence.Length increases while useful information decreases.

What exactly failed in the response

The sample begins with a recognizable structure. It has sections, tables, technical concepts, and references to real practices: definitions, validation signals, review processes, logs, control criteria, and auditing. That first part can be debated in detail, but it still serves an objective.

The failure appears when the writing becomes ornamental and self-referential. Phrases like “estricto marco perimetral temporal absolutamente limitante” or “lentísimo incesante continuado asedio iterativo” already show stylistic inflation. Then the system does not just exaggerate: it gets stuck. In the second block, the repetition of “única” and “asertivamente” stops functioning as language and becomes an automatic pattern.

There are four clear symptoms:

  1. Compression loss: the model no longer reduces information; it expands it without adding content.
  2. Hierarchy loss: everything seems equally important, critical, and unavoidable.
  3. Style-control loss: the technical tone turns into mechanical grandiosity.
  4. Stopping-criterion loss: the system does not detect that the answer is no longer helping the user.

That last point matters. A research system should have internal quality signals: source density, coverage, repetition, novelty per paragraph, coherence with the plan, and reasonable length. When those signals do not stop the output, the user receives a huge block that consumes time and trust.

Why this is not just “writing too much”

A long answer is not a problem by itself. Deep Research exists precisely for complex tasks that require several steps. Google presents it as a feature that can explore complex topics, refine analysis, and produce reports with links to original sources. The problem begins when length is no longer connected to progress.

In a healthy report, every paragraph should do at least one of these things:

  • introduce a new idea;
  • qualify a previous idea;
  • compare sources;
  • explain a mechanism;
  • ground a practical implication;
  • close a section.

In the broken sample, many paragraphs do none of that. The text behaves as if it confused exhaustiveness with accumulation. “Longer” becomes “more correct”, and “more emphatic” becomes “more technical”.

That is the point where a research agent stops researching and starts filling space.

The likely mechanism: text degeneration

Language models generate text step by step. At each step, they calculate which token is likely next, conditioned by the prompt, the context, the retrieved sources, and the tokens already generated. That architecture enables fluent answers, but it can also create loops.

The paper The Curious Case of Neural Text Degeneration showed that certain decoding strategies can produce repetitive, generic, or strange text even when the base model is strong. Later, Learning to Break the Loop analyzed how repetition can become self-reinforcing: the more a sentence or pattern appears in context, the easier it is for the model to keep following it. The Repetition Neurons work adds another perspective: some internal activations appear to be associated with continuing repeated patterns.

Translated to this case: once the report starts overusing intensifiers such as “estricto”, “crítico”, “ineludible”, “único”, or “asertivamente”, those words leave a trace in the generated context. The output itself becomes part of the prompt for the next token. If the system does not correct, penalize, or cut the pattern, repetition feeds itself.

Lexical repetition loop where a repeated word increases the probability of further repetitions.
In a repetition loop, previous output contaminates the next decision: generated text becomes context for repeating more.

Why Deep Research can make the failure more visible

Deep Research increases the failure surface because it turns a query into an agentic task. There is a plan, several searches, page reading, note accumulation, synthesis, and writing. That design is powerful, but it also creates pressure at five points.

Pressure pointWhat can go wrongVisible signal
PlanningThe plan asks for too much exhaustiveness or too many layers.The report tries to cover everything without prioritizing.
RetrievalToo many similar sources, notes, or fragments enter the context.Repeated ideas appear with tiny variations.
Context compressionIntermediate notes are summarized poorly.Emphasis words survive while structure disappears.
Final writingThe generator interprets “deep” as “longer and more solemn”.Grandiose, redundant, low-actionability sentences.
Quality controlNo automatic cut-off for repetition or low density.The loop continues until the answer is consumed.

Google’s long-context documentation is useful here: a huge context window unlocks new use cases, but it does not remove every limitation. In tasks with many pieces of information to retrieve, performance can vary, and reviewing the entire context has a cost. In agentic workflows, accumulated state is useful, but it can also carry errors forward.

The role of Spanish

The fact that the failure happens in Spanish does not mean Spanish is the problem. But Spanish makes this kind of collapse more visible.

There are practical reasons:

  • Technical Spanish accepts long chains of nouns and adjectives without immediately breaking grammar.
  • Words like “de”, “y”, “o”, “del”, and “la” can connect phrases for a very long time.
  • Formal synonyms are abundant: “crítico”, “severo”, “drástico”, “ineludible”, “restrictivo”, “perentorio”.
  • Grandiose technical writing can look valid for a few lines before revealing that it says nothing.

That is why the error looks almost baroque. The model does not produce random characters. It produces degenerated formal Spanish: many words make local sense, but the whole loses function.

Where I would place technical responsibility

From the outside, we cannot know whether the failure belongs exactly to the Gemini 3.1 Pro base model, the Deep Research agent, the final report composer, a style policy, or a combination of layers. That is why a simplistic conclusion is not useful.

A better approach is to separate responsibilities:

  1. Base model: it may be prone to repeating patterns under certain decoding or context conditions.
  2. Research orchestrator: it may accumulate too many notes, repeat concepts, or fail to clean intermediate context.
  3. System prompt: it may reward exhaustiveness, formality, or total coverage without enough concision limits.
  4. Decoding: it may allow a high-probability pattern to dominate for too many tokens.
  5. Post-processing: it may not detect obvious repetition before delivering the report.
  6. User interface: it may not offer clean recovery: “the answer degraded, retry from the last healthy section”.

In long agentic systems, quality does not depend on one model call. It depends on the whole circuit.

How to detect it before wasting time

There are early signals that let you cut the task:

  • The same technical word appears three or more times in a few lines without a real need.
  • Every sentence adds adjectives but no data.
  • The report redefines what it already defined without changing the angle.
  • Connectors grow: “therefore”, “irrevocably”, “as a consequence”, “under strict rigor”.
  • Sources stop appearing or no longer support the claims.
  • The answer keeps an authoritative tone while failing to advance.

A practical rule: if a research paragraph cannot be summarized as one new idea, it is probably filler.

Checklist for detecting and recovering a Deep Research report that enters repetition.
Recovery does not start by repeating the same query: it starts by reducing scope, format, and stylistic freedom.

How to ask for reports with less risk

Prompt shape matters. It does not eliminate the problem, but it reduces the chance of triggering inflated output.

A safer prompt would be:

Research the topic and first deliver a source table with title, URL, date, and usefulness. Then write a report of up to 1,200 words. Use direct sentences. Do not use grandiose language. Avoid repeating concepts. If a section repeats previous ideas, condense it or remove it. Each section must add a new idea, verifiable data, or a practical implication. If evidence is missing, say so.

For longer reports, split the work:

  1. Research plan.
  2. Source table.
  3. Executive summary.
  4. Section-by-section development.
  5. Critical review.
  6. Final version.

The key is not to ask for “an exhaustive report” without constraints. In generative models, “exhaustive” can become “infinite”, “formal” can become “pompous”, and “detailed” can become “repetitive”.

What a system like Deep Research should do better

A robust research agent should have visible and automatic defenses:

DefenseWhat it controlsExpected result
Repetition detectorRepeated words, n-grams, and phrases.Cut or regenerate before delivering garbage.
Information-density meterNew ideas per paragraph.Reduce filler.
Plan validationEvery section must answer part of the plan.Avoid drift.
Source coverageImportant claims connected to sources.Preserve traceability.
Block-by-block deliveryThe user validates sections before the full report.Prevent one final failure from ruining everything.
Recovery buttonRetry from the last healthy block.Save time.

This matters because Deep Research is positioned as a time-saving feature. If the user has to read thousands of words to discover that the report broke, the promise reverses: automation creates review debt.

How Nicolás Torres would review it

I would treat this failure the way I would treat a generation pipeline that returns corrupted output after several intermediate steps: I would not blame only the final screen. I would review the whole flow.

First, I would isolate the sample:

  • original prompt;
  • Deep Research plan;
  • sources used;
  • exact point where repetition starts;
  • report length;
  • language;
  • selected model;
  • whether files or private sources were attached;
  • whether the report was exported or generated inside the interface.

Then I would run three controlled retries:

  1. Same topic, short output: to see whether the base content can be summarized well.
  2. Same topic, structured format: table, bullets, conclusions, and no long prose.
  3. Same topic, another model or no Deep Research: to separate base-model behavior from agentic orchestration.

If the error appears only in the long Deep Research report, the likely cause sits in the combination of accumulated context, synthesis, and output control.

Need to design an AI agent that does not break in production?

Useful agents are not just powerful models. They need architecture, limits, tools, measurement, and recovery when something fails.

If you are building an AI experience for your website, lead capture, support, or internal processes, it should be designed as a system: with controlled context, verifiable outputs, and quality rules before the result is shown to the user.

Request an AI agent diagnosis

Frequently Asked Questions

What error is visible in the Gemini 3.1 Pro response?
The visible error is output degeneration: the report stops adding new information and starts repeating synonyms, connectors, and adjectives until it becomes a huge block of useless text.
Is this the same as hallucination?
Not exactly. A hallucination invents facts. Here the main problem is self-reinforcing generation: the model repeats plausible linguistic forms while losing its objective, structure, and verifiable content.
Why can this happen in Deep Research?
Deep Research combines planning, search, reading many sources, synthesis, and long-form writing. If context compression, output style, or decoding enters a repetitive pattern, the system can carry that failure through many lines.
Does this mean Gemini 3.1 Pro is not useful for research?
No. It means long agentic tasks need extra controls: length limits, repetition validation, source checks, section-by-section delivery, and the ability to stop when the report loses information density.
How can I reduce the risk when asking for long reports?
Ask for a plan and source table first, limit each section, demand direct language, ban ornamental repetition, and instruct the model to stop if it detects redundancy or missing evidence.

Back to Archive