Research Agent Pattern: Search, Verify, Cite

Pattern Essence

Research Agent is a pattern where the agent runs controlled research through a bounded pipeline: search, dedupe, policy-check, read, extract notes with provenance, and synthesize an answer only from verified materials.

When to use it: when you need to collect and verify facts from multiple sources, not answer without evidence.

This is not "just browse".

A research pipeline usually contains:

Search + dedupe URL: remove duplicates before reading
Read within budget: read pages within time and source limits
Extract facts: build structured notes
Verify claim/citation: basic check of key claims
Synthesize with references: write final answer with citations

Problem

Imagine you ask:

"Find the rules in the new law and explain briefly."

The agent "googled" and returned a conclusion, but without a clear source trail.

Then typical risks appear:

shallow reading (many opened, few actually read)
duplicates of the same materials
mixing facts with author interpretation
weak or fake citation
zero reproducibility of results

Open-web research without process quickly turns into chaos: many steps, little evidence.

That is the core problem: without a controlled pipeline it is hard to prove the conclusion is based on real and verified sources.

Solution

Research Agent works through a bounded pipeline, not through "search a bit more".

Analogy: this is like investigative journalism with a checklist. First you gather sources and notes with links, then you write conclusions. Without this, it is easy to mix facts with assumptions.

Core principle: writer gets permission to synthesize only after extracted notes with provenance exist.

Controlled process:

Search (Search): find a limited set of sources
Dedupe (Dedupe): normalize URLs and remove duplicates
Policy check (Policy Check): pass sources through policy-gate
Read (Read): read only allowed pages
Extract (Extract): create notes with provenance (url + quote)
Verify (Verify): check key claims
Synthesize (Synthesize): write answer only from notes

This lets you attach to the answer:

which pages were read
which quotes were extracted
which claims were verified
why the pipeline stopped (stop reason)

Works well if:

search step (Search) has strict limits (max_urls, max_seconds)
read step (Read) passes policy-check
extract step (Extract) carries full provenance
execution forbids synthesis without notes

Otherwise the agent may:

cite sources it never read
mix unverified facts
fabricate citation to sound convincing

That is why you need budget caps, dedupe + cache, guarded execution, and stop rules against infinite search-loop.

How It Works

Diagram

Critical principle: writer must not invent sources. It works only with extracted notes.

Full flow: Search → Dedupe → Policy Check → Read → Extract → Verify → Synthesize

Search (Search)
One or two controlled search steps with limits on time and URL count.

Dedupe (Dedupe)
URLs are normalized and duplicates removed before reading.

Policy check (Policy Check)
Only allowed domains, content types, and safe source risk levels are processed.

Read (Read)
Pages are fetched through cache to avoid re-reading the same content.

Extract (Extract)
Facts are stored in structured form: url, quote, claims, timestamp.

Verify (Verify)
Basic spot-check: are key claims supported by page quotes.

Synthesize (Synthesize)
Final answer is written only from notes and includes explicit citations.

In Code, It Looks Like This

PYTHON

budget = {"max_urls": 10, "max_seconds": 90}
urls = search_once(goal, k=8)
urls = dedupe_and_normalize(urls)[: budget["max_urls"]]

notes = []
for url in urls:
    if budget_exceeded(budget):
        break

    if not policy_allow(url):
        continue  # stop/skip reason can be logged

    page = fetch_with_cache(url)
    note = extract_structured_note(goal, page, url=url)
    notes.append(note)

if not notes:
    return partial_or_escalate("no_reliable_sources")

verified = spot_check_claims(notes, sample_size=2)
answer = synthesize_from_notes(goal, notes, verified=verified)

return answer

What matters here is not "beautiful prompts", but controlled execution: budget, dedupe, cache, stop reasons.

How It Looks During Runtime

TEXT

Goal: What are the EU AI Act restrictions for high-risk systems?

Search:
- found 12 URLs
- after dedupe: 7 unique

Read/Extract:
- 5 pages fetched successfully
- 2 rejected due to low relevance

Verify:
- 2 key claims passed spot-check

Synthesize:
- short summary generated
- citations added from 3 sources

Full Research agent example

PYPython

TSTypeScript · soon

When It Fits - And When It Doesn't

Good Fit

	Situation	Why Research Fits
✅	External sources are required and citation is needed	Research agent can search, read, and cite sources.
✅	Topic is dynamic and internal base is insufficient	Runtime search can retrieve current data from open web or external sources.
✅	Conclusion provenance is required	You can explicitly show where key facts and claims came from.
✅	Execution controls exist: budgets and tool rules	Bounded research controls cost and reduces tool-loop risk.

Not a Fit

	Situation	Why Research Does Not Fit
❌	Data already exists in RAG	External search is unnecessary, internal retrieval is enough.
❌	Critical latency path	Search and reading pages are usually more expensive than local generation.
❌	No policy-safe pipeline for browse process	Safe control of tools and domains is not guaranteed.

Because research mode is almost always more expensive than local generation or RAG.

How It Differs From RAG

	RAG	Research Agent
Sources	Internal index/knowledge base	Open web or external sources
Focus	Fast grounded answer	Search, reading, and fact verification
Cost control	Relatively stable	Needs strict budget caps
Main risk	Weak retrieval	Tool loops and fake citation

RAG works on a prepared knowledge layer. Research Agent acquires knowledge externally at runtime.

When To Use Research (vs Other Patterns)

Use Research Agent when you need to gather facts from multiple sources and consolidate them into structured evidence.

Quick test:

if you need to "research a topic across sources and provide evidence-based conclusion" -> Research
if you need to "analyze an already provided dataset" -> Data Analysis Agent

Comparison with other patterns and examples

Quick cheat sheet:

If task looks like this...	Use
After each step you need to decide what to do next	ReAct Agent
You first need to split a large goal into smaller executable tasks	Task Decomposition Agent
You need to execute code, validate results, and iterate safely	Code Execution Agent
You need to analyze data and return conclusions based on analysis	Data Analysis Agent
You need multi-source research with structured evidence	Research Agent

Examples:

ReAct: "Find root cause of API outage: check logs -> inspect errors -> run next check based on result".

Task Decomposition: "Prepare launch of a new plan: break into subtasks for content, engineering, QA, and support".

Code Execution: "Calculate 12-month retention in Python and validate formula correctness on real data".

Data Analysis: "Analyze sales CSV: find trends, outliers, and provide short conclusions".

Research: "Collect data on 5 competitors from multiple sources and prepare comparative summary".

Not sure whether your request already needs multi-source research? Design Your Agent →

How To Combine With Other Patterns

Research + RAG: verified external findings are stored in internal knowledge base for later answers.
Research + Guarded-Policy: policies limit allowed tools, domains, and data types.
Research + Fallback-Recovery: on unstable search/fetch, the agent retries or switches to fallback sources.

In Short

Quick take

Research Agent:

runs bounded search and source reading
extracts structured notes with provenance
verifies key claims before answering
returns cited answer without infinite review loops

Pros and Cons

Pros

collects data from multiple sources in one flow

adds links, so answers are easier to verify

covers topics better than a single source

good for comparing facts and versions

Cons

runs slower due to search and reading

without limits it can spend too many resources

answer quality depends on source quality

FAQ

Q: Can we search "until confidence"?
A: No. Confidence is not a stop condition. You need explicit limits: max_urls, max_seconds, and stagnation/convergence rules.

Q: Why is URL dedupe important?
A: Without it, the agent pays to re-read the same content and distorts the effective source count.

Q: Is it enough to just add citations at the end?
A: No. Citations must come from extracted notes, not be generated "for appearance".

What Next

Research Agent covers open-world search and citation.

Now that you know how to build agents with different patterns, the next question is: what do they actually run on? How does an agent execute tools, keep state, and manage memory under the hood?