Pattern Essence
Research Agent is a pattern where the agent runs controlled research through a bounded pipeline: search, dedupe, policy-check, read, extract notes with provenance, and synthesize an answer only from verified materials.
When to use it: when you need to collect and verify facts from multiple sources, not answer without evidence.
This is not "just browse".
A research pipeline usually contains:
Search + dedupe URL: remove duplicates before readingRead within budget: read pages within time and source limitsExtract facts: build structured notesVerify claim/citation: basic check of key claimsSynthesize with references: write final answer with citations
Problem
Imagine you ask:
"Find the rules in the new law and explain briefly."
The agent "googled" and returned a conclusion, but without a clear source trail.
Then typical risks appear:
- shallow reading (many opened, few actually read)
- duplicates of the same materials
- mixing facts with author interpretation
- weak or fake citation
- zero reproducibility of results
Open-web research without process quickly turns into chaos: many steps, little evidence.
That is the core problem: without a controlled pipeline it is hard to prove the conclusion is based on real and verified sources.
Solution
Research Agent works through a bounded pipeline, not through "search a bit more".
Analogy: this is like investigative journalism with a checklist. First you gather sources and notes with links, then you write conclusions. Without this, it is easy to mix facts with assumptions.
Core principle: writer gets permission to synthesize only after extracted notes with provenance exist.
Controlled process:
- Search (
Search): find a limited set of sources - Dedupe (
Dedupe): normalize URLs and remove duplicates - Policy check (
Policy Check): pass sources through policy-gate - Read (
Read): read only allowed pages - Extract (
Extract): create notes with provenance (url+quote) - Verify (
Verify): check key claims - Synthesize (
Synthesize): write answer only from notes
This lets you attach to the answer:
- which pages were read
- which quotes were extracted
- which claims were verified
- why the pipeline stopped (
stop reason)
Works well if:
- search step (
Search) has strict limits (max_urls,max_seconds) - read step (
Read) passes policy-check - extract step (
Extract) carries full provenance - execution forbids synthesis without notes
Otherwise the agent may:
- cite sources it never read
- mix unverified facts
- fabricate citation to sound convincing
That is why you need budget caps, dedupe + cache, guarded execution, and stop rules against infinite search-loop.
How It Works
Critical principle: writer must not invent sources. It works only with extracted notes.
Full flow: Search β Dedupe β Policy Check β Read β Extract β Verify β Synthesize
Search (Search)
One or two controlled search steps with limits on time and URL count.
Dedupe (Dedupe)
URLs are normalized and duplicates removed before reading.
Policy check (Policy Check)
Only allowed domains, content types, and safe source risk levels are processed.
Read (Read)
Pages are fetched through cache to avoid re-reading the same content.
Extract (Extract)
Facts are stored in structured form: url, quote, claims, timestamp.
Verify (Verify)
Basic spot-check: are key claims supported by page quotes.
Synthesize (Synthesize)
Final answer is written only from notes and includes explicit citations.
In Code, It Looks Like This
budget = {"max_urls": 10, "max_seconds": 90}
urls = search_once(goal, k=8)
urls = dedupe_and_normalize(urls)[: budget["max_urls"]]
notes = []
for url in urls:
if budget_exceeded(budget):
break
if not policy_allow(url):
continue # stop/skip reason can be logged
page = fetch_with_cache(url)
note = extract_structured_note(goal, page, url=url)
notes.append(note)
if not notes:
return partial_or_escalate("no_reliable_sources")
verified = spot_check_claims(notes, sample_size=2)
answer = synthesize_from_notes(goal, notes, verified=verified)
return answer
What matters here is not "beautiful prompts", but controlled execution: budget, dedupe, cache, stop reasons.
How It Looks During Runtime
Goal: What are the EU AI Act restrictions for high-risk systems?
Search:
- found 12 URLs
- after dedupe: 7 unique
Read/Extract:
- 5 pages fetched successfully
- 2 rejected due to low relevance
Verify:
- 2 key claims passed spot-check
Synthesize:
- short summary generated
- citations added from 3 sources
Full Research agent example
When It Fits - And When It Doesn't
Good Fit
| Situation | Why Research Fits | |
|---|---|---|
| β | External sources are required and citation is needed | Research agent can search, read, and cite sources. |
| β | Topic is dynamic and internal base is insufficient | Runtime search can retrieve current data from open web or external sources. |
| β | Conclusion provenance is required | You can explicitly show where key facts and claims came from. |
| β | Execution controls exist: budgets and tool rules | Bounded research controls cost and reduces tool-loop risk. |
Not a Fit
| Situation | Why Research Does Not Fit | |
|---|---|---|
| β | Data already exists in RAG | External search is unnecessary, internal retrieval is enough. |
| β | Critical latency path | Search and reading pages are usually more expensive than local generation. |
| β | No policy-safe pipeline for browse process | Safe control of tools and domains is not guaranteed. |
Because research mode is almost always more expensive than local generation or RAG.
How It Differs From RAG
| RAG | Research Agent | |
|---|---|---|
| Sources | Internal index/knowledge base | Open web or external sources |
| Focus | Fast grounded answer | Search, reading, and fact verification |
| Cost control | Relatively stable | Needs strict budget caps |
| Main risk | Weak retrieval | Tool loops and fake citation |
RAG works on a prepared knowledge layer. Research Agent acquires knowledge externally at runtime.
When To Use Research (vs Other Patterns)
Use Research Agent when you need to gather facts from multiple sources and consolidate them into structured evidence.
Quick test:
- if you need to "research a topic across sources and provide evidence-based conclusion" -> Research
- if you need to "analyze an already provided dataset" -> Data Analysis Agent
Comparison with other patterns and examples
Quick cheat sheet:
| If task looks like this... | Use |
|---|---|
| After each step you need to decide what to do next | ReAct Agent |
| You first need to split a large goal into smaller executable tasks | Task Decomposition Agent |
| You need to execute code, validate results, and iterate safely | Code Execution Agent |
| You need to analyze data and return conclusions based on analysis | Data Analysis Agent |
| You need multi-source research with structured evidence | Research Agent |
Examples:
ReAct: "Find root cause of API outage: check logs -> inspect errors -> run next check based on result".
Task Decomposition: "Prepare launch of a new plan: break into subtasks for content, engineering, QA, and support".
Code Execution: "Calculate 12-month retention in Python and validate formula correctness on real data".
Data Analysis: "Analyze sales CSV: find trends, outliers, and provide short conclusions".
Research: "Collect data on 5 competitors from multiple sources and prepare comparative summary".
How To Combine With Other Patterns
- Research + RAG: verified external findings are stored in internal knowledge base for later answers.
- Research + Guarded-Policy: policies limit allowed tools, domains, and data types.
- Research + Fallback-Recovery: on unstable search/fetch, the agent retries or switches to fallback sources.
In Short
Research Agent:
- runs bounded search and source reading
- extracts structured notes with provenance
- verifies key claims before answering
- returns cited answer without infinite review loops
Pros and Cons
Pros
collects data from multiple sources in one flow
adds links, so answers are easier to verify
covers topics better than a single source
good for comparing facts and versions
Cons
runs slower due to search and reading
without limits it can spend too many resources
answer quality depends on source quality
FAQ
Q: Can we search "until confidence"?
A: No. Confidence is not a stop condition. You need explicit limits: max_urls, max_seconds, and stagnation/convergence rules.
Q: Why is URL dedupe important?
A: Without it, the agent pays to re-read the same content and distorts the effective source count.
Q: Is it enough to just add citations at the end?
A: No. Citations must come from extracted notes, not be generated "for appearance".
What Next
Research Agent covers open-world search and citation.
Now that you know the core patterns, next question is: how to combine them in a real system? When to use deterministic workflow, and when a flexible agent? And how do they work together?