
Do You Even Need a Vector Database Anymore?
Skip embeddings. Skip chunking. Build a full-document RAG system using Claude’s 200k context window and tool calling. Perfect for small datasets, policy manuals, research papers, and internal assistants.
Claude long context RAG without chunking
Let me start with something slightly uncomfortable.
What if you don’t need a vector database for half the RAG apps people are building?
For the last year, the default advice has been:
- Split documents
- Chunk into 1,000 tokens
- Generate embeddings
- Store in a vector database
- Retrieve top-k
- Feed into the model
That works. But it’s not always necessary.
With Claude 3.x long-context models (200k token window), you can often skip the entire embedding pipeline for small-to-medium documents.
In this tutorial, we’ll build:
A “Chat with 200k Token PDF” system
No vector DB
No chunking
No embedding layer
Just:
PDF → full context → structured queries → grounded answers
You’ll be able to build this in under an hour.
Why this matters
Chunking creates real problems:
- Context gets split mid-paragraph
- Tables break
- Cross-section reasoning fails
- Retrieval misses relevant sections
- Infrastructure becomes heavier than necessary
If your dataset is:
- A single 100–150 page policy manual
- One legal agreement
- A research paper
- An internal handbook
- A single RFP document
And it fits inside ~200k tokens…
You can directly load the whole thing into Claude.
No retrieval layer required.
When this approach makes sense
Use long-context RAG without embeddings when:
- You have 1–5 documents
- Total size < 200k tokens
- You need cross-section reasoning
- You want fast MVP development
- You want simpler infrastructure
This is ideal for:
- Internal knowledge assistants
- Policy Q&A systems
- Legal draft analyzers
- Research summarizers
- SOP assistants
What we’re building
Minimal system:
- Upload PDF
- Extract text
- Send entire document to Claude
- Ask structured questions
- Get grounded answers
No chunking logic.
No vector store.
No retrieval ranking.
Tool stack
- Python 3.10+
- Anthropic SDK
- PyPDF
- Claude 3.5 Sonnet (200k context)
Install dependencies:
pip install anthropic pypdf python-dotenv
Set your API key (Mac/Linux):
export ANTHROPIC_API_KEY=your_key_here
On Windows:
setx ANTHROPIC_API_KEY your_key_here
Restart your terminal after setting it.
Step 1 — Extract full PDF text
Create pdf_loader.py:
from pypdf import PdfReader
def load_pdf_text(path: str) -> str:
reader = PdfReader(path)
text_parts = []
for page in reader.pages:
text = page.extract_text()
if text:
text_parts.append(text)
return "\n".join(text_parts)
Usage:
document_text = load_pdf_text("sample.pdf")
print(len(document_text))
Important:
If the PDF is scanned images, you’ll need OCR (e.g., Tesseract).
This works for text-based PDFs.
Step 2 — Send entire document to Claude
Create chat_pdf.py:
import os
from anthropic import Anthropic
from pdf_loader import load_pdf_text
client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
document_text = load_pdf_text("sample.pdf")
def ask_question(question: str):
response = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1000,
temperature=0,
system="You answer questions strictly using the provided document.",
messages=[
{
"role": "user",
"content": f"""
Here is the full document:
{document_text}
Question:
{question}
Instructions:
- Answer only from the document
- Quote sections where relevant
- If answer not found, say 'Not found in document'
"""
}
]
)
return response.content[0].text
Now test it:
print(ask_question("What are the eligibility criteria?"))
That’s your minimal long-context RAG system.
Step 3 — Add structured extraction with tool use
Now we make it more powerful.
Instead of free-form answers, we extract structured data.
Example: extract
- Key points
- Important dates
- Risks
Define tool schema:
tools = [
{
"name": "extract_summary",
"description": "Extract structured summary from document",
"input_schema": {
"type": "object",
"properties": {
"key_points": {
"type": "array",
"items": {"type": "string"}
},
"important_dates": {
"type": "array",
"items": {"type": "string"}
},
"risks": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["key_points"]
}
}
]
Call Claude with tool enforcement:
response = client.messages.create(
model="claude-3-5-sonnet-20240620",
max_tokens=1000,
tools=tools,
tool_choice={"type": "tool", "name": "extract_summary"},
messages=[
{
"role": "user",
"content": f"Extract structured summary from this document:\n\n{document_text}"
}
]
)
print(response)
Now Claude:
- Reads full document
- Calls your tool
- Returns structured JSON
No retrieval pipeline involved.
Real example workflow
Let’s say you upload a 120-page policy manual.
You can now ask:
- "List all penalty clauses and cite section numbers."
- "Extract every numeric threshold in this document."
- "Compare Section 3 and Section 7 for contradictions."
- "Summarize all compliance requirements."
Because the full document is in memory, Claude can reason across sections.
Chunk-based RAG often struggles with cross-chapter logic.
That’s the core advantage here.
Cost tradeoff
If your document is 150k tokens:
Every query sends 150k tokens.
That’s expensive.
This approach is:
Infrastructure simple
Compute heavy
Embedding-based RAG is:
Infrastructure heavy
Compute lighter per query
Choose based on scale.
Common beginner mistakes
-
Ignoring token count
Always measure document length. -
Not forcing grounding
Tell the model to say "Not found in document" if missing. -
Leaving temperature high
Use temperature=0 for document QA. -
Using messy PDF extraction
Clean headers and artifacts before sending. -
Overbuilding too early
Start simple. Add retrieval only if needed.
When NOT to use this
Do not use long-context-only RAG if:
- You have hundreds of documents
- Corpus updates frequently
- You need low-latency queries
- You exceed token window
- Cost becomes prohibitive
This is best for small corpora and fast MVPs.
Final takeaway
Vector databases are a tool, not a requirement.
If your entire knowledge base fits inside 200k tokens, try long-context reasoning first.
Build the simplest thing that works.
Then scale only when the problem demands it.