PublicFebruary 27, 20264 min read

Do You Even Need a Vector Database Anymore?

Skip embeddings. Skip chunking. Build a full-document RAG system using Claude’s 200k context window and tool calling. Perfect for small datasets, policy manuals, research papers, and internal assistants.

#claude #rag #vector db

Share:X LinkedIn

Claude long context RAG without chunking

Let me start with something slightly uncomfortable.

What if you don’t need a vector database for half the RAG apps people are building?

For the last year, the default advice has been:

Split documents
Chunk into 1,000 tokens
Generate embeddings
Store in a vector database
Retrieve top-k
Feed into the model

That works. But it’s not always necessary.

With Claude 3.x long-context models (200k token window), you can often skip the entire embedding pipeline for small-to-medium documents.

In this tutorial, we’ll build:

A “Chat with 200k Token PDF” system

No vector DB

No chunking

No embedding layer

Just:

PDF → full context → structured queries → grounded answers

You’ll be able to build this in under an hour.

Why this matters

Chunking creates real problems:

Context gets split mid-paragraph
Tables break
Cross-section reasoning fails
Retrieval misses relevant sections
Infrastructure becomes heavier than necessary

If your dataset is:

A single 100–150 page policy manual
One legal agreement
A research paper
An internal handbook
A single RFP document

And it fits inside ~200k tokens…

You can directly load the whole thing into Claude.

No retrieval layer required.

When this approach makes sense

Use long-context RAG without embeddings when:

You have 1–5 documents
Total size < 200k tokens
You need cross-section reasoning
You want fast MVP development
You want simpler infrastructure

This is ideal for:

Internal knowledge assistants
Policy Q&A systems
Legal draft analyzers
Research summarizers
SOP assistants

What we’re building

Minimal system:

Upload PDF
Extract text
Send entire document to Claude
Ask structured questions
Get grounded answers

No chunking logic. No vector store. No retrieval ranking.

Tool stack

Python 3.10+
Anthropic SDK
PyPDF
Claude 3.5 Sonnet (200k context)

Install dependencies:

pip install anthropic pypdf python-dotenv

Set your API key (Mac/Linux):

export ANTHROPIC_API_KEY=your_key_here

On Windows:

setx ANTHROPIC_API_KEY your_key_here

Restart your terminal after setting it.

Step 1 — Extract full PDF text

Create pdf_loader.py:

from pypdf import PdfReader

def load_pdf_text(path: str) -> str:
    reader = PdfReader(path)
    text_parts = []

    for page in reader.pages:
        text = page.extract_text()
        if text:
            text_parts.append(text)

    return "\n".join(text_parts)

Usage:

document_text = load_pdf_text("sample.pdf")
print(len(document_text))

Important:

If the PDF is scanned images, you’ll need OCR (e.g., Tesseract). This works for text-based PDFs.

Step 2 — Send entire document to Claude

Create chat_pdf.py:

import os
from anthropic import Anthropic
from pdf_loader import load_pdf_text

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

document_text = load_pdf_text("sample.pdf")

def ask_question(question: str):
    response = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=1000,
        temperature=0,
        system="You answer questions strictly using the provided document.",
        messages=[
            {
                "role": "user",
                "content": f"""
Here is the full document:

{document_text}

Question:
{question}

Instructions:
- Answer only from the document
- Quote sections where relevant
- If answer not found, say 'Not found in document'
"""
            }
        ]
    )

    return response.content[0].text

Now test it:

print(ask_question("What are the eligibility criteria?"))

That’s your minimal long-context RAG system.

Step 3 — Add structured extraction with tool use

Now we make it more powerful.

Instead of free-form answers, we extract structured data.

Example: extract

Key points
Important dates
Risks

Define tool schema:

tools = [
    {
        "name": "extract_summary",
        "description": "Extract structured summary from document",
        "input_schema": {
            "type": "object",
            "properties": {
                "key_points": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "important_dates": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "risks": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            },
            "required": ["key_points"]
        }
    }
]

Call Claude with tool enforcement:

response = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1000,
    tools=tools,
    tool_choice={"type": "tool", "name": "extract_summary"},
    messages=[
        {
            "role": "user",
            "content": f"Extract structured summary from this document:\n\n{document_text}"
        }
    ]
)

print(response)

Now Claude:

Reads full document
Calls your tool
Returns structured JSON

No retrieval pipeline involved.

Real example workflow

Let’s say you upload a 120-page policy manual.

You can now ask:

"List all penalty clauses and cite section numbers."
"Extract every numeric threshold in this document."
"Compare Section 3 and Section 7 for contradictions."
"Summarize all compliance requirements."

Because the full document is in memory, Claude can reason across sections.

Chunk-based RAG often struggles with cross-chapter logic.

That’s the core advantage here.

Cost tradeoff

If your document is 150k tokens:

Every query sends 150k tokens.

That’s expensive.

This approach is:

Infrastructure simple Compute heavy

Embedding-based RAG is:

Infrastructure heavy Compute lighter per query

Choose based on scale.

Common beginner mistakes

Ignoring token count

Always measure document length.

Not forcing grounding

Tell the model to say "Not found in document" if missing.

Leaving temperature high

Use temperature=0 for document QA.

Using messy PDF extraction

Clean headers and artifacts before sending.

Overbuilding too early

Start simple. Add retrieval only if needed.

When NOT to use this

Do not use long-context-only RAG if:

You have hundreds of documents
Corpus updates frequently
You need low-latency queries
You exceed token window
Cost becomes prohibitive

This is best for small corpora and fast MVPs.

Final takeaway

Vector databases are a tool, not a requirement.

If your entire knowledge base fits inside 200k tokens, try long-context reasoning first.

Build the simplest thing that works.

Then scale only when the problem demands it.