Do You Even Need a Vector Database Anymore?
PublicMarch 2, 20264 min read

Do You Even Need a Vector Database Anymore?

Skip embeddings. Skip chunking. Build a full-document RAG system using Claude’s 200k context window and tool calling. Perfect for small datasets, policy manuals, research papers, and internal assistants.

Share:XLinkedIn

Claude long context RAG without chunking

Let me start with something slightly uncomfortable.

What if you don’t need a vector database for half the RAG apps people are building?

For the last year, the default advice has been:

  • Split documents
  • Chunk into 1,000 tokens
  • Generate embeddings
  • Store in a vector database
  • Retrieve top-k
  • Feed into the model

That works. But it’s not always necessary.

With Claude 3.x long-context models (200k token window), you can often skip the entire embedding pipeline for small-to-medium documents.

In this tutorial, we’ll build:

A “Chat with 200k Token PDF” system
No vector DB
No chunking
No embedding layer

Just:

PDF → full context → structured queries → grounded answers

You’ll be able to build this in under an hour.


Why this matters

Chunking creates real problems:

  • Context gets split mid-paragraph
  • Tables break
  • Cross-section reasoning fails
  • Retrieval misses relevant sections
  • Infrastructure becomes heavier than necessary

If your dataset is:

  • A single 100–150 page policy manual
  • One legal agreement
  • A research paper
  • An internal handbook
  • A single RFP document

And it fits inside ~200k tokens…

You can directly load the whole thing into Claude.

No retrieval layer required.


When this approach makes sense

Use long-context RAG without embeddings when:

  • You have 1–5 documents
  • Total size < 200k tokens
  • You need cross-section reasoning
  • You want fast MVP development
  • You want simpler infrastructure

This is ideal for:

  • Internal knowledge assistants
  • Policy Q&A systems
  • Legal draft analyzers
  • Research summarizers
  • SOP assistants

What we’re building

Minimal system:

  1. Upload PDF
  2. Extract text
  3. Send entire document to Claude
  4. Ask structured questions
  5. Get grounded answers

No chunking logic.
No vector store.
No retrieval ranking.


Tool stack

  • Python 3.10+
  • Anthropic SDK
  • PyPDF
  • Claude 3.5 Sonnet (200k context)

Install dependencies:

pip install anthropic pypdf python-dotenv

Set your API key (Mac/Linux):

export ANTHROPIC_API_KEY=your_key_here

On Windows:

setx ANTHROPIC_API_KEY your_key_here

Restart your terminal after setting it.


Step 1 — Extract full PDF text

Create pdf_loader.py:

from pypdf import PdfReader

def load_pdf_text(path: str) -> str:
    reader = PdfReader(path)
    text_parts = []

    for page in reader.pages:
        text = page.extract_text()
        if text:
            text_parts.append(text)

    return "\n".join(text_parts)

Usage:

document_text = load_pdf_text("sample.pdf")
print(len(document_text))

Important:

If the PDF is scanned images, you’ll need OCR (e.g., Tesseract).
This works for text-based PDFs.


Step 2 — Send entire document to Claude

Create chat_pdf.py:

import os
from anthropic import Anthropic
from pdf_loader import load_pdf_text

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

document_text = load_pdf_text("sample.pdf")

def ask_question(question: str):
    response = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        max_tokens=1000,
        temperature=0,
        system="You answer questions strictly using the provided document.",
        messages=[
            {
                "role": "user",
                "content": f"""
Here is the full document:

{document_text}

Question:
{question}

Instructions:
- Answer only from the document
- Quote sections where relevant
- If answer not found, say 'Not found in document'
"""
            }
        ]
    )

    return response.content[0].text

Now test it:

print(ask_question("What are the eligibility criteria?"))

That’s your minimal long-context RAG system.


Step 3 — Add structured extraction with tool use

Now we make it more powerful.

Instead of free-form answers, we extract structured data.

Example: extract

  • Key points
  • Important dates
  • Risks

Define tool schema:

tools = [
    {
        "name": "extract_summary",
        "description": "Extract structured summary from document",
        "input_schema": {
            "type": "object",
            "properties": {
                "key_points": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "important_dates": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "risks": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            },
            "required": ["key_points"]
        }
    }
]

Call Claude with tool enforcement:

response = client.messages.create(
    model="claude-3-5-sonnet-20240620",
    max_tokens=1000,
    tools=tools,
    tool_choice={"type": "tool", "name": "extract_summary"},
    messages=[
        {
            "role": "user",
            "content": f"Extract structured summary from this document:\n\n{document_text}"
        }
    ]
)

print(response)

Now Claude:

  • Reads full document
  • Calls your tool
  • Returns structured JSON

No retrieval pipeline involved.


Real example workflow

Let’s say you upload a 120-page policy manual.

You can now ask:

  • "List all penalty clauses and cite section numbers."
  • "Extract every numeric threshold in this document."
  • "Compare Section 3 and Section 7 for contradictions."
  • "Summarize all compliance requirements."

Because the full document is in memory, Claude can reason across sections.

Chunk-based RAG often struggles with cross-chapter logic.

That’s the core advantage here.


Cost tradeoff

If your document is 150k tokens:

Every query sends 150k tokens.

That’s expensive.

This approach is:

Infrastructure simple
Compute heavy

Embedding-based RAG is:

Infrastructure heavy
Compute lighter per query

Choose based on scale.


Common beginner mistakes

  1. Ignoring token count
    Always measure document length.

  2. Not forcing grounding
    Tell the model to say "Not found in document" if missing.

  3. Leaving temperature high
    Use temperature=0 for document QA.

  4. Using messy PDF extraction
    Clean headers and artifacts before sending.

  5. Overbuilding too early
    Start simple. Add retrieval only if needed.


When NOT to use this

Do not use long-context-only RAG if:

  • You have hundreds of documents
  • Corpus updates frequently
  • You need low-latency queries
  • You exceed token window
  • Cost becomes prohibitive

This is best for small corpora and fast MVPs.


Final takeaway

Vector databases are a tool, not a requirement.

If your entire knowledge base fits inside 200k tokens, try long-context reasoning first.

Build the simplest thing that works.

Then scale only when the problem demands it.

Share:XLinkedIn

Comments

Sign in to join the conversation.

No comments yet. Be the first to share your thoughts!