PDF to Markdown for RAG & LLM Ingestion

Q: Why convert PDFs to Markdown for RAG instead of using raw text?

Raw PDF text extraction loses structure: headings, lists and tables collapse and reading order breaks, which produces noisy chunks. Markdown keeps the structure, so chunking on headings and sections yields cleaner, more retrievable passages, and it uses far fewer tokens than raw dumps or HTML.

Short answer

Markdown is the clean input RAG wants

Embedding raw PDF text gives a retriever noisy chunks: headings, lists and tables collapse, reading order breaks, and binary/layout junk leaks in. Convert the PDF to Markdown first and the document structure survives – so you can chunk on headings and sections, keep tables and formulas as facts, and spend far fewer tokens per passage. Cleaner chunks in, better retrieval out.

Why it matters

What clean Markdown buys your pipeline

Clean chunk boundaries

Headings and sections survive, so splitting on structure produces coherent, self-contained chunks instead of mid-sentence cuts.

Fewer tokens

Plain Markdown is far cheaper to embed and to send as context than raw PDF dumps or HTML, so you index and retrieve more for less.

Tables stay facts

Columns become real Markdown tables, not jumbled lines, so tabular numbers remain retrievable rather than scrambled.

Formulas preserved

Mathematical notation is kept rather than flattened into garbled characters that pollute embeddings.

Scans become text

OCR turns image-only and scanned PDFs into selectable Markdown across many languages, so scans are indexable too.

Links & footnotes

Hyperlinks and footnotes carry over as Markdown links instead of being dropped, keeping references intact.

How to

Ingest a PDF in four steps

One predictable lifecycle over HTTPS with a bearer API key. The hosted MCP exposes the same steps as agent tools.

1

Create a job

POST the PDF URL (or uploaded bytes) to /api/v2/jobs. You get a job id and a status.

2

Poll until ready

GET /api/v2/jobs/{id} until status is ready or error. On error, read error_code/error_message. On paid tiers, register a webhook instead of polling.

3

Fetch the Markdown

GET /api/v2/jobs/{id}/download. Honor truncated and pages – truncated means a long document was returned partially up to the time budget.

4

Chunk & embed

Split on headings/sections, embed the chunks, and index them in your vector store. The Markdown structure keeps boundaries clean.

# 1. create a job from a PDF URL
curl -s -X POST https://pdf2md.dev/api/v2/jobs \
  -H "Authorization: Bearer p2m_your_key" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com/paper.pdf"}'
# -> { "job_id": "...", "status": "queued" }

# 2. poll status until ready or error
curl -s https://pdf2md.dev/api/v2/jobs/JOB_ID \
  -H "Authorization: Bearer p2m_your_key"
# -> { "status": "ready", "pages": 24, "truncated": false }
#    on failure: { "status": "error", "error_code": "...", "error_message": "..." }

# 3. fetch the Markdown, then chunk + embed it
curl -s https://pdf2md.dev/api/v2/jobs/JOB_ID/download \
  -H "Authorization: Bearer p2m_your_key"

Prefer agent tools? The hosted MCP exposes the same lifecycle (pdf_to_markdown_create_job_from_url, _get_job, _get_markdown) so Claude, ChatGPT and agent frameworks can ingest PDFs with no local server.

Chunking

Chunking strategies for Markdown

Clean Markdown gives you natural seams to split on, so chunks stay coherent and retrievable.

Split on headings

Use the heading hierarchy as chunk boundaries so each chunk is a self-contained section with its own context.

Keep tables whole

Never split inside a Markdown table; half a table loses its meaning. Keep each table, and its heading, in one chunk.

Add a little overlap

A small overlap of a sentence or two between adjacent chunks preserves context across boundaries and improves recall.

Carry the heading path

Prepend the section heading (or the breadcrumb of headings) to each chunk so the embedding captures where it came from.

Right-size chunks

Aim for a few hundred tokens per chunk: too large dilutes relevance, too small loses context. Markdown structure makes this easy to tune.

Drop boilerplate

Strip repeated headers and footers so they do not dominate embeddings; the converter already removes most layout noise.

In code

A minimal Python ingestion

Create, poll, fetch, then split on headings, ready to embed and index in your vector store.

# pip install requests
import requests, time, re

API = "https://pdf2md.dev/api/v2"
H = {"Authorization": "Bearer p2m_your_key"}

job = requests.post(f"{API}/jobs", headers=H,
    json={"url": "https://example.com/paper.pdf"}).json()
jid = job["job_id"]

while True:
    j = requests.get(f"{API}/jobs/{jid}", headers=H).json()
    if j["status"] in ("ready", "error"): break
    time.sleep(3)

md = requests.get(f"{API}/jobs/{jid}/download", headers=H).text
# naive split on top-level headings -> chunks ready to embed
chunks = re.split(r"\n(?=# )", md)

Why tables retrieve better

A scrambled table embeds as noise, so the number you need rarely matches a query. As a real Markdown table the rows and headers stay aligned, so tabular facts like prices, metrics and dates remain searchable and quotable in answers.

Polling vs webhooks

For a few documents, poll GET /api/v2/jobs/{id} every few seconds. For bulk or backend pipelines on a paid tier, register a webhook (or pass callback_url) and we POST you on ready or error, so you skip polling entirely.

Build it into your pipeline

The same converter is a REST API and a hosted MCP endpoint, with machine-readable discovery so scripts and agents can drive it directly.

Developer hub OpenAPI llms.txt

FAQ

Common questions

Why Markdown for RAG instead of raw PDF text?

Raw extraction loses structure – headings, lists and tables collapse and reading order breaks – which produces noisy chunks. Markdown keeps the structure, so chunking on headings yields cleaner, more retrievable passages, and it uses far fewer tokens than raw dumps or HTML.

Do tables and formulas survive?

Yes. Columns become real Markdown tables instead of jumbled lines, and mathematical notation is preserved rather than flattened, so numeric and tabular facts stay retrievable. More on extracting tables.

Can I automate it in a pipeline?

Yes. Use the REST API (create, poll, download) with a bearer API key, or the hosted MCP for agent frameworks. On paid tiers a webhook removes polling. There is nothing to host. See the Python tutorial for a full example.

How do I handle long documents?

Each conversion runs up to a per-tier time budget; a long document is returned partially with truncated=true instead of failing. Read truncated and split very large files, or use a higher tier with a longer budget.

Does it handle scanned PDFs for ingestion?

Yes. Image-only and scanned PDFs are OCR'd into selectable Markdown across many languages, so scans become retrievable text in your index.

How should I chunk the Markdown for RAG?

Split on the heading hierarchy so each chunk is a self-contained section, keep tables whole, add a small overlap between chunks, and prepend the section heading so the embedding knows the context. Aim for a few hundred tokens per chunk.

Why do Markdown tables retrieve better than raw text?

Raw extraction scrambles columns into noise, so the value you need rarely matches a query. A real Markdown table keeps rows and headers aligned, so tabular facts stay searchable and quotable.

Is there a free tier for testing?

Yes. A free Google account enables an API key and the hosted MCP: 3 slots, 10 MB files, a 15-minute time budget and 1-hour retention. Paid tiers raise every limit and add webhooks and batch create.