PDF to Markdown for RAG
Turn PDFs into clean, chunk-friendly Markdown for retrieval and LLM ingestion – tables and formulas intact, scans OCR'd, far fewer tokens than raw PDF text. Drive it from the REST API or the hosted MCP.
Markdown is the clean input RAG wants
Embedding raw PDF text gives a retriever noisy chunks: headings, lists and tables collapse, reading order breaks, and binary/layout junk leaks in. Convert the PDF to Markdown first and the document structure survives – so you can chunk on headings and sections, keep tables and formulas as facts, and spend far fewer tokens per passage. Cleaner chunks in, better retrieval out.
What clean Markdown buys your pipeline
Clean chunk boundaries
Headings and sections survive, so splitting on structure produces coherent, self-contained chunks instead of mid-sentence cuts.
Fewer tokens
Plain Markdown is far cheaper to embed and to send as context than raw PDF dumps or HTML, so you index and retrieve more for less.
Tables stay facts
Columns become real Markdown tables, not jumbled lines, so tabular numbers remain retrievable rather than scrambled.
Formulas preserved
Mathematical notation is kept rather than flattened into garbled characters that pollute embeddings.
Scans become text
OCR turns image-only and scanned PDFs into selectable Markdown across many languages, so scans are indexable too.
Links & footnotes
Hyperlinks and footnotes carry over as Markdown links instead of being dropped, keeping references intact.
Ingest a PDF in four steps
One predictable lifecycle over HTTPS with a bearer API key. The hosted MCP exposes the same steps as agent tools.
Create a job
POST the PDF URL (or uploaded bytes) to /api/v2/jobs. You get a job id and a status.
Poll until ready
GET /api/v2/jobs/{id} until status is ready or error. On error, read error_code/error_message. On paid tiers, register a webhook instead of polling.
Fetch the Markdown
GET /api/v2/jobs/{id}/download. Honor truncated and pages – truncated means a long document was returned partially up to the time budget.
Chunk & embed
Split on headings/sections, embed the chunks, and index them in your vector store. The Markdown structure keeps boundaries clean.
# 1. create a job from a PDF URL curl -s -X POST https://pdf2md.dev/api/v2/jobs \ -H "Authorization: Bearer p2m_your_key" \ -H "Content-Type: application/json" \ -d '{"url":"https://example.com/paper.pdf"}' # -> { "job_id": "...", "status": "queued" } # 2. poll status until ready or error curl -s https://pdf2md.dev/api/v2/jobs/JOB_ID \ -H "Authorization: Bearer p2m_your_key" # -> { "status": "ready", "pages": 24, "truncated": false } # on failure: { "status": "error", "error_code": "...", "error_message": "..." } # 3. fetch the Markdown, then chunk + embed it curl -s https://pdf2md.dev/api/v2/jobs/JOB_ID/download \ -H "Authorization: Bearer p2m_your_key"
Prefer agent tools? The hosted MCP exposes the same lifecycle (pdf_to_markdown_create_job_from_url, _get_job, _get_markdown) so Claude, ChatGPT and agent frameworks can ingest PDFs with no local server.
Chunking strategies for Markdown
Clean Markdown gives you natural seams to split on, so chunks stay coherent and retrievable.
Split on headings
Use the heading hierarchy as chunk boundaries so each chunk is a self-contained section with its own context.
Keep tables whole
Never split inside a Markdown table; half a table loses its meaning. Keep each table, and its heading, in one chunk.
Add a little overlap
A small overlap of a sentence or two between adjacent chunks preserves context across boundaries and improves recall.
Carry the heading path
Prepend the section heading (or the breadcrumb of headings) to each chunk so the embedding captures where it came from.
Right-size chunks
Aim for a few hundred tokens per chunk: too large dilutes relevance, too small loses context. Markdown structure makes this easy to tune.
Drop boilerplate
Strip repeated headers and footers so they do not dominate embeddings; the converter already removes most layout noise.
A minimal Python ingestion
Create, poll, fetch, then split on headings, ready to embed and index in your vector store.
# pip install requests import requests, time, re API = "https://pdf2md.dev/api/v2" H = {"Authorization": "Bearer p2m_your_key"} job = requests.post(f"{API}/jobs", headers=H, json={"url": "https://example.com/paper.pdf"}).json() jid = job["job_id"] while True: j = requests.get(f"{API}/jobs/{jid}", headers=H).json() if j["status"] in ("ready", "error"): break time.sleep(3) md = requests.get(f"{API}/jobs/{jid}/download", headers=H).text # naive split on top-level headings -> chunks ready to embed chunks = re.split(r"\n(?=# )", md)
Why tables retrieve better
A scrambled table embeds as noise, so the number you need rarely matches a query. As a real Markdown table the rows and headers stay aligned, so tabular facts like prices, metrics and dates remain searchable and quotable in answers.
Polling vs webhooks
For a few documents, poll GET /api/v2/jobs/{id} every few seconds. For bulk or backend pipelines on a paid tier, register a webhook (or pass callback_url) and we POST you on ready or error, so you skip polling entirely.
Build it into your pipeline
The same converter is a REST API and a hosted MCP endpoint, with machine-readable discovery so scripts and agents can drive it directly.
Common questions
Why Markdown for RAG instead of raw PDF text?
Raw extraction loses structure – headings, lists and tables collapse and reading order breaks – which produces noisy chunks. Markdown keeps the structure, so chunking on headings yields cleaner, more retrievable passages, and it uses far fewer tokens than raw dumps or HTML.
Do tables and formulas survive?
Yes. Columns become real Markdown tables instead of jumbled lines, and mathematical notation is preserved rather than flattened, so numeric and tabular facts stay retrievable. More on extracting tables.
Can I automate it in a pipeline?
Yes. Use the REST API (create, poll, download) with a bearer API key, or the hosted MCP for agent frameworks. On paid tiers a webhook removes polling. There is nothing to host. See the Python tutorial for a full example.
How do I handle long documents?
Each conversion runs up to a per-tier time budget; a long document is returned partially with truncated=true instead of failing. Read truncated and split very large files, or use a higher tier with a longer budget.
Does it handle scanned PDFs for ingestion?
Yes. Image-only and scanned PDFs are OCR'd into selectable Markdown across many languages, so scans become retrievable text in your index.
How should I chunk the Markdown for RAG?
Split on the heading hierarchy so each chunk is a self-contained section, keep tables whole, add a small overlap between chunks, and prepend the section heading so the embedding knows the context. Aim for a few hundred tokens per chunk.
Why do Markdown tables retrieve better than raw text?
Raw extraction scrambles columns into noise, so the value you need rarely matches a query. A real Markdown table keeps rows and headers aligned, so tabular facts stay searchable and quotable.
Is there a free tier for testing?
Yes. A free Google account enables an API key and the hosted MCP: 3 slots, 10 MB files, a 15-minute time budget and 1-hour retention. Paid tiers raise every limit and add webhooks and batch create.