PDF to Markdown glossary
Short, plain-language definitions of the terms that come up when you convert PDFs into clean, LLM-ready Markdown – from OCR and reading order to RAG, chunking and hosted MCP. Each one links to the guide that goes deeper.
Jump to: Chunking · Conversion engine · Embeddings · Formulas · Hosted MCP · Markdown · OCR · PDF to Markdown · RAG · Reading order · REST API · Scanned PDF · Table reconstruction · Tokens · Vector database
The terms, A to Z
Chunking
Splitting a document into smaller passages so a retrieval system or LLM can index and search them. Clean Markdown chunks far better than raw PDF text because headings and tables stay intact. See Markdown for RAG.
Conversion engine
The component that reads a PDF and produces Markdown. pdf2md.dev runs two open-source engines: Docling (fast on clean documents) and MinerU (robust on dense, complex layouts). See tables to Markdown.
Embeddings
Numeric vector representations of text that let a system find passages by meaning rather than exact words. They power retrieval in a RAG pipeline; you embed the Markdown chunks of a document. See Markdown for RAG.
Formulas (LaTeX / math)
Mathematical notation in a document. A good converter preserves equations, often as LaTeX, instead of flattening them into garbled characters. See tables & formulas.
Hosted MCP
MCP (Model Context Protocol) is an open standard that lets AI agents call external tools. A hosted MCP endpoint exposes PDF-to-Markdown conversion as a tool an agent can call directly, with no local setup. See the developer hub.
Markdown
A lightweight plain-text format that marks up headings, lists, tables, links and code with simple symbols. It is compact, diffable and the preferred way to feed documents to LLMs. See PDF to Markdown for AI.
OCR
Optical Character Recognition: turning the text inside an image or scanned page into real, selectable characters. It is what makes a scanned PDF convertible to editable Markdown. See scanned PDF to Markdown.
PDF to Markdown
Converting a PDF, whose text is stored by position rather than as structure, into clean Markdown with real headings, tables and lists. The result is editable, searchable and ready for LLMs. Try it.
RAG
Retrieval-Augmented Generation: a pattern where an LLM answers using passages retrieved from your own documents instead of only its training data. Converting PDFs to clean Markdown is the first step in most RAG pipelines. See Markdown for RAG.
Reading order
The correct sequence in which a page's text should be read, especially across multiple columns. A PDF does not store it, so a converter must reconstruct it to avoid scrambled output. See tables to Markdown.
REST API
A web interface for driving the converter from code: create a job, poll its status, then download the Markdown. It lets you convert PDFs programmatically or from an agent. See the Python tutorial.
Scanned PDF
A PDF whose pages are images, for example photographed or scanned paper, with no underlying text layer. It needs OCR before its content can become Markdown. See scanned PDF to Markdown.
Table reconstruction
Rebuilding the rows and columns a PDF only draws visually into a real Markdown table, instead of a screenshot or misaligned lines. See tables to Markdown.
Tokens
The units, roughly word fragments, an LLM counts for context limits and pricing. Clean Markdown uses fewer tokens than messy extracted text, so more of a document fits in a prompt. See Markdown for RAG.
Vector database
A store for embeddings that retrieves the passages most similar in meaning to a query. It holds the embedded Markdown chunks a RAG system searches. See Markdown for RAG.
From terms to a converted file
Put the glossary to work: drop a PDF and get clean Markdown with OCR, real tables and formulas – free, in the browser, or from a REST API and a hosted MCP.