Bulk conversion

Batch convert PDFs to Markdown

Have a folder of reports, papers or invoices? Convert them all with the REST API: one small job per file, looped over your set, with idempotency-safe retries so a rerun never duplicates work.

Short answer

A batch is many small jobs

The API converts one PDF per job: create, poll, download. To convert many, you run that flow once per file in a loop, with two additions that make a batch reliable. An Idempotency-Key per file means a rerun after a crash picks up where it left off instead of redoing finished work. A small concurrency limit keeps several conversions in flight without overwhelming the API. That same shape scales from ten files to thousands.

How to

Convert a whole folder

A minimal shell loop over every PDF in a directory. Swap in your key.

API="https://pdf2md.dev/api/v2"
AUTH="Authorization: Bearer p2m_your_key"

for pdf in *.pdf; do
  # filename as the Idempotency-Key: a rerun reuses the same job
  JID=$(curl -fsS -X POST "$API/jobs" -H "$AUTH" \
    -H "Idempotency-Key: $pdf" -F file=@"$pdf" | jq -r .job_id)

  until [ "$(curl -fsS "$API/jobs/$JID" -H "$AUTH" | jq -r .status)" = "ready" ]; do
    sleep 3
  done

  curl -fsS "$API/jobs/$JID/download" -H "$AUTH" -o "${pdf%.pdf}.md"
  echo "converted $pdf"
done

This version is sequential for clarity. For real volume, convert several files at once with a worker pool and switch from polling to webhooks, so you are notified as each job finishes.

Scale it

From a folder to a pipeline

Concurrency

Convert several files at once with a bounded worker pool. Start around five in flight and tune to your tier.

Idempotent retries

A stable key per file (filename or content hash) makes a rerun safe, so a half-finished batch resumes cleanly.

Webhooks

For thousands of files, let webhooks tell you when each job is ready instead of polling them all.

Any language

Wrap the single-file flow from the Python, Node.js, Go or cURL guides in a loop.

Feeding a knowledge base?

A batch of PDFs becomes a folder of clean Markdown, ready to chunk and embed. See the RAG guide for turning that output into retrievable context.

Make it robust

Handling failures across a batch

In a real run, a few files will be password-protected, corrupt or simply too large. The goal is to let the batch finish and tell you what was skipped, not to stop on the first bad file.

Skip and log, do not stop

When a job comes back with status: error, record the filename and its error_code to a log and move on. At the end you have a clean set of Markdown files plus a short list of the ones that need attention, instead of a batch that died halfway.

Resume safely

Because each create uses the filename (or a content hash) as its Idempotency-Key, rerunning the whole batch is safe: finished files return their existing job, and only the previously failed or unprocessed ones do real work. A nightly job can re-run the same folder without redoing everything.

Two more details keep large runs tidy. Write each result next to its source (report.pdf becomes report.md) so the mapping is obvious, and check the truncated flag on each job, since a very long document can come back as a partial result that hit the time budget. On the free tier the slot and size limits are lower, so a big batch is a good reason to move to a paid tier, which raises concurrency, file size and the monthly page allowance.

FAQ

Common questions

How do I convert many PDFs at once?

Run the create, poll and download flow per file in a loop. Use an Idempotency-Key per file so a re-run does not duplicate work, and a small concurrency limit to stay friendly to the API.

Is there a single bulk endpoint?

Conversion is per job, so a batch is many jobs. The same loop scales from a handful of files to thousands; add concurrency and webhooks for large runs.

How do I avoid re-converting files on a retry?

Pass a stable Idempotency-Key on create, for example the filename or a content hash. The same key returns the same job instead of starting a duplicate conversion.

How many can I run in parallel?

Start with a small pool, for example five, and adjust. The free tier has lower slot limits; paid tiers raise slots and throughput.

Can I use webhooks instead of polling each job?

Yes. For large batches, webhooks notify you as each job finishes, which scales better than polling thousands of jobs. See the developer hub.

Which language should I write the batch in?

Any. See the Python, Node.js, Go and cURL guides for the single-file flow, then wrap it in a loop over your files.