Documents API
Turn any document into AI-ready content
Extract text, tables, and images from PDFs, Office files, and images through one API. Advanced OCR turns scanned documents into clean Markdown, JSON, or HTML, ready for RAG pipelines, digitization, and invoice processing.
Why GreenPT
Document AI without the data tradeoff
Most document APIs ask you to send sensitive files to infrastructure you do not control. GreenPT processes your documents on private, EU-hosted, renewable-powered infrastructure, and never trains on your data.
-
EU-hosted and GDPR-aligned, so sensitive documents stay in Europe.
-
Private by design: your files are never used to train models.
-
Wide format coverage in a single API, from PDFs to spreadsheets to images.
-
Built-in OCR reads scans and image-based documents.
-
Fast or accurate table modes to match speed against precision.
-
Structured DoclingDocument JSON that drops straight into RAG pipelines.
Capabilities
One API for every document
-
Wide format support
Process PDFs, Word, PowerPoint, Excel, CSV, HTML, and common image types through a single endpoint, no per-format plumbing.
-
OCR for scans and images
Advanced OCR reads scanned documents and images with embedded text, turning pixels into clean, searchable content.
-
Table extraction
Detect and reconstruct table structure from reports and spreadsheets, with a fast mode for speed or an accurate mode for precision.
-
Multiple output formats
Get results as Markdown, JSON, HTML, HTML split by page, plain text, or DocTags. Pick one or request several at once.
-
Image extraction
Pull embedded images out alongside the text so figures and diagrams are not lost in conversion.
-
Structured JSON for RAG
Receive a structured DoclingDocument schema with texts, tables, pictures, and pages, ready to chunk and embed for retrieval.
Formats
Many inputs in, clean structure out
Send the documents you already have. Get back the format your application needs, whether that is Markdown for an LLM or structured JSON for a pipeline.
Input formats
Documents
- .docx
- .pptx
- .xlsx
- .csv
- .md
- .html
Images
- .png
- .jpg
- .tiff
- .bmp
- .webp
Special
- .vtt
- .xml
- .json
Output formats
- Markdown
- JSON
- HTML
- HTML by page
- Plain text
- DocTags
Use cases
Built for document-heavy workflows
-
RAG pipelines
Convert source documents into clean, structured text so your retrieval and embedding steps start from quality input.
-
Invoice processing
Automate extraction of data from invoices and receipts, including the tables that hold line items and totals.
-
Document digitization
Turn scanned archives and legacy PDFs into searchable, machine-readable text for indexing and reuse.
-
Data extraction
Lift tables out of financial reports and spreadsheets into structured formats your systems can consume.
-
Academic research
Process research papers with their formulas, citations, and figures intact for analysis or summarization.
-
Accessibility
Make image-based documents accessible by extracting their text, so screen readers and search can reach the content.
Documents API, in short
Which file formats can I send?
PDFs, Microsoft Office files (Word, PowerPoint, Excel), CSV, Markdown, HTML, and common image types including PNG, JPEG, TIFF, BMP, and WebP. Special formats like VTT, XML, and JSON are supported too.
What output formats do I get back?
Markdown (the default), structured JSON in the DoclingDocument schema, HTML, HTML split by page, plain text, or DocTags. You can request one format or several in a single call.
Can it read scanned documents?
Yes. Built-in OCR converts scanned documents and images with embedded text into clean, structured output. You can also force OCR to replace existing text when a source PDF has unreliable text layers.
Does it extract tables?
Yes. Table structure detection is on by default, with a fast mode for speed and an accurate mode for complex layouts. Embedded images can be extracted alongside the text.
Is my data private, and where is it processed?
The Documents API runs on GreenPT’s private, EU-hosted, renewable-powered infrastructure. Your files are processed to fulfil your request and are not used to train models.
Read the API docs →Start building
Make every document AI-ready .
Send your first file in minutes. Convert PDFs, scans, and Office documents into clean, structured content on private, EU-hosted infrastructure.
- 100% Renewable
- EU Hosted
- GDPR-aligned