Mail-Merge SQL Data into DOCX & PDF in DuckDB

A Query.Farm VGI worker for DuckDB.

Mail-Merge SQL Data into DOCX & PDF in DuckDB

vgi-docgen · a Query.Farm VGI worker · powered by docxtpl

A VGI worker that brings mail-merge at scale from SQL to DuckDB: merge query data into DOCX templates to produce filled documents — invoices, contracts, statements, letters — each returned as a BLOB. It is the inverse of vgi-tika / vgi-pdf: those pull text/structure out of documents; this fills documents up.

Backed by docxtpl (Jinja2-style {{ placeholders }} and {% for %} loops over python-docx, both LGPL/MIT) and docxcompose (BSD) for the merge. PDF output is optional and uses headless LibreOffice if present.

INSTALL vgi FROM community; LOAD vgi;
ATTACH 'docgen' (TYPE vgi, LOCATION 'uv run docgen_worker.py');

-- One rendered DOCX per row (scalar): template + a STRUCT of row data -> BLOB
SELECT docgen.docgen_render('invoice.docx', {customer: name, total: amount})
FROM invoices;

-- Inline template bytes work too (BLOB template instead of a path)
SELECT docgen.docgen_render(tpl.bytes, {customer: name})
FROM letters, templates AS tpl;

-- Opt into PDF (3rd positional arg; requires headless LibreOffice on PATH)
SELECT docgen.docgen_render('invoice.docx', {total: amount}, true) FROM invoices;

-- Many rows -> ONE merged document (buffering): template as a NAMED arg
SELECT doc FROM docgen.docgen_merge(
    (SELECT customer, total FROM invoices),
    template := 'invoice.docx');

Functions

function	kind	signature	returns
`docgen_render(template, data)`	scalar	template `VARCHAR` path or `BLOB` bytes; `data` a `STRUCT`	rendered `.docx` `BLOB` (NULL on failure)
`docgen_render(template, data, pdf)`	scalar	+ `pdf BOOLEAN`	`.docx` or `.pdf` `BLOB` (NULL if PDF requested but LibreOffice absent)
`docgen_merge(rel, template[, pdf])`	table (buffering)	`rel` a `(SELECT …)`; `template := 'path'` named; `pdf := true` named	one-row `doc BLOB` — all rows merged into a single document

`docgen_render` — one document per row (scalar)

A true DuckDB scalar, so it composes in any projection. Arguments are positional only (scalar functions do not support name := value):

template — a VARCHAR path (resolved directly, then under $VGI_DOCGEN_TEMPLATES) or a BLOB of raw .docx bytes. These are two overloads sharing the name, resolved by argument type.
data — an arbitrary STRUCT. Every field becomes a Jinja2 variable in the template ({{ field }}); nested STRUCT/LIST data drives {% for %} loops and tables. Build it inline with {a: x, b: y} or struct_pack(a := x).
pdf (optional) — BOOLEAN; true converts to PDF via LibreOffice.

`docgen_merge` — many rows into one document (buffering)

The mail-merge path: renders the template once per input row and concatenates the results into a single merged .docx (a page break between rows), returned as one BLOB. Because it must see every row first, it is a buffering (Sink/finalize) table function: the relation is the positional (SELECT …), and the template (and optional pdf) are named args (table functions support name := value). Every relation column is available to the template by name.

Templates

Path — 'invoice.docx'. Resolved relative to the worker's working directory, then under $VGI_DOCGEN_TEMPLATES if set, so you can keep a shared templates directory and reference bare names.
Inline bytes — pass a BLOB column of raw .docx bytes (e.g. a templates table). Lets templates live in DuckDB itself.

Templates are ordinary .docx files authored in Word/LibreOffice with docxtpl tags: {{ customer }}, {% for item in items %} … {% endfor %}, {%tr %} for table-row loops, etc.

Robustness (untrusted-binary discipline)

The scalar path treats a NULL template/data, a non-.docx (non-ZIP) blob, a malformed template, a Jinja render error, or a requested-but-unavailable PDF conversion as NULL output — never a worker crash. The merge path surfaces the same failures as a clean DuckDB error (so a typo'd template name is visible, not silently empty). Every .docx is magic-checked (PK\x03\x04) before docxtpl touches it.

PDF output (optional)

DOCX is the clean, dependency-light default. PDF conversion shells out to a headless LibreOffice (soffice/libreoffice on PATH, or $VGI_DOCGEN_SOFFICE). LibreOffice is not a dependency of this package — if it is absent, the scalar PDF path yields NULL and the merge PDF path raises a clear error. DOCX→PDF fidelity is whatever LibreOffice produces.

Development

uv sync --extra dev
uv run --no-sync pytest -q                 # pure core + in-proc + Client RPC E2E
make test-sql                              # haybarn-unittest SQL E2E (authoritative)
uv run --no-sync ruff check . && uv run --no-sync mypy vgi_docgen/

Caveats

DOCX→PDF needs LibreOffice (heavy, optional, slower); without it, stick to DOCX output. Fidelity is LibreOffice's, not Word's.
Large BLOBs through Arrow have a practical size ceiling. A merge of thousands of rich documents into one file can grow large; prefer per-row docgen_render (or batch the merge) for very large jobs.
Template complexity is the support burden. Simple {{ }} substitution is trivial; loops, tables, and embedded images depend on correctly-authored docxtpl templates and are where most issues live.

Licensing

This worker is MIT. Dependencies are permissive:

docxtpl — LGPL-2.1 (used as a library, not modified/vendored).
python-docx — MIT.
docxcompose — BSD-3-Clause.
Jinja2 (via docxtpl) — BSD-3-Clause.
pyarrow — Apache-2.0.
LibreOffice — MPL-2.0 / LGPL-3.0, and only an optional runtime dependency for the PDF path (never bundled, never imported; invoked as an external process only when you opt into PDF output).

Authorship & License

Written by Query.Farm.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
ci		ci
docs		docs
test/sql		test/sql
tests		tests
vgi_docgen		vgi_docgen
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
conftest.py		conftest.py
docgen_worker.py		docgen_worker.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mail-Merge SQL Data into DOCX & PDF in DuckDB

Functions

`docgen_render` — one document per row (scalar)

`docgen_merge` — many rows into one document (buffering)

Templates

Robustness (untrusted-binary discipline)

PDF output (optional)

Development

Caveats

Licensing

Authorship & License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Mail-Merge SQL Data into DOCX & PDF in DuckDB

Functions

docgen_render — one document per row (scalar)

docgen_merge — many rows into one document (buffering)

Templates

Robustness (untrusted-binary discipline)

PDF output (optional)

Development

Caveats

Licensing

Authorship & License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`docgen_render` — one document per row (scalar)

`docgen_merge` — many rows into one document (buffering)

Packages