A Query.Farm VGI worker for DuckDB.
vgi-docgen · a Query.Farm VGI worker · powered by docxtpl
A VGI worker that brings mail-merge at scale from SQL to DuckDB: merge query data into DOCX templates to produce filled documents — invoices, contracts, statements, letters — each returned as a BLOB. It is the inverse of vgi-tika / vgi-pdf: those pull text/structure out of documents; this fills documents up.
Backed by docxtpl (Jinja2-style
{{ placeholders }} and {% for %} loops over python-docx,
both LGPL/MIT) and docxcompose (BSD) for
the merge. PDF output is optional and uses headless LibreOffice if present.
INSTALL vgi FROM community; LOAD vgi;
ATTACH 'docgen' (TYPE vgi, LOCATION 'uv run docgen_worker.py');
-- One rendered DOCX per row (scalar): template + a STRUCT of row data -> BLOB
SELECT docgen.docgen_render('invoice.docx', {customer: name, total: amount})
FROM invoices;
-- Inline template bytes work too (BLOB template instead of a path)
SELECT docgen.docgen_render(tpl.bytes, {customer: name})
FROM letters, templates AS tpl;
-- Opt into PDF (3rd positional arg; requires headless LibreOffice on PATH)
SELECT docgen.docgen_render('invoice.docx', {total: amount}, true) FROM invoices;
-- Many rows -> ONE merged document (buffering): template as a NAMED arg
SELECT doc FROM docgen.docgen_merge(
(SELECT customer, total FROM invoices),
template := 'invoice.docx');| function | kind | signature | returns |
|---|---|---|---|
docgen_render(template, data) |
scalar | template VARCHAR path or BLOB bytes; data a STRUCT |
rendered .docx BLOB (NULL on failure) |
docgen_render(template, data, pdf) |
scalar | + pdf BOOLEAN |
.docx or .pdf BLOB (NULL if PDF requested but LibreOffice absent) |
docgen_merge(rel, template[, pdf]) |
table (buffering) | rel a (SELECT …); template := 'path' named; pdf := true named |
one-row doc BLOB — all rows merged into a single document |
A true DuckDB scalar, so it composes in any projection. Arguments are
positional only (scalar functions do not support name := value):
- template — a
VARCHARpath (resolved directly, then under$VGI_DOCGEN_TEMPLATES) or aBLOBof raw.docxbytes. These are two overloads sharing the name, resolved by argument type. - data — an arbitrary
STRUCT. Every field becomes a Jinja2 variable in the template ({{ field }}); nestedSTRUCT/LISTdata drives{% for %}loops and tables. Build it inline with{a: x, b: y}orstruct_pack(a := x). - pdf (optional) —
BOOLEAN;trueconverts to PDF via LibreOffice.
The mail-merge path: renders the template once per input row and concatenates the
results into a single merged .docx (a page break between rows), returned as
one BLOB. Because it must see every row first, it is a buffering
(Sink/finalize) table function: the relation is the positional (SELECT …), and
the template (and optional pdf) are named args (table functions support
name := value). Every relation column is available to the template by name.
- Path —
'invoice.docx'. Resolved relative to the worker's working directory, then under$VGI_DOCGEN_TEMPLATESif set, so you can keep a shared templates directory and reference bare names. - Inline bytes — pass a
BLOBcolumn of raw.docxbytes (e.g. a templates table). Lets templates live in DuckDB itself.
Templates are ordinary .docx files authored in Word/LibreOffice with
docxtpl tags: {{ customer }},
{% for item in items %} … {% endfor %}, {%tr %} for table-row loops, etc.
The scalar path treats a NULL template/data, a non-.docx (non-ZIP) blob, a
malformed template, a Jinja render error, or a requested-but-unavailable PDF
conversion as NULL output — never a worker crash. The merge path surfaces the
same failures as a clean DuckDB error (so a typo'd template name is visible,
not silently empty). Every .docx is magic-checked (PK\x03\x04) before
docxtpl touches it.
DOCX is the clean, dependency-light default. PDF conversion shells out to a
headless LibreOffice (soffice/libreoffice on PATH, or $VGI_DOCGEN_SOFFICE).
LibreOffice is not a dependency of this package — if it is absent, the scalar
PDF path yields NULL and the merge PDF path raises a clear error. DOCX→PDF
fidelity is whatever LibreOffice produces.
uv sync --extra dev
uv run --no-sync pytest -q # pure core + in-proc + Client RPC E2E
make test-sql # haybarn-unittest SQL E2E (authoritative)
uv run --no-sync ruff check . && uv run --no-sync mypy vgi_docgen/- DOCX→PDF needs LibreOffice (heavy, optional, slower); without it, stick to DOCX output. Fidelity is LibreOffice's, not Word's.
- Large BLOBs through Arrow have a practical size ceiling. A merge of
thousands of rich documents into one file can grow large; prefer per-row
docgen_render(or batch the merge) for very large jobs. - Template complexity is the support burden. Simple
{{ }}substitution is trivial; loops, tables, and embedded images depend on correctly-authored docxtpl templates and are where most issues live.
This worker is MIT. Dependencies are permissive:
- docxtpl — LGPL-2.1 (used as a library, not modified/vendored).
- python-docx — MIT.
- docxcompose — BSD-3-Clause.
- Jinja2 (via docxtpl) — BSD-3-Clause.
- pyarrow — Apache-2.0.
- LibreOffice — MPL-2.0 / LGPL-3.0, and only an optional runtime dependency for the PDF path (never bundled, never imported; invoked as an external process only when you opt into PDF output).
Written by Query.Farm.
Copyright 2026 Query Farm LLC - https://query.farm
