diff --git a/Doc/library/base64.rst b/Doc/library/base64.rst index a722607b2c1f19..8af40a2f8a65e3 100644 --- a/Doc/library/base64.rst +++ b/Doc/library/base64.rst @@ -16,8 +16,10 @@ This module provides functions for encoding binary data to printable ASCII characters and decoding such encodings back to binary data. This includes the :ref:`encodings specified in ` -:rfc:`4648` (Base64, Base32 and Base16) -and the non-standard :ref:`Base85 encodings `. +:rfc:`4648` (Base64, Base32 and Base16), the :ref:`Base85 encoding +` specified in `PDF 2.0 +`_, and non-standard variants +of Base85 used elsewhere. There are two interfaces provided by this module. The modern interface supports encoding :term:`bytes-like objects ` to ASCII @@ -284,19 +286,28 @@ POST request. Base85 Encodings ----------------- -Base85 encoding is not formally specified but rather a de facto standard, -thus different systems perform the encoding differently. +Base85 encoding is a family of algorithms which represent four bytes +using five ASCII characters. Originally implemented in the Unix +``btoa(1)`` utility, a version of it was later adopted by Adobe in the +PostScript language and is standardized in PDF 2.0 (ISO 32000-2). +This version, in both its ``btoa`` and PDF variants, is implemented by +:func:`a85encode`. -The :func:`a85encode` and :func:`b85encode` functions in this module are two implementations of -the de facto standard. You should call the function with the Base85 -implementation used by the software you intend to work with. +A separate version, using a different output character set, was +defined as an April Fool's joke in :rfc:`1924` but is now used by Git +and other software. This version is implemented by :func:`b85encode`. -The two functions present in this module differ in how they handle the following: +Finally, a third version, using yet another output character set +designed for safe inclusion in programming language strings, is +defined by ZeroMQ and implemented here by :func:`z85encode`. -* Whether to include enclosing ``<~`` and ``~>`` markers -* Whether to include newline characters -* The set of ASCII characters used for encoding -* Handling of null bytes +The functions present in this module differ in how they handle the following: + +* Whether to include and expect enclosing ``<~`` and ``~>`` markers. +* Whether to fold the input into multiple lines. +* The set of ASCII characters used for encoding. +* Compact encodings of sequences of spaces and null bytes. +* The encoding of zero-padding bytes applied to the input. Refer to the documentation of the individual functions for more information. @@ -307,18 +318,22 @@ Refer to the documentation of the individual functions for more information. *foldspaces* is an optional flag that uses the special short sequence 'y' instead of 4 consecutive spaces (ASCII 0x20) as supported by 'btoa'. This - feature is not supported by the "standard" Ascii85 encoding. + feature is not supported by the standard encoding used in PDF. If *wrapcol* is non-zero, insert a newline (``b'\n'``) character after at most every *wrapcol* characters. If *wrapcol* is zero (default), do not insert any newlines. - If *pad* is true, the input is padded with ``b'\0'`` so its length is a - multiple of 4 bytes before encoding. - Note that the ``btoa`` implementation always pads. + *pad* controls whether zero-padding applied to the end of the input + is fully retained in the output encoding, as done by ``btoa``, + producing an exact multiple of 5 bytes of output. This is not part + of the standard encoding used in PDF, as it does not preserve the + length of the data. - *adobe* controls whether the encoded byte sequence is framed with ``<~`` - and ``~>``, which is used by the Adobe implementation. + *adobe* controls whether the encoded byte sequence is framed with + ``<~`` and ``~>``, as in a PostScript base-85 string literal. Note + that while ASCII85Decode streams in PDF documents *must* be + terminated with ``~>``, they *must not* use a leading ``<~``. .. versionadded:: 3.4 @@ -330,10 +345,12 @@ Refer to the documentation of the individual functions for more information. *foldspaces* is a flag that specifies whether the 'y' short sequence should be accepted as shorthand for 4 consecutive spaces (ASCII 0x20). - This feature is not supported by the "standard" Ascii85 encoding. + This feature is not supported by the standard Ascii85 encoding used in + PDF and PostScript. - *adobe* controls whether the input sequence is in Adobe Ascii85 format - (i.e. is framed with <~ and ~>). + *adobe* controls whether the ``<~`` and ``~>`` markers are + present. While the leading ``<~`` is not required, the input must + end with ``~>``, or a :exc:`ValueError` is raised. *ignorechars* should be a :term:`bytes-like object` containing characters to ignore from the input. @@ -356,8 +373,11 @@ Refer to the documentation of the individual functions for more information. Encode the :term:`bytes-like object` *b* using base85 (as used in e.g. git-style binary diffs) and return the encoded :class:`bytes`. - If *pad* is true, the input is padded with ``b'\0'`` so its length is a - multiple of 4 bytes before encoding. + The input is padded with ``b'\0'`` so its length is a multiple of 4 + bytes before encoding. If *pad* is true, all the resulting + characters are retained in the output, which will always be a + multiple of 5 bytes, and thus the length of the data may not be + preserved on decoding. If *wrapcol* is non-zero, insert a newline (``b'\n'``) character after at most every *wrapcol* characters. @@ -372,8 +392,7 @@ Refer to the documentation of the individual functions for more information. .. function:: b85decode(b, *, ignorechars=b'', canonical=False) Decode the base85-encoded :term:`bytes-like object` or ASCII string *b* and - return the decoded :class:`bytes`. Padding is implicitly removed, if - necessary. + return the decoded :class:`bytes`. *ignorechars* should be a :term:`bytes-like object` containing characters to ignore from the input. @@ -392,11 +411,12 @@ Refer to the documentation of the individual functions for more information. .. function:: z85encode(s, pad=False, *, wrapcol=0) Encode the :term:`bytes-like object` *s* using Z85 (as used in ZeroMQ) - and return the encoded :class:`bytes`. See `Z85 specification - `_ for more information. + and return the encoded :class:`bytes`. - If *pad* is true, the input is padded with ``b'\0'`` so its length is a - multiple of 4 bytes before encoding. + The input is padded with ``b'\0'`` so its length is a multiple of 4 + bytes before encoding. If *pad* is true, all the resulting + characters are retained in the output, which will always be a + multiple of 5 bytes, as required by the ZeroMQ standard. If *wrapcol* is non-zero, insert a newline (``b'\n'``) character after at most every *wrapcol* characters. @@ -414,8 +434,7 @@ Refer to the documentation of the individual functions for more information. .. function:: z85decode(s, *, ignorechars=b'', canonical=False) Decode the Z85-encoded :term:`bytes-like object` or ASCII string *s* and - return the decoded :class:`bytes`. See `Z85 specification - `_ for more information. + return the decoded :class:`bytes`. *ignorechars* should be a :term:`bytes-like object` containing characters to ignore from the input. @@ -499,3 +518,11 @@ recommended to review the security section for any code deployed to production. Section 5.2, "Base64 Content-Transfer-Encoding," provides the definition of the base64 encoding. + `ISO 32000-2 Portable document format - Part 2: PDF 2.0 `_ + Section 7.4.3, "ASCII85Decode Filter," provides the definition + of the Ascii85 encoding used in PDF and PostScript, including + the output character set and the details of data length preservation + using zero-padding and partial output groups. + + `ZeroMQ RFC 32/Z85 `_ + The "Formal Specification" section provides the character set used in Z85. diff --git a/Doc/library/binascii.rst b/Doc/library/binascii.rst index 8b4ba6ae9fb254..60afe9261d51fa 100644 --- a/Doc/library/binascii.rst +++ b/Doc/library/binascii.rst @@ -133,8 +133,11 @@ The :mod:`!binascii` module defines the following functions: should be accepted as shorthand for 4 consecutive spaces (ASCII 0x20). This feature is not supported by the "standard" Ascii85 encoding. - *adobe* controls whether the input sequence is in Adobe Ascii85 format - (i.e. is framed with <~ and ~>). + *adobe* controls whether the encoded byte sequence is framed with + ``<~`` and ``~>``, as in a PostScript base-85 string literal. If + *adobe* is true, a leading ``<~`` is optionally accepted, while a + trailing ``~>`` is *required*, and :exc:`binascii.Error` is raised + if it is not found. *ignorechars* should be a :term:`bytes-like object` containing characters to ignore from the input. @@ -164,12 +167,16 @@ The :mod:`!binascii` module defines the following functions: after at most every *wrapcol* characters. If *wrapcol* is zero (default), do not insert any newlines. - If *pad* is true, the input is padded with ``b'\0'`` so its length is a - multiple of 4 bytes before encoding. - Note that the ``btoa`` implementation always pads. + If *pad* is true, the zero-padding applied to the end of the input + is fully retained in the output encoding, as done by ``btoa``, + producing an exact multiple of 5 bytes of output. This is not part + of the standard encoding used in PDF, as it does not preserve the + length of the data. - *adobe* controls whether the encoded byte sequence is framed with ``<~`` - and ``~>``, which is used by the Adobe implementation. + *adobe* controls whether the encoded byte sequence is framed with + ``<~`` and ``~>``, as in a PostScript base-85 string literal. Note + that while ASCII85Decode streams in PDF documents *must* be + terminated with ``~>``, they *must not* use a leading ``<~``. .. versionadded:: 3.15 @@ -213,8 +220,10 @@ The :mod:`!binascii` module defines the following functions: after at most every *wrapcol* characters. If *wrapcol* is zero (default), do not insert any newlines. - If *pad* is true, the input is padded with ``b'\0'`` so its length is a - multiple of 4 bytes before encoding. + If *pad* is true, the zero-padding applied to the end of the input + is retained in the output, which will always be a multiple of 5 + bytes, and thus the length of the data may not be preserved on + decoding. .. versionadded:: 3.15 diff --git a/Lib/base64.py b/Lib/base64.py index 4b810e08569e5b..4a0e9d446edb0b 100644 --- a/Lib/base64.py +++ b/Lib/base64.py @@ -315,16 +315,20 @@ def a85encode(b, *, foldspaces=False, wrapcol=0, pad=False, adobe=False): foldspaces is an optional flag that uses the special short sequence 'y' instead of 4 consecutive spaces (ASCII 0x20) as supported by 'btoa'. This - feature is not supported by the "standard" Adobe encoding. + feature is not supported by the standard encoding used in PDF. If wrapcol is non-zero, insert a newline (b'\\n') character after at most every wrapcol characters. - pad controls whether the input is padded to a multiple of 4 before - encoding. Note that the btoa implementation always pads. + pad controls whether zero-padding applied to the end of the input + is fully retained in the output encoding, as done by btoa, + producing an exact multiple of 5 bytes of output. + + adobe controls whether the encoded byte sequence is framed with <~ + and ~>, as in a PostScript base-85 string literal. Note that + while ASCII85Decode streams in PDF documents must be terminated + with ~>, they must not use a leading <~. - adobe controls whether the encoded byte sequence is framed with <~ and ~>, - which is used by the Adobe implementation. """ return binascii.b2a_ascii85(b, foldspaces=foldspaces, adobe=adobe, wrapcol=wrapcol, pad=pad) @@ -333,12 +337,14 @@ def a85decode(b, *, foldspaces=False, adobe=False, ignorechars=b' \t\n\r\v', canonical=False): """Decode the Ascii85 encoded bytes-like object or ASCII string b. - foldspaces is a flag that specifies whether the 'y' short sequence should be - accepted as shorthand for 4 consecutive spaces (ASCII 0x20). This feature is - not supported by the "standard" Adobe encoding. + foldspaces is a flag that specifies whether the 'y' short sequence + should be accepted as shorthand for 4 consecutive spaces (ASCII + 0x20). This feature is not supported by the standard Ascii85 + encoding used in PDF and PostScript. - adobe controls whether the input sequence is in Adobe Ascii85 format (i.e. - is framed with <~ and ~>). + adobe controls whether the <~ and ~> markers are present. While + the leading <~ is not required, the input must end with ~>, or a + ValueError is raised. ignorechars should be a byte string containing characters to ignore from the input. This should only contain whitespace characters, and by default @@ -358,8 +364,10 @@ def b85encode(b, pad=False, *, wrapcol=0): If wrapcol is non-zero, insert a newline (b'\\n') character after at most every wrapcol characters. - If pad is true, the input is padded with b'\\0' so its length is a multiple of - 4 bytes before encoding. + The input is padded with b'\0' so its length is a multiple of 4 + bytes before encoding. If pad is true, all the resulting + characters are retained in the output, which will always be a + multiple of 5 bytes. """ return binascii.b2a_base85(b, wrapcol=wrapcol, pad=pad) @@ -379,8 +387,10 @@ def z85encode(s, pad=False, *, wrapcol=0): If wrapcol is non-zero, insert a newline (b'\\n') character after at most every wrapcol characters. - If pad is true, the input is padded with b'\\0' so its length is a multiple of - 4 bytes before encoding. + The input is padded with b'\0' so its length is a multiple of + bytes before encoding. If pad is true, all the resulting + characters are retained in the output, which will always be a + multiple of 5 bytes, as required by the ZeroMQ standard. """ return binascii.b2a_base85(s, wrapcol=wrapcol, pad=pad, alphabet=binascii.Z85_ALPHABET) diff --git a/Modules/binascii.c b/Modules/binascii.c index 673dca6ee134bd..0e7af135a6f6ce 100644 --- a/Modules/binascii.c +++ b/Modules/binascii.c @@ -1057,7 +1057,8 @@ binascii.a2b_ascii85 foldspaces: bool = False Allow 'y' as a short form encoding four spaces. adobe: bool = False - Expect data to be wrapped in '<~' and '~>' as in Adobe Ascii85. + Expect data to be terminated with '~>' as in Adobe Ascii85, and + optionally accept leading '<~'. ignorechars: Py_buffer = b'' A byte string containing characters to ignore from the input. canonical: bool = False @@ -1069,7 +1070,7 @@ Decode Ascii85 data. static PyObject * binascii_a2b_ascii85_impl(PyObject *module, Py_buffer *data, int foldspaces, int adobe, Py_buffer *ignorechars, int canonical) -/*[clinic end generated code: output=09b35f1eac531357 input=dd050604ed30199e]*/ +/*[clinic end generated code: output=09b35f1eac531357 input=08eab2e53c62f1a8]*/ { const unsigned char *ascii_data = data->buf; Py_ssize_t ascii_len = data->len; @@ -1264,7 +1265,7 @@ binascii.b2a_ascii85 wrapcol: size_t = 0 Split result into lines of provided width. pad: bool = False - Pad input to a multiple of 4 before encoding. + Retain zero-padding bytes at end of output. adobe: bool = False Wrap result in '<~' and '~>' as in Adobe Ascii85. @@ -1274,7 +1275,7 @@ Ascii85-encode data. static PyObject * binascii_b2a_ascii85_impl(PyObject *module, Py_buffer *data, int foldspaces, size_t wrapcol, int pad, int adobe) -/*[clinic end generated code: output=5ce8fdee843073f4 input=791da754508c7d17]*/ +/*[clinic end generated code: output=5ce8fdee843073f4 input=a77e31d63517bf19]*/ { const unsigned char *bin_data = data->buf; Py_ssize_t bin_len = data->len; @@ -1539,7 +1540,7 @@ binascii.b2a_base85 / * pad: bool = False - Pad input to a multiple of 4 before encoding. + Retain zero-padding bytes at end of output. wrapcol: size_t = 0 alphabet: Py_buffer(c_default="{NULL, NULL}") = BASE85_ALPHABET @@ -1549,7 +1550,7 @@ Base85-code line of data. static PyObject * binascii_b2a_base85_impl(PyObject *module, Py_buffer *data, int pad, size_t wrapcol, Py_buffer *alphabet) -/*[clinic end generated code: output=98b962ed52c776a4 input=1b20b0bd6572691b]*/ +/*[clinic end generated code: output=98b962ed52c776a4 input=54886d05128d41a8]*/ { const unsigned char *bin_data = data->buf; Py_ssize_t bin_len = data->len; diff --git a/Modules/clinic/binascii.c.h b/Modules/clinic/binascii.c.h index ed695758ef998c..29fa9e87de87c7 100644 --- a/Modules/clinic/binascii.c.h +++ b/Modules/clinic/binascii.c.h @@ -372,7 +372,8 @@ PyDoc_STRVAR(binascii_a2b_ascii85__doc__, " foldspaces\n" " Allow \'y\' as a short form encoding four spaces.\n" " adobe\n" -" Expect data to be wrapped in \'<~\' and \'~>\' as in Adobe Ascii85.\n" +" Expect data to be terminated with \'~>\' as in Adobe Ascii85, and\n" +" optionally accept leading \'<~\'.\n" " ignorechars\n" " A byte string containing characters to ignore from the input.\n" " canonical\n" @@ -492,7 +493,7 @@ PyDoc_STRVAR(binascii_b2a_ascii85__doc__, " wrapcol\n" " Split result into lines of provided width.\n" " pad\n" -" Pad input to a multiple of 4 before encoding.\n" +" Retain zero-padding bytes at end of output.\n" " adobe\n" " Wrap result in \'<~\' and \'~>\' as in Adobe Ascii85."); @@ -709,7 +710,7 @@ PyDoc_STRVAR(binascii_b2a_base85__doc__, "Base85-code line of data.\n" "\n" " pad\n" -" Pad input to a multiple of 4 before encoding."); +" Retain zero-padding bytes at end of output."); #define BINASCII_B2A_BASE85_METHODDEF \ {"b2a_base85", _PyCFunction_CAST(binascii_b2a_base85), METH_FASTCALL|METH_KEYWORDS, binascii_b2a_base85__doc__}, @@ -1684,4 +1685,4 @@ binascii_b2a_qp(PyObject *module, PyObject *const *args, Py_ssize_t nargs, PyObj return return_value; } -/*[clinic end generated code: output=b41544f39b0ef681 input=a9049054013a1b77]*/ +/*[clinic end generated code: output=42dd48f323cbb118 input=a9049054013a1b77]*/