Improve the runtime performance of the bsonjs.dumps API#106
Open
maozguttman wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Improve the runtime performance of the
bsonjs.dumpsAPI.Below are the results from
benchmark.py, executed before and after my changes.bsonjswas compiled with GCC 14.2.0 on SLES 15 and run on Python 3.13.2.Before my changes:
After my changes:
My internal tool processes BSON files, some of which are extremely large and are probably not representative of the BSON data that most users work with.
As a result, the performance improvements described below may be most relevant to workloads with similar characteristics.
For example, I have a BSON file with a size of 791,038,613 bytes, containing:
It is important to note that these BSON files rarely contain escaped or Unicode characters.
I used Copilot to identify potential areas for runtime optimization. Some of the suggested improvements provided measurable gains in this use case.
I would appreciate it if you could review my changes and, if appropriate, provide feedback, commit them, and include them in a future release.
The overall runtime of my test case was reduced from 19.50 seconds to 4.91 seconds (a 74.8% reduction).
The table below shows the runtime reduction contributed by each change:
Testing
I ran the Python test suite and added a unit test for the "
bson_utf8_escape_for_jsonoptimization".In addition, to validate the "Nested string copy elimination in BSON visitor" changes, I locally enabled the "BSON max len limited" configuration (this change was used only for testing and is not included in the commit). I tested all possible values using sample data containing BSON objects and obtained identical results both with and without my changes.
I also tested the complete set of changes on several BSON files and obtained identical results with and without the optimizations.
Code changes
1.
bson_utf8_escape_for_jsonoptimizationProblem:
The
bson_utf8_escape_for_jsonfunction is called for every string value and key. It always creates a newbson_string_tand processes escape and Unicode characters even when the string does not contain any.This leads to unnecessary memory allocations and copying.
Fix:
Update the function to:
Files changed:
bsonjs/bson/bson-string.c:bson_string_new_n, bson_string_newbsonjs/bson/bson-string.h:bson_string_new_nbsonjs/bson/bson-utf8.cbsonjs/bson/bson.c:_bson_as_json_visit_utf8,_bson_as_json_visit_regex,_bson_as_json_visit_dbpointer,_bson_as_json_visit_before,_bson_as_json_visit_code,_bson_as_json_visit_symbol,_bson_as_json_visit_codewscopetest/test_bsonjs.py:test_dumps_escaped_and_unicode_characters2. Nested string copy elimination in BSON visitor
Problem:
For each nested document or array:
bson_string_tis allocatedThis causes excessive allocations and copying.
Fix:
Use a single shared
bson_string_tbuffer across all nesting levels.Files changed:
bsonjs/bson/bson.c:_state_str_len,bson_json_state_t,_bson_as_json_visit_after,_bson_as_json_visit_codewscope,_bson_as_json_visit_document,_bson_as_json_visit_array,_bson_as_json_visit_all3.
bson_string_append_printfoptimizationProblem:
Each call to
bson_string_append_printf:vsnprintfinto itThis results in frequent heap allocations.
Fix:
vsnprintfFiles changed:
bsonjs/bson/bson-string.c:bson_string_append_printf,bson_strdupv_printf,bson_strdup_printfbsonjs/bson/bson-string.h:bson_strdupv_printf4.
bson_utf8_validateoptimizationProblem:
Fix:
memchrfor more efficient detection.Files changed:
bsonjs/bson/bson-utf8.c:bson_utf8_validate5. Eliminate Redundant
strlenCallsProblem:
strlenare made even though the string length is already known.Fix:
strlen.Files changed:
bsonjs/bson/bson-iso8601.c:_bson_iso8601_date_formatbsonjs/bson/bson-iter.c:bson_iter_visit_allbsonjs/bson/bson-string.c:bson_string_append,bson_string_append_c,bson_string_append_unichar,bson_string_append_printf,bson_strdupv_printf,bson_strdup_printfbsonjs/bson/bson-string.h:STR_AND_LEN,bson_string_append,bson_string_append_printfbsonjs/bson/bson-utf8.c:bson_utf8_escape_for_jsonbsonjs/bson/bson-utf8.h:bson_utf8_escape_for_jsonbsonjs/bson/bson.c:_bson_as_json_visit_utf8,_bson_as_json_visit_decimal128,_bson_as_json_visit_double,_bson_as_json_visit_undefined,_bson_as_json_visit_null,_bson_as_json_visit_oid,_bson_as_json_visit_binary,_bson_as_json_visit_bool,_bson_as_json_visit_date_time,_bson_as_json_visit_regex,_bson_as_json_visit_timestamp,_bson_as_json_visit_dbpointer,_bson_as_json_visit_minkey,_bson_as_json_visit_maxkey,_bson_as_json_visit_before,_bson_as_json_visit_code,_bson_as_json_visit_symbol,_bson_as_json_visit_codewscope,_bson_as_json_visit_document,_bson_as_json_visit_array,_bson_as_json_visit_all,_bson_iter_validate_before