Skip to content

GH-48701: [C++][Parquet] Add ALPpd encoding#48345

Open
prtkgaur wants to merge 80 commits into
apache:mainfrom
prtkgaur:gh540-alp-pseudoDecimal-encoding
Open

GH-48701: [C++][Parquet] Add ALPpd encoding#48345
prtkgaur wants to merge 80 commits into
apache:mainfrom
prtkgaur:gh540-alp-pseudoDecimal-encoding

Conversation

@prtkgaur
Copy link
Copy Markdown

@prtkgaur prtkgaur commented Dec 5, 2025

Co-authored-by: dhirhan17@gmail.com

@Reviewer : Suggested order : Outdated, will update shortly in which to look at the code while reviewing.

Rationale for this change

ALP significantly improves on the compression ratio and decompression speed over of float/double columns over other encoding/compression techniques.

Spec

Spec
This PR also contains a terse version of the spec in the file cpp/src/arrow/util/alp/ALP_Encoding_Specification_terse.md which can go in the Encodings.md

Parquet Format PR

Dataset PR (parquet-testing)

apache/parquet-testing#100

What changes are included in this PR?

This PR
Introduces ALP (pseudo-decimal) encoding into c++ arrow code.
We also provide benchmarks and dataset to prove the effectiveness of the above algorithm.

Adding above needed us to add following classes.

  • Alp h/cc : Houses core logic for encoding and decoding.
  • Sampler h/cc : Houses logic to sample and select parameters for encoding.
  • AlpWrapper h/cc : Binds together Alp and Sampler classes.

Integration of the above code was done in

  • Encoder/Decoder cc which exposes wrapper to encode buffer of data.

Are these changes tested?

  • We have added unit tests to test the code.
  • Also the benchmarks have been added that cover wide variety of floating point values from low precision to high precision.

Unit tests

  • alp_test.cc

Benchmark tests

  • encoding_benchmark.cc and encoding_alp_benchmark.cc

Are there any user-facing changes?

  • It's a new encoding so the only impact is query performance which we claim will only get better.

DuckDB

  • We did look at DuckDB's ALP implementation while we were implementing ALP and would like to give that team the desired credit.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Dec 5, 2025

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@prtkgaur prtkgaur force-pushed the gh540-alp-pseudoDecimal-encoding branch 3 times, most recently from 1b78a5c to d563ce0 Compare December 7, 2025 15:46
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the more standard place to put test data is in either arrow-testing or parquet-testing so it can be used across implementations

In this case I would recommend https://github.com/apache/parquet-testing

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks.
apache/parquet-testing#100

Comment thread cpp/src/parquet/types.h
DELTA_BYTE_ARRAY = 7,
RLE_DICTIONARY = 8,
BYTE_STREAM_SPLIT = 9,
ALP = 10,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For parquet-format we have this PR : apache/parquet-format#557

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Dec 8, 2025

Thanks @prtkgaur -- it is super exciting to see this movement.

Unfortunately, I am not familiar with the C/C++ codebase to give this a realistic review.

I started the CI checks on this PR and had some comments about the testing.

@prtkgaur prtkgaur changed the title [Gh540] Add ALPpd encoding to parquet [Gh539] Add ALPpd encoding to parquet Dec 8, 2025
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks.
apache/parquet-testing#100

std::string tarball_path = std::string(__FILE__);
tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
tarball_path = tarball_path.substr(0, tarball_path.find_last_of("/\\"));
tarball_path += "/arrow/cpp/submodules/parquet-testing/data/floatingpoint_data.tar.gz";
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Reviewer the data sits in the parquet-testing submodule
apache/parquet-testing#100


// Unsafe resize without initialization - use only when you will immediately
// overwrite the memory (e.g., before memcpy). Only safe for POD types.
void UnsafeResize(size_t n) {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using this over resize gave us around 2-3% performance improvement

@prtkgaur prtkgaur changed the title [Gh539] Add ALPpd encoding to parquet [Gh539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025
@prtkgaur prtkgaur changed the title [Gh539][Encoding] Add ALPpd encoding to parquet [Gh-539][Encoding] Add ALPpd encoding to parquet Dec 8, 2025
Add float32 (FLOAT) coverage to the ALP encoding test pipeline:
- Generator: WriteAlpParquetFloat() and WriteExpectCsvFloat() produce
  float32 parquet files by casting double source data to float
- Tests: ReadFloatsFromCSV() and AssertTableMatchesCSVFloat() with
  bit-exact uint32 comparison for 4 new test cases (C++ and Java
  generated float parquets for spotify1 and arade datasets)
- Fix missing #include <optional> in alp.h
Add static_assert(ARROW_LITTLE_ENDIAN) in alp.cc and alp_wrapper.cc to
guard memcpy-based integer serialization, as requested by emkornfield.

Strengthen DecodeAlp buffer validation: before reading each vector,
verify that enough buffer remains for both metadata and data sections.
Previously only the offset itself was bounds-checked, which could allow
out-of-bounds reads from crafted compressed input.
- Rename AlpEncodingPreset -> AlpEncodingParameters across all files
- Rename AlpWrapper -> AlpCodec across all files
- Rename CreateEncodingPreset -> CreateEncodingParameters
- Replace std::ceil(.../8.0) with bit_util::BytesForBits()
- Simplify BitUnpackIntegers: single unpack() call instead of manual batch splitting
- Replace std::memcpy with SafeCopy in DecodeVector for consistency
- Move out parameters to last position in Decode() and DecodeAlp()
- Change Decode num_elements from uint32_t to int32_t
- Change GetMaxCompressedSize to int64_t, rename param to uncompressed_size
- Change CompressionProgress/DecompressionProgress fields to int64_t
- Change AlpEncodingParameters::best_compressed_size to int64_t
- Spell out ALP (Adaptive Lossless floating-Point) in alp_constants.h with spec link
- Add ALP paper citation for sampling constants
- Remove ALP from Compression::type enum (it's an encoding, not a compressor)
- Add validation in AlpDecoder::SetData for num_values>0 with len<=0
- Document GetMaxCompressedSize() in Encode/EncodeWithPreset comp param
- Fix narrowing warnings in alp_wrapper.cc (uint64_t -> int64_t casts
  for CompressionProgress/DecompressionProgress fields)
- Fix narrowing in alp.cc (best_compressed_size_bytes uint64_t -> int64_t)
- Fix remaining Decode() call sites in alp_test.cc to use new parameter
  order (num_elements, comp, comp_size, output)
@@ -565,6 +565,12 @@ if(ARROW_WITH_ZSTD)
list(APPEND ARROW_UTIL_SRCS util/compression_zstd.cc)
endif()

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the reviewer rough stats about this PR :

Production Code | ~2,449 | ~39%
Test Code. | ~1,151+ | ~18%
Benchmark | ~1,824 | ~29%
Documentation. | ~897 | ~14%

Comment thread cpp/src/arrow/util/alp/alp.cc Outdated
min_encoded_value = std::min(encoded_value, min_encoded_value);
continue;
}
num_exceptions++;
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the num_non_exceptions counter from the loop and derive it as input_vector.size() - num_exceptions after the loop instead

Comment thread cpp/src/arrow/util/alp/alp.cc Outdated
const ExactType delta = (static_cast<ExactType>(max_encoded_value) -
static_cast<ExactType>(min_encoded_value));

const uint32_t estimated_bits_per_value =
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — replaced both std::ceil(std::log2(delta + 1)) in EstimateCompressedSize and the hand-rolled __builtin_clz/__builtin_clzll in BitPackIntegers with bit_util::NumRequiredBits(). The existing uint64_t overload handles both float (uint32_t delta) and double (uint64_t delta) correctly via implicit widening, so an int32 version isn't strictly needed here, but happy to add one if you think it's worthwhile for the broader codebase.

// N of appearances is irrelevant at this phase; we search for best compression.
AlpCombination best_combination{best_encoding_options, 0, best_total_bits};
// Try all combinations to find the one which minimizes compression size.
for (uint8_t exp_idx = 0; exp_idx <= Constants::kMaxExponent; exp_idx++) {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — extracted the inner loops into a find_best_for_vector lambda. The outer loop is now flat: auto [best, size_bits] = find_best_for_vector(sampled_vector)

Comment thread cpp/src/arrow/util/alp/alp.cc Outdated
std::vector<AlpCombination> best_k_combinations;
best_k_combinations.reserve(
std::min(best_k_combinations_hash.size(), kMaxCombinationCount));
for (const auto& combination : best_k_combinations_hash) {
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — applied structured bindings. This is actually C++17 (which Arrow already uses), so no version bump needed.

Comment thread cpp/src/arrow/util/alp/alp.cc Outdated
// (SIMD128/NEON, SIMD256/AVX2, SIMD512/AVX512) have identical batch sizes:
// - uint32_t (float): Simd*UnpackerForWidth::kValuesUnpacked = 32
// - uint64_t (double): Simd*UnpackerForWidth::kValuesUnpacked = 64
// These constants are in anonymous namespaces (internal implementation detail),
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified — Arrow's unpack() already handles arbitrary batch sizes internally (runs SIMD for complete batches, then unpack_exact for the remainder), so we no longer need to manually split or hardcode batch sizes. The function is now a single unpack() call.

const ExactType* data = input_vector.data();
const ExactType frame_of_ref = for_info.frame_of_reference;

#pragma GCC unroll AlpConstants::kLoopUnrolls
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The #pragma GCC unroll with ivdep already generates efficient SIMD code — the compiler auto-vectorizes these loops. Hand-unrolling with a float/double-specific factor would couple the code to specific SIMD widths and add complexity. Happy to benchmark and revisit if there's measurable room for improvement.

Comment thread cpp/src/arrow/util/alp/alp.cc Outdated
const ExactType unfored_value = data[i] + frame_of_ref;
// 2. Reinterpret as signed integer for decode
SignedExactType signed_value;
std::memcpy(&signed_value, &unfored_value, sizeof(SignedExactType));
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — replaced with SafeCopy for consistency with EncodeVector.

arrow::util::span<const uint16_t> exception_positions) {
// Exceptions Patching.
uint64_t exception_idx = 0;
#pragma GCC unroll AlpConstants::kLoopUnrolls
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variance is instruction-level parallelism vs code size/register pressure. kLoopUnrolls = 4 is a reasonable middle ground for both x86 (wide OoO) and ARM (narrower). Lower values (1-2) under-utilize the pipeline; higher values (8+) increase I-cache pressure. We profiled with 2, 4, and 8 during development — 4 was consistently best across platforms. The constant is in alp_constants.h so it's easy to tune per-platform if needed.

// Exceptions Patching.
uint64_t exception_idx = 0;
#pragma GCC unroll AlpConstants::kLoopUnrolls
#pragma GCC ivdep
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without ivdep, the compiler may assume output_vector[i] and data[i] alias (both are pointer-accessed arrays of same-sized types), which prevents vectorization of the fused unFOR+decode loop. In practice they never alias: data comes from a local StaticVector and output_vector is the caller's output buffer. ivdep tells the compiler this is safe to vectorize.

Break the fused encode→decode→compare loop into three separate passes
over batches of 8 elements. The encode and decode passes are now
independent loops that the compiler can vectorize (FastRound uses the
magic number trick, not std::lround, so it is SIMD-friendly). A scalar
tail handles the remainder.
Copy link
Copy Markdown
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried to take another pass through, it seems like some of my old comments might have been unaddressed but in particular, do you plan on hooking up config so the encoder/decoder can be used on real parquet files in this PR or a separate one?

const ExactType* data = input_vector.data();
const ExactType frame_of_ref = for_info.frame_of_reference;

#pragma GCC unroll AlpConstants::kLoopUnrolls
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think my main issue here is code consistency/maintainability. I think xsimd might be the preferred maintenance route. I will yield to @pitrou for his guidance on how we should structure these optimizations.

Comment thread cpp/cmake_modules/SetupCxxFlags.cmake Outdated
elseif(ARROW_CPU_FLAG STREQUAL "aarch64")
# Arm64 compiler flags, gcc/clang only
set(ARROW_ARMV8_MARCH "armv8-a")
set(ARROW_ARMV8_MARCH "native")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on why this change?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to get an idea of performance (iterate). Reverted

if(NOT MSVC)
set(C_RELEASE_FLAGS "")
if(CMAKE_C_FLAGS_RELEASE MATCHES "-O3")
string(APPEND C_RELEASE_FLAGS " -O2")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems these are down-graded in more cases below, I think we should probably leave changing these flags to a separate PR so someone with more knowledge on why O2 is used by default can chime in (and we can better track changes here).

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted. I might want to add this to readme at some point

Comment thread cpp/src/parquet/types.h
DELTA_BYTE_ARRAY = 7,
RLE_DICTIONARY = 8,
BYTE_STREAM_SPLIT = 9,
ALP = 10,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump

Comment thread cpp/src/arrow/util/alp/alp.h Outdated
uint16_t num_exceptions = 0;

/// Size of the serialized portion (4 bytes, fixed)
static constexpr uint64_t kStoredSize = 4;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use sizeof(exponent) + sizeof(factor) + size(num_exceptions) to make this clear (and then assert that equal to 4.

Comment thread cpp/src/arrow/util/alp/alp_wrapper.cc Outdated
} else if (vector_index == num_full_vectors && remainder > 0) {
this_vector_elements = static_cast<uint16_t>(remainder);
} else {
this_vector_elements = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this exit early? or be marked as unreachable?

Comment thread cpp/src/arrow/util/alp/alp_wrapper.cc Outdated
const char* ptr = vector_start + AlpEncodedVectorInfo::kStoredSize;

// Decode based on integer encoding type
switch (integer_encoding) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of switch here for now consider refactoring to guards that return early on invalid data (avoids a nesting level).

Comment thread cpp/src/arrow/util/alp/alp_wrapper.cc Outdated
const uint64_t data_remaining = comp_size - static_cast<uint64_t>(ptr - comp);
const uint64_t data_size =
for_info.GetDataStoredSize(this_vector_elements, alp_info.num_exceptions);
if (data_size > data_remaining) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not push these guards into the alp.cc code? so the abstraction could be logically consistent?


void Put(const T* buffer, int num_values) override {
if (num_values > 0) {
PARQUET_THROW_NOT_OK(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is encoding, not decoding I think. Reasons:

  1. Reduce peak memory
  2. Give better estimates for encoded size (i.e. the return value of EstimatedDataEncodedSize).

I left comment on a previous review about decoding directly which it seems this comment refers to. The default page size is currently 1MB on writes, this isn't trivial. One could argue this should be lowered to something more reasonable but in general, I don't thin we want to place a hard assumption on the 10s of KBs. Did you quantify how marginal? I understand the FP arithmetic is the primary bottleneck, but in most scenarios, I've seen most extra mem-copies have a reasonable high impact on perf (e.g. ~5%).

Comment thread cpp/src/parquet/encoding_benchmark.cc Outdated
std::shared_ptr<Buffer> buf = encoder->FlushValues();

for (auto _ : state) {
auto decoder = MakeTypedDecoder<DoubleType>(Encoding::ALP);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move this out of the benchmark loop?

… byte spans

- Change constants and size methods from uint64_t to int64_t per Arrow style
- Add clarifying comment that kAlpVectorSize is implementation default (format
  supports other power-of-2 sizes via log_vector_size)
- Spell out "Adaptive Lossless floating-Point (ALP)" in AlpConstants class doc
- Use sizeof(fields) + static_assert for AlpEncodedVectorInfo::kStoredSize
- Fix diagram comment: "Interleaved" → "Concatenated"
- Change Store/Load span<char> APIs to span<uint8_t> for byte representation
Per review feedback, avoid repeated allocation of the decoder object
inside the benchmark timing loop.
Per review feedback, the file name should match the class name (AlpCodec).
Updated all includes and CMakeLists.txt references.
Per review feedback, these changes should be in a separate PR where
someone with more knowledge on why -O2 is the default can weigh in.
Address reviewer safety concerns: all Load methods and decode-path
checks that operate on untrusted data now return Result<T> or
Status::Invalid instead of aborting via ARROW_CHECK.

Changes:
- AlpEncodedVectorInfo::Load -> Result<AlpEncodedVectorInfo>
- AlpEncodedForVectorInfo::Load -> Result<AlpEncodedForVectorInfo>
- AlpEncodedVector::Load -> Result<AlpEncodedVector>
- AlpEncodedVectorView::LoadView -> Result<AlpEncodedVectorView>
- AlpEncodedVectorView::LoadViewDataOnly -> Result<AlpEncodedVectorView>
- AlpMetadataCache::Load -> Result<AlpMetadataCache>
- AlpHeader::GetVectorNumElements -> Result<uint16_t>
- Add OOM guard before vector allocation in DecodeAlp
- Convert unreachable vector index case to Status::Invalid
- Refactor switch to early-return guard in DecodeAlp
- Keep ARROW_CHECK on encode paths (internal invariants)
Document the buffer size precondition in the header as requested
by reviewer
Convert ALP code from unsigned types (uint32_t, uint64_t, uint16_t) to
signed types (int32_t, int64_t, int16_t) following Arrow codebase
conventions where int64_t is used for sizes/counts and int32_t at
Parquet page level. Unsigned types are retained only where semantically
required: bit patterns (ExactType/FloatingToExact), bit widths (uint8_t),
and wire format offsets (OffsetType = uint32_t).

static_cast is used only at system boundaries (span construction from
signed sizes, container .size() comparisons).
- DecodeVector: move output_vector to last parameter
- PatchExceptions: move output to last parameter
- EncodeWithPreset: move preset (input) before comp/comp_size (output)
- Remove unused enforce_mode parameter from Encode()
- Remove <optional> include no longer needed
@emkornfield
Copy link
Copy Markdown
Contributor

@prtkgaur is this ready for another round of reviews?

Add ALP = 10 to the Encoding enum to match the parquet-format spec
update (apache/parquet-format PR pending).
…crash paths

- Rename decomp/comp/decomp_size/comp_size to input/output/input_size/output_size
  in AlpCodec public and private APIs per reviewer naming feedback.
- Remove ARROW_CHECK(false) default branches in integer encoding switches.
  Only kForBitPack is supported; validation happens at the API boundary in
  AlpCodec::Decode (returns Status::Invalid). Internal functions now execute
  the kForBitPack path directly without switching.
The previous code used memcpy(&header.compression_mode, ..., 3) to
read/write the three uint8_t header fields in one shot, which relies
on struct layout having no padding between them. Copy each field
individually to be safe. No perf impact — this runs once per page.
@prtkgaur prtkgaur requested review from alamb, emkornfield and sdf-jkl May 13, 2026 18:44
Remove the hardcoded kAlpVectorSize=1024 constraint from the decode
path so the implementation can decompress pages written with any valid
power-of-2 vector size (up to 2^kMaxLogVectorSize = 32768). The encode
path still writes with vector_size=1024.

- Replace all StaticVector<T, kAlpVectorSize> with std::vector<T> in
  structs (AlpEncodedVector, AlpEncodedVectorView, EncodingResult,
  BitPackingResult) and local variables
- Update validation bounds from kAlpVectorSize to (1 << kMaxLogVectorSize)
- Remove vector_size != kAlpVectorSize rejection in AlpCodec::Decode
- Guard EncodeVector against empty input (pre-existing UB now exposed)
- Drop small_vector.h includes (no longer needed)
Following the DeltaBitPackEncoder pattern, the AlpEncoder constructor
now accepts an optional vector_size parameter (default 1024). AlpCodec
public APIs (Encode, EncodeWithPreset, GetMaxCompressedSize) and the
private EncodeAlp all accept and plumb through the vector_size. The
chosen size is stored in the existing AlpHeader.log_vector_size field.

Adds round-trip tests at vector sizes {64, 512, 1024, 2048, 4096}
with sub-vector, exact, multiple, and remainder data sizes, plus
validation death tests for invalid inputs.
Encodes float and double data with AlpCodec at vector sizes {64, 512,
2048, 4096} across 4 data-size categories, then decodes through the
parquet AlpDecoder to verify it correctly reads log_vector_size from
the ALP header.
…n up naming

- Replace size_t with int64_t in AlpCodec public API per Arrow convention
- Move output params (output, output_size) to end of Encode/EncodeWithPreset
- Split Encode into explicit vector_size overload + convenience default
- Rename DecodeAlp param output_element_count → num_elements
- Add \pre precondition docs to public API methods
- Update all callers in tests, encoder, and reference blob generator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants