Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,47 @@ Notes:
self-explanatory code.
- Keep public APIs small and consistent with neighboring crates.

## Performance: avoid hidden-cost accessors in hot loops

The most common performance trap in this codebase is calling a *per-element accessor that
hides non-trivial work* inside an `O(n)` loop, instead of doing the work once over the whole
chunk. The call site looks like a cheap getter, but each call re-pays a cost that is constant
(or amortizable) across the loop, making the loop `O(n · k)`.

Watch for these accessors used inside `for i in 0..n { ... }`:

| Per-element accessor (in a loop) | Hidden cost | Bulk replacement |
| --- | --- | --- |
| `Validity::is_valid(i)` / `is_null(i)` | for array-backed validity, **allocates an `ExecutionCtx` and runs a scalar lookup per call** | `validity.execute_mask(len, ctx)?` once, then `Mask::value(i)` |
| `array.scalar_at(i)` / `array.execute_scalar(i, ctx)` | per-element execution through the compute stack | canonicalize once (`execute::<PrimitiveArray>` / `as_slice`) then index |
| `BitBuffer::value(i)` / `Mask::value(i)` accumulated into a count | recomputes the byte address each call; defeats popcount | `true_count()`, `BitBuffer::count_range(start, end)`, `set_indices()` |
| `BitIterator::next()` to accumulate a running rank/prefix count | bit-at-a-time | `count_range` over each gap (SIMD popcount) |
| re-deriving a value inside the loop (e.g. `self.validity()?` each iteration) | re-runs the derivation `n` times | hoist it above the loop |

Decide per site — bulk is not always the answer:

- **Sequential / contiguous access** over an accessor that hides amortizable work → hoist and
go bulk (materialize once, then index or iterate the chunk).
- **Gather over arbitrary indices** → you cannot amortize a per-element *decode*, but you can
still materialize the backing buffer once (e.g. `execute_mask`) and then do cheap `O(1)`
random reads, avoiding per-call context/allocation.
- **The accessor is already genuinely `O(1)`** (e.g. reading an already-materialized `Mask`/
slice, or a native bitmap) → leave it; bulk would not help.

Even after materializing into a `Mask`/`BitBuffer`, **do not loop with `value(i)` per
element** to act on each set bit — the per-element branch dominates. Iterate a `u64` word
at a time with all-set / all-unset fast paths: use [`BitBuffer::for_each_set_index`]. It
beats `for i in 0..len { if buf.value(i) {..} }` by 2-45x (more at low density) and beats
collecting `set_indices()` by ~2x at mid/high density, while self-adapting from sparse to
dense — see `vortex-mask/benches/mask_iteration.rs`. Reach for the cached `indices()` /
`slices()` representations when you need them more than once; otherwise `for_each_set_index`
needs no materialization.

When you touch such a loop, back the change with a benchmark (see
`vortex-array/benches/validity_is_valid.rs` for the `is_valid` case,
`vortex-mask/benches/valid_counts.rs` for the popcount case, and
`vortex-mask/benches/mask_iteration.rs` for the per-element-vs-word-vs-sparse comparison).

## Tests

- Strongly consider `rstest` cases when parameterizing repetitive test logic.
Expand Down Expand Up @@ -179,6 +220,9 @@ Check new and modified lines against this list before finishing:
- Updating expected test output to match buggy behavior without independently verifying the
intended semantics.
- Silently reducing the scope of an approved plan when implementation is harder than expected.
- Calling a hidden-cost per-element accessor (`Validity::is_valid`, `scalar_at`, `BitBuffer::
value` accumulation) inside a hot loop instead of materializing once — see
"Performance: avoid hidden-cost accessors in hot loops".

## Summaries

Expand Down
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions encodings/runend/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,7 @@ harness = false
[[bench]]
name = "run_end_decode"
harness = false

[[bench]]
name = "run_end_filter"
harness = false
128 changes: 128 additions & 0 deletions encodings/runend/benches/run_end_filter.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright the Vortex contributors

//! Benchmarks for the run-end filter inner loop (`filter_run_end_primitive`).
//!
//! This measures the kernel directly rather than going through the lazy
//! `ArrayRef::filter` (which only builds a `FilterArray` node and does not run
//! the kernel). The hot work is a per-run popcount of the predicate mask, which
//! now uses `BitBuffer::count_range` (SIMD) instead of a bit-by-bit walk.

#![expect(clippy::cast_possible_truncation)]
#![expect(clippy::cast_precision_loss)]
#![expect(clippy::cast_sign_loss)]
#![expect(clippy::expect_used)]

use std::fmt;

use divan::Bencher;
use rand::SeedableRng;
use rand::rngs::StdRng;
use rand::seq::SliceRandom;
use vortex_buffer::BitBuffer;
use vortex_runend::_benchmarking::filter_run_end_primitive;

fn main() {
divan::main();
}

#[derive(Clone, Copy)]
struct FilterBenchArgs {
/// Total logical length of the decoded array.
length: usize,
/// Average run length used when building the run-end array.
run_length: usize,
/// Fraction of mask bits that are set to `true`.
density: f64,
}

impl fmt::Display for FilterBenchArgs {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"len={}_run={}_density={:.1}",
self.length, self.run_length, self.density
)
}
}

const FILTER_ARGS: &[FilterBenchArgs] = &[
FilterBenchArgs {
length: 16_384,
run_length: 16,
density: 0.1,
},
FilterBenchArgs {
length: 16_384,
run_length: 16,
density: 0.5,
},
FilterBenchArgs {
length: 16_384,
run_length: 16,
density: 0.9,
},
FilterBenchArgs {
length: 262_144,
run_length: 16,
density: 0.1,
},
FilterBenchArgs {
length: 262_144,
run_length: 16,
density: 0.5,
},
FilterBenchArgs {
length: 262_144,
run_length: 16,
density: 0.9,
},
FilterBenchArgs {
length: 1_048_576,
run_length: 16,
density: 0.1,
},
FilterBenchArgs {
length: 1_048_576,
run_length: 16,
density: 0.5,
},
FilterBenchArgs {
length: 1_048_576,
run_length: 16,
density: 0.9,
},
];

/// Build the run-end boundaries (cumulative run lengths) for `length` rows.
fn build_run_ends(length: usize, run_length: usize) -> Vec<u32> {
let n_runs = length.div_ceil(run_length);
(0..n_runs)
.map(|r| (((r + 1) * run_length).min(length)) as u32)
.collect()
}

/// Build a predicate mask of `length` bits with approximately `density` set bits,
/// shuffled so the set bits are spread across runs.
fn build_mask(length: usize, density: f64) -> BitBuffer {
let n_true = (length as f64 * density).round() as usize;
let mut bits = vec![false; length];
for b in bits.iter_mut().take(n_true) {
*b = true;
}
let mut rng = StdRng::seed_from_u64(0x5eed);
bits.shuffle(&mut rng);
BitBuffer::from(bits)
}

#[divan::bench(args = FILTER_ARGS)]
fn filter_run_end(bencher: Bencher, args: FilterBenchArgs) {
let run_ends = build_run_ends(args.length, args.run_length);
let mask = build_mask(args.length, args.density);
let length = args.length as u64;
bencher
.with_inputs(|| (run_ends.clone(), mask.clone()))
.bench_refs(|(run_ends, mask)| {
filter_run_end_primitive::<u32>(run_ends, 0, length, mask).expect("filter")
});
}
24 changes: 15 additions & 9 deletions encodings/runend/src/compute/filter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ use std::cmp::min;
use std::ops::AddAssign;

use num_traits::AsPrimitive;
use num_traits::NumCast;
use vortex_array::ArrayRef;
use vortex_array::ArrayView;
use vortex_array::ExecutionCtx;
Expand Down Expand Up @@ -74,7 +75,7 @@ impl FilterKernel for RunEnd {
}

// Code adapted from apache arrow-rs https://github.com/apache/arrow-rs/blob/b1f5c250ebb6c1252b4e7c51d15b8e77f4c361fa/arrow-select/src/filter.rs#L425
fn filter_run_end_primitive<R: NativePType + AddAssign + From<bool> + AsPrimitive<u64>>(
pub fn filter_run_end_primitive<R: NativePType + AddAssign + From<bool> + AsPrimitive<u64>>(
run_ends: &[R],
offset: u64,
length: u64,
Expand All @@ -87,16 +88,21 @@ fn filter_run_end_primitive<R: NativePType + AddAssign + From<bool> + AsPrimitiv
let mut count = R::zero();

let new_mask: Mask = BitBuffer::collect_bool(run_ends.len(), |i| {
let mut keep = false;
let end = min(run_ends[i].as_() - offset, length);

// Safety: predicate must be the same length as the array the ends have been taken from
for pred in (start..end).map(|i| unsafe {
mask.value_unchecked(i.try_into().vortex_expect("index must fit in usize"))
}) {
count += <R as From<bool>>::from(pred);
keep |= pred
}
// SIMD popcount of the predicate bits in this run. The range matches the
// bit-by-bit `value_unchecked` read it replaces, so `end <= mask.len()`.
let start_usize = start
.try_into()
.vortex_expect("run start index must fit in usize");
let end_usize = end
.try_into()
.vortex_expect("run end index must fit in usize");
let run_trues = mask.count_range(start_usize, end_usize);
count += <R as NumCast>::from(run_trues)
.vortex_expect("run popcount must fit in run-end native type");
let keep = run_trues > 0;

// this is to avoid branching
new_run_ends[j] = count;
j += keep as usize;
Expand Down
1 change: 1 addition & 0 deletions encodings/runend/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ mod rules;

#[doc(hidden)]
pub mod _benchmarking {
pub use compute::filter::filter_run_end_primitive;
pub use compute::take::take_indices_unchecked;

use super::*;
Expand Down
4 changes: 4 additions & 0 deletions vortex-array/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,10 @@ required-features = ["_test-harness"]
name = "dict_mask"
harness = false

[[bench]]
name = "validity_is_valid"
harness = false

[[bench]]
name = "dict_unreferenced_mask"
harness = false
Expand Down
62 changes: 62 additions & 0 deletions vortex-array/benches/validity_is_valid.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
// SPDX-License-Identifier: Apache-2.0
// SPDX-FileCopyrightText: Copyright the Vortex contributors

//! Benchmarks the cost of reading array-backed [`Validity`] per element.
//!
//! `Validity::is_valid(i)` for the `Validity::Array` variant spins up a fresh
//! execution context and executes a scalar lookup on *every* call, so calling it
//! in a `for i in 0..n` loop is pathologically slow. The fix used by callers is to
//! materialize the validity into a `Mask` once (`execute_mask`) and then do cheap
//! O(1) bit reads via `Mask::value`. This bench contrasts the two.
//!
//! Sizes are kept small because the per-element variant is intentionally the slow
//! one we are measuring.

#![allow(clippy::unwrap_used)]

use divan::Bencher;
use vortex_array::LEGACY_SESSION;
use vortex_array::VortexSessionExecute;
use vortex_array::validity::Validity;
use vortex_buffer::BitBuffer;

fn main() {
divan::main();
}

const SIZES: &[usize] = &[256, 1024, 4096];

/// Build an array-backed validity (~10% nulls so it is `Validity::Array`).
fn array_validity(len: usize) -> Validity {
Validity::from(BitBuffer::from_iter(
(0..len).map(|i| !i.is_multiple_of(10)),
))
}

/// Per-element `Validity::is_valid` over array-backed validity (the antipattern).
#[divan::bench(args = SIZES)]
fn is_valid_per_element(bencher: Bencher, len: usize) {
let validity = array_validity(len);
bencher.with_inputs(|| &validity).bench_refs(|validity| {
let mut count = 0usize;
for i in 0..len {
count += validity.is_valid(i).unwrap() as usize;
}
count
});
}

/// Materialize the validity into a `Mask` once, then read bits (the fix).
#[divan::bench(args = SIZES)]
fn execute_mask_then_value(bencher: Bencher, len: usize) {
let validity = array_validity(len);
bencher.with_inputs(|| &validity).bench_refs(|validity| {
let mut ctx = LEGACY_SESSION.create_execution_ctx();
let mask = validity.execute_mask(len, &mut ctx).unwrap();
let mut count = 0usize;
for i in 0..len {
count += mask.value(i) as usize;
}
count
});
}
17 changes: 15 additions & 2 deletions vortex-array/src/arrays/listview/rebuild.rs
Original file line number Diff line number Diff line change
Expand Up @@ -209,9 +209,16 @@ impl ListViewArray {
let mut new_sizes = BufferMut::<S>::with_capacity(len);
let mut take_indices = BufferMut::<u64>::with_capacity(self.elements().len());

// Materialize the validity into a mask once. `self.validity()` re-derives the
// listview validity, and `Validity::is_valid` allocates a fresh execution context
// per call for array-backed validity, so checking it per row would be quadratic-ish.
let validity_mask = self
.validity()?
.execute_mask(len, &mut LEGACY_SESSION.create_execution_ctx())?;

let mut n_elements = NewOffset::zero();
for index in 0..len {
if !self.validity()?.is_valid(index)? {
if !validity_mask.value(index) {
new_offsets.push(n_elements);
new_sizes.push(S::zero());
continue;
Expand Down Expand Up @@ -292,9 +299,15 @@ impl ListViewArray {
let mut new_elements_builder =
builder_with_capacity(element_dtype.as_ref(), self.elements().len());

// Materialize the validity into a mask once (see `rebuild_by_offset`): avoids
// re-deriving the listview validity and allocating an execution context per row.
let validity_mask = self
.validity()?
.execute_mask(len, &mut LEGACY_SESSION.create_execution_ctx())?;

let mut n_elements = NewOffset::zero();
for index in 0..len {
if !self.validity()?.is_valid(index)? {
if !validity_mask.value(index) {
// For NULL lists, place them after the previous item's data to maintain the
// no-overlap invariant for zero-copy to `ListArray` arrays.
new_offsets.push(n_elements);
Expand Down
Loading
Loading