Start for free
Engine internals Patch notes · #2026-04-24

Inside REGEX_REPLACE: how a string-rewriting function compiles down to vectorised SIMD

A look at the patch that brought a real, vectorised string-rewriting function to the Opteryx execution engine — why the obvious implementation was the wrong one, and what we learned by measuring the version we threw away.

patch · execution/string_ops.rs // before: 1.7M rows/s, allocator-bound fn regex_replace(col: &StringArray, pat: &Regex, rep: &str) -> StringArray { ... } // after: 11.4M rows/s, scan-bound fn regex_replace_simd(col: &StringArray, pat: &CompiledNFA, rep: &str) -> StringArray { ... }

Most string functions in a query engine are boring on purpose. They have to be — they run a billion times, on every workload, and a single allocation in the wrong place will undo a quarter's worth of planner work. REGEX_REPLACE is one of the few that resists this. It is regular-expression-shaped, allocation-shaped, and unicode-shaped, all at once.

This post is about how we built ours, the version we threw away first, and the surprisingly small change that took it from 1.7M rows/s to 11.4M rows/s on the same hardware. If you have ever benchmarked a vectorised string function and watched the profile flatten out at the wrong call, this is for you.

The brief

Opteryx is a read-only analytical engine, which means most of our string work is in the filter path — predicates, projections, the occasional rewrite for compatibility with downstream BI tools. REGEX_REPLACE showed up on the roadmap because three independent customers asked for it in the same week, all to scrub PII from a column before exporting to a notebook. None of them wanted it to be fast. They wanted it to be correct, and to not blow up the cost preview.

The obvious implementation, and why it was wrong

The first cut took an afternoon. Iterate the column, run regex.replace_all on each row, push the result into a fresh StringArray. Done. It passed the test suite, returned the right answers on every fuzz seed we threw at it, and ran at 1.7 million rows per second on the bench dataset.

1.7M is fine in isolation. It is not fine when the rest of the engine ingests Parquet at 90M rows/s and aggregates at 40M. A single REGEX_REPLACE in a long projection list dropped end-to-end throughput by 30×. Worse, the profile showed something we already knew but had hoped to ignore.

Every row produced a fresh String. Every String went through malloc. The matcher itself was finishing work in nanoseconds and then waiting for the allocator to catch up. We were not regex-bound; we were memory-bound, and the regex was a passenger.

Rebuilding around the buffer

The fix is a pattern that shows up often in column engines and rarely in the regex literature: don't allocate per row, allocate per batch. Conceptually, it looks like this.

Rust · execution/string_ops.rs
pub fn regex_replace_batch(
    col: &StringArray,
    pat: &CompiledNFA,
    rep: &str,
) -> StringArray {
    let mut values = Vec::<u8>::with_capacity(col.values_size() * 2);
    let mut offsets = OffsetsBuilder::with_capacity(col.len());

    for s in col.iter() {
        pat.replace_into(s, rep, &mut values);  // no alloc
        offsets.push(values.len() as i32);
    }
    StringArray::from_parts(offsets.finish(), values.into())
}

Two changes, both small:

  1. The output buffer is allocated once, sized to roughly twice the input. Rewrites that grow the column reallocate at most one or two times across the whole batch instead of once per row.
  2. The matcher writes directly into our buffer with replace_into. The regex crate does not expose this on its public API; we wrapped its NFA executor with a sink that accepts a &mut Vec<u8>.

That alone got us to 6.2M rows/s. The allocator dropped from 78% of CPU to 4%. The matcher started showing up in the profile, which is the polite way of saying we now had a different bottleneck.

Where SIMD actually helps

It is fashionable to say "we used SIMD" as if it explains a 10× speedup. It almost never does, on its own. SIMD lanes help when the work fits the lane shape — fixed widths, branch-free, predictable memory access. A regex NFA satisfies none of these.

What does fit a lane is the literal prefix scan. Almost every real-world regex starts with a literal: ^Error:, \bemail=, (\d{3})-. We compile the pattern, lift the literal prefix, and run a memchr-style SIMD scan across the input to find candidate positions. Only the candidates feed into the NFA. For a column where 99% of rows do not match, this is the entire game.

The regex does not have to be fast. The not-matching has to be fast. Eng review notes, week of Apr 14
bench/regex_replace
# 10M rows, mean length 84 bytes, match rate 1.2%
naive          1,712,304 rows/s   // per-row alloc
batched        6,218,901 rows/s   // shared buffer
batched+simd  11,420,556 rows/s   // literal-prefix SIMD scan

Surfacing the cost

The other half of the brief — "do not blow up the cost preview" — turned out to be harder than the performance work. Opteryx's planner shows an estimated scan size before the query runs. REGEX_REPLACE does not change scan size, but it does change CPU time, and on a metered cluster the two are not the same thing.

We extended the cost model with a per-function CPU coefficient, calibrated from the bench numbers above. When you write a query that calls REGEX_REPLACE over a billion rows, the preview now tells you what it will cost in the same units as the rest of the query. No surprises, no apologies.

What is next

Two follow-ups are already in flight. The first is extending the same buffer-sharing pattern to REPLACE, SUBSTR, and UPPER/LOWER. The second is a planner-side rewrite that recognises REGEX_REPLACE with a constant pattern and a one-shot replacement and lowers it to plain REPLACE when the pattern is, in fact, a literal.

About 11% of regex_replace calls in production turn out to be plain string replacements with the regex syntax wrapped around them. Those should never have hit the NFA in the first place.