Blog | Velox

Why Sort is row-based in Velox — A Quantitative Assessment

December 24, 2025 · 8 min read

Meng Duan (macduan)

Software Engineer @ ByteDance

Xiaoxuan Meng

Software Engineer @ Meta

TL;DR

Velox is a fully vectorized execution engine[1]. Its internal columnar memory layout enhances cache locality, exposes more inter-instruction parallelism to CPUs, and enables the use of SIMD instructions, significantly accelerating large-scale query processing.

However, some operators in Velox utilize a hybrid layout, where datasets can be temporarily converted to a row-oriented format. The OrderBy operator is one example, where our implementation first materializes the input vectors into rows, containing both sort keys and payload columns, sorts them, and converts the rows back to vectors.

In this article, we explain the rationale behind this design decision and provide experimental evidence for its implementation. We show a prototype of a hybrid sorting strategy that materializes only the sort-key columns, reducing the overhead of materializing payload columns. Contrary to expectations, the end-to-end performance did not improve—in fact, it was even up to 3× slower. We present the two variants and discuss why one is counter-intuitively faster than the other.

Multi-Round Lazy Start Merge

November 9, 2025 · 6 min read

Meng Duan (macduan)

Software Engineer @ ByteDance

Xiaoxuan Meng

Software Engineer @ Meta

Pedro Pedreira

Software Engineer @ Meta

Background

Efficiently merging sorted data partitions at scale is crucial for a variety of training data preparation workloads, especially for Generative Recommenders (GRs) a new paradigm introduced in the paper Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations. A key requirement is to merge training data across partitions—for example, merging hourly partitions into daily ones—while ensuring that all rows sharing the same primary key are stored consecutively. Training data is typically partitioned and bucketed by primary key, with rows sharing the same key stored consecutively, so merging across partitions essentially becomes a multi-way merge problem.

Normally, Apache Spark can be used for this sort-merge requirement — for example, via CLUSTER BY. However, training datasets for a single job can often reach the PB scale, which in turn generates shuffle data at PB scale. Although we typically apply bucketing and ordering by key when preparing training data in production, Spark can eliminate the shuffle when merging training data from multiple hourly partitions. However, each Spark task can only read the files planned from various partitions within a split sequentially, placing them into the sorter and spilling as needed. Only after all files have been read does Spark perform a sort-merge of the spilled files. This process produces a large number of small spill files, which further degrades efficiency.

Enabling Shared Library Builds in Velox

November 1, 2025 · 8 min read

Jacob Wujciak-Jens

Software Engineer @ Voltron Data

In this post, I’ll share how we unblocked shared library builds in Velox, the challenges we encountered with our large CMake build system, and the creative solution that let us move forward without disrupting contributors or downstream users.

Velox switches to C++20 standard

August 30, 2025 · 2 min read

Christian Zentgraf

Software Engineer @ IBM

Background

The C++ standard used in a project determines what built-in functionality developers can use to ease development and extended capabilities.

Since its inception in August of 2021 Velox used the C++17 standard.

Extending Velox - GPU Acceleration with cuDF

July 11, 2025 · 4 min read

Gregory Kimball

Software Engineer @ NVIDIA

TL;DR

This post describes the design principles and software components for extending Velox with hardware acceleration libraries like NVIDIA's cuDF. Velox provides a flexible execution model for hardware accelerators, and cuDF's data structures and algorithms align well with core components in Velox.

SEGFAULT due to Dependency Update

July 7, 2025 · 4 min read

Deepak Majeti

Software Engineer @ IBM

Christian Zentgraf

Software Engineer @ IBM

Background

Velox depends on several libraries. Some of these dependencies include open-source libraries from Meta, including Folly and Facebook Thrift. These libraries are in active development and also depend on each other, so they all have to be updated to the same version at the same time.

Updating these dependencies typically involves modifying the Velox code to align with any public API or semantic changes in these dependencies. However, a recent upgrade of Folly and Facebook Thrift to version v2025.04.28.00 caused a SEGFAULT only in one unit test in Velox named velox_functions_remote_client_test.

A Velox Primer, Part 3

May 12, 2025 · 10 min read

Orri Erling

Software Engineer @ Meta

Pedro Pedreira

Software Engineer @ Meta

At the end of the previous article, we were halfway through running our first distributed query:

SELECT l_partkey, count(*) FROM lineitem GROUP BY l_partkey;

We discussed how a query starts, how tasks are set up, and the interactions between plans, operators, and drivers. We have also presented how the first stage of the query is executed, from table scan to partitioned output - or the producer side of the shuffle.

In this article, we will discuss the second query stage, or the consumer side of the shuffle.

A Velox Primer, Part 2

March 25, 2025 · 9 min read

Orri Erling

Software Engineer @ Meta

Pedro Pedreira

Software Engineer @ Meta

In this article, we will discuss how a distributed compute engine executes a query similar to the one presented in our first article:

SELECT l_partkey, count(*) FROM lineitem GROUP BY l_partkey;

We use the TPC-H schema to illustrate the example, and Prestissimo as the compute engine orchestrating distributed query execution. Prestissimo is responsible for the query engine frontend (parsing, resolving metadata, planning, optimizing) and distributed execution (allocating resources and shipping query fragments), and Velox is responsible for the execution of plan fragments within a single worker node. Throughout this article, we will present which functions are performed by Velox and which by the distributed engine - Prestissimo, in this example.

A Velox Primer, Part 1

February 17, 2025 · 6 min read

Orri Erling

Software Engineer @ Meta

Pedro Pedreira

Software Engineer @ Meta

This is the first part of a series of short articles that will take you through Velox’s internal structures and concepts. In this first part, we will discuss how distributed queries are executed, how data is shuffled among different stages, and present Velox concepts such as Tasks, Splits, Pipelines, Drivers, and Operators that enable such functionality.

Velox Query Tracing

December 15, 2024 · 6 min read

Meng Duan (macduan)

Software Engineer @ ByteDance

Xiaoxuan Meng

Software Engineer @ Meta

Jialiang Tan

Software Engineer @ Meta

TL;DR

The query trace tool helps analyze and debug query performance and correctness issues. It helps prevent interference from external noise in a production environment (such as storage, network, etc.) by allowing replay of a part of the query plan and dataset in an isolated environment, such as a local machine. This is much more efficient for query performance analysis and issue debugging, as it eliminates the need to replay the whole query in a production environment.

TL;DR​

Background​

Background​

TL;DR​

Background​

TL;DR​

TL;DR

Background

Background

TL;DR

Background

TL;DR