Introduction
I read this post recently by Nikita Sivukhin. In it, they refactor a find
method to locate the position of an element in a slice. The optimization takes advantage of LLVM’s auto-vectorization to improve performance.
I decided to apply the techniques discussed to the std.mem.allEqual
function in Zig’s standard library to investigate how vectorization could affect its performance.
The Standard Library
The allEqual
method is defined as
Returns true if all elements in a slice are equal to the scalar value provided
With this pretty simple source code:
pub fn allEqual(comptime T: type, slice: []const T, scalar: T) bool {
for (slice) |item| {
if (item != scalar) return false;
}
return true;
}
Putting this on godbolt using Zig 0.14.1 and the flags -O ReleaseFast -target x86_64-linux -mcpu=native
gives the compiler output:
allEqual:
push rbp
mov rbp, rsp
test rsi, rsi
je .LBB0_1
cmp dword ptr [rdi], edx
jne .LBB0_3
mov ecx, 1
.LBB0_7:
mov rax, rcx
cmp rsi, rcx
je .LBB0_4
lea rcx, [rax + 1]
cmp dword ptr [rdi + 4*rax], edx
je .LBB0_7
.LBB0_4:
cmp rax, rsi
setae al
pop rbp
ret
.LBB0_1:
mov al, 1
pop rbp
ret
.LBB0_3:
xor eax, eax
pop rbp
ret
However, this approach lacks vectorization. Why doesn’t LLVM apply it? To understand what might be preventing vectorization, we can check the LLVM optimization remarks for clues.
Unlike Rust, Zig doesn’t support passing arbitrary LLVM flags such as -pass-remarks
, -pass-remarks-missed
, or -pass-remarks-analysis
directly. To work around this, we need to first emit the LLVM IR using the -emit-llvm-ir
flag to generate a .ll
file. Then use the llvm-opt
tool on this file to generate the optimization remarks.
Setting up the development environment
I used Nix with Devenv to set up the environment. Just run devenv init
in your project directory and configure the devenv.nix
file as needed:
{ pkgs, lib, config, inputs, ... }:
{
languages.zig.enable = true;
packages = with pkgs; [
llvmPackages_19.llvm
];
}
Finally execute devenv shell
to build the environment.
Investigation
We put the following slightly modified code in a test.zig
file.
const std = @import("std");
export fn allEqual_std(data: [*]const u8, len: usize, target: u8) bool {
return std.mem.allEqual(u8, data[0..len], target);
}
The export
keyword is needed, otherwise LLVM’s dead code elimination will ignore this function as its never actually called.
Executing:
zig build-obj -femit-llvm-ir=test.ll -femit-asm=output.s -O ReleaseFast test.zig &&
opt -O3 \
-pass-remarks=.* \
-pass-remarks-missed=.* \
-pass-remarks-analysis=.* \
test.ll -S -o optimized.ll
The following output is received:
remarks-analysis=.* test.ll -S -o optimized.ll
remark: mem.zig:1191:10: loop not vectorized: could not determine number of loop iterations
remark: mem.zig:1191:10: loop not vectorized
The remark loop not vectorized: could not determine number of loop iterations
means the compiler can’t statically determine how many times the loop will run. Vectorization needs predictable iteration counts to safely group operations into SIMD instructions, which is handled by the LoopVectorizer.
This can be caused by non-constant loop bounds or early exits. In the original implementation of std.mem.allEqual
in the standard library, there’s an early return
statement inside the loop. That’s what prevents the loop from being vectorized.
Refactor and Restructure
To trigger loop vectorization, the implementation must be modified to not use early returns:
export fn allEqual_no_early_stop(slice: [*]const u8, len: usize, scalar: u8) bool {
var m: bool = true;
for (slice[0..len]) |item| {
m = m and (item == scalar);
}
return m;
}
This looks correct, now let’s check the compiler output and the LLVM remarks.
.LBB0_5:
cmp byte ptr [rdi + r8], dl
sete r9b
and r9b, al
cmp byte ptr [rdi + r8 + 1], dl
sete al
cmp byte ptr [rdi + r8 + 2], dl
sete r10b
and r10b, al
and r10b, r9b
cmp byte ptr [rdi + r8 + 3], dl
sete al
cmp byte ptr [rdi + r8 + 4], dl
sete r9b
and r9b, al
...
Unfortunately, there still isn’t any SIMD output. However, the loop does get unrolled, and the compiler generates branchless code with no jump instructions. The following remarks are generated:
remark: w.zig:11:15: loop not vectorized: value that could not be identified as reduction is used outside the loop
remark: w.zig:11:15: loop not vectorized
remark: w.zig:11:15: loop not vectorized: value that could not be identified as reduction is used outside the loop
remark: w.zig:11:15: loop not vectorized
remark: <unknown>:0:0: Cannot SLP vectorize list: vectorization was impossible with available vectorization factors
remark: w.zig:12:9: Vectorized horizontal reduction with cost -17 and with tree size 3
remark: w.zig:11:15: unrolled loop by a factor of 8 with run-time trip count
That’s a lot. What do they mean?
The good news
Vectorized horizontal reduction with cost -17 and with tree size 3
This refers to a SIMD operation that reduces vector elements into a single scalar value. The -17
indicates a profitable optimization according to LLVM’s vectorization cost model. A tree size of 3
reflects the structure of the operation’s dependency graph and 3
is its depth.
unrolled loop by a factor of 8 with run-time trip count
This means the LLVM optimizer has applied runtime loop unrolling with an unroll factor of 8 i.e. each iteration of the unrolled loop performs the work of 8 iterations of the original loop.
The bad news
loop not vectorized: value that could not be identified as reduction is used outside the loop
This remark is due to something known as Data Dependency Chain. It prevents the compiler from generating vectorized output from the above implementation. I’ll get to it soon.
Cannot SLP vectorize list: vectorization was impossible with available vectorization factors
The SLPVectorizer
combines similar independent instructions into vector instructions. This could not be achieved here due to the reduction pattern in the code m = m and (item == scalar)
as well as the dependency chain.
Data Dependency Chain
In the current implementation there is:
m = m and (item == scalar);
This is a read-after-write dependency. It is not vectorizable because each iteration depends on the previous value of m
. Iteration $N +1$ cannot be computed until the result of iteration $N$ is known.
A method is needed to check whether all elements of a slice are equal to a given scalar, without introducing data dependencies, and where the result of each iteration is independent of the others.
The following is a well-known vectorizable pattern that uses summation:
export fn allEqual_no_early_stop(slice: [*]const u8, len: usize, scalar: u8) bool {
var mismatch_count: u32 = 0;
for (slice[0..len]) |item| {
mismatch_count += @intFromBool(item != scalar);
}
return mismatch_count == 0;
}
Here we count the number of mismatches and equate it with zero. If all the elements of the slice are equal to the scalar. The result of the operation @intFromBool(item != scalar)
is independent of the other results of the other iterations. So, it is safe to vectorize. Looking at the compiler output:
.LBB0_6:
vmovq xmm7, qword ptr [rdi + rcx]
vmovq xmm8, qword ptr [rdi + rcx + 8]
vmovq xmm9, qword ptr [rdi + rcx + 16]
vmovq xmm10, qword ptr [rdi + rcx + 24]
vpcmpeqb xmm7, xmm7, xmm1
vpxor xmm7, xmm7, xmm2
vpmovzxbd ymm7, xmm7
vpand ymm7, ymm7, ymm3
vpaddd ymm0, ymm0, ymm7
vpcmpeqb xmm7, xmm8, xmm1
vpxor xmm7, xmm7, xmm2
vpmovzxbd ymm7, xmm7
vpand ymm7, ymm7, ymm3
vpaddd ymm4, ymm4, ymm7
vpcmpeqb xmm7, xmm9, xmm1
...
We can see the SSE registers, and hence LLVM is successfully able to vectorize the allEqual
function.
We get the following remarks:
remark: w.zig:4:15: loop not vectorized: vectorization and interleaving are explicitly disabled, or the loop has already been vectorized
remark: w.zig:4:15: loop not vectorized: vectorization and interleaving are explicitly disabled, or the loop has already been vectorized
This remark can occur due to two reasons:
Vectorization has been explicitly disabled using compiler flags such as
-fno-vectorize
or-fno-unroll-loops
.The loops were already vectorized in an earlier optimization phase, so
llvm-opt
skips redundant attempts.
Since we did not use any such disabling flags, the remark likely results from the second reason. Indicating that the loops have already been successfully vectorized.
Results
Benchmarking the stdlib implementation against the new implementation that triggers LLVM’s auto-vectorizer. The benchmark is performed against 3 scenarios:
- All elements of the slice match the target.
- Only the first element does not match the target.
- Only the last element does not match the target.
1. All elements equal to the target
Method | Median (ns) | Mean(ns) | Throughput |
---|---|---|---|
std.mem.allEqual | 581131 | 580878 | 7e9 |
allEqual_no_early | 192468 | 192378 | 2e10 |
2. Only the first element does not match the target.
Method | Median(ns) | Min(ns) | Throughput |
---|---|---|---|
std.mem.allEqual | 0 | 0 | inf |
allEqual_no_early | 194103 | 196067 | 2e10 |
3. Only the last element does not match the target
Method | Median(ns) | Min(ns) | Throughput |
---|---|---|---|
std.mem.allEqual | 557238 | 557433 | 7e9 |
allEqual_no_early | 194800 | 194474 | 2e10 |
Conclusion
The auto-vectorized implementation performs well only when all elements match the scalar value or the first mismatch is near the end. If there’s a mismatch early on, the early-stopping version is faster. It returns as soon as a mismatch is found, while the vectorized version still scans the entire slice to count all mismatches.