Skip to content

Conversation

@allisoneer
Copy link
Owner

@allisoneer allisoneer commented Dec 5, 2025

Summary by CodeRabbit

  • New Features

    • Added support for multiple distance metrics (squared Euclidean, Euclidean, cosine, dot product).
    • Introduced metadata support for indexed vectors with encoding/decoding capabilities.
    • Added persistence functionality to save and load indices from disk.
    • Enabled advanced search with predicate-based filtering.
    • Added batch insertion capabilities.
    • Introduced deletion and compaction operations for index maintenance.
  • Chores

    • Updated build configuration and minimum Zig version requirement.

✏️ Tip: You can customize this high-level summary in your review settings.

Allison Durham added 3 commits December 4, 2025 23:01
- Update build.zig to use root_module pattern (addStaticLibrary removed)
- Update ArrayList API calls to pass allocator explicitly
- Simplify custom format function signature (Writergate)
- Use {f} specifier for custom formatters
- Update .gitignore for Claude and thoughts-data paths
- Add thoughts MCP tool config
- Add symlinks for thoughts workspace directories
Add distance metrics (cosine, dot_product, euclidean, squared_euclidean),
multi-layer HNSW search with ef_search, metadata integration with comptime
serialization, and full persistence (save/load).

New API: HNSW(T, metric, Metadata) with save()/load() methods.
@coderabbitai
Copy link

coderabbitai bot commented Dec 5, 2025

Walkthrough

This pull request introduces a major architectural refactor of the HNSW index implementation, modularizing distance metrics, adding metadata support with compile-time encoding/decoding, introducing a v2 index with vector stores and graph tapes, and implementing persistent storage. Build system updated to Zig 0.15+ module conventions.

Changes

Cohort / File(s) Summary
Project Configuration & Setup
.gitignore, .thoughts/config.json, .thoughts-data/context, .thoughts-data/references, .thoughts-data/thoughts
Added Claude-related ignore patterns and new thought/context/reference configuration files.
Build System Modernization
build.zig, build.zig.zon
Migrated from static library to module-based build pattern for Zig 0.15+. Updated minimum Zig version, added fingerprint and paths fields to manifest, and reorganized library/test/benchmark declarations to use root_module.
Distance Metrics & Abstractions
src/distance.zig
New distance metric abstraction with public enum DistanceMetric (squared_euclidean, euclidean, cosine, dot_product) and generic distanceFn factory. Includes SIMD-optimized distance implementations with scalar fallbacks.
Metadata Framework
src/metadata.zig
New compile-time metadata encoding/decoding system. Introduces StringTable for dynamic string storage, generic FixedOf for fixed-layout metadata representations, and encode/decode functions supporting nested structs, optionals, arrays, and batch operations.
Persistence Layer
src/persistence.zig, src/persistence_v2.zig
Two persistence modules: legacy version with generic serialization utilities, and v2 with FileHeaderV2 struct, saveV2/loadHeaderV2 functions, and LoadedSections for section-based I/O.
New Index Architecture
src/vector_store.zig, src/graph_tape.zig, src/neighbors_ref.zig
New foundational data structures: VectorStore provides 64-byte aligned vector storage; GraphTape manages contiguous neighbor tape per level; NeighborsRef provides tape-based neighbor list API with iteration and access utilities. GraphConfig sets per-level neighbor counts.
HNSW v2 Implementation
src/hnsw_v2.zig
Complete rewrite of core HNSW index. Generic over type T, distance metric, and metadata. Integrates VectorStore, GraphTape, and metadata encoding. Adds search context, deletion with optional slot reuse, compaction with clustering, and persistence via persistence_v2.
Search Implementations
src/search.zig, src/search2.zig
Two search modules: legacy search with generic topK/topKFiltered; search2 with new topK2 supporting generic context types and ef-search configuration. Both return SearchResult(T) with id and distance.
Module Exports & Root
src/zvdb.zig, src/hnsw.zig
zvdb.zig updated as root module exporting all submodules (distance, metadata, hnsw_v2, persistence, search2, etc.), with HNSW as a factory function and legacy namespace for backward compatibility.
Benchmark Updates
benchmarks/shared_benchmarks.zig, benchmarks/single_threaded_benchmarks.zig, benchmarks/multi_threaded_benchmarks.zig
Updated to use namespace imports, refactored HNSW initialization with new signature and batch approach, changed format specifiers from {} to {f}, and modified BenchmarkResult.format signature. Removed multi-threaded execution paths in favor of single-threaded batching.
Comprehensive Test Suite
src/test_hnsw.zig, src/test_hnsw_v2_basic.zig, src/test_hnsw_v2_compact.zig, src/test_hnsw_v2_persist.zig, src/test_graph_tape.zig, src/test_search2.zig, src/test_vector_store.zig, src/test_persistence_v2.zig
Extensive new tests covering distance metrics, SIMD equivalence, metadata encoding/decoding, HNSW v2 lifecycle (insert, search, delete, compact, persist), graph tape operations, and search2 behavior.
Compile-Fail Tests
src/array_of_optionals.zig, src/array_of_structs.zig, src/array_of_strings.zig, src/name_collision.zig, src/zero_length_array.zig
New compile-time validation tests for metadata.FixedOf constraints, ensuring unsupported field types are caught at compile time.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant HNSW as HNSW v2
    participant VectorStore
    participant GraphTape
    participant Metadata
    participant Search
    participant Persist as Persistence

    Client->>HNSW: insert(point, meta)
    activate HNSW
    HNSW->>VectorStore: add(vec_bytes)
    activate VectorStore
    VectorStore->>VectorStore: grow if needed
    VectorStore-->>HNSW: slot_id
    deactivate VectorStore
    
    HNSW->>Metadata: encode(meta, string_table)
    activate Metadata
    Metadata-->>HNSW: meta_fixed
    deactivate Metadata
    
    HNSW->>GraphTape: allocateNode(level)
    activate GraphTape
    GraphTape-->>HNSW: graph_offset
    deactivate GraphTape
    
    HNSW->>HNSW: link neighbors at levels
    HNSW->>GraphTape: getNeighbors, pushBack
    deactivate HNSW

    Client->>HNSW: searchTopK(query, k, ef)
    activate HNSW
    HNSW->>Search: topK2(query, k, ctx)
    activate Search
    Search->>HNSW: distance(a, b)
    Search->>HNSW: getPoint(id)
    Search->>HNSW: getNeighborsSlice(id, level)
    Search->>HNSW: isDeleted(id)
    Search-->>HNSW: results[]
    deactivate Search
    HNSW-->>Client: SearchResult[]
    deactivate HNSW

    Client->>HNSW: save(path)
    activate HNSW
    HNSW->>VectorStore: read all vectors
    HNSW->>Metadata: encode all metadata
    HNSW->>GraphTape: read all graph data
    HNSW->>Persist: saveV2(header, sections...)
    activate Persist
    Persist->>Persist: write header + vectors + metadata + graph + strings
    deactivate Persist
    deactivate HNSW

    Client->>HNSW: load(path)
    activate HNSW
    HNSW->>Persist: loadHeaderV2(path)
    HNSW->>Persist: loadSectionsV2(path, header)
    activate Persist
    Persist-->>HNSW: sections{vectors, metadata, tape, strings}
    deactivate Persist
    HNSW->>VectorStore: restore from bytes
    HNSW->>GraphTape: restore from bytes
    HNSW->>Metadata: decode all metadata
    HNSW-->>Client: HNSW instance
    deactivate HNSW
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Areas requiring extra attention:

  • Generic comptime metadata framework (src/metadata.zig): Complex recursive field flattening, encode/decode logic, and optional handling. Verify all field types are correctly handled and no name collisions occur.
  • HNSW v2 core logic (src/hnsw_v2.zig): Dense insertion/search/deletion/compaction algorithms with graph operations, level selection, and multi-layer traversal. Verify neighbor linking, entry point maintenance, and compaction ID remapping.
  • Graph tape and neighbors (src/graph_tape.zig, src/neighbors_ref.zig): Low-level memory layout and offset calculations; verify alignment, bounds checks, and level-based indexing across all accessor paths.
  • Search implementation (src/search2.zig): Priority queue logic, visited tracking, and pruning strategy for ef-search. Verify correct handling of deleted nodes and result ordering.
  • Persistence round-trip (src/hnsw_v2.zig + src/persistence_v2.zig): Serialization/deserialization of vectors, metadata, graph structure, and string blobs. Verify header validation, section lengths, and metadata reconstruction.
  • Build system migration (build.zig, build.zig.zon): Zig 0.15+ module patterns, import wiring, and artifact declarations. Confirm all modules resolve correctly.

Poem

🐰 A modular warren, now built with care,
With distance metrics floating through the air,
Metadata encoded in fixed-size blocks,
Graph tapes and vectors locked by locks,
From search to save, each layer shines bright—
hop hop—refactored just right! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title is vague and generic, using non-descriptive language ('Updates', 'many things will happen here') that fails to convey the actual scope of changes. Replace with a specific, concise title summarizing the main change, such as 'Refactor HNSW to v2 with metadata support and persistence' or 'Add HNSWv2 implementation with metadata encoding and graph restructuring'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch updates

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (6)
.gitignore (1)

4-13: Redundant ignore patterns.

Line 4 (.claude/*) already ignores all contents of the .claude/ directory, making lines 10 and 13 redundant. Either:

  • Keep only .claude/* if you want to ignore everything in that directory, or
  • Remove .claude/* and keep only the specific patterns if you want to track some files in .claude/
 .claude/*
 
 # Thoughts data directory (created by thoughts init)
 /.thoughts-data
-
-# Claude settings backup (managed by thoughts)
-/.claude/settings.local.json.bak
-
-# Claude settings quarantine backups (auto-pruned)
-/.claude/settings.local.json.malformed.*.bak
src/distance.zig (1)

43-59: Consider numerical stability for near-zero vectors.

The check if (na == 0 or nb == 0) return 1; handles exact zero vectors, but extremely small vectors could still cause numerical instability in the division. For typical use cases with normalized embeddings this is fine, but if unnormalized vectors are expected, consider adding an epsilon threshold.

-            if (na == 0 or nb == 0) return 1;
+            const epsilon: T = 1e-10;
+            if (na < epsilon or nb < epsilon) return 1;
src/metadata.zig (1)

125-151: Document lifetime requirements for decoded strings.

The doc comment mentions strings are slices into StringTable, which is good. Consider adding a stronger warning about the lifetime constraint - decoded Metadata instances become invalid if the StringTable is deallocated or modified.

 /// Decode FixedOf(Metadata) back to Metadata using StringTable for string fields.
-/// Note: The returned strings are slices into the StringTable, not owned copies.
+/// WARNING: The returned Metadata contains string slices that point into the StringTable.
+/// The Metadata becomes invalid if the StringTable is deallocated or reallocated.
+/// Callers must ensure the StringTable outlives all decoded Metadata instances.
 pub fn decode(
src/search.zig (1)

217-228: Consider bounding candidates queue growth.

The filtered search always adds neighbors to candidates regardless of filter outcome to preserve reachability. However, unlike the unfiltered version, there's no pruning based on ef_search for candidates that don't pass the filter. With very sparse filters, this could lead to significant memory growth.

Consider adding a maximum traversal limit or pruning candidates based on distance from the best filtered result.

src/test_hnsw.zig (1)

76-85: Size assertion may be fragile.

The size assertion @sizeOf(Fixed) == 16 assumes specific struct packing. While correct now, this could break if padding or field ordering changes. Consider using @sizeOf(Fixed) >= 16 or documenting the expected layout.

src/persistence.zig (1)

6-19: Consider explicit alignment for cross-platform compatibility.

The FileHeader is marked extern struct which disables Zig's automatic field reordering, but the struct layout may still have platform-specific padding. Consider adding explicit padding or using packed struct with align(1) if byte-exact serialization is required.

Also note that entry_point uses u64 while the code uses usize internally - this is good for 32/64-bit portability.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f02c187 and 94e99fb.

📒 Files selected for processing (18)
  • .gitignore (1 hunks)
  • .thoughts/config.json (1 hunks)
  • .thoughts/config.v1.bak-20251204-220114.json (1 hunks)
  • benchmarks/multi_threaded_benchmarks.zig (1 hunks)
  • benchmarks/shared_benchmarks.zig (0 hunks)
  • benchmarks/single_threaded_benchmarks.zig (1 hunks)
  • build.zig (3 hunks)
  • build.zig.zon (1 hunks)
  • context (1 hunks)
  • references (1 hunks)
  • src/distance.zig (1 hunks)
  • src/hnsw.zig (6 hunks)
  • src/metadata.zig (1 hunks)
  • src/persistence.zig (1 hunks)
  • src/search.zig (1 hunks)
  • src/test_hnsw.zig (9 hunks)
  • src/zvdb.zig (1 hunks)
  • thoughts (1 hunks)
💤 Files with no reviewable changes (1)
  • benchmarks/shared_benchmarks.zig
🔇 Additional comments (32)
thoughts (1)

1-1: Scaffolding file for thoughts data tooling.

This appears to be a reference/discovery file for the thoughts tooling system. Consider whether this file should be tracked in version control given that .thoughts-data/ is already in .gitignore.

context (1)

1-1: Scaffolding file for thoughts context data.

Same pattern as the thoughts file. The file will be tracked in git while the actual .thoughts-data/context directory it references is ignored.

build.zig (4)

7-17: LGTM on Zig 0.15+ module migration.

The module-based library pattern is correctly implemented: addModule creates the reusable module, and addLibrary with .root_module wires it up.


53-62: Benchmark module pattern looks correct.

The single-threaded benchmark correctly creates a module, imports the zvdb library, and creates an executable from it.


73-82: Multi-threaded benchmark follows the same correct pattern.

Consistent with single-threaded benchmarks.


19-25: No module import needed for tests.

The test file imports zvdb.zig as a relative path (@import("zvdb.zig")), not as a named module like @import("zvdb"). Relative path imports don't require addImport registration in the build configuration, so the current setup is correct.

Likely an incorrect or invalid review comment.

build.zig.zon (2)

2-3: Zig 0.15 package manifest updates.

The bare symbol .zvdb for name and the fingerprint field are correct for Zig 0.15+. The fingerprint provides package identity for the build system.


9-9: Version bump to Zig 0.15.0.

This aligns with the module-based build pattern changes in build.zig. Ensure your CI/CD and developer environments are updated to Zig 0.15+.

references (1)

1-1: Scaffolding file for thoughts references data.

Consistent with the thoughts and context files pattern.

benchmarks/single_threaded_benchmarks.zig (1)

11-11: Same format specifier concern as in multi_threaded_benchmarks.zig.

The {f} specifier requires floating-point values. If shared.runInsertionBenchmark and shared.runSearchBenchmark return structs, this will fail to compile.

Also applies to: 16-16

.thoughts/config.json (1)

1-24: Configuration file for context management tooling looks valid.

The JSON structure is well-formed. Note that this configuration references a personal repository (personal_thoughts.git) for syncing thoughts and context. Ensure this is intentional for your workflow and that the repository access permissions are appropriately configured for any collaborators.

src/zvdb.zig (1)

1-7: Clean module re-export pattern.

This establishes a well-organized public API surface for the library. Consumers can now import from zvdb rather than individual internal modules, which provides better encapsulation and allows internal restructuring without breaking the public API.

src/distance.zig (2)

3-8: Well-designed distance metric enum.

The enum with explicit u8 backing type is appropriate for serialization and the set of metrics covers common use cases for vector similarity search.


10-19: Good comptime specialization pattern.

The distanceFn factory cleanly returns specialized distance functions at compile time, enabling zero-cost abstraction for metric selection.

src/search.zig (4)

1-8: LGTM!

Clean and simple generic result struct with appropriate fields for search results.


15-25: LGTM!

The function signature is well-designed with clear comptime parameters for distance function and type, along with runtime parameters for search configuration. The documentation clearly explains the expected interface for the nodes parameter.


28-53: LGTM!

Greedy descent logic correctly navigates from the top layer down to layer 1, refining the entry point at each level. The boundary check level < node.connections.len properly handles nodes with fewer levels.


55-119: LGTM!

The ef-search implementation at layer 0 is correct:

  • Uses min-heap for candidates (closest first) and max-heap for results (worst at top for pruning)
  • Proper early termination when current candidate is worse than the worst result
  • Correctly maintains result set size bounded by ef_search
  • Final extraction properly sorts results in ascending order by distance
src/test_hnsw.zig (5)

3-11: LGTM!

Clean imports and well-structured test metadata type that covers typical use case fields (path, line range).


36-70: LGTM!

Good coverage of distance metrics with appropriate test cases for edge cases like orthogonal vectors (cosine) and identical vectors.


253-308: LGTM!

Excellent concurrent access test that validates thread safety with multiple threads inserting simultaneously and verifies correct final count.


483-531: LGTM!

Thorough round-trip test that validates not just structure fields but also search result consistency and metadata preservation across save/load cycles.


419-460: LGTM!

Well-structured test demonstrating the filtered search API with a realistic use case (filtering by file extension). Properly validates that results match the filter criteria.

src/persistence.zig (3)

43-54: LGTM!

Clean header loading with proper magic number and version validation. Returns descriptive errors for invalid files.


108-153: LGTM!

Metadata serialization is straightforward with proper size validation. No inner allocations means simpler error handling.


32-41: > Likely an incorrect or invalid review comment.

src/hnsw.zig (6)

14-18: LGTM!

Clean parameterization with compile-time metric and metadata type. The MetaFixed and dist aliases provide convenient access to derived types.


27-43: LGTM!

Node initialization properly handles memory allocation with appropriate errdefer cleanup. Clear ownership semantics for point data.


89-99: LGTM!

Insert properly validates dimensions and encodes metadata. The metadata encoding integrates cleanly with the string table for string field storage.


199-250: LGTM!

Search functions properly validate query dimensions and delegate to the search module with appropriate locking. The filtered search API is well-designed with compile-time predicate.


269-302: LGTM!

Save function properly acquires lock, builds header with sentinel value for null entry point, and serializes all sections with appropriate cleanup via defer.


383-396: LGTM!

Entry point handling correctly decodes the sentinel value and validates bounds before assignment.

Comment on lines 361 to 380
const points = try persistence.deserializeVectors(T, vectors_bytes, node_count, dims, allocator);
defer allocator.free(points); // Outer slice only; inner slices moved to nodes

const metas = try persistence.deserializeMetadata(MetaFixed, metas_bytes, node_count, allocator);
defer if (metas.len > 0) allocator.free(metas);

const conns = try persistence.deserializeGraph(graph_bytes, node_count, allocator);
defer if (conns.len > 0) allocator.free(conns);

fn lessThan(_: void, a: CandidateNode, b: CandidateNode) std.math.Order {
return std.math.order(a.distance, b.distance);
// Assemble nodes
for (0..node_count) |id| {
const node = Node{
.id = id,
.point = points[id], // Takes ownership
.meta_fixed = metas[id],
.connections = conns[id], // Takes ownership
.mutex = .{},
};
try g.nodes.put(id, node);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Ownership transfer during error may cause double-free or leak.

The deserialized points, metas, and conns are freed via defer (lines 362, 365, 368), but inner slices are transferred to nodes in the loop (lines 374, 376). If g.nodes.put fails mid-loop:

  • Previously assembled nodes have taken ownership of their slices
  • Remaining slices would be freed by the defer, but the outer slice free doesn't free inner slices
  • On g.deinit() via errdefer (line 313), those nodes would try to free their slices again

Consider restructuring to either:

  1. Not use defer and manually free remaining items on error
  2. Use toOwnedSlice pattern to transfer entire ownership

Comment on lines 3 to 9
pub const StringTable = struct {
data: []u8 = &.{},

pub fn slice(self: *const StringTable, off: u32, len: u32) []const u8 {
return self.data[@as(usize, off)..][0..@as(usize, len)];
}
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

StringTable needs a deinit method for memory cleanup.

The StringTable stores allocated memory in data but lacks a deinit method. Callers need a way to free this memory.

 pub const StringTable = struct {
     data: []u8 = &.{},
+    allocator: ?std.mem.Allocator = null,
 
     pub fn slice(self: *const StringTable, off: u32, len: u32) []const u8 {
         return self.data[@as(usize, off)..][0..@as(usize, len)];
     }
+
+    pub fn deinit(self: *StringTable) void {
+        if (self.allocator) |alloc| {
+            if (self.data.len > 0) {
+                alloc.free(self.data);
+            }
+        }
+        self.* = .{};
+    }
 };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub const StringTable = struct {
data: []u8 = &.{},
pub fn slice(self: *const StringTable, off: u32, len: u32) []const u8 {
return self.data[@as(usize, off)..][0..@as(usize, len)];
}
};
pub const StringTable = struct {
data: []u8 = &.{},
allocator: ?std.mem.Allocator = null,
pub fn slice(self: *const StringTable, off: u32, len: u32) []const u8 {
return self.data[@as(usize, off)..][0..@as(usize, len)];
}
pub fn deinit(self: *StringTable) void {
if (self.allocator) |alloc| {
if (self.data.len > 0) {
alloc.free(self.data);
}
}
self.* = .{};
}
};
🤖 Prompt for AI Agents
In src/metadata.zig around lines 3 to 9, StringTable currently holds allocated
memory in its data field but provides no way to free it; add a public deinit
method like pub fn deinit(self: *StringTable, allocator: *std.mem.Allocator)
void that checks if self.data.len > 0 and frees the backing allocation via
allocator.free(self.data), then set self.data = &.{} to avoid double-free;
update any callers that construct/populate StringTable to call deinit with the
allocator when done.

src/metadata.zig Outdated
Comment on lines 102 to 110
if (st.data.len == 0) {
// First allocation
const new_data = try allocator.alloc(u8, s.len);
st.data = new_data;
} else {
// Resize existing
const new_data = try allocator.realloc(st.data, old_len + s.len);
st.data = new_data;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Memory leak risk on realloc failure.

If allocator.realloc fails (returns error), the existing st.data is not freed. The error propagates but the old allocation remains, causing a leak. Also, the allocator used should be stored in the StringTable for later cleanup.

             // Append string to blob
             const old_len = st.data.len;
             if (st.data.len == 0) {
                 // First allocation
                 const new_data = try allocator.alloc(u8, s.len);
                 st.data = new_data;
+                st.allocator = allocator;
             } else {
                 // Resize existing
-                const new_data = try allocator.realloc(st.data, old_len + s.len);
-                st.data = new_data;
+                const new_data = allocator.realloc(st.data, old_len + s.len) catch |err| {
+                    // Note: old st.data remains valid, caller must handle cleanup
+                    return err;
+                };
+                st.data = new_data;
             }

Alternatively, ensure the caller always has access to an allocator that matches what was used for encoding.

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +94 to +106
const points = try allocator.alloc([]T, node_count);
errdefer allocator.free(points);

var off: usize = 0;
for (0..node_count) |id| {
const p = try allocator.alloc(T, dims);
const src = data[off..][0..(dims * elem_size)];
@memcpy(std.mem.sliceAsBytes(p), src);
points[id] = p;
off += dims * elem_size;
}
return points;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Memory leak on partial deserialization failure.

If allocator.alloc(T, dims) fails mid-loop (line 99), previously allocated point slices (lines 99, 102) are not freed. The errdefer on line 95 only frees the outer points slice.

Consider tracking and freeing allocated inner slices on error:

     const points = try allocator.alloc([]T, node_count);
-    errdefer allocator.free(points);
+    errdefer {
+        for (points[0..off / (dims * elem_size)]) |p| allocator.free(p);
+        allocator.free(points);
+    }

Or use an index-based approach to track how many have been allocated.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const points = try allocator.alloc([]T, node_count);
errdefer allocator.free(points);
var off: usize = 0;
for (0..node_count) |id| {
const p = try allocator.alloc(T, dims);
const src = data[off..][0..(dims * elem_size)];
@memcpy(std.mem.sliceAsBytes(p), src);
points[id] = p;
off += dims * elem_size;
}
return points;
}
const points = try allocator.alloc([]T, node_count);
errdefer {
for (points[0..off / (dims * elem_size)]) |p| allocator.free(p);
allocator.free(points);
}
var off: usize = 0;
for (0..node_count) |id| {
const p = try allocator.alloc(T, dims);
const src = data[off..][0..(dims * elem_size)];
@memcpy(std.mem.sliceAsBytes(p), src);
points[id] = p;
off += dims * elem_size;
}
return points;
}
🤖 Prompt for AI Agents
In src/persistence.zig around lines 94 to 106, the inner per-point allocations
can leak if allocator.alloc(T, dims) fails mid-loop because only the outer
points slice is errdefer-freed; modify the loop to track how many inner slices
have been successfully allocated (e.g., maintain a counter or use the loop
index), and on any allocation error free each of those allocated inner slices
before returning (and still free the outer points slice). Implement an explicit
cleanup path that iterates 0..allocated_count to allocator.free each points[i]
on error, then free points and return the error.

Comment on lines +172 to +178
for (node.connections) |level_conns| {
const neighbor_count: u32 = @intCast(level_conns.items.len);
try buffer.appendSlice(allocator, std.mem.asBytes(&neighbor_count));
for (level_conns.items) |nb| {
const nb_id: u32 = @intCast(nb);
try buffer.appendSlice(allocator, std.mem.asBytes(&nb_id));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Endianness inconsistency between serialize and deserialize.

Serialization uses std.mem.asBytes(&neighbor_count) which uses native endianness, but deserialization uses std.mem.readInt(..., .little) which forces little-endian. This will cause data corruption on big-endian systems.

Apply consistent little-endian encoding:

-                const neighbor_count: u32 = @intCast(level_conns.items.len);
-                try buffer.appendSlice(allocator, std.mem.asBytes(&neighbor_count));
+                var neighbor_count_bytes: [4]u8 = undefined;
+                std.mem.writeInt(u32, &neighbor_count_bytes, @intCast(level_conns.items.len), .little);
+                try buffer.appendSlice(allocator, &neighbor_count_bytes);
                 for (level_conns.items) |nb| {
-                    const nb_id: u32 = @intCast(nb);
-                    try buffer.appendSlice(allocator, std.mem.asBytes(&nb_id));
+                    var nb_bytes: [4]u8 = undefined;
+                    std.mem.writeInt(u32, &nb_bytes, @intCast(nb), .little);
+                    try buffer.appendSlice(allocator, &nb_bytes);
                 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for (node.connections) |level_conns| {
const neighbor_count: u32 = @intCast(level_conns.items.len);
try buffer.appendSlice(allocator, std.mem.asBytes(&neighbor_count));
for (level_conns.items) |nb| {
const nb_id: u32 = @intCast(nb);
try buffer.appendSlice(allocator, std.mem.asBytes(&nb_id));
}
for (node.connections) |level_conns| {
var neighbor_count_bytes: [4]u8 = undefined;
std.mem.writeInt(u32, &neighbor_count_bytes, @intCast(level_conns.items.len), .little);
try buffer.appendSlice(allocator, &neighbor_count_bytes);
for (level_conns.items) |nb| {
var nb_bytes: [4]u8 = undefined;
std.mem.writeInt(u32, &nb_bytes, @intCast(nb), .little);
try buffer.appendSlice(allocator, &nb_bytes);
}
🤖 Prompt for AI Agents
In src/persistence.zig around lines 172 to 178, the serialization writes
integers using std.mem.asBytes(&neighbor_count) which produces native-endian
bytes while deserialization always reads little-endian; change the serialization
to explicitly write little-endian bytes so both sides match. Fix by converting
each integer to little-endian before appending (e.g., use the integer's
toLittleEndian() method or a std helper like std.mem.writeIntLittle /
writeIntLittleEndian into a small buffer) and append those bytes (neighbor_count
and each nb_id) instead of raw native-endian asBytes; ensure types are cast to
u32 first and that the same little-endian format is used for all integer writes.

Comment on lines +195 to +231
var connections = try allocator.alloc([]std.ArrayList(usize), node_count);
errdefer allocator.free(connections);

var offset: usize = 0;
for (0..node_count) |id| {
if (offset >= data.len) {
connections[id] = &.{};
continue;
}

const level_count = data[offset];
offset += 1;

if (level_count == 0) {
connections[id] = &.{};
continue;
}

const levels = try allocator.alloc(std.ArrayList(usize), level_count);
errdefer allocator.free(levels);

for (0..level_count) |level| {
const neighbor_count = std.mem.readInt(u32, data[offset..][0..4], .little);
offset += 4;

levels[level] = std.ArrayList(usize){};
try levels[level].ensureTotalCapacity(allocator, neighbor_count);

for (0..neighbor_count) |_| {
const nb_id = std.mem.readInt(u32, data[offset..][0..4], .little);
offset += 4;
try levels[level].append(allocator, @intCast(nb_id));
}
}

connections[id] = levels;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Incomplete error cleanup in deserializeGraph.

The nested allocation structure makes error cleanup complex. If an error occurs mid-loop:

  1. Line 196's errdefer only frees the outer connections slice
  2. Line 214's errdefer only frees the current node's levels slice
  3. ArrayLists populated via append are not cleaned up
  4. Previously processed nodes' allocations are leaked

Consider a cleanup helper or restructuring to ensure all nested allocations are freed on error paths.

🤖 Prompt for AI Agents
In src/persistence.zig around lines 195 to 231, the nested allocations in
deserializeGraph can leak on errors because the current errdefer calls only free
the outer slices and the currently allocated inner slice, but not any ArrayLists
that have been appended to or levels allocated for previously processed nodes;
implement a deterministic cleanup routine: after allocating connections,
register a single errdefer that calls a helper cleanup(allocator, connections,
processed_node_count) which iterates 0..processed_node_count, for each node
iterates its levels length and deallocates each levels[level] ArrayList (and any
internal buffers), then frees the levels slice itself, and finally frees the
connections slice; ensure that when you allocate a node's levels you initialize
each levels[level] to an empty ArrayList before filling so the cleanup can
safely free partially-populated structures, and remove the per-allocation
errdefer usage in favor of the single cleanup helper to guarantee no leaks on
any early return or error.

Allison Durham added 3 commits December 5, 2025 10:16
- Add SIMD-accelerated distance metrics for f32/f64 using std.simd
- Add batch insert with single-lock optimization and pre-allocated StringTable
- Add soft delete with tombstones and entry point migration
- Add updateReplace (new ID) and updateInPlace (stable ID) operations
- Add compact() to rebuild contiguous IDs after deletions
- Update save() to always compact before writing
- Update search to skip deleted nodes while preserving graph traversal
Update benchmark functions to use the new HNSW(T, metric, Metadata)
signature and updated method names (insert -> insert with metadata,
search -> searchDefault).
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmarks/shared_benchmarks.zig (1)

37-48: toCsv has memory leak risk and uses catch unreachable.

This function allocates via page_allocator but the caller has no way to free the returned slice (no allocator handle returned). Additionally, catch unreachable will panic on allocation failure instead of propagating the error.

-    pub fn toCsv(self: BenchmarkResult) []const u8 {
-        return std.fmt.allocPrint(std.heap.page_allocator, "{s},{d},{d},{d},{d},{d},{d},{d:.2}", .{
+    pub fn toCsv(self: BenchmarkResult, allocator: std.mem.Allocator) ![]const u8 {
+        return std.fmt.allocPrint(allocator, "{s},{d},{d},{d},{d},{d},{d},{d:.2}", .{
             self.operation,
             self.num_points,
             self.dimensions,
             self.num_queries orelse 0,
             self.k orelse 0,
             self.num_threads orelse 1,
             self.total_time_ns,
             self.operations_per_second,
-        }) catch unreachable;
+        });
     }
♻️ Duplicate comments (3)
src/metadata.zig (2)

3-22: StringTable memory management issues already flagged.

The previous review identified that StringTable needs a deinit method for memory cleanup and that ensureAdditionalCapacity should track the allocator. These remain valid concerns.


113-136: Memory leak on realloc failure already flagged.

The previous review identified that if allocator.realloc fails, the existing st.data allocation is not freed, causing a leak.

src/hnsw.zig (1)

614-633: Ownership transfer during load - verify no double-free on error.

The previous review flagged that if g.nodes.put fails mid-loop, there's a risk of double-free because:

  • points and conns outer slices are freed by defer (lines 615, 621)
  • Inner slices are transferred to nodes in the loop
  • On g.deinit() via errdefer (line 566), those nodes would try to free their slices again

The current implementation still has this concern. If put fails at iteration N, nodes 0..N-1 have taken ownership of their points[i] and conns[i], but the defer statements only free the outer slice, not the inner slices for indices N+1 onwards.

🧹 Nitpick comments (6)
src/distance.zig (1)

39-48: Missing length assertion in euclideanDist.

Unlike squaredEuclidean, this function doesn't assert length equality before calling the SIMD path directly. While simdSquaredEuclidean will eventually work with mismatched lengths (producing incorrect results silently), adding the assertion here maintains consistency and fails fast.

 fn euclideanDist(comptime T: type) fn ([]const T, []const T) T {
     return struct {
         fn f(a: []const T, b: []const T) T {
+            std.debug.assert(a.len == b.len);
             if (T == f32 or T == f64) {
                 return std.math.sqrt(simdSquaredEuclidean(T, a, b));
             }
             return std.math.sqrt(squaredEuclidean(T)(a, b));
         }
     }.f;
 }
src/metadata.zig (1)

168-197: encodeInto has no bounds checking on cursor.

If the caller didn't allocate enough capacity via ensureAdditionalCapacity, the @memcpy at line 185 will write past the allocated buffer causing undefined behavior.

 pub fn encodeInto(
     comptime Metadata: type,
     meta: Metadata,
     st: *StringTable,
     cursor: *usize,
 ) FixedOf(Metadata) {
     if (Metadata == void) return .{};
 
     var fixed: FixedOf(Metadata) = undefined;
     const info = @typeInfo(Metadata).@"struct";
 
     inline for (info.fields) |f| {
         switch (@typeInfo(f.type)) {
             .pointer => {
                 const s = @field(meta, f.name);
                 const off: u32 = @intCast(cursor.*);
                 const len: u32 = @intCast(s.len);
+                std.debug.assert(cursor.* + s.len <= st.data.len);
                 @memcpy(st.data[cursor.*..][0..s.len], s);
                 cursor.* += s.len;
                 @field(fixed, f.name ++ "_off") = off;
                 @field(fixed, f.name ++ "_len") = len;
             },
src/test_hnsw.zig (1)

732-756: Memory leak test doesn't call hnsw.deinit() before arena cleanup.

While the ArenaAllocator will free all memory regardless, not calling deinit() means the test doesn't verify that HNSW's cleanup logic works correctly. This test currently only verifies that HNSW doesn't hold references to memory outside the arena.

 test "HNSW - Memory Leaks" {
     var hnsw: HNSW(f32, .squared_euclidean, void) = undefined;
     {
         var arena = std.heap.ArenaAllocator.init(testing.allocator);
         defer arena.deinit();
         const allocator = arena.allocator();
 
         hnsw = HNSW(f32, .squared_euclidean, void).init(allocator, 64, 16, 200);
+        defer hnsw.deinit();  // Verify deinit runs without error
 
         const num_points = 1000;
         const dim = 64;
src/hnsw.zig (3)

96-98: Consider returning Error.DimMismatch instead of debug.assert for dimension validation.

Using debug.assert means dimension mismatches silently corrupt data in release builds. Given that Error.DimMismatch exists, returning an error provides consistent runtime safety.

-        pub fn insert(self: *Self, point: []const T, meta: Metadata) !usize {
-            std.debug.assert(point.len == self.dims);
+        pub fn insert(self: *Self, point: []const T, meta: Metadata) !usize {
+            if (point.len != self.dims) return Error.DimMismatch;

The same applies to insertBatch (line 159), searchTopK (line 449), and searchFiltered (line 480).


151-159: Batch insert dimension validation should return error, not assert.

Same issue as single insert - using debug.assert for user-provided data is unsafe in release builds.

         pub fn insertBatch(self: *Self, points: []const []const T, metas: []const Metadata) ![]usize {
-            std.debug.assert(points.len == metas.len);
+            if (points.len != metas.len) return error.LengthMismatch;
             if (points.len == 0) return &[_]usize{};
 
             self.mutex.lock();
             defer self.mutex.unlock();
 
             // Validate all dimensions
-            for (points) |p| std.debug.assert(p.len == self.dims);
+            for (points) |p| {
+                if (p.len != self.dims) return Error.DimMismatch;
+            }

438-441: Prefer returning Error.DimMismatch over debug.assert for runtime safety.

Consistent with earlier feedback on insert - user-provided query vectors should be validated with error returns.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 94e99fb and 6e0c2f3.

📒 Files selected for processing (6)
  • benchmarks/shared_benchmarks.zig (5 hunks)
  • src/distance.zig (1 hunks)
  • src/hnsw.zig (6 hunks)
  • src/metadata.zig (1 hunks)
  • src/search.zig (1 hunks)
  • src/test_hnsw.zig (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/search.zig
🔇 Additional comments (21)
benchmarks/shared_benchmarks.zig (3)

2-4: LGTM - imports updated to use module namespace.

The import pattern using zvdb namespace and aliasing HNSW and DistanceMetric is clean and consistent with the module re-export pattern.


59-70: LGTM - insertion benchmark updated correctly.

The HNSW initialization with (f32, .squared_euclidean, void) and insert(point, {}) correctly matches the new API.


88-107: LGTM - search benchmark updated correctly.

Both insert and searchDefault calls are correctly updated to the new API signatures.

src/distance.zig (4)

3-8: LGTM - Clean enum definition for distance metrics.

The enum with u8 backing type is memory-efficient and covers the essential metrics.


50-70: LGTM - cosine distance implementation is correct.

The handling of zero-norm vectors (returning 1 for maximum distance) is appropriate, and the "1 - similarity" formulation maintains the "lower is better" invariant.


91-111: LGTM - SIMD squared Euclidean with proper tail handling.

The implementation correctly handles vector lengths not divisible by SIMD lane width via scalar tail processing.


132-159: LGTM - Efficient single-pass SIMD for cosine components.

Computing dot product and both norms in one pass is a good optimization, avoiding redundant memory traversal.

src/metadata.zig (2)

26-92: LGTM - Elegant comptime struct generation.

The FixedOf function correctly transforms dynamic string fields to offset/length pairs while preserving primitive fields, producing a fixed-size extern struct suitable for persistence.


140-164: LGTM - decode correctly reconstructs Metadata from fixed representation.

The documentation clearly states returned strings are slices into StringTable, not owned copies - this is the expected behavior.

src/test_hnsw.zig (6)

1-11: LGTM - Clean test setup with practical metadata struct.

The TestMeta struct with path, line_start, line_end is a good real-world example for testing the metadata encoding/decoding.


36-70: LGTM - Thorough distance metric unit tests.

Good coverage of edge cases: orthogonal vectors for cosine, identical vectors, and explicit expected values for dot product.


116-200: LGTM - Excellent SIMD equivalence testing across dimensions.

Testing various dimensions (1, 4, 8, 16, 100, 128, 768) ensures SIMD tail handling is correct. Using relative tolerance for larger results is appropriate.


217-238: LGTM - Metadata encode/decode round-trip verified.

The test properly validates both primitive fields and string field reconstruction via StringTable.


758-813: LGTM - Concurrent insertion test validates thread safety.

The test spawns 8 threads each inserting 1000 points, then verifies the total count matches expected and search works correctly.


988-1036: LGTM - Comprehensive save/load round-trip verification.

The test validates structure fields, search results, distances, and metadata preservation after persistence round-trip.

src/hnsw.zig (6)

7-12: LGTM - Clean module organization with public re-exports.

The explicit imports and DistanceMetric alias make the public API clear.


26-57: LGTM - Node struct with proper initialization and cleanup.

The errdefer chain in Node.init correctly handles partial allocation failures, and deinit properly frees both connections and point data.


238-252: LGTM - Soft delete with tombstone pattern.

The implementation correctly marks the node as deleted and updates the entry point if necessary. Search traversal through deleted nodes' neighbors is preserved.


300-378: LGTM - Compact implementation with proper ID remapping.

The compact function correctly:

  1. Builds old→new ID mapping
  2. Remaps all connection references
  3. Frees deleted nodes
  4. Updates entry point
  5. Recomputes max_level

518-555: LGTM - Save implementation correctly compacts before persisting.

The auto-compaction ensures clean contiguous IDs and no tombstones in the persisted file.


569-577: The file.reader(&buf) API usage is correct and safe. The stack-allocated buffer has sufficient lifetime because the reader's takeStruct() operation is synchronous and completes immediately (line 576 in hnsw.zig), before the buffer goes out of scope. This pattern is idiomatic in the codebase and matches the documented API: the buffer must outlive the reader operations, which it does. No lifetime issues exist.

- Add presence_words header for optional field tracking with bit-packed flags
- Support ?T optionals for primitives, strings, arrays, and nested structs
- Inline primitive arrays [N]T in FixedOf output
- Flatten nested structs with __ separator
- Add compile-time validation with clear error messages
- Recursive encode/decode with presence bit handling
- Add comprehensive tests including HNSW save/load round-trip
- Add compile-fail tests for unsupported type combinations
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/metadata.zig (1)

3-22: Add a small StringTable.deinit helper to centralize cleanup

StringTable owns an expandable data buffer, and helpers like appendString / encode / encodeInto grow it via the provided allocator, but there’s no convenience method to release that buffer. Call sites in tests are already manually doing if (st.data.len > 0) allocator.free(st.data);.

Consider adding something like:

pub fn deinit(self: *StringTable, allocator: std.mem.Allocator) void {
    if (self.data.len > 0) {
        allocator.free(self.data);
        self.data = &.{};
    }
}

and switching callers over to defer st.deinit(allocator);. That keeps memory‑ownership conventions explicit and avoids repeating ad‑hoc cleanup logic.

Also applies to: 312-328, 498-590

🧹 Nitpick comments (1)
src/test_hnsw.zig (1)

1160-1184: “HNSW - Memory Leaks” test does not actually exercise HNSW deallocation

This test initializes hnsw with an ArenaAllocator but never calls hnsw.deinit(), and ArenaAllocator.deinit() simply frees its backing allocations wholesale, without per‑allocation leak checking. As written, the test mainly exercises that operations succeed under an arena allocator, not whether HNSW correctly frees its internal allocations.

If the goal is to detect leaks in HNSW:

  • Create hnsw inside the arena scope with defer hnsw.deinit(); so deinit() runs before arena.deinit(), and
  • Use a leak‑detecting allocator (e.g., std.heap.GeneralPurposeAllocator hooked up to std.testing.allocator) directly, instead of hiding allocations behind ArenaAllocator.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6e0c2f3 and 8de5098.

📒 Files selected for processing (9)
  • build.zig (3 hunks)
  • src/array_of_optionals.zig (1 hunks)
  • src/array_of_strings.zig (1 hunks)
  • src/array_of_structs.zig (1 hunks)
  • src/enum_no_explicit_tag.zig (1 hunks)
  • src/metadata.zig (1 hunks)
  • src/name_collision.zig (1 hunks)
  • src/test_hnsw.zig (9 hunks)
  • src/zero_length_array.zig (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/zero_length_array.zig
🔇 Additional comments (7)
src/array_of_optionals.zig (1)

1-5: Compile‑fail harness for [N]?T metadata looks correct

Bad.a is [2]?u32, which validateSupportedField rejects (Arrays of optionals [N]?T not supported), so FixedOf(Bad) will fail at comptime as intended. No changes needed.

build.zig (1)

7-18: Root‑module based build wiring looks consistent

Library, tests, and benchmarks are wired using root_module / createModule in a way that matches Zig 0.15’s build API, and the zvdb module is shared correctly via addImport. The compile‑fail test files are intentionally left as manual zig build-obj runs per the comment, which is fine for now.

Also applies to: 19-25, 57-67, 77-87

src/name_collision.zig (1)

1-5: Reserved field name collision test is wired correctly

Using a user field named presence_words alongside an optional ensures FixedOf(Bad) trips the reserved‑name check in checkNameCollisions, so this file serves as a good compile‑fail guard.

src/array_of_strings.zig (1)

1-5: Compile‑fail for [N][]const u8 is aligned with validation rules

Bad.a is an array of strings, which validateSupportedField rejects, so FixedOf(Bad) will produce a comptime error as intended. No changes needed.

src/array_of_structs.zig (1)

1-6: Array‑of‑structs rejection is exercised correctly

Bad.a is [3]S where S is a struct; validateSupportedField’s array branch will emit the “Arrays of structs [N]S not supported” compileError, so this file is a good negative test for that rule.

src/test_hnsw.zig (2)

36-71: Distance and SIMD equivalence tests give solid coverage

The scalar vs. SIMD distance checks for squared‑euclidean, euclidean, cosine, and dot product across varied dims (f32/f64) look well‑constructed and numerically reasonable (relative/absolute tolerances are conservative). These will be very helpful for guarding future SIMD changes.

Also applies to: 76-200


1186-1241: Concurrent insertion test assumes HNSW is safely shared across threads

"HNSW - Concurrent Access" spawns 8 threads that all call insert on the same HNSW instance without external synchronization, then asserts count() == num_threads * points_per_thread.

That’s a good stress test, but it also encodes the contract that HNSW is thread‑safe for concurrent writes. Please ensure the implementation actually provides the necessary synchronization (or explicitly document that this test is allowed to data‑race in non‑debug builds and is only for experimentation).

Comment on lines 321 to 338
test "metadata encode/decode - optional present and null" {
const Meta = struct { a: ?u32, b: u32 };
var st = zvdb.metadata.StringTable{};

// Present
const m1: Meta = .{ .a = 42, .b = 10 };
const f1 = try zvdb.metadata.encode(Meta, m1, &st, testing.allocator);
const r1 = zvdb.metadata.decode(Meta, f1, &st);
try testing.expect(r1.a != null);
try testing.expectEqual(@as(u32, 42), r1.a.?);

// Null
const m2: Meta = .{ .a = null, .b = 20 };
const f2 = try zvdb.metadata.encode(Meta, m2, &st, testing.allocator);
const r2 = zvdb.metadata.decode(Meta, f2, &st);
try testing.expect(r2.a == null);
try testing.expectEqual(@as(u32, 20), r2.b);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Several metadata encode/decode tests leak StringTable allocations

In these tests you allocate a fresh zvdb.metadata.StringTable and call metadata.encode with testing.allocator, but never free st.data:

  • metadata encode/decode - optional present and null
  • metadata encode/decode - optional array
  • metadata encode/decode - nested struct
  • metadata encode/decode - primitive array
  • metadata encode/decode - multiple optionals
  • metadata encode/decode - nested optional with inner optional

Under std.testing.allocator this will typically be reported as a leak.

Consider adding a common pattern, e.g.:

var st = zvdb.metadata.StringTable{};
defer if (st.data.len > 0) testing.allocator.free(st.data);

(or, once a StringTable.deinit exists, defer st.deinit(testing.allocator);) to keep the tests leak‑free.

Also applies to: 357-384, 406-442, 454-479

🤖 Prompt for AI Agents
In src/test_hnsw.zig around lines 321 to 338 (and similarly for ranges 357-384,
406-442, 454-479), the tests create a local zvdb.metadata.StringTable but never
free its allocations, causing allocator leaks under std.testing.allocator; fix
by ensuring the StringTable is cleaned up after each test — either add a defer
that frees st.data when non-empty using testing.allocator (e.g., defer if
(st.data.len > 0) testing.allocator.free(st.data);) or, if available, call a
dedicated st.deinit(testing.allocator) in a defer — apply this pattern to all
listed tests so no StringTable allocations leak.

Phase 1 - Concurrent reads:
- Replace global Mutex with RwLock for concurrent search
- Add shared locks to read-only methods (count, getNode, etc.)
- Fix connect() with ID-ordered locking to prevent deadlocks
- Handle self-edge case in connect()

Phase 2 - Parallel batch insert:
- Rewrite insertBatch with parallel worker threads
- Add insertBatchThreaded for explicit thread count control
- Pre-reserve HashMap and connection capacities before parallel phase
- Spawn workers for greedy search and connect operations

Benchmarks:
- Update insertion benchmark to use insertBatchThreaded
- Update search benchmark to spawn actual threads

Tests:
- Add concurrent search test (4 threads, 100 queries each)
- Add parallel insertBatch correctness test
- Add insertBatch edge cases (empty, single element)
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/hnsw.zig (1)

97-147: errdefer node.deinit in insert causes use-after-free when subsequent operations fail

In insert (lines 109–112), once the node is placed into self.nodes via put(), any subsequent error (e.g., from connect at line 131) will trigger the errdefer and deinit the node while it remains stored in the map, leaving a dangling pointer inside self.nodes. Further access to the index becomes use-after-free.

In insertBatchThreaded (lines 207–214), the pre-insert loop reuses a single node variable with an errdefer inside each iteration. While errdefers are scoped to their iterations, if all insertions succeed but a later operation fails (e.g., line 220+), the nodes remain in the map with no cleanup, leading to resource leaks and potential inconsistency.

A safer pattern is to handle errors only at the point of failure:

@@ pub fn insert(self: *Self, point: []const T, meta: Metadata) !usize {
             const mf = try metadata.encode(Metadata, meta, &self.string_table, self.allocator);
 
             var node = try Node.init(self.allocator, id, point, level, mf);
-            errdefer node.deinit(self.allocator);
-
-            try self.nodes.put(id, node);
+            self.nodes.put(id, node) catch |err| {
+                node.deinit(self.allocator);
+                return err;
+            };
@@ pub fn insertBatchThreaded(self: *Self, points: []const []const T, metas: []const Metadata, num_threads: ?usize) ![]usize {
                 const id = base_id + i;
                 const level = levels[i];
                 var node = try Node.init(self.allocator, id, p, level, fixeds[i]);
-                errdefer node.deinit(self.allocator);
 
                 // Pre-reserve connection capacity to avoid realloc during parallel phase
                 for (node.connections) |*alist| {
                     try alist.ensureTotalCapacity(self.allocator, self.m + 1);
                 }
-                try self.nodes.put(id, node);
+                self.nodes.put(id, node) catch |err| {
+                    node.deinit(self.allocator);
+                    return err;
+                };

This avoids deiniting nodes that are already owned by the map on later failures. If full rollback is needed on post-insertion errors (e.g., within connect), an explicit cleanup path must remove partially inserted nodes from self.nodes before returning an error.

♻️ Duplicate comments (1)
src/test_hnsw.zig (1)

321-479: Several metadata encode/decode tests still leak StringTable allocations

This is the same issue called out in the earlier review: some tests allocate a fresh zvdb.metadata.StringTable and call metadata.encode with testing.allocator but never free st.data. Under std.testing.allocator this will show up as leaks.

The affected tests in this version:

  • metadata encode/decode - optional present and null
  • metadata encode/decode - optional array
  • metadata encode/decode - nested struct
  • metadata encode/decode - primitive array
  • metadata encode/decode - multiple optionals
  • metadata encode/decode - nested optional with inner optional

You already added cleanup in other tests (e.g. metadata encode/decode - optional string, metadata encode and decode, optional nested struct); mirroring that here will keep the whole suite leak‑free.

Suggested patch pattern:

 test "metadata encode/decode - optional present and null" {
     const Meta = struct { a: ?u32, b: u32 };
     var st = zvdb.metadata.StringTable{};
+    defer if (st.data.len > 0) testing.allocator.free(st.data);
@@
 test "metadata encode/decode - optional array" {
     const Meta = struct { arr: ?[3]i16 };
     var st = zvdb.metadata.StringTable{};
+    defer if (st.data.len > 0) testing.allocator.free(st.data);
@@
 test "metadata encode/decode - nested struct" {
@@
     const Meta = struct { loc: Inner, tag: u8 };
     var st = zvdb.metadata.StringTable{};
+    defer if (st.data.len > 0) testing.allocator.free(st.data);
@@
 test "metadata encode/decode - primitive array" {
@@
     const Meta = struct { arr: [4]f32, tag: u8 };
     var st = zvdb.metadata.StringTable{};
+    defer if (st.data.len > 0) testing.allocator.free(st.data);
@@
 test "metadata encode/decode - multiple optionals" {
@@
     const Meta = struct { a: ?u32, b: ?u32, c: u32 };
     var st = zvdb.metadata.StringTable{};
+    defer if (st.data.len > 0) testing.allocator.free(st.data);
@@
 test "metadata encode/decode - nested optional with inner optional" {
@@
     const Meta = struct { data: ?Inner };
     var st = zvdb.metadata.StringTable{};
+    defer if (st.data.len > 0) testing.allocator.free(st.data);

This matches the pattern already used elsewhere and ensures no StringTable‑backed allocations escape the tests.

🧹 Nitpick comments (3)
benchmarks/shared_benchmarks.zig (1)

58-84: Batch insertion benchmark looks correct; consider avoiding @ptrCast for slice covariance

The HNSW init + batch insert path and lifetimes (points/metas/ids) look sound and match the new API. The only nit is:

const const_points: []const []const f32 = @ptrCast(points);

You can typically rely on slice const‑coercion instead of a low‑level @ptrCast, e.g.:

-    const const_points: []const []const f32 = @ptrCast(points);
+    const const_points: []const []const f32 = points;

Same idea applies to similar patterns in tests; it keeps the code idiomatic and avoids depending on slice representation details.

src/test_hnsw.zig (1)

1304-1353: Parallel insertBatch test is good coverage; consider simplifying the slice cast

The parallel insertBatch test does a nice job stress‑testing the internal worker‑based path. One minor nit:

const const_points: []const []const f32 = @ptrCast(points);

As with the benchmarks, you can usually rely on slice const‑coercion instead of @ptrCast, e.g.:

-    const const_points: []const []const f32 = @ptrCast(points);
+    const const_points: []const []const f32 = points;

This keeps the test code idiomatic and reduces dependence on slice representation details.

src/hnsw.zig (1)

630-651: getNode/getStringTable expose internal pointers without a lifetime guarantee under concurrency

Both helpers:

pub fn getStringTable(self: *const Self) *const metadata.StringTable { ... }

pub fn getNode(self: *Self, id: usize) ?*const Node {
    self.rwlock.lockShared();
    defer self.rwlock.unlockShared();
    return self.nodes.getPtr(id);
}

return raw pointers, but the read lock is released before the caller uses them. External callers that:

  • hold on to a *const Node across a mutation (delete, compact, updateReplace, updateInPlace, save with implicit compactLocked), or
  • hold on to a *const StringTable while concurrent inserts/batch inserts are appending to it

can observe use‑after‑free or stale data unless they also coordinate on the same RwLock.

Since tests already pass the StringTable pointer straight into searchFiltered (which does acquire a shared lock internally), this is probably safe for current internal usage, but it’s easy to misuse from external code.

Consider either:

  • documenting that callers must not retain these pointers across mutations, or
  • changing the API to operate via callbacks while holding the shared lock, e.g. withNode(id, fn (*const Node) void), withStringTable(fn (*const StringTable) void), so consumers can’t accidentally outlive the lock.

This would make the concurrency story around these helpers more robust.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8de5098 and b73ec31.

📒 Files selected for processing (3)
  • benchmarks/shared_benchmarks.zig (3 hunks)
  • src/hnsw.zig (5 hunks)
  • src/test_hnsw.zig (9 hunks)
🔇 Additional comments (2)
src/test_hnsw.zig (1)

36-200: Distance metric + SIMD equivalence tests look solid

The scalar vs SIMD tests for squared‑euclidean, cosine, and dot product across multiple dims, plus the basic metric sanity checks, give good coverage and use appropriate tolerances and edge cases (e.g. zero‑norm cosine). No issues spotted here.

src/hnsw.zig (1)

8-83: Parametric HNSW + metadata/string-table plumbing looks coherent

The refactor to pub fn HNSW(comptime T, comptime metric, comptime Metadata) with:

  • MetaFixed = metadata.FixedOf(Metadata)
  • dist = distance.distanceFn(T, metric)
  • per‑instance dims, string_table, and RwLock

is consistent with how tests and benchmarks now use the API (typed metric + metadata). Node’s ownership model (owning point and connections, but only a MetaFixed view into the shared string table) plus init/deinit and HNSW.init/deinit all line up cleanly with persistence and metadata encode/decode.

Allison Durham added 2 commits December 7, 2025 01:44
- Phase 1: SIMD distance functions with vector accumulators and @mulAdd FMA
- Phase 2: VectorStore with 64-byte aligned SoA layout
- Phase 3: GraphTape and NeighborsRef for compact inline neighbors
- Phase 4: Search v2 with DynamicBitSet visited tracking
- Phase 5: HNSWv2 core with dense nodes, Xoshiro256 PRNG, shrink cache
- Phase 6: Persistence v2.0 with new header format and round-trip tests

Also fixes OOB read bug: guard getNeighborsSlice against accessing
levels not allocated for a node, and clamp neighbor count in slice().
Phase 7: Add soft delete modes - tombstone-only and slot reuse
Phase 8: Implement cluster-based compaction with findCluster() and compact()
Phase 9: Update zvdb.zig exports to HNSWv2, add legacy module access,
         refactor tests and benchmarks for new API
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (1)
src/test_hnsw.zig (1)

314-472: Ensure StringTable is deinitialized in all metadata encode/decode tests

Several tests still create a fresh zvdb.metadata.StringTable without ever freeing its backing data:

  • metadata encode/decode - optional present and null
  • metadata encode/decode - optional array
  • metadata encode/decode - nested struct
  • metadata encode/decode - primitive array
  • metadata encode/decode - multiple optionals
  • metadata encode/decode - nested optional with inner optional

If metadata.encode populates the string table for these Meta shapes (now or in the future), these will show up as leaks under std.testing.allocator.

To keep things consistent and future-proof, mirror the pattern you already use in other tests:

var st = zvdb.metadata.StringTable{};
defer if (st.data.len > 0) testing.allocator.free(st.data);

Adding that to each of the above tests will ensure no StringTable allocations leak, regardless of the exact encode behavior.

🧹 Nitpick comments (10)
src/test_hnsw_v2_basic.zig (1)

185-191: Potential underflow when res.len is 0.

If searchTopK returns an empty slice (which the test at line 185 allows via res.len > 0), the loop 0..res.len - 1 will underflow since res.len is usize. While line 185 guards against this by asserting res.len > 0, consider adding explicit protection or restructuring to be defensive.

     try testing.expect(res.len > 0);
     try testing.expect(res.len <= 10);

     // Results should be sorted by distance
-    for (0..res.len - 1) |i| {
-        try testing.expect(res[i].distance <= res[i + 1].distance);
+    if (res.len > 1) {
+        for (0..res.len - 1) |i| {
+            try testing.expect(res[i].distance <= res[i + 1].distance);
+        }
     }
src/test_hnsw_v2_compact.zig (1)

81-88: Same potential underflow concern as in basic tests.

Line 85's loop 0..res2.len - 1 could underflow if res2.len is 0. While line 81 asserts res2.len >= 1, consider the same defensive pattern for consistency.

     try testing.expect(res2.len >= 1);
     try testing.expect(res2.len <= 4); // At most 4 nodes left

     // Results should be sorted by distance
-    for (0..res2.len - 1) |i| {
-        try testing.expect(res2[i].distance <= res2[i + 1].distance);
+    if (res2.len > 1) {
+        for (0..res2.len - 1) |i| {
+            try testing.expect(res2[i].distance <= res2[i + 1].distance);
+        }
     }
src/graph_tape.zig (1)

29-49: Consider extracting level offset calculation to reduce duplication.

The level offset calculation is duplicated between getNeighbors and getNeighborsConst. Extracting it to a helper would improve maintainability.

+    fn levelOffset(self: *const GraphTape, level: usize) usize {
+        return if (level == 0)
+            0
+        else
+            self.config.neighborsBaseBytes() + (level - 1) * self.config.neighborsLevelBytes();
+    }
+
     /// Get a NeighborsRef for the given node offset and level.
     pub fn getNeighbors(self: *GraphTape, offset: u32, level: usize) nr.NeighborsRef {
         const base = self.data.items.ptr + offset;
-        const lvl_off = if (level == 0)
-            0
-        else
-            self.config.neighborsBaseBytes() + (level - 1) * self.config.neighborsLevelBytes();
+        const lvl_off = self.levelOffset(level);
         const max_n = if (level == 0) self.config.m_base else self.config.m;
         return .{ .tape = base + lvl_off, .max_neighbors = max_n };
     }

     /// Get const neighbors (read-only view).
     pub fn getNeighborsConst(self: *const GraphTape, offset: u32, level: usize) nr.NeighborsRef {
         const base = self.data.items.ptr + offset;
-        const lvl_off = if (level == 0)
-            0
-        else
-            self.config.neighborsBaseBytes() + (level - 1) * self.config.neighborsLevelBytes();
+        const lvl_off = self.levelOffset(level);
         const max_n = if (level == 0) self.config.m_base else self.config.m;
         return .{ .tape = base + lvl_off, .max_neighbors = max_n };
     }
src/vector_store.zig (1)

61-80: Add bounds checks for slot in get/getTyped to mirror set

set asserts slot < self.count, but get and getTyped don’t, so a bad caller can read past initialized data (or past the buffer) without an early debug signal.

You can make this symmetric and safer in debug builds:

    /// Get vector at slot as raw bytes.
    pub fn get(self: *const VectorStore, slot: u32) []const u8 {
-        const off = @as(usize, slot) * self.vector_size_bytes;
+        std.debug.assert(@as(usize, slot) < self.count);
+        const off = @as(usize, slot) * self.vector_size_bytes;
         return self.data[off..][0..self.vector_size_bytes];
    }

    /// Get vector at slot as typed slice.
    pub fn getTyped(self: *const VectorStore, comptime T: type, slot: u32, dims: usize) []const T {
-        const off = @as(usize, slot) * self.vector_size_bytes;
+        std.debug.assert(@as(usize, slot) < self.count);
+        const off = @as(usize, slot) * self.vector_size_bytes;
         const ptr: [*]const T = @ptrCast(@alignCast(self.data.ptr + off));
         return ptr[0..dims];
    }

You may also want to assert that dims * @sizeOf(T) and capacity * vector_size_bytes don’t overflow usize in init/grow, but that’s more of a defensive refactor.

src/neighbors_ref.zig (1)

35-68: Add defensive bounds checks in get/set to complement max_neighbors

pushBack enforces n < self.max_neighbors, and slice clamps count to max_neighbors, but get/set blindly trust i. If a caller passes an out-of-range index (or the header is corrupted), you’ll read/write past the allocated tape.

Consider adding debug assertions:

    pub fn get(self: NeighborsRef, i: usize) u32 {
-        const off = @sizeOf(u32) + i * @sizeOf(u32);
+        std.debug.assert(i < self.max_neighbors);
+        const off = @sizeOf(u32) + i * @sizeOf(u32);
         return std.mem.readInt(u32, @as(*const [4]u8, @ptrCast(self.tape + off)), .little);
    }

    pub fn set(self: NeighborsRef, i: usize, val: u32) void {
-        const off = @sizeOf(u32) + i * @sizeOf(u32);
+        std.debug.assert(i < self.max_neighbors);
+        const off = @sizeOf(u32) + i * @sizeOf(u32);
         std.mem.writeInt(u32, @as(*[4]u8, @ptrCast(self.tape + off)), val, .little);
    }

This keeps the neighbor tape API safer in debug builds without changing release behavior.

src/hnsw_v2.zig (5)

75-81: Consider caching live count for frequent access.

liveCount() is O(n) which is fine for occasional use. If this is called frequently, consider maintaining a counter that's updated on insert/delete.


222-222: Redundant break statement.

The if (lvl == 0) break; is unnecessary—the loop condition while (lvl > level) already handles exit when lvl reaches level. Since lvl is only decremented in the continue expression after the body executes, this check never triggers in a way that the loop condition wouldn't already handle.

                     }
                 }
-                if (lvl == 0) break;
             }

327-333: Unused parameter old in findNewEntryPoint.

The parameter is suppressed with _ = old; but never used. Consider removing it if not needed, or using it to find a better replacement entry point (e.g., a neighbor of the old entry point) for better search quality.

-        fn findNewEntryPoint(self: *Self, old: u32) ?u32 {
-            _ = old;
+        fn findNewEntryPoint(self: *Self) ?u32 {
             for (self.nodes.items, 0..) |n, i| {
                 if (!n.deleted) return @intCast(i);
             }
             return null;
         }

Update the call site accordingly:

-            if (self.entry_point == id) self.entry_point = self.findNewEntryPoint(id);
+            if (self.entry_point == id) self.entry_point = self.findNewEntryPoint();

335-347: Clarify return values when Metadata is void.

When has_metadata is false, returning {} (empty struct/void) from a ?Metadata function is technically correct but semantically confusing. Returning null would be more consistent with the "no metadata" semantic. The current code returns {} (present but empty) vs null (absent).

         pub fn getMetadata(self: *const Self, id: u32) ?Metadata {
-            if (!has_metadata) return {};
+            if (!has_metadata) return null;
             if (id >= self.metadata_store.items.len) return null;
             if (self.nodes.items[id].deleted) return null;
             return metadata.decode(Metadata, self.metadata_store.items[id], &self.string_table);
         }

         pub fn getMetadataFixed(self: *const Self, id: u32) ?MetaFixed {
-            if (!has_metadata) return .{};
+            if (!has_metadata) return null;
             if (id >= self.metadata_store.items.len) return null;
             if (self.nodes.items[id].deleted) return null;
             return self.metadata_store.items[id];
         }

507-507: Direct tape memory manipulation is fragile.

Writing directly to nr.tape with pointer casting bypasses the NeighborsRef abstraction. If the internal format of NeighborsRef changes, this code will silently break.

Consider adding a setCount() method to NeighborsRef for encapsulation, or at minimum add a comment explaining the assumed memory layout.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b73ec31 and 47a13e0.

📒 Files selected for processing (17)
  • .thoughts/config.json (1 hunks)
  • benchmarks/shared_benchmarks.zig (4 hunks)
  • src/distance.zig (1 hunks)
  • src/graph_tape.zig (1 hunks)
  • src/hnsw_v2.zig (1 hunks)
  • src/neighbors_ref.zig (1 hunks)
  • src/persistence_v2.zig (1 hunks)
  • src/search2.zig (1 hunks)
  • src/test_graph_tape.zig (1 hunks)
  • src/test_hnsw.zig (2 hunks)
  • src/test_hnsw_v2_basic.zig (1 hunks)
  • src/test_hnsw_v2_compact.zig (1 hunks)
  • src/test_hnsw_v2_persist.zig (1 hunks)
  • src/test_search2.zig (1 hunks)
  • src/test_vector_store.zig (1 hunks)
  • src/vector_store.zig (1 hunks)
  • src/zvdb.zig (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • .thoughts/config.json
  • src/distance.zig
🔇 Additional comments (36)
src/test_hnsw_v2_basic.zig (5)

1-31: LGTM! Well-structured basic insert/search test.

The test properly initializes the allocator with deferred cleanup, validates IDs, count, and search results including exact distance match verification.


33-53: LGTM! Good nearest neighbor validation.

The test correctly validates that querying near (0.9, 0.9) returns the closest point (1,1) at id 3.


55-79: LGTM! Deletion behavior is well-tested.

Properly validates that deleted nodes are tombstoned (count unchanged, liveCount decremented) and excluded from search results.


81-118: LGTM! Slot reuse configurations are thoroughly tested.

Both slot reuse and tombstone-only modes are validated with clear assertions on expected slot assignments.


120-152: LGTM! Metadata and empty index edge cases covered.

The metadata retrieval test properly validates both live and deleted node behavior, and the empty index test confirms zero results are returned.

src/test_hnsw_v2_persist.zig (3)

7-58: LGTM! Comprehensive persistence round-trip test.

The test properly validates vector contents, metadata preservation, and search functionality after save/load cycle with appropriate cleanup.


60-88: LGTM! Void metadata persistence is validated.

Good coverage for the case where metadata type is void, ensuring the persistence layer handles this correctly.


90-128: LGTM! Deleted state preservation is well-tested.

Validates that tombstoned entries persist correctly and remain excluded from search results after reload.

src/test_graph_tape.zig (4)

6-23: LGTM! Size calculations are mathematically correct.

The expected values align with the formula: base = 4 + m_base4 = 132, level = 4 + m4 = 68. Node sizes correctly accumulate across levels.


25-66: LGTM! Core neighbor operations are well-tested.

Covers initialization, pushBack, count, get, clear, and iterator functionality across multiple levels.


68-113: LGTM! Contains, set, and slice operations validated.

Good coverage for mutation operations and slice-based access patterns.


115-157: LGTM! Multi-node allocation and byte access validated.

The offset calculations (0, 132, 332) correctly demonstrate that nodes are allocated contiguously with proper sizing, and getNodeBytes returns the expected slice length.

src/test_vector_store.zig (4)

5-36: LGTM! Comprehensive alignment and grow behavior test.

Validates 64-byte alignment (SIMD-friendly), correct slot indexing after growth, and typed retrieval accuracy.


38-52: LGTM! Empty initialization and auto-grow validated.

Good edge case coverage for zero initial capacity with subsequent growth on first add.


54-82: LGTM! Set overwrite and raw byte access tests are correct.

Validates that set properly overwrites existing data and get returns the correct byte slice length.


84-93: LGTM! Large dimension alignment validated.

Confirms that 128-dimensional vectors maintain 64-byte alignment with correct size calculation (512 bytes).

src/test_hnsw_v2_compact.zig (4)

5-43: LGTM! Compaction correctly removes deleted nodes.

Validates count/liveCount changes after delete and compact operations.


90-126: LGTM! Metadata preservation through compaction is validated.

Good test structure: insert with metadata, delete middle, compact, then verify remaining metadata via search results.


128-170: LGTM! Edge cases for empty and no-op compaction covered.

Validates both the all-deleted scenario (empty index, null entry_point) and the no-op case when nothing is deleted.


172-209: LGTM! Persistence after compaction is validated.

Good end-to-end test: create, insert, delete, compact, save, then load and verify state consistency.

src/graph_tape.zig (1)

51-66: LGTM! Utility methods are well-implemented.

getNodeBytes, size, and clear provide clean accessors for compaction workflows and tape management.

src/zvdb.zig (1)

4-33: Public zvdb surface + legacy namespace look coherent

The new module surface (HNSW factory, DistanceMetric alias, SearchResult, and legacy re-exports) is clean and keeps migration paths explicit. No issues from a correctness/API-shape perspective.

benchmarks/shared_benchmarks.zig (1)

58-99: Single-threaded HNSWv2 benchmark wiring looks correct

The switch to HNSW(f32, .squared_euclidean, void) with a clear single-threaded insertion loop and proper allocation/free of points aligns with the new API and keeps the benchmarks honest about num_threads = 1.

src/hnsw_v2.zig (13)

1-16: LGTM!

Clean import structure and well-designed compile-time parameterization. The use of comptime for has_metadata and distFn resolution is idiomatic Zig.


17-32: LGTM!

Config defaults are sensible for HNSW (m=16, m_base=32, ef_construction=200). Node struct is compact with appropriate field sizes.


48-58: LGTM!

Good approach to seed the PRNG by XORing the user-provided seed with a timestamp, ensuring different seeds across runs while allowing reproducibility when needed.


60-69: LGTM!

Proper cleanup with defensive check on string_table.data.len before freeing.


83-90: LGTM!

Clever use of @ctz for geometric level distribution. The x == 0 edge case is correctly handled to avoid returning 64.


92-116: LGTM!

Good defensive bound check in getNeighborsSlice for level access. The stateless distance wrapper is clean.


118-143: LGTM!

Correct pruning logic with proper temp allocation cleanup. Insertion sort is appropriate given the small neighbor list sizes (bounded by m/m_base).


145-156: LGTM!

Correct bidirectional connection logic with proper capacity checks. The self-connection guard at line 146 is a good defensive measure.


277-317: LGTM!

Clean adapter pattern with SearchContext for decoupling the search implementation. The delegation is straightforward and correct.


353-382: LGTM!

Correct greedy descent implementation for cluster identification. Good approach for improving cache locality during compaction.


544-606: LGTM!

Comprehensive save implementation with proper handling of empty cases and conditional defers for cleanup. Header construction is complete with all necessary fields.


682-686: LGTM!

Clean handling of the sentinel value for null entry point using std.math.maxInt(u64).


627-641: No memory leak. VectorStore.init with capacity 0 returns an empty slice (&[_]u8{}) without any heap allocation, so directly assigning to self.vectors.data at line 637 is safe.

src/search2.zig Outdated
Comment on lines 30 to 37
if (entry_point == null or k == 0 or node_count == 0) return &.{};

var visited = try std.DynamicBitSet.initEmpty(allocator, node_count);
defer visited.deinit();

var ep = entry_point.?;
var ep_dist = ctx.dist(query, ctx.getPoint(ep));

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix early-return allocation and index types for visited

Two hard issues here:

  1. Allocator contract violation on early return

return &.{} returns a pointer to a static empty slice, but all current callers (e.g. tests) unconditionally allocator.free(res). That is not a slice allocated from allocator and can break allocator invariants.

Make early exits allocate a zero-length slice via the same allocator so ownership is consistent:

 pub fn topK2(
@@
-    ctx: Context,
+    ctx: Context,
 ) ![]SearchResult(T) {
-    if (entry_point == null or k == 0 or node_count == 0) return &.{};
+    if (entry_point == null or k == 0 or node_count == 0) {
+        // Always return an allocator-owned slice so callers can unconditionally free.
+        return try allocator.alloc(SearchResult(T), 0);
+    }
@@
-    visited.set(ep);
+    visited.set(@intCast(ep));
@@
-            if (visited.isSet(nb)) continue;
-            visited.set(nb);
+            if (visited.isSet(@intCast(nb))) continue;
+            visited.set(@intCast(nb));

This both restores allocator ownership semantics and fixes the index type mismatch for DynamicBitSet (usize vs u32), which otherwise won’t compile.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In src/search2.zig around lines 30-37, the early return uses a static empty
slice (&.{}) which violates the allocator ownership contract because callers
always call allocator.free on the result, and the DynamicBitSet init uses a
mismatched index type (node_count is not usize). Replace the static return by
allocating a zero-length slice from the same allocator and returning that slice
(so callers can free it safely), and ensure the value passed to
std.DynamicBitSet.initEmpty is a usize (either cast node_count to usize or
change its declaration) so index types match; update any subsequent uses or
casts accordingly.

Allison Durham added 2 commits December 15, 2025 21:04
- Add ef_search==0 guard and cast u32 IDs to usize for DynamicBitSet
- Fix MockGraph.getPoint to return proper slice
- Add TapeOverflow check in GraphTape.allocateNode
- Add LengthOverflow checks in persistence_v2.loadSectionsV2
- Implement vector slot reuse via free_vector_slots
- Add errdefer for new_meta in compact()
- Add StringTable.deinit() method and use it in HNSWv2.deinit
- Add HNSWv2.entryPoint() getter
- Rename isEnumWithExplicitTag to isIntegerBackedEnum
- Remove obsolete enum_no_explicit_tag.zig compile-fail test
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (9)
.thoughts/config.json (1)

13-20: Same privacy concern applies to context_mounts.

The context_mounts configuration also references the private personal_thoughts repository with auto-sync enabled. This compounds the privacy/compliance risk flagged in the thoughts_mount section.

src/search2.zig (2)

79-101: ef_search == 0 case needs handling to prevent trap on empty heap

This issue was flagged in previous reviews. When ef_search == 0, the early exit on line 30 now catches it, but if that's removed or changed, line 81 and 91 will trap on results.peek().? because the heap is empty.

Since line 30 currently returns early when ef_search == 0, this is currently safe. However, if the early-exit logic changes, consider adding an explicit assertion at the start of the function:

std.debug.assert(ef_search > 0);

30-30: Allocator ownership violation on early return

This issue was flagged in previous reviews. Line 30 returns &.{}, a static empty slice, but callers expect an allocator-owned slice they can unconditionally free. This violates allocator contract and can break allocator invariants.

Apply this fix:

-    if (entry_point == null or k == 0 or node_count == 0 or ef_search == 0) return &.{};
+    if (entry_point == null or k == 0 or node_count == 0 or ef_search == 0) {
+        return try allocator.alloc(SearchResult(T), 0);
+    }
src/graph_tape.zig (1)

23-32: LGTM: Overflow checks properly address previous concerns

The code now includes proper bounds checking before casting to u32, addressing the integer overflow concern raised in earlier reviews. Lines 27-28 correctly guard against both current length overflow and addition overflow before performing the cast.

src/metadata.zig (2)

24-27: LGTM: deinit method properly added

The StringTable now has a proper deinit method that frees allocated memory, addressing the memory cleanup concern from earlier reviews.


41-48: Function name isIntegerBackedEnum may be misleading

This concern was raised in earlier reviews. The function name suggests it distinguishes between explicitly-tagged and inferred-tag enums, but all Zig enums have integer backing types. The current implementation returns true for any enum.

If the intent is to document that enums are supported, the current implementation is correct. If the intent was to reject inferred-tag enums specifically, additional compile-time introspection would be needed.

src/persistence_v2.zig (1)

115-144: LGTM: Proper overflow guards before casting u64 to usize

The code now includes explicit checks for header.*_len > max_usize before casting to usize, returning error.LengthOverflow when necessary. This properly addresses the concern from earlier reviews about unsafe casts on 32-bit platforms.

src/hnsw_v2.zig (2)

174-183: LGTM: Vector slot reuse properly implemented

The code now reuses vector slots when reuse_deleted_slots is enabled and free_vector_slots has available slots. This addresses the unbounded vector storage growth concern from earlier reviews. The free_vector_slots list (line 44) tracks freed slots, and delete() (line 340) populates it.


468-472: LGTM: errdefer properly added for new_meta

Line 472 now includes errdefer if (has_metadata) new_meta.deinit(self.allocator) to prevent memory leaks if subsequent operations fail. This addresses the concern from earlier reviews.

🧹 Nitpick comments (2)
src/test_persistence_v2.zig (1)

33-33: Consider using std.testing.tmpDir() for better portability

Hardcoded /tmp/ paths may not exist on all systems (e.g., Windows). While these tests will likely run on Unix-like systems, using Zig's std.testing.tmpDir() would improve cross-platform compatibility.

Also applies to: 68-68

src/test_hnsw_v2_persist.zig (1)

11-11: Consider using std.testing.tmpDir() for cross-platform compatibility

Hardcoded /tmp/ paths may fail on Windows and other non-Unix systems. Using std.testing.tmpDir() would be more portable.

Also applies to: 64-64, 94-94, 134-134

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47a13e0 and f4c4e9f.

📒 Files selected for processing (9)
  • .thoughts/config.json (1 hunks)
  • src/graph_tape.zig (1 hunks)
  • src/hnsw_v2.zig (1 hunks)
  • src/metadata.zig (1 hunks)
  • src/persistence_v2.zig (1 hunks)
  • src/search2.zig (1 hunks)
  • src/test_hnsw_v2_persist.zig (1 hunks)
  • src/test_persistence_v2.zig (1 hunks)
  • src/test_search2.zig (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/test_search2.zig
🔇 Additional comments (3)
.thoughts/config.json (1)

21-85: Well-curated reference collection.

The references section provides a comprehensive, well-documented collection of relevant libraries for HNSW implementation, SIMD optimization, and concurrent data structures. The descriptions clearly explain each library's relevance to the project.

src/test_hnsw_v2_persist.zig (1)

147-147: LGTM: Correct use of public entryPoint() method

The code correctly uses the public entryPoint() method to access the entry point, avoiding the private field access issue that was flagged in earlier reviews.

src/metadata.zig (1)

16-20: Standard error propagation pattern - no leak on realloc failure

The past comment raised concerns about memory leaks if realloc fails. However, this follows Zig's standard error propagation pattern: when realloc returns an error, the original allocation remains valid and owned by the caller, who must handle cleanup via errdefer or similar mechanisms. This is correct Zig idiom.

Comment on lines +8 to +12
"thoughts_mount": {
"remote": "git@github.com:allisoneer/personal_thoughts.git",
"subpath": "projects/zvdb/thoughts",
"sync": "auto"
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if config.json is in gitignore and search for other private repo references

echo "=== Checking .gitignore for .thoughts/config.json ==="
if [ -f .gitignore ]; then
  rg -n '\.thoughts.*config\.json' .gitignore || echo "Not found in .gitignore"
else
  echo ".gitignore does not exist"
fi

echo -e "\n=== Searching for other private repository URLs ==="
rg -n 'git@github\.com:[^/]+/personal_' --type json

echo -e "\n=== Checking if .thoughts/config.json exists and its size ==="
if [ -f .thoughts/config.json ]; then
  echo ".thoughts/config.json exists"
  wc -l .thoughts/config.json
else
  echo ".thoughts/config.json does not exist"
fi

echo -e "\n=== Checking for any .gitignore patterns related to .thoughts ==="
if [ -f .gitignore ]; then
  echo "=== Content of .gitignore ==="
  cat .gitignore | grep -i thoughts || echo "No 'thoughts' patterns found in .gitignore"
fi

Repository: allisoneer/zvdb

Length of output: 519


Remove private repository URL from committed configuration.

The .thoughts/config.json file is tracked in the repository and contains an SSH URL to a private GitHub repository (git@github.com:allisoneer/personal_thoughts.git). This exposes private infrastructure details and should never be committed.

Take one of these approaches:

  • Move sensitive configuration to environment variables or a secrets manager
  • Add .thoughts/config.json to .gitignore and commit config.json.example with placeholder values instead
  • For infrastructure URLs, use SSH config (~/.ssh/config) with host aliases rather than embedding URLs in application configuration
🤖 Prompt for AI Agents
.thoughts/config.json around lines 8 to 12 contains a private SSH repo URL that
must not be committed; remove the sensitive URL and replace it with a
non-sensitive placeholder, then either (a) move the real repository URL into an
environment variable or secrets manager and update the code to read from that
secret, or (b) add .thoughts/config.json to .gitignore and commit a
config.json.example with a placeholder value; optionally document using an SSH
host alias in ~/.ssh/config for infra URLs instead of embedding the full repo
URL.

Comment on lines +91 to +97
fn randomLevel(self: *Self) u16 {
const max_level_limit: u16 = 31;
const x = self.prng.random().int(u64);
if (x == 0) return max_level_limit;
const lvl: u16 = @intCast(@ctz(x));
return if (lvl > max_level_limit) max_level_limit else lvl;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

randomLevel() can return incorrect distribution for x == 0

When x == 0, the function returns max_level_limit (31) immediately. However, @ctz(0) is well-defined in Zig and returns the bit width (64 for u64). The special case for zero creates an unexpected spike in probability for level 31.

Consider removing the special case and letting @ctz handle zero naturally:

 fn randomLevel(self: *Self) u16 {
     const max_level_limit: u16 = 31;
     const x = self.prng.random().int(u64);
-    if (x == 0) return max_level_limit;
     const lvl: u16 = @intCast(@ctz(x));
     return if (lvl > max_level_limit) max_level_limit else lvl;
 }

This would give a more uniform distribution at the maximum level.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
fn randomLevel(self: *Self) u16 {
const max_level_limit: u16 = 31;
const x = self.prng.random().int(u64);
if (x == 0) return max_level_limit;
const lvl: u16 = @intCast(@ctz(x));
return if (lvl > max_level_limit) max_level_limit else lvl;
}
fn randomLevel(self: *Self) u16 {
const max_level_limit: u16 = 31;
const x = self.prng.random().int(u64);
const lvl: u16 = @intCast(@ctz(x));
return if (lvl > max_level_limit) max_level_limit else lvl;
}
🤖 Prompt for AI Agents
In src/hnsw_v2.zig around lines 91 to 97, the special-case branch for x == 0
causes a spike at max_level_limit (31); instead remove that branch and compute
level from @ctz(x) directly, cast the result to u16, then clamp it to
max_level_limit (i.e., return min(@intCast(u16, @ctz(x)), max_level_limit));
this lets @ctz handle zero naturally and avoids the incorrect distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants