Benchmark methodology for HoloVec.

This guide defines what we benchmark, why the suites are shaped the way they are, and which literature each suite is meant to reflect.

Benchmark Philosophy

HDC/VSA benchmarking has to be model-aware.

The comparison literature already highlights that useful evaluation covers at least:

  • bundle capacity
  • non-exact unbinding quality
  • the interaction of binding and bundling in query answering
  • application-level behavior rather than timing in isolation

That structure is explicit in Schlegel, Neubert, and Protzel's comparison study:

  • Schlegel et al. evaluate "(1) the capacity of bundles, (2) the approximation quality of non-exact unbinding operations, (3) the influence of combining binding and bundling operations on the query answering performance, and (4) the performance on two example applications" (Schlegel et al. 2022).

The HDC/VSA survey literature also argues that different models and data transformations should be understood in terms of their algebraic properties rather than treated as interchangeable vectors:

That means:

  • exact-inverse, self-inverse, approximate-inverse, sparse, and non-commutative models should not be forced into one scoreboard
  • quality metrics are as important as timing
  • the benchmark suite must expose the workload assumptions directly

Literature Mapping

Suite Why it exists Primary literature anchor
primitives sanity-check core ops and record baseline timings survey-level cross-model grounding from Kleyko et al.
bundle-capacity bundled-item recovery under cleanup Schlegel et al. 2022
approximate-unbinding sequential bind/unbind degradation on approximate models Schlegel et al. 2022
cleanup-factorization multi-factor recovery with cleanup dynamics Frady et al. / Resonator Networks and follow-on factorization work
order-sensitivity non-commutativity and exact recovery for directional models Yeung et al. 2024 and matrix-binding literature such as Gallant and Okaywe (2013)
sparse-retrieval sparse overlap and segment-pattern retrieval Rachkovskij and Kussul 2001

Model-Specific Expectations

Exact-inverse models

Examples: FHRR, GHRR

Expect:

  • very high bind/unbind recovery on clean pairs
  • strong compositional recovery on structured queries
  • sensitivity to the underlying structure, not just nearest-neighbor speed

Self-inverse models

Examples: MAP, BSC, BSDC-SEG

Expect:

  • strong cleanup and factorization behavior on clean compositions
  • simple algebra that often makes them attractive for hardware or discrete pipelines

Approximate-inverse models

Examples: HRR, VTB

Expect:

  • graceful degradation instead of perfect recovery
  • quality to depend more strongly on cleanup strategy and task formulation

Sparse models

Examples: BSDC, BSDC-SEG

Expect:

  • retrieval behavior to depend on overlap or segment structure, not just cosine-like scoring
  • different capacity and error regimes from dense continuous models

Non-commutative models

Examples: GHRR, VTB

Expect:

  • order-sensitive workloads to reveal their value
  • symmetric workloads to under-test them

Runner

The benchmark runner is:

python -m benchmarks.run --suite <suite> --model <model|all> --output <path>

Useful examples:

python -m benchmarks.run \
  --suite primitives \
  --model FHRR \
  --backend numpy \
  --smoke \
  --output artifacts/primitives-fhrr.json

python -m benchmarks.run \
  --suite order-sensitivity \
  --model GHRR \
  --format csv \
  --output artifacts/ghrr-order.csv

Supported suites:

  • primitives
  • bundle-capacity
  • approximate-unbinding
  • cleanup-factorization
  • order-sensitivity
  • sparse-retrieval

Output Policy

  • JSON is the default and is the preferred archival format.
  • CSV is supported for spreadsheet and docs workflows.
  • Outputs are written to local artifacts, not committed benchmark result blobs.

CI Policy

CI only smoke-tests the runner on tiny workloads. It does not enforce benchmark thresholds.

That is intentional. Before v1.0, we want reproducible methodology and correct model-aware task selection first. Hard regression thresholds can come later once we have stable published baselines.