Data Sharing Between Benchmarks¶
Some benchmarks reuse the datasets generated by another benchmark (for example, Primitives reuses TPC-H data). These benchmarks should:
Override
BaseBenchmark.get_data_source_benchmark()to return the canonical source benchmark identifier (e.g.,"tpch").Resolve their default
output_dirto the samebenchmark_runs/datagenlocation used by the source benchmark. The helperget_benchmark_runs_datagen_path()mirrors theDirectoryManagerlayout and appliesformat_scale_factor()so thatsf1,sf10, etc. are used consistently.
CLI vs. Programmatic Usage¶
CLI / BenchmarkOrchestrator – the orchestrator constructs a
DirectoryManager(respecting any custombase_dirsupplied by callers). When a benchmark declares a data source, the orchestrator forcesbenchmark.output_dirto the shared path returned byDirectoryManager.get_datagen_path(alias, scale_factor). As a result, data sharing works even when the CLI is pointed at an alternatebenchmark_runsroot.Direct instantiation – classes such as
ReadPrimitivesBenchmarkfall back toget_benchmark_runs_datagen_path()when nooutput_diris provided. This keeps programmatic usage aligned with the CLI defaults without requiringDirectoryManager.
If a benchmark wishes to share data in a non-standard location, it should expose
configuration (or honour output_dir) and return None from
get_data_source_benchmark(), opting out of the automatic reuse logic.
Note that circular data-sharing relationships are not supported; aliases should form a simple chain that terminates at a benchmark which generates its own data.
The lifecycle runner recognises these aliases during manifest validation. When a
benchmark declares a data source, _validate_manifest_if_present accepts
manifests whose benchmark field matches either the benchmark’s own name or the
alias. This allows --no-regenerate to work correctly for shared datasets while
preserving manifest integrity.
When adding a new data-sharing benchmark, ensure:
Default paths align via
benchbox.utils.path_utils.get_benchmark_runs_datagen_path.Shared manifests populate
benchmark.tablesduring reuse.Lifecycle tests cover manifest reuse and
--no-regenerateflows for the new alias.