Data Sharing Between Benchmarks

Tags contributor guide cloud-storage

Some benchmarks reuse the datasets generated by another benchmark (for example, Primitives reuses TPC-H data). These benchmarks should:

  1. Override BaseBenchmark.get_data_source_benchmark() to return the canonical source benchmark identifier (e.g., "tpch").

  2. Resolve their default output_dir to the same benchmark_runs/datagen location used by the source benchmark. The helper get_benchmark_runs_datagen_path() mirrors the DirectoryManager layout and applies format_scale_factor() so that sf1, sf10, etc. are used consistently.

CLI vs. Programmatic Usage

  • CLI / BenchmarkOrchestrator – the orchestrator constructs a DirectoryManager (respecting any custom base_dir supplied by callers). When a benchmark declares a data source, the orchestrator forces benchmark.output_dir to the shared path returned by DirectoryManager.get_datagen_path(alias, scale_factor). As a result, data sharing works even when the CLI is pointed at an alternate benchmark_runs root.

  • Direct instantiation – classes such as ReadPrimitivesBenchmark fall back to get_benchmark_runs_datagen_path() when no output_dir is provided. This keeps programmatic usage aligned with the CLI defaults without requiring DirectoryManager.

If a benchmark wishes to share data in a non-standard location, it should expose configuration (or honour output_dir) and return None from get_data_source_benchmark(), opting out of the automatic reuse logic.

Note that circular data-sharing relationships are not supported; aliases should form a simple chain that terminates at a benchmark which generates its own data.

The lifecycle runner recognises these aliases during manifest validation. When a benchmark declares a data source, _validate_manifest_if_present accepts manifests whose benchmark field matches either the benchmark’s own name or the alias. This allows --no-regenerate to work correctly for shared datasets while preserving manifest integrity.

When adding a new data-sharing benchmark, ensure:

  • Default paths align via benchbox.utils.path_utils.get_benchmark_runs_datagen_path.

  • Shared manifests populate benchmark.tables during reuse.

  • Lifecycle tests cover manifest reuse and --no-regenerate flows for the new alias.