datagen - Data Generation¶
Generate benchmark data files without running queries. This is a convenience wrapper for benchbox run --phases generate.
Basic Syntax¶
benchbox datagen --benchmark <name> --scale <sf> [OPTIONS]
Options¶
Required:
--benchmark TEXT: Benchmark name (e.g.,tpch,tpcds,clickbench)--scale FLOAT: Scale factor for data generation
Optional:
--output PATH: Output directory for generated data--format [parquet|csv|json]: Not yet implemented. This option is accepted but has no effect. Data is generated as pipe-delimited.tblflat files (compressed to.tbl.zstwhen compression is enabled, which is the default). To convert generated data to parquet or other formats, usebenchbox runwith--table-format.--seed INT: Random seed for reproducible data generation--verbose, -v: Enable verbose logging
Usage Examples¶
# Generate TPC-H data at scale factor 0.1
benchbox datagen --benchmark tpch --scale 0.1 --output ./data/tpch_0.1
# Generate TPC-DS data with specific seed
benchbox datagen --benchmark tpcds --scale 1 --seed 42 --output ./data/tpcds_1
# Generate ClickBench data
benchbox datagen --benchmark clickbench --scale 1 --output ./data/clickbench
# Generate with verbose logging
benchbox datagen --benchmark tpch --scale 0.01 --output ./data --verbose
Notes¶
Internally invokes
benchbox run --phases generatewith a dummy platform. No database connection is needed.Generated data can be reused across multiple benchmark runs by pointing
--outputto a shared location, or by usingbenchbox run --global-cache.