BigQuery Platform¶
Google BigQuery is a serverless, highly scalable data warehouse with built-in machine learning and real-time analytics. BenchBox provides full support for BigQuery benchmarking with optimized data loading via Cloud Storage.
Features¶
Serverless - No infrastructure to manage
Separation of storage and compute - Pay for what you use
Standard SQL - ANSI SQL compliant
Columnar storage - Capacitor format with automatic compression
Slot-based pricing - On-demand or reserved capacity
Prerequisites¶
Google Cloud project with BigQuery API enabled
Service account or user credentials with BigQuery permissions
Cloud Storage bucket for data staging (recommended for large datasets)
Billing enabled on the project
Installation¶
# Install BigQuery client
pip install google-cloud-bigquery google-cloud-storage
# Or via BenchBox extras
pip install "benchbox[bigquery]"
Configuration¶
Service Account (Recommended)¶
# Set credentials path
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
export BIGQUERY_PROJECT=your-project-id
Application Default Credentials¶
# Login with gcloud
gcloud auth application-default login
# Set project
export BIGQUERY_PROJECT=your-project-id
Interactive Setup¶
benchbox platforms setup --platform bigquery
CLI Options¶
benchbox run --platform bigquery --benchmark tpch --scale 1.0 \
--platform-option project_id=your-project-id \
--platform-option dataset=benchbox \
--platform-option location=US
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
(env) |
GCP project ID |
|
(auto) |
BigQuery dataset name |
|
US |
Data location (US, EU, etc.) |
|
(env) |
Service account JSON path |
|
(none) |
GCS bucket for staging |
|
(none) |
Query cost limit |
Authentication Methods¶
Service Account¶
# Create service account
gcloud iam service-accounts create benchbox-sa \
--display-name="BenchBox Service Account"
# Grant permissions
gcloud projects add-iam-policy-binding your-project \
--member="serviceAccount:benchbox-sa@your-project.iam.gserviceaccount.com" \
--role="roles/bigquery.admin"
# Create key
gcloud iam service-accounts keys create benchbox-key.json \
--iam-account=benchbox-sa@your-project.iam.gserviceaccount.com
# Set credentials
export GOOGLE_APPLICATION_CREDENTIALS=benchbox-key.json
User Credentials¶
# Interactive login
gcloud auth application-default login
# Set project
gcloud config set project your-project-id
Workload Identity (GKE)¶
# Use with Workload Identity Federation
benchbox run --platform bigquery --benchmark tpch \
--platform-option use_workload_identity=true
Usage Examples¶
Basic Benchmark¶
# TPC-H at scale factor 1
benchbox run --platform bigquery --benchmark tpch --scale 1.0 \
--platform-option project_id=your-project-id
With Cloud Storage Staging¶
# For large datasets, use GCS staging
benchbox run --platform bigquery --benchmark tpch --scale 100.0 \
--staging-root gs://your-bucket/benchbox/
With Cost Controls¶
# Limit query costs
benchbox run --platform bigquery --benchmark tpch --scale 10.0 \
--platform-option maximum_bytes_billed=10000000000 # 10GB limit
Python API¶
from benchbox import TPCH
from benchbox.platforms.bigquery import BigQueryAdapter
adapter = BigQueryAdapter(
project_id="your-project-id",
dataset="benchbox",
location="US",
)
benchmark = TPCH(scale_factor=1.0)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)
Pricing¶
On-Demand Pricing¶
Operation |
Cost |
|---|---|
Queries |
$5 per TB scanned |
Storage |
$0.02 per GB/month |
Streaming inserts |
$0.01 per 200MB |
Slot Reservations¶
For predictable costs, use slot reservations:
# Use with reserved slots
benchbox run --platform bigquery --benchmark tpch \
--platform-option reservation_id=projects/proj/locations/US/reservations/benchbox
Performance Features¶
Partitioning¶
BenchBox applies partitioning with --tuning tuned:
-- Date partitioning
CREATE TABLE lineitem
PARTITION BY DATE(l_shipdate)
CLUSTER BY l_orderkey
AS SELECT * FROM staging.lineitem;
Clustering¶
Clustering improves query performance:
-- Clustering keys
ALTER TABLE lineitem
CLUSTER BY l_shipdate, l_orderkey;
Query Caching¶
BenchBox disables caching for accurate benchmarks:
job_config = bigquery.QueryJobConfig(
use_query_cache=False,
use_legacy_sql=False,
)
Data Loading¶
Direct Load (Small Datasets)¶
For datasets under 10GB, direct loading via API:
# Automatic for small scale factors
benchbox run --platform bigquery --benchmark tpch --scale 0.1
Cloud Storage (Large Datasets)¶
For large datasets, stage in GCS first:
# Configure GCS staging
benchbox run --platform bigquery --benchmark tpch --scale 100.0 \
--staging-root gs://your-bucket/benchbox/
Load Job Configuration¶
# Customize load behavior
benchbox run --platform bigquery --benchmark tpch \
--platform-option write_disposition=WRITE_TRUNCATE \
--platform-option create_disposition=CREATE_IF_NEEDED
Cost Optimization¶
Query Cost Limits¶
Prevent runaway queries:
# Set per-query limit
benchbox run --platform bigquery --benchmark tpch \
--platform-option maximum_bytes_billed=1000000000 # 1GB
Dry Run Estimates¶
Preview query costs:
# Dry run mode
benchbox run --platform bigquery --benchmark tpch --dry-run ./estimate
Dataset Expiration¶
Auto-delete test datasets:
# Set dataset TTL
benchbox run --platform bigquery --benchmark tpch \
--platform-option dataset_expires_days=7
Troubleshooting¶
Authentication Failed¶
# Verify credentials
gcloud auth application-default print-access-token
# Check service account permissions
gcloud projects get-iam-policy your-project \
--filter="bindings.members:benchbox-sa"
Project Not Found¶
# Verify project ID
gcloud projects list
# Set default project
gcloud config set project your-project-id
Quota Exceeded¶
# Check quotas
gcloud compute project-info describe --project your-project-id
# Request quota increase via Cloud Console
# BigQuery > Quotas > Request Increase
Dataset Location Mismatch¶
# Datasets must be in same location as jobs
# Specify location explicitly
benchbox run --platform bigquery --benchmark tpch \
--platform-option location=us-east1
Permission Denied¶
-- Required permissions
-- roles/bigquery.admin (full access)
-- Or specific roles:
-- - roles/bigquery.dataEditor
-- - roles/bigquery.jobUser
-- - roles/storage.objectAdmin (for GCS staging)