GCP Dataproc Platform¶

Tags intermediate guide dataproc cloud-platform

GCP Dataproc is Google Cloud’s fully managed Apache Spark and Hadoop service. BenchBox integrates with Dataproc to execute benchmarks on managed clusters, supporting both persistent and ephemeral cluster modes.

Features¶

Managed clusters - Automatic provisioning and monitoring
Flexible pricing - Per-second billing with preemptible VM support
GCS integration - Native Google Cloud Storage support
Auto-scaling - Dynamic cluster sizing based on workload
Hive Metastore - Managed metadata service support

Installation¶

# Install with Dataproc support
uv add benchbox --extra dataproc

# Dependencies installed: google-cloud-dataproc, google-cloud-storage

Prerequisites¶

GCP Project with Dataproc API enabled
GCS Bucket for data staging
Service Account with the following roles:
- roles/dataproc.editor - Create/manage clusters and jobs
- roles/storage.objectAdmin - Read/write GCS objects
Application Default Credentials configured

Configuration¶

Environment Variables¶

# Required
export DATAPROC_PROJECT_ID=my-project-id
export DATAPROC_GCS_STAGING_DIR=gs://your-bucket/benchbox/

# Optional
export DATAPROC_REGION=us-central1
export DATAPROC_CLUSTER_NAME=my-cluster
export DATAPROC_DATABASE=benchbox
export DATAPROC_MASTER_TYPE=n2-standard-4
export DATAPROC_WORKER_TYPE=n2-standard-4
export DATAPROC_NUM_WORKERS=2
export DATAPROC_USE_PREEMPTIBLE=false
export DATAPROC_EPHEMERAL=false

CLI Usage¶

# Basic usage with existing cluster
benchbox run --platform dataproc --benchmark tpch --scale 1.0 \
  --platform-option project_id=my-project \
  --platform-option cluster_name=my-cluster \
  --platform-option gcs_staging_dir=gs://bucket/benchbox/

# With custom machine types
benchbox run --platform dataproc --benchmark tpch --scale 10.0 \
  --platform-option project_id=my-project \
  --platform-option cluster_name=my-cluster \
  --platform-option gcs_staging_dir=gs://bucket/benchbox/ \
  --platform-option worker_machine_type=n2-highmem-8 \
  --platform-option num_workers=4

# With preemptible workers
benchbox run --platform dataproc --benchmark tpch --scale 10.0 \
  --platform-option project_id=my-project \
  --platform-option cluster_name=my-cluster \
  --platform-option gcs_staging_dir=gs://bucket/benchbox/ \
  --platform-option use_preemptible=true

# Dry-run to preview queries
benchbox run --platform dataproc --benchmark tpch --dry-run ./preview \
  --platform-option project_id=my-project \
  --platform-option gcs_staging_dir=gs://bucket/benchbox/

Platform Options¶

Option	Default	Description
`project_id`	required	GCP project ID
`gcs_staging_dir`	required	GCS path for data staging
`region`	us-central1	GCP region
`cluster_name`	auto-generated	Dataproc cluster name
`database`	benchbox	Hive database name
`master_machine_type`	n2-standard-4	Master VM type
`worker_machine_type`	n2-standard-4	Worker VM type
`num_workers`	2	Number of worker nodes
`use_preemptible`	false	Use preemptible workers
`ephemeral_cluster`	false	Create/delete cluster per job
`timeout_minutes`	60	Job timeout in minutes

Machine Types¶

Type	vCPU	Memory	Cost/Hour	Best For
n2-standard-4	4	16GB	~$0.20	General workloads
n2-standard-8	8	32GB	~$0.39	Larger datasets
n2-highmem-4	4	32GB	~$0.26	Memory-intensive
n2-highmem-8	8	64GB	~$0.52	Large joins/aggregations
c2-standard-4	4	16GB	~$0.21	CPU-intensive

Preemptible VMs: ~80% cheaper but can be interrupted

Python API¶

from benchbox.platforms.gcp import DataprocAdapter

# Initialize adapter with existing cluster
adapter = DataprocAdapter(
    project_id="my-project",
    region="us-central1",
    cluster_name="my-cluster",
    gcs_staging_dir="gs://my-bucket/benchbox/",
    database="tpch_benchmark",
    num_workers=4,
)

# Or with ephemeral cluster
adapter = DataprocAdapter(
    project_id="my-project",
    gcs_staging_dir="gs://my-bucket/benchbox/",
    create_ephemeral_cluster=True,
    use_preemptible_workers=True,
)

# Create database in Hive
adapter.create_schema("tpch_sf1")

# Load data to GCS and create Hive tables
adapter.load_data(
    tables=["lineitem", "orders", "customer"],
    source_dir="/path/to/tpch/data",
)

# Execute query (submitted as Dataproc job)
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")

# Clean up (deletes ephemeral cluster if configured)
adapter.close()

Cluster Modes¶

Persistent Cluster¶

Best for ongoing development and multiple benchmark runs:

adapter = DataprocAdapter(
    project_id="my-project",
    cluster_name="benchmark-cluster",  # Existing cluster
    gcs_staging_dir="gs://bucket/benchbox/",
)

Reuse cluster across jobs
Faster job submission (no cluster startup)
Pay for cluster uptime

Ephemeral Cluster¶

Best for one-off benchmarks and CI/CD:

adapter = DataprocAdapter(
    project_id="my-project",
    gcs_staging_dir="gs://bucket/benchbox/",
    create_ephemeral_cluster=True,
)

Cluster created when first job is submitted
Cluster deleted on adapter.close()
Pay only for actual usage

Spark Configuration¶

BenchBox automatically optimizes Spark configuration based on benchmark type and scale factor:

# Automatic configuration includes:
# - Adaptive Query Execution (AQE) settings
# - Shuffle partition tuning
# - Memory allocation
# - Join optimization

# For TPC-H at SF=10 with 4 workers:
# spark.sql.shuffle.partitions = 200
# spark.sql.adaptive.enabled = true
# spark.sql.adaptive.skewJoin.enabled = true

Cost Estimation¶

Scale Factor	Data Size	Workers	Est. Runtime	Est. Cost
0.01	~10 MB	2x n2-std-4	~30 min	~$0.30
1.0	~1 GB	2x n2-std-4	~2 hours	~$1.20
10.0	~10 GB	4x n2-std-8	~4 hours	~$6.24
100.0	~100 GB	8x n2-std-8	~8 hours	~$24.96

With preemptible workers: reduce worker costs by ~80%

IAM Configuration¶

Minimum required IAM roles:

# For the service account running BenchBox
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:SA_EMAIL" \
  --role="roles/dataproc.editor"

gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:SA_EMAIL" \
  --role="roles/storage.objectAdmin"

# For Dataproc worker service account
gcloud projects add-iam-policy-binding PROJECT_ID \
  --member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" \
  --role="roles/dataproc.worker"

Troubleshooting¶

Common Issues¶

Cluster creation fails:

Verify Dataproc API is enabled
Check service account permissions
Ensure region has available quota

Job fails to start:

Check cluster is in RUNNING state
Verify GCS bucket is accessible
Review job logs in Dataproc console

Slow performance:

Increase number of workers
Use larger machine types
Enable preemptible secondary workers
Check for data skew in queries

GCS access denied:

Verify service account has Storage Object Admin role
Check bucket and object permissions
Ensure bucket is in same project or cross-project access is configured

Comparison with Other Platforms¶

Aspect	Dataproc	AWS EMR	Azure HDInsight
Pricing	Per-second	Per-second	Per-minute
Preemptible	Preemptible VMs	Spot Instances	Low-priority VMs
Storage	GCS	S3/HDFS	ADLS/Blob
Metastore	Dataproc Metastore	AWS Glue Catalog	Hive Metastore
Scaling	Manual/Auto	Auto	Manual