GCP Dataproc Platform¶
GCP Dataproc is Google Cloud’s fully managed Apache Spark and Hadoop service. BenchBox integrates with Dataproc to execute benchmarks on managed clusters, supporting both persistent and ephemeral cluster modes.
Features¶
Managed clusters - Automatic provisioning and monitoring
Flexible pricing - Per-second billing with preemptible VM support
GCS integration - Native Google Cloud Storage support
Auto-scaling - Dynamic cluster sizing based on workload
Hive Metastore - Managed metadata service support
Installation¶
# Install with Dataproc support
uv add benchbox --extra dataproc
# Dependencies installed: google-cloud-dataproc, google-cloud-storage
Prerequisites¶
GCP Project with Dataproc API enabled
GCS Bucket for data staging
Service Account with the following roles:
roles/dataproc.editor- Create/manage clusters and jobsroles/storage.objectAdmin- Read/write GCS objects
Application Default Credentials configured
Configuration¶
Environment Variables¶
# Required
export DATAPROC_PROJECT_ID=my-project-id
export DATAPROC_GCS_STAGING_DIR=gs://your-bucket/benchbox/
# Optional
export DATAPROC_REGION=us-central1
export DATAPROC_CLUSTER_NAME=my-cluster
export DATAPROC_DATABASE=benchbox
export DATAPROC_MASTER_TYPE=n2-standard-4
export DATAPROC_WORKER_TYPE=n2-standard-4
export DATAPROC_NUM_WORKERS=2
export DATAPROC_USE_PREEMPTIBLE=false
export DATAPROC_EPHEMERAL=false
CLI Usage¶
# Basic usage with existing cluster
benchbox run --platform dataproc --benchmark tpch --scale 1.0 \
--platform-option project_id=my-project \
--platform-option cluster_name=my-cluster \
--platform-option gcs_staging_dir=gs://bucket/benchbox/
# With custom machine types
benchbox run --platform dataproc --benchmark tpch --scale 10.0 \
--platform-option project_id=my-project \
--platform-option cluster_name=my-cluster \
--platform-option gcs_staging_dir=gs://bucket/benchbox/ \
--platform-option worker_machine_type=n2-highmem-8 \
--platform-option num_workers=4
# With preemptible workers
benchbox run --platform dataproc --benchmark tpch --scale 10.0 \
--platform-option project_id=my-project \
--platform-option cluster_name=my-cluster \
--platform-option gcs_staging_dir=gs://bucket/benchbox/ \
--platform-option use_preemptible=true
# Dry-run to preview queries
benchbox run --platform dataproc --benchmark tpch --dry-run ./preview \
--platform-option project_id=my-project \
--platform-option gcs_staging_dir=gs://bucket/benchbox/
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
required |
GCP project ID |
|
required |
GCS path for data staging |
|
us-central1 |
GCP region |
|
auto-generated |
Dataproc cluster name |
|
benchbox |
Hive database name |
|
n2-standard-4 |
Master VM type |
|
n2-standard-4 |
Worker VM type |
|
2 |
Number of worker nodes |
|
false |
Use preemptible workers |
|
false |
Create/delete cluster per job |
|
60 |
Job timeout in minutes |
Machine Types¶
Type |
vCPU |
Memory |
Cost/Hour |
Best For |
|---|---|---|---|---|
n2-standard-4 |
4 |
16GB |
~$0.20 |
General workloads |
n2-standard-8 |
8 |
32GB |
~$0.39 |
Larger datasets |
n2-highmem-4 |
4 |
32GB |
~$0.26 |
Memory-intensive |
n2-highmem-8 |
8 |
64GB |
~$0.52 |
Large joins/aggregations |
c2-standard-4 |
4 |
16GB |
~$0.21 |
CPU-intensive |
Preemptible VMs: ~80% cheaper but can be interrupted
Python API¶
from benchbox.platforms.gcp import DataprocAdapter
# Initialize adapter with existing cluster
adapter = DataprocAdapter(
project_id="my-project",
region="us-central1",
cluster_name="my-cluster",
gcs_staging_dir="gs://my-bucket/benchbox/",
database="tpch_benchmark",
num_workers=4,
)
# Or with ephemeral cluster
adapter = DataprocAdapter(
project_id="my-project",
gcs_staging_dir="gs://my-bucket/benchbox/",
create_ephemeral_cluster=True,
use_preemptible_workers=True,
)
# Create database in Hive
adapter.create_schema("tpch_sf1")
# Load data to GCS and create Hive tables
adapter.load_data(
tables=["lineitem", "orders", "customer"],
source_dir="/path/to/tpch/data",
)
# Execute query (submitted as Dataproc job)
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")
# Clean up (deletes ephemeral cluster if configured)
adapter.close()
Cluster Modes¶
Persistent Cluster¶
Best for ongoing development and multiple benchmark runs:
adapter = DataprocAdapter(
project_id="my-project",
cluster_name="benchmark-cluster", # Existing cluster
gcs_staging_dir="gs://bucket/benchbox/",
)
Reuse cluster across jobs
Faster job submission (no cluster startup)
Pay for cluster uptime
Ephemeral Cluster¶
Best for one-off benchmarks and CI/CD:
adapter = DataprocAdapter(
project_id="my-project",
gcs_staging_dir="gs://bucket/benchbox/",
create_ephemeral_cluster=True,
)
Cluster created when first job is submitted
Cluster deleted on adapter.close()
Pay only for actual usage
Spark Configuration¶
BenchBox automatically optimizes Spark configuration based on benchmark type and scale factor:
# Automatic configuration includes:
# - Adaptive Query Execution (AQE) settings
# - Shuffle partition tuning
# - Memory allocation
# - Join optimization
# For TPC-H at SF=10 with 4 workers:
# spark.sql.shuffle.partitions = 200
# spark.sql.adaptive.enabled = true
# spark.sql.adaptive.skewJoin.enabled = true
Cost Estimation¶
Scale Factor |
Data Size |
Workers |
Est. Runtime |
Est. Cost |
|---|---|---|---|---|
0.01 |
~10 MB |
2x n2-std-4 |
~30 min |
~$0.30 |
1.0 |
~1 GB |
2x n2-std-4 |
~2 hours |
~$1.20 |
10.0 |
~10 GB |
4x n2-std-8 |
~4 hours |
~$6.24 |
100.0 |
~100 GB |
8x n2-std-8 |
~8 hours |
~$24.96 |
With preemptible workers: reduce worker costs by ~80%
IAM Configuration¶
Minimum required IAM roles:
# For the service account running BenchBox
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:SA_EMAIL" \
--role="roles/dataproc.editor"
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:SA_EMAIL" \
--role="roles/storage.objectAdmin"
# For Dataproc worker service account
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="serviceAccount:PROJECT_NUMBER-compute@developer.gserviceaccount.com" \
--role="roles/dataproc.worker"
Troubleshooting¶
Common Issues¶
Cluster creation fails:
Verify Dataproc API is enabled
Check service account permissions
Ensure region has available quota
Job fails to start:
Check cluster is in RUNNING state
Verify GCS bucket is accessible
Review job logs in Dataproc console
Slow performance:
Increase number of workers
Use larger machine types
Enable preemptible secondary workers
Check for data skew in queries
GCS access denied:
Verify service account has Storage Object Admin role
Check bucket and object permissions
Ensure bucket is in same project or cross-project access is configured
Comparison with Other Platforms¶
Aspect |
Dataproc |
AWS EMR |
Azure HDInsight |
|---|---|---|---|
Pricing |
Per-second |
Per-second |
Per-minute |
Preemptible |
Preemptible VMs |
Spot Instances |
Low-priority VMs |
Storage |
GCS |
S3/HDFS |
ADLS/Blob |
Metastore |
Dataproc Metastore |
AWS Glue Catalog |
Hive Metastore |
Scaling |
Manual/Auto |
Auto |
Manual |