Google Cloud Dataproc Serverless Platform¶
Dataproc Serverless is Google Cloud’s fully managed Apache Spark service that eliminates cluster management entirely. You submit batches and Google Cloud handles all infrastructure automatically with sub-minute startup times.
Features¶
Zero Cluster Management - No clusters to create, configure, or maintain
Fast Startup - Sub-minute batch startup (vs minutes for clusters)
Auto-scaling - Resources scale automatically based on workload
Cost-effective - Pay only for actual compute time, no idle costs
GCS Integration - Native Google Cloud Storage support
Installation¶
# Install with Dataproc Serverless support
uv add benchbox --extra dataproc-serverless
# Dependencies installed: google-cloud-dataproc, google-cloud-storage
Prerequisites¶
GCP project with Dataproc Serverless API enabled
GCS bucket for data staging
Google Cloud authentication configured:
Interactive:
gcloud auth application-default loginService account:
GOOGLE_APPLICATION_CREDENTIALSenvironment variableCompute Engine: Automatic metadata service
Configuration¶
Environment Variables¶
# Required
export GOOGLE_CLOUD_PROJECT=my-project
export GCS_STAGING_DIR=gs://my-bucket/benchbox
# Optional
export DATAPROC_REGION=us-central1
export DATAPROC_RUNTIME_VERSION=2.1
CLI Usage¶
# Basic usage
benchbox run --platform dataproc-serverless --benchmark tpch --scale 1.0 \
--platform-option project_id=my-project \
--platform-option gcs_staging_dir=gs://my-bucket/benchbox
# With custom region
benchbox run --platform dataproc-serverless --benchmark tpch --scale 1.0 \
--platform-option project_id=my-project \
--platform-option gcs_staging_dir=gs://my-bucket/benchbox \
--platform-option region=europe-west1
# With service account
benchbox run --platform dataproc-serverless --benchmark tpch --scale 1.0 \
--platform-option project_id=my-project \
--platform-option gcs_staging_dir=gs://my-bucket/benchbox \
--platform-option service_account=my-sa@my-project.iam.gserviceaccount.com
# Dry-run to preview queries
benchbox run --platform dataproc-serverless --benchmark tpch --dry-run ./preview \
--platform-option project_id=my-project \
--platform-option gcs_staging_dir=gs://my-bucket/benchbox
Platform Options¶
Option |
Default |
Description |
|---|---|---|
|
required |
GCP project ID |
|
required |
GCS path for data staging (e.g., gs://bucket/path) |
|
us-central1 |
GCP region for batch execution |
|
benchbox |
Hive database name |
|
2.1 |
Dataproc Serverless runtime version |
|
- |
Service account email for batch execution |
|
- |
VPC network URI (optional) |
|
- |
Subnetwork URI (optional) |
|
60 |
Batch timeout in minutes |
Python API¶
from benchbox.platforms.gcp import DataprocServerlessAdapter
# Initialize with project and staging
adapter = DataprocServerlessAdapter(
project_id="my-project",
region="us-central1",
gcs_staging_dir="gs://my-bucket/benchbox",
)
# Verify connection
adapter.create_connection()
# Create schema
adapter.create_schema("tpch_benchmark")
# Load data to GCS and create tables
adapter.load_data(
tables=["lineitem", "orders", "customer"],
source_dir="/path/to/tpch/data",
)
# Execute query via Serverless batch
result = adapter.execute_query("SELECT COUNT(*) FROM lineitem")
# Clean up
adapter.close()
Execution Model¶
Dataproc Serverless executes benchmarks via the Batch Controller API:
Batch Submission - PySpark script submitted to Serverless
Resource Provisioning - GCP automatically provisions Spark resources
Execution - SQL query executed via Spark SQL
Result Storage - Results written to GCS
Cleanup - Resources automatically released
Serverless vs Cluster-based Dataproc¶
Aspect |
Dataproc Serverless |
Dataproc Clusters |
|---|---|---|
Startup |
Sub-minute |
Minutes |
Management |
Zero |
Requires configuration |
Scaling |
Automatic |
Manual or auto-scale rules |
Idle Costs |
None |
Cluster running costs |
Use Case |
Intermittent workloads |
Continuous workloads |
Customization |
Runtime only |
Full cluster control |
Runtime Versions¶
Version |
Spark |
Python |
Release |
|---|---|---|---|
2.2 |
3.5.x |
3.11 |
Latest |
2.1 |
3.4.x |
3.10 |
Stable |
2.0 |
3.3.x |
3.10 |
Legacy |
Authentication¶
Dataproc Serverless uses Google Cloud Application Default Credentials (ADC):
Interactive Login (Development)¶
gcloud auth application-default login
Service Account (Automation)¶
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
# Or use service account impersonation
gcloud auth application-default login --impersonate-service-account=SA@PROJECT.iam.gserviceaccount.com
Compute Engine (GCE/GKE)¶
No configuration needed - automatically uses instance service account.
Cost Estimation¶
Dataproc Serverless uses consumption-based billing:
Resource |
Price |
|---|---|
vCPU |
~$0.06/hour |
Memory |
~$0.0065/GB-hour |
Scale Factor |
Data Size |
Est. Runtime |
Est. Cost* |
|---|---|---|---|
0.01 |
~10 MB |
~10 min |
~$0.05 |
1.0 |
~1 GB |
~45 min |
~$0.50 |
10.0 |
~10 GB |
~2 hours |
~$3.00 |
100.0 |
~100 GB |
~6 hours |
~$15.00 |
*Estimates vary based on query complexity and data distribution.
IAM Permissions¶
Required roles for the executing identity:
roles/dataproc.worker # Submit and manage batches
roles/storage.objectAdmin # Read/write GCS staging
Recommended project-level setup:
# Grant Dataproc permissions
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="user:USER@DOMAIN.COM" \
--role="roles/dataproc.worker"
# Grant storage permissions
gcloud projects add-iam-policy-binding PROJECT_ID \
--member="user:USER@DOMAIN.COM" \
--role="roles/storage.objectAdmin"
VPC Configuration¶
For private networking, specify VPC and subnet:
benchbox run --platform dataproc-serverless --benchmark tpch \
--platform-option project_id=my-project \
--platform-option gcs_staging_dir=gs://my-bucket/benchbox \
--platform-option network_uri=projects/my-project/global/networks/my-vpc \
--platform-option subnetwork_uri=projects/my-project/regions/us-central1/subnetworks/my-subnet
Troubleshooting¶
Common Issues¶
Authentication fails:
Run
gcloud auth application-default loginCheck service account permissions
Verify project ID is correct
Batch fails to start:
Check Dataproc Serverless API is enabled
Verify IAM permissions
Review GCP quotas
Storage access denied:
Check
roles/storage.objectAdminpermissionVerify GCS bucket exists
Check bucket is in same project or cross-project permissions
Batch timeout:
Increase
timeout_minutesfor large scale factorsCheck for data skew issues
Review batch logs in GCP Console