Onehouse Quanton Platform¶

Tags intermediate guide quanton cloud-platform spark

Onehouse Quanton is a serverless managed Spark compute runtime that delivers 2-3x better price-performance versus AWS EMR and Databricks, with multi-table-format support for Apache Hudi, Apache Iceberg, and Delta Lake.

Features¶

Multi-format native - Hudi, Iceberg, and Delta Lake support
Serverless Spark - No cluster management overhead
Apache XTable - Cross-format metadata translation
Cost-efficient - No per-cluster fees, pay-as-you-go compute
Open standards - 100% Spark SQL compatible

Prerequisites¶

Onehouse account (onehouse.ai)
AWS account with S3 access for data staging
Onehouse API key

Installation¶

# Install required dependencies
pip install requests boto3

# Or via BenchBox extras
pip install "benchbox[quanton]"

Configuration¶

Environment Variables (Recommended)¶

export ONEHOUSE_API_KEY=your-onehouse-api-key

CLI Options¶

benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option api_key=your-api-key \
  --platform-option s3_staging_dir=s3://your-bucket/benchbox-data \
  --platform-option table_format=iceberg

Platform Options¶

Option	Default	Description
`api_key`	(env)	Onehouse API key
`s3_staging_dir`	(required)	S3 path for data staging (e.g., s3://bucket/path)
`region`	us-east-1	AWS region for cluster deployment
`database`	benchbox	Database name for benchmarks
`table_format`	iceberg	Table format: iceberg, hudi, or delta
`cluster_size`	small	Cluster size: small, medium, large, xlarge
`record_key`	(auto)	Hudi record key field (required for hudi format)
`precombine_field`	(optional)	Hudi precombine field for ordering during updates
`hudi_table_type`	COPY_ON_WRITE	Hudi table type: COPY_ON_WRITE or MERGE_ON_READ

Table Format Selection¶

Quanton supports three open table formats, each with distinct strengths:

Format	Best For	Key Features
Iceberg	Enterprise data lakes, multi-engine access	Schema evolution, hidden partitioning, partition evolution
Hudi	Streaming workloads, record-level updates	Record-level ACID, incremental processing, efficient upserts
Delta	Databricks interop, time travel queries	ACID transactions, unified batch/streaming

Iceberg (Default)¶

benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://bucket/data \
  --platform-option table_format=iceberg

Hudi¶

# Hudi requires record_key for write operations
benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://bucket/data \
  --platform-option table_format=hudi \
  --platform-option record_key=l_orderkey \
  --platform-option precombine_field=l_shipdate

Delta Lake¶

benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://bucket/data \
  --platform-option table_format=delta

Usage Examples¶

Basic Benchmark¶

# TPC-H with Iceberg (default)
benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://my-bucket/benchbox

Production Benchmark¶

# Larger scale with medium cluster
benchbox run --platform quanton --benchmark tpch --scale 10.0 \
  --platform-option s3_staging_dir=s3://my-bucket/benchbox \
  --platform-option cluster_size=medium \
  --platform-option table_format=iceberg

Cross-Format Comparison¶

# Compare performance across table formats
for format in iceberg hudi delta; do
  benchbox run --platform quanton --benchmark tpch --scale 1.0 \
    --platform-option s3_staging_dir=s3://my-bucket/benchbox \
    --platform-option table_format=$format \
    --output results/quanton_${format}.json
done

Python API¶

from benchbox import TPCH
from benchbox.platforms.onehouse import QuantonAdapter

adapter = QuantonAdapter(
    api_key="your-onehouse-api-key",
    s3_staging_dir="s3://my-bucket/benchbox-data",
    region="us-east-1",
    table_format="iceberg",
    cluster_size="small",
)

benchmark = TPCH(scale_factor=1.0)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)

Cluster Sizing¶

Size	Workers	Recommended Scale	Use Case
Small	1-2	SF 0.01-1.0	Development, testing
Medium	2-5	SF 1.0-10.0	Standard benchmarks
Large	5-10	SF 10.0-100.0	Production workloads
XLarge	10+	SF 100.0+	Large-scale analytics

Hudi-Specific Configuration¶

When using Hudi table format, additional configuration is required:

Record Key¶

The record key uniquely identifies each record for ACID operations:

# TPC-H lineitem: use composite key
--platform-option record_key=l_orderkey,l_linenumber

# TPC-H orders: use primary key
--platform-option record_key=o_orderkey

Precombine Field¶

The precombine field orders records during deduplication:

# Use date field for ordering
--platform-option precombine_field=l_shipdate

Table Type¶

Choose between COPY_ON_WRITE (faster reads) or MERGE_ON_READ (faster writes):

# Analytics workload (default)
--platform-option hudi_table_type=COPY_ON_WRITE

# Write-heavy workload
--platform-option hudi_table_type=MERGE_ON_READ

Cost Optimization¶

Serverless Benefits¶

No idle cluster costs
Pay only for compute time used
Automatic scaling based on workload

Cluster Auto-Stop¶

Clusters automatically terminate after idle timeout (default: 15 minutes).

S3 Data Reuse¶

Data staged to S3 is reused across runs:

# First run uploads data
benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://bucket/data

# Subsequent runs skip upload
benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://bucket/data

Troubleshooting¶

Authentication Failed¶

# Verify API key is valid
curl -H "Authorization: Bearer $ONEHOUSE_API_KEY" \
  https://api.onehouse.ai/v1/health

S3 Access Denied¶

Ensure your AWS credentials have access to the S3 staging bucket:

# Test S3 access
aws s3 ls s3://your-bucket/benchbox-data/

Job Timeout¶

For large-scale benchmarks, increase the timeout:

benchbox run --platform quanton --benchmark tpch --scale 100.0 \
  --platform-option timeout_minutes=120

Hudi Write Failures¶

Ensure record_key is specified for Hudi format:

# Error: No record_key configured
# Fix: Add record_key parameter
--platform-option record_key=primary_key_column