Onehouse Quanton Platform

Tags intermediate guide quanton cloud-platform spark

Onehouse Quanton is a serverless managed Spark compute runtime that delivers 2-3x better price-performance versus AWS EMR and Databricks, with multi-table-format support for Apache Hudi, Apache Iceberg, and Delta Lake.

Features

  • Multi-format native - Hudi, Iceberg, and Delta Lake support

  • Serverless Spark - No cluster management overhead

  • Apache XTable - Cross-format metadata translation

  • Cost-efficient - No per-cluster fees, pay-as-you-go compute

  • Open standards - 100% Spark SQL compatible

Prerequisites

  • Onehouse account (onehouse.ai)

  • AWS account with S3 access for data staging

  • Onehouse API key

Installation

# Install required dependencies
pip install requests boto3

# Or via BenchBox extras
pip install "benchbox[quanton]"

Configuration

CLI Options

benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option api_key=your-api-key \
  --platform-option s3_staging_dir=s3://your-bucket/benchbox-data \
  --platform-option table_format=iceberg

Platform Options

Option

Default

Description

api_key

(env)

Onehouse API key

s3_staging_dir

(required)

S3 path for data staging (e.g., s3://bucket/path)

region

us-east-1

AWS region for cluster deployment

database

benchbox

Database name for benchmarks

table_format

iceberg

Table format: iceberg, hudi, or delta

cluster_size

small

Cluster size: small, medium, large, xlarge

record_key

(auto)

Hudi record key field (required for hudi format)

precombine_field

(optional)

Hudi precombine field for ordering during updates

hudi_table_type

COPY_ON_WRITE

Hudi table type: COPY_ON_WRITE or MERGE_ON_READ

Table Format Selection

Quanton supports three open table formats, each with distinct strengths:

Format

Best For

Key Features

Iceberg

Enterprise data lakes, multi-engine access

Schema evolution, hidden partitioning, partition evolution

Hudi

Streaming workloads, record-level updates

Record-level ACID, incremental processing, efficient upserts

Delta

Databricks interop, time travel queries

ACID transactions, unified batch/streaming

Iceberg (Default)

benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://bucket/data \
  --platform-option table_format=iceberg

Hudi

# Hudi requires record_key for write operations
benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://bucket/data \
  --platform-option table_format=hudi \
  --platform-option record_key=l_orderkey \
  --platform-option precombine_field=l_shipdate

Delta Lake

benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://bucket/data \
  --platform-option table_format=delta

Usage Examples

Basic Benchmark

# TPC-H with Iceberg (default)
benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://my-bucket/benchbox

Production Benchmark

# Larger scale with medium cluster
benchbox run --platform quanton --benchmark tpch --scale 10.0 \
  --platform-option s3_staging_dir=s3://my-bucket/benchbox \
  --platform-option cluster_size=medium \
  --platform-option table_format=iceberg

Cross-Format Comparison

# Compare performance across table formats
for format in iceberg hudi delta; do
  benchbox run --platform quanton --benchmark tpch --scale 1.0 \
    --platform-option s3_staging_dir=s3://my-bucket/benchbox \
    --platform-option table_format=$format \
    --output results/quanton_${format}.json
done

Python API

from benchbox import TPCH
from benchbox.platforms.onehouse import QuantonAdapter

adapter = QuantonAdapter(
    api_key="your-onehouse-api-key",
    s3_staging_dir="s3://my-bucket/benchbox-data",
    region="us-east-1",
    table_format="iceberg",
    cluster_size="small",
)

benchmark = TPCH(scale_factor=1.0)
benchmark.generate_data()
adapter.load_benchmark(benchmark)
results = adapter.run_benchmark(benchmark)

Cluster Sizing

Size

Workers

Recommended Scale

Use Case

Small

1-2

SF 0.01-1.0

Development, testing

Medium

2-5

SF 1.0-10.0

Standard benchmarks

Large

5-10

SF 10.0-100.0

Production workloads

XLarge

10+

SF 100.0+

Large-scale analytics

Hudi-Specific Configuration

When using Hudi table format, additional configuration is required:

Record Key

The record key uniquely identifies each record for ACID operations:

# TPC-H lineitem: use composite key
--platform-option record_key=l_orderkey,l_linenumber

# TPC-H orders: use primary key
--platform-option record_key=o_orderkey

Precombine Field

The precombine field orders records during deduplication:

# Use date field for ordering
--platform-option precombine_field=l_shipdate

Table Type

Choose between COPY_ON_WRITE (faster reads) or MERGE_ON_READ (faster writes):

# Analytics workload (default)
--platform-option hudi_table_type=COPY_ON_WRITE

# Write-heavy workload
--platform-option hudi_table_type=MERGE_ON_READ

Cost Optimization

Serverless Benefits

  • No idle cluster costs

  • Pay only for compute time used

  • Automatic scaling based on workload

Cluster Auto-Stop

Clusters automatically terminate after idle timeout (default: 15 minutes).

S3 Data Reuse

Data staged to S3 is reused across runs:

# First run uploads data
benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://bucket/data

# Subsequent runs skip upload
benchbox run --platform quanton --benchmark tpch --scale 1.0 \
  --platform-option s3_staging_dir=s3://bucket/data

Troubleshooting

Authentication Failed

# Verify API key is valid
curl -H "Authorization: Bearer $ONEHOUSE_API_KEY" \
  https://api.onehouse.ai/v1/health

S3 Access Denied

Ensure your AWS credentials have access to the S3 staging bucket:

# Test S3 access
aws s3 ls s3://your-bucket/benchbox-data/

Job Timeout

For large-scale benchmarks, increase the timeout:

benchbox run --platform quanton --benchmark tpch --scale 100.0 \
  --platform-option timeout_minutes=120

Hudi Write Failures

Ensure record_key is specified for Hudi format:

# Error: No record_key configured
# Fix: Add record_key parameter
--platform-option record_key=primary_key_column