NYC Taxi OLAP Benchmark¶
CLI name:
nyctaxi- usebenchbox run --benchmark nyctaxi
Overview¶
The NYC Taxi OLAP Benchmark uses real-world NYC Taxi & Limousine Commission (TLC) trip record data for comprehensive OLAP analytics testing. Unlike synthetic benchmarks, this benchmark leverages actual transportation data from New York City, providing realistic distributions, seasonal patterns, and geographic analytics opportunities.
The benchmark is ideal for testing analytical database performance on real-world data patterns, particularly for organizations dealing with transportation, logistics, or time-series geospatial data.
Key Features¶
Real-world data - Uses actual NYC TLC trip records (or realistic synthetic fallback)
Multi-dimensional analysis - Temporal, geographic, and financial dimensions
25 OLAP queries - Comprehensive query coverage across 9 categories
Zone-based geography - 265 NYC taxi zones for geographic analytics
Flexible scale factors - From testing (0.01) to full-dataset (SF=10 ceiling)
Date range filtering - Configurable year and month selection
Standard SQL - Queries work across multiple database platforms
Data Source¶
The benchmark uses data from the NYC TLC Trip Record Data:
Format: Parquet files from TLC data portal
Coverage: Yellow and green taxi trips
Time range: 2019-2025 (configurable)
Fallback: Synthetic data generation when download unavailable
Schema Description¶
The NYC Taxi benchmark uses a star schema with a fact table (trips) and dimension table (taxi_zones):
Tables¶
Table |
Purpose |
Approximate Rows (SF 1) |
|---|---|---|
trips |
Trip fact records with fare and location data |
~9,600,000 |
taxi_zones |
NYC taxi zone dimension table |
265 |
taxi_zones Table Structure¶
Column |
Type |
Description |
|---|---|---|
|
INTEGER |
Unique zone identifier (1-265) |
|
VARCHAR |
NYC borough (Manhattan, Brooklyn, Queens, Bronx, Staten Island, EWR) |
|
VARCHAR |
Zone name (e.g., “Times Sq/Theatre District”) |
|
VARCHAR |
Service zone type (Yellow Zone, Boro Zone, Airports, EWR, N/A) |
trips Table Structure¶
Column |
Type |
Description |
|---|---|---|
|
INTEGER |
Primary key (synthetic) |
|
TIMESTAMP |
Trip start timestamp |
|
TIMESTAMP |
Trip end timestamp |
|
INTEGER |
Pickup zone ID (FK to taxi_zones) |
|
INTEGER |
Dropoff zone ID (FK to taxi_zones) |
|
DECIMAL(10,2) |
Trip distance in miles |
|
INTEGER |
Number of passengers |
|
INTEGER |
Rate code (1-6) |
|
INTEGER |
Payment method (1-6) |
|
DECIMAL(10,2) |
Base fare amount |
|
DECIMAL(10,2) |
Tip amount |
|
DECIMAL(10,2) |
Tolls paid |
|
DECIMAL(10,2) |
MTA tax |
|
DECIMAL(10,2) |
Improvement surcharge |
|
DECIMAL(10,2) |
Congestion pricing surcharge |
|
DECIMAL(10,2) |
Total trip cost |
|
INTEGER |
Taxi vendor identifier |
Query Categories¶
The benchmark includes 25 queries organized into 9 categories:
Temporal Queries¶
Time-based aggregations and patterns:
trips-per-hour: Hourly trip distributiontrips-per-day: Daily trip patternstrips-per-month: Monthly aggregationshourly-revenue: Revenue by hour of day
Geographic Queries¶
Zone-level spatial analytics:
top-pickup-zones: Busiest pickup locationstop-dropoff-zones: Busiest dropoff locationszone-pairs: Popular origin-destination pairsborough-summary: Borough-level aggregations
Financial Queries¶
Revenue and tip analysis:
total-revenue: Overall revenue metricstip-analysis: Tip patterns and percentagesfare-distribution: Fare amount distributionspayment-analysis: Payment type breakdowns
Characteristics Queries¶
Trip attribute analysis:
distance-stats: Trip distance statisticspassenger-distribution: Passenger count patternstrip-duration: Duration analysis
Rate Code Queries¶
Rate code analysis:
rate-code-distribution: Rate code usage patterns
Vendor Queries¶
Vendor performance comparisons:
vendor-comparison: Vendor-level metrics
Complex Queries¶
Multi-dimensional analytics:
peak-hour-zones: Peak hours by zoneweekend-weekday: Weekend vs weekday patternsrevenue-by-zone-hour: Zone-hour revenue matrix
Point Queries¶
Single-value lookups:
specific-trip-count: Filtered trip counts
Baseline Queries¶
Full table operations:
full-scan: Complete table scanrow-count: Basic count
Usage Examples¶
Basic Benchmark Setup¶
from benchbox import NYCTaxi
# Initialize NYC Taxi benchmark
nyctaxi = NYCTaxi(scale_factor=1.0, output_dir="nyctaxi_data")
# Download/generate data
data_files = nyctaxi.generate_data()
# Get all benchmark queries
queries = nyctaxi.get_queries()
print(f"Generated {len(queries)} NYC Taxi queries")
# Get specific query
hourly_query = nyctaxi.get_query("trips-per-hour")
print(hourly_query)
Configuring Data Year and Months¶
# Use specific year and months
nyctaxi_2023 = NYCTaxi(
scale_factor=0.1,
output_dir="nyctaxi_2023",
year=2023,
months=[1, 2, 3] # Q1 only
)
data_files = nyctaxi_2023.generate_data()
DuckDB Integration Example¶
import duckdb
from benchbox import NYCTaxi
# Initialize and generate data
nyctaxi = NYCTaxi(scale_factor=0.1, output_dir="nyctaxi_small")
data_files = nyctaxi.generate_data()
# Create DuckDB connection and schema
conn = duckdb.connect("nyctaxi.duckdb")
schema_sql = nyctaxi.get_create_tables_sql(dialect="duckdb")
for stmt in schema_sql.split(";"):
if stmt.strip():
conn.execute(stmt)
# Load data efficiently with DuckDB
zones_file = nyctaxi.tables["taxi_zones"]
trips_file = nyctaxi.tables["trips"]
conn.execute(f"""
INSERT INTO taxi_zones
SELECT * FROM read_csv('{zones_file}', header=true)
""")
conn.execute(f"""
INSERT INTO trips
SELECT * FROM read_csv('{trips_file}', header=true)
""")
# Run queries
queries = nyctaxi.get_queries()
for query_id, query_sql in list(queries.items())[:5]:
result = conn.execute(query_sql).fetchall()
print(f"{query_id}: {len(result)} rows")
conn.close()
Query Categories Example¶
from benchbox import NYCTaxi
nyctaxi = NYCTaxi(scale_factor=0.01)
# Get queries by category
temporal_queries = nyctaxi.get_queries_by_category("temporal")
print(f"Temporal queries: {temporal_queries}")
geographic_queries = nyctaxi.get_queries_by_category("geographic")
print(f"Geographic queries: {geographic_queries}")
financial_queries = nyctaxi.get_queries_by_category("financial")
print(f"Financial queries: {financial_queries}")
# Get detailed query info
info = nyctaxi.get_query_info("trips-per-hour")
print(f"Query info: {info}")
CLI Options (--benchmark-option)¶
Configure NYC Taxi data generation via --benchmark-option KEY=VALUE:
Option |
Default |
Description |
|---|---|---|
|
all |
Comma-separated taxi types: |
|
|
Year of TLC data to load (2019-2025) |
|
all |
Comma-separated months to include (1-12) |
|
- |
Random seed for reproducibility |
|
- |
Force data regeneration ( |
Options accept hyphenated aliases (e.g. taxi-types for taxi_types).
# Load only yellow and green trips from Jan-Mar 2022
benchbox run --platform duckdb --benchmark nyctaxi --scale 1 \
--benchmark-option taxi_types=yellow,green \
--benchmark-option year=2022 \
--benchmark-option months=1,2,3
Spatial Queries¶
The NYC Taxi benchmark includes a geospatial layer
(benchbox/core/nyctaxi/spatial.py) that exposes platform-specific spatial
queries against a companion taxi_zones_spatial table. These queries are
not automatically included in a standard benchmark run - they require a
spatial extension on the target platform and are retrieved via the spatial
API rather than through --benchmark-option.
Platform Support¶
check_spatial_support(platform) returns the matrix of capabilities used to
decide which queries are runnable on a given platform:
Platform |
Extension |
Distance fns |
Geohash |
H3 |
Notes |
|---|---|---|---|---|---|
DuckDB |
|
|
No |
No |
Basic Euclidean operations |
PostgreSQL |
PostGIS |
Full ST_* including |
Yes |
Via extension |
Most complete support |
ClickHouse |
Native geo |
|
Yes |
Yes |
Different API surface |
Platforms not listed return {"basic_spatial": False}.
Query Catalogs¶
Each platform exposes a subset of spatial query IDs from
spatial.py (10 DuckDB / 4 PostGIS / 4 ClickHouse):
Category |
DuckDB |
PostGIS |
ClickHouse |
|---|---|---|---|
|
✅ |
✅ |
✅ |
|
✅ |
✅ |
✅ |
|
✅ |
✅ |
- |
|
✅ |
- |
- |
|
✅ |
- |
- |
|
✅ |
- |
- |
|
✅ |
- |
- |
|
✅ |
- |
- |
|
✅ |
- |
- |
|
✅ |
- |
- |
|
- |
✅ |
- |
|
- |
- |
✅ |
|
- |
- |
✅ |
Reference Data¶
TAXI_ZONE_CENTROIDS in spatial.py provides representative
(longitude, latitude) points for the NYC TLC zones most commonly referenced
by the spatial queries (airports, Manhattan neighborhoods, key Brooklyn and
Queens zones). The full zone geometry is not included - callers materialize
centroids into the taxi_zones_spatial table at setup time and the geometry
column is populated per-platform.
API Reference¶
from benchbox.core.nyctaxi import (
get_spatial_queries,
get_spatial_create_table_sql,
check_spatial_support,
)
get_spatial_queries(platform: str) -> dict- returns the spatial query catalog forplatform(duckdb,postgres/postgresql/postgis, or any ClickHouse variant). Unknown platforms return an empty dict.get_spatial_create_table_sql(dialect: str = "duckdb") -> str- returns theCREATE TABLE taxi_zones_spatialstatement in the requested dialect. PostgreSQL output includes a generatedgeomcolumn (PostGISGEOMETRY(POINT, 4326)) and a GIST index.check_spatial_support(platform: str) -> dict[str, bool]- capability map (keys:basic_spatial,st_distance,st_point,st_centroid,geohash,h3,geography, …). Use this to filter query catalogs before execution.
The spatial layer is not currently wired to a --benchmark-option flag; it
is accessed programmatically via the functions above.
Scale Factor Guidelines¶
Scale Factor |
Trips |
Data Size |
Memory Usage |
Use Case |
|---|---|---|---|---|
0.01 |
~96K |
~10 MB |
< 100 MB |
Quick testing |
0.1 |
~960K |
~96 MB |
< 500 MB |
Development |
1.0 |
~9.6M |
~0.96 GB |
< 4 GB |
Standard benchmark |
10.0 |
~96M |
~9.6 GB |
< 20 GB |
Performance testing (ceiling) |
100.0 |
~96M |
~9.6 GB |
< 20 GB |
⚠️ Saturated - same as SF=10 |
Note: Each taxi type’s sample rate saturates at 1.0 when SF ≥ 10. Beyond SF=10, the full available dataset is used and no additional scaling occurs. This applies independently to each taxi type (Yellow, Green, HVFHV).
Why not extend beyond the ceiling with synthetic data? NYC Taxi is a real-data benchmark - its value comes from actual trip distributions, fare structures, and temporal patterns that synthetic generators cannot faithfully replicate. Mixing synthetic rows into real data would compromise the benchmark’s core purpose: evaluating database performance against authentic workload characteristics. For larger synthetic datasets, use TPC-H or CoffeeShop instead.
Performance Characteristics¶
Query Performance Patterns¶
Temporal Queries:
Bottleneck: Date/time extraction and grouping
Optimization: Temporal indexes, date partitioning
Typical performance: Fast (seconds)
Geographic Queries:
Bottleneck: Join with taxi_zones dimension table
Optimization: Zone ID indexes, denormalization
Typical performance: Fast to medium
Financial Queries:
Bottleneck: Aggregation over numeric columns
Optimization: Columnar storage, SIMD operations
Typical performance: Fast
Complex Queries:
Bottleneck: Multi-dimensional grouping, joins
Optimization: Materialized views, query caching
Typical performance: Medium to slow
Data Characteristics¶
The NYC Taxi data exhibits realistic patterns:
Temporal patterns: Peak hours (7-9am, 5-7pm), weekday/weekend differences
Geographic clusters: Manhattan Yellow Zones dominate, airport traffic patterns
Fare distributions: Right-skewed with peak around $10-15
Tip patterns: Strong correlation with fare amount, payment type
Seasonal variations: Holiday effects, summer vs winter patterns
Best Practices¶
Data Generation¶
Start small - Use SF=0.01 for initial testing
Choose appropriate year - Match your analysis timeframe
Consider months - Use specific months for seasonal analysis
Query Optimization¶
Index zone IDs - For geographic join performance
Partition by date - For temporal query efficiency
Materialize zones - Denormalize frequently-joined columns
Performance Testing¶
Warm-up queries - Run queries multiple times
Monitor resources - Track CPU, memory, I/O
Compare categories - Different query types stress different components
Common Issues and Solutions¶
Data Download Failures¶
Issue: Unable to download TLC data (network restrictions, rate limiting)
# Solution: Use synthetic data fallback (automatic)
nyctaxi = NYCTaxi(scale_factor=0.1)
# Benchmark will automatically generate synthetic data if download fails
data_files = nyctaxi.generate_data()
Memory Issues with Large Scale Factors¶
Issue: Out of memory during data generation
# Solution: Process in smaller chunks using months
for month in [1, 2, 3]:
nyctaxi = NYCTaxi(
scale_factor=10.0,
year=2019,
months=[month]
)
data_files = nyctaxi.generate_data()
# Process and unload before next month
Query Date Range Issues¶
Issue: Queries return no results
# Solution: Ensure query date parameters match generated data
nyctaxi = NYCTaxi(year=2023, months=[1]) # January 2023
# Queries will be parameterized for Jan 1-31, 2023
External Resources¶
NYC TLC Trip Record Data - Official data source
NYC Taxi Zones - Zone geography
TLC Data Dictionary - Column definitions