NYC Taxi OLAP Benchmark¶
Overview¶
The NYC Taxi OLAP Benchmark uses real-world NYC Taxi & Limousine Commission (TLC) trip record data for comprehensive OLAP analytics testing. Unlike synthetic benchmarks, this benchmark leverages actual transportation data from New York City, providing realistic distributions, seasonal patterns, and geographic analytics opportunities.
The benchmark is ideal for testing analytical database performance on real-world data patterns, particularly for organizations dealing with transportation, logistics, or time-series geospatial data.
Key Features¶
Real-world data - Uses actual NYC TLC trip records (or realistic synthetic fallback)
Multi-dimensional analysis - Temporal, geographic, and financial dimensions
25 OLAP queries - Comprehensive query coverage across 9 categories
Zone-based geography - 265 NYC taxi zones for geographic analytics
Flexible scale factors - From testing (0.01) to production scale (100+)
Date range filtering - Configurable year and month selection
Standard SQL - Queries work across multiple database platforms
Data Source¶
The benchmark uses data from the NYC TLC Trip Record Data:
Format: Parquet files from TLC data portal
Coverage: Yellow and green taxi trips
Time range: 2019-2025 (configurable)
Fallback: Synthetic data generation when download unavailable
Schema Description¶
The NYC Taxi benchmark uses a star schema with a fact table (trips) and dimension table (taxi_zones):
Tables¶
Table |
Purpose |
Approximate Rows (SF 1) |
|---|---|---|
trips |
Trip fact records with fare and location data |
~30,000,000 |
taxi_zones |
NYC taxi zone dimension table |
265 |
taxi_zones Table Structure¶
Column |
Type |
Description |
|---|---|---|
|
INTEGER |
Unique zone identifier (1-265) |
|
VARCHAR |
NYC borough (Manhattan, Brooklyn, Queens, Bronx, Staten Island, EWR) |
|
VARCHAR |
Zone name (e.g., “Times Sq/Theatre District”) |
|
VARCHAR |
Service zone type (Yellow Zone, Boro Zone, Airports, EWR, N/A) |
trips Table Structure¶
Column |
Type |
Description |
|---|---|---|
|
INTEGER |
Primary key (synthetic) |
|
TIMESTAMP |
Trip start timestamp |
|
TIMESTAMP |
Trip end timestamp |
|
INTEGER |
Pickup zone ID (FK to taxi_zones) |
|
INTEGER |
Dropoff zone ID (FK to taxi_zones) |
|
DECIMAL(10,2) |
Trip distance in miles |
|
INTEGER |
Number of passengers |
|
INTEGER |
Rate code (1-6) |
|
INTEGER |
Payment method (1-6) |
|
DECIMAL(10,2) |
Base fare amount |
|
DECIMAL(10,2) |
Tip amount |
|
DECIMAL(10,2) |
Tolls paid |
|
DECIMAL(10,2) |
MTA tax |
|
DECIMAL(10,2) |
Improvement surcharge |
|
DECIMAL(10,2) |
Congestion pricing surcharge |
|
DECIMAL(10,2) |
Total trip cost |
|
INTEGER |
Taxi vendor identifier |
Query Categories¶
The benchmark includes 25 queries organized into 9 categories:
Temporal Queries¶
Time-based aggregations and patterns:
trips-per-hour: Hourly trip distributiontrips-per-day: Daily trip patternstrips-per-month: Monthly aggregationshourly-revenue: Revenue by hour of day
Geographic Queries¶
Zone-level spatial analytics:
top-pickup-zones: Busiest pickup locationstop-dropoff-zones: Busiest dropoff locationszone-pairs: Popular origin-destination pairsborough-summary: Borough-level aggregations
Financial Queries¶
Revenue and tip analysis:
total-revenue: Overall revenue metricstip-analysis: Tip patterns and percentagesfare-distribution: Fare amount distributionspayment-analysis: Payment type breakdowns
Characteristics Queries¶
Trip attribute analysis:
distance-stats: Trip distance statisticspassenger-distribution: Passenger count patternstrip-duration: Duration analysis
Rate Code Queries¶
Rate code analysis:
rate-code-distribution: Rate code usage patterns
Vendor Queries¶
Vendor performance comparisons:
vendor-comparison: Vendor-level metrics
Complex Queries¶
Multi-dimensional analytics:
peak-hour-zones: Peak hours by zoneweekend-weekday: Weekend vs weekday patternsrevenue-by-zone-hour: Zone-hour revenue matrix
Point Queries¶
Single-value lookups:
specific-trip-count: Filtered trip counts
Baseline Queries¶
Full table operations:
full-scan: Complete table scanrow-count: Basic count
Usage Examples¶
Basic Benchmark Setup¶
from benchbox import NYCTaxi
# Initialize NYC Taxi benchmark
nyctaxi = NYCTaxi(scale_factor=1.0, output_dir="nyctaxi_data")
# Download/generate data
data_files = nyctaxi.generate_data()
# Get all benchmark queries
queries = nyctaxi.get_queries()
print(f"Generated {len(queries)} NYC Taxi queries")
# Get specific query
hourly_query = nyctaxi.get_query("trips-per-hour")
print(hourly_query)
Configuring Data Year and Months¶
# Use specific year and months
nyctaxi_2023 = NYCTaxi(
scale_factor=0.1,
output_dir="nyctaxi_2023",
year=2023,
months=[1, 2, 3] # Q1 only
)
data_files = nyctaxi_2023.generate_data()
DuckDB Integration Example¶
import duckdb
from benchbox import NYCTaxi
# Initialize and generate data
nyctaxi = NYCTaxi(scale_factor=0.1, output_dir="nyctaxi_small")
data_files = nyctaxi.generate_data()
# Create DuckDB connection and schema
conn = duckdb.connect("nyctaxi.duckdb")
schema_sql = nyctaxi.get_create_tables_sql(dialect="duckdb")
for stmt in schema_sql.split(";"):
if stmt.strip():
conn.execute(stmt)
# Load data efficiently with DuckDB
zones_file = nyctaxi.tables["taxi_zones"]
trips_file = nyctaxi.tables["trips"]
conn.execute(f"""
INSERT INTO taxi_zones
SELECT * FROM read_csv('{zones_file}', header=true)
""")
conn.execute(f"""
INSERT INTO trips
SELECT * FROM read_csv('{trips_file}', header=true)
""")
# Run queries
queries = nyctaxi.get_queries()
for query_id, query_sql in list(queries.items())[:5]:
result = conn.execute(query_sql).fetchall()
print(f"{query_id}: {len(result)} rows")
conn.close()
Query Categories Example¶
from benchbox import NYCTaxi
nyctaxi = NYCTaxi(scale_factor=0.01)
# Get queries by category
temporal_queries = nyctaxi.get_queries_by_category("temporal")
print(f"Temporal queries: {temporal_queries}")
geographic_queries = nyctaxi.get_queries_by_category("geographic")
print(f"Geographic queries: {geographic_queries}")
financial_queries = nyctaxi.get_queries_by_category("financial")
print(f"Financial queries: {financial_queries}")
# Get detailed query info
info = nyctaxi.get_query_info("trips-per-hour")
print(f"Query info: {info}")
Scale Factor Guidelines¶
Scale Factor |
Trips |
Data Size |
Memory Usage |
Use Case |
|---|---|---|---|---|
0.01 |
~300K |
~11 MB |
< 100 MB |
Quick testing |
0.1 |
~3M |
~110 MB |
< 500 MB |
Development |
1.0 |
~30M |
~1.1 GB |
< 4 GB |
Standard benchmark |
10.0 |
~300M |
~11 GB |
< 20 GB |
Performance testing |
100.0 |
~3B |
~111 GB |
< 150 GB |
Production simulation |
Performance Characteristics¶
Query Performance Patterns¶
Temporal Queries:
Bottleneck: Date/time extraction and grouping
Optimization: Temporal indexes, date partitioning
Typical performance: Fast (seconds)
Geographic Queries:
Bottleneck: Join with taxi_zones dimension table
Optimization: Zone ID indexes, denormalization
Typical performance: Fast to medium
Financial Queries:
Bottleneck: Aggregation over numeric columns
Optimization: Columnar storage, SIMD operations
Typical performance: Fast
Complex Queries:
Bottleneck: Multi-dimensional grouping, joins
Optimization: Materialized views, query caching
Typical performance: Medium to slow
Data Characteristics¶
The NYC Taxi data exhibits realistic patterns:
Temporal patterns: Peak hours (7-9am, 5-7pm), weekday/weekend differences
Geographic clusters: Manhattan Yellow Zones dominate, airport traffic patterns
Fare distributions: Right-skewed with peak around $10-15
Tip patterns: Strong correlation with fare amount, payment type
Seasonal variations: Holiday effects, summer vs winter patterns
Best Practices¶
Data Generation¶
Start small - Use SF=0.01 for initial testing
Choose appropriate year - Match your analysis timeframe
Consider months - Use specific months for seasonal analysis
Query Optimization¶
Index zone IDs - For geographic join performance
Partition by date - For temporal query efficiency
Materialize zones - Denormalize frequently-joined columns
Performance Testing¶
Warm-up queries - Run queries multiple times
Monitor resources - Track CPU, memory, I/O
Compare categories - Different query types stress different components
Common Issues and Solutions¶
Data Download Failures¶
Issue: Unable to download TLC data (network restrictions, rate limiting)
# Solution: Use synthetic data fallback (automatic)
nyctaxi = NYCTaxi(scale_factor=0.1)
# Benchmark will automatically generate synthetic data if download fails
data_files = nyctaxi.generate_data()
Memory Issues with Large Scale Factors¶
Issue: Out of memory during data generation
# Solution: Process in smaller chunks using months
for month in [1, 2, 3]:
nyctaxi = NYCTaxi(
scale_factor=10.0,
year=2019,
months=[month]
)
data_files = nyctaxi.generate_data()
# Process and unload before next month
Query Date Range Issues¶
Issue: Queries return no results
# Solution: Ensure query date parameters match generated data
nyctaxi = NYCTaxi(year=2023, months=[1]) # January 2023
# Queries will be parameterized for Jan 1-31, 2023
External Resources¶
NYC TLC Trip Record Data - Official data source
NYC Taxi Zones - Zone geography
TLC Data Dictionary - Column definitions