Module Location:
src/tif1/io_pipeline.py
Source Implementation: src/tif1/core.py (re-exported for public API)
Dependencies: pandas, polars (optional), pydantic (validation), orjson (JSON parsing)io_pipeline module is the core data transformation layer in tif1, responsible for converting raw JSON payloads from the TracingInsights CDN into structured, FastF1-compatible DataFrames. This module orchestrates the entire data flow from network fetch through validation, parsing, column renaming, type coercion, and final DataFrame construction.
The pipeline is designed with three primary goals:
- Performance: Zero-copy construction, vectorized operations, and minimal memory allocations
- Compatibility: 100% FastF1-compatible output with identical column names, types, and ordering
- Reliability: Comprehensive validation, error handling, and graceful degradation for malformed data
Architecture Overview
The I/O pipeline is designed as a multi-stage transformation system that prioritizes performance, correctness, and FastF1 compatibility. Each stage is optimized for zero-copy operations where possible, with careful attention to memory efficiency and processing speed.Design Principles
The pipeline architecture follows these core principles:- Separation of Concerns: Each function has a single, well-defined responsibility
- Composability: Functions can be chained together to build complex transformations
- Backend Agnostic: Supports both pandas and polars with library-specific optimizations
- Fail-Safe: Graceful degradation for malformed data, with optional strict validation
- Performance First: Zero-copy construction, vectorized operations, and lazy evaluation where possible
Pipeline Stages
The data transformation pipeline consists of six distinct stages, each handling a specific aspect of the data flow:Stage Descriptions
| Stage | Function | Purpose | Input | Output |
|---|---|---|---|---|
| 1. Fetch | fetch_json_async | Retrieve JSON from CDN with caching | URL parameters | Raw JSON dict |
| 2. Validate | _validate_json_payload | Schema validation with Pydantic | Raw JSON dict | Validated JSON dict |
| 3. Extract | _extract_driver_info_map | Build driver metadata lookup | Driver list | Driver code → metadata map |
| 4. Construct | _create_lap_df / _create_session_df | Build raw DataFrame | JSON dict | Raw DataFrame |
| 5. Transform | _process_lap_df | Rename columns, coerce types | Raw DataFrame | Processed DataFrame |
| 6. Finalize | _reorder_laps_columns | Apply FastF1 column order | Processed DataFrame | Final DataFrame |
Data Flow Characteristics
The pipeline is optimized for the following characteristics:-
Zero-copy construction: Uses
copy=Falsein pandas andstrict=Falsein polars to avoid unnecessary memory allocations- Pandas:
pd.DataFrame(data, copy=False)creates views instead of copies when possible - Polars:
pl.DataFrame(data, strict=False)allows flexible schema inference without strict type checking - Result: 30-50% reduction in memory usage for large datasets
- Pandas:
-
Batch processing: Processes entire datasets at once using vectorized operations rather than row-by-row iteration
- All type coercions use pandas/polars vectorized operations
- Column renaming applied in single operation via dictionary mapping
- Categorical conversion applied to all columns simultaneously
- Result: 10-100x faster than row-by-row processing
-
Lazy validation: Validation is optional and can be disabled for maximum performance in production environments
- Controlled by
validate_data,validate_lap_times, andvalidate_telemetryconfig flags - Non-strict mode logs errors but continues processing
- Strict mode raises
InvalidDataErroron validation failures - Result: 5-20ms saved per session when validation is disabled
- Controlled by
-
Dual backend support: Seamlessly supports both pandas and polars with library-specific optimizations
- Pandas: Optimized for categorical types, nullable booleans, and timedelta operations
- Polars: Optimized for lazy evaluation, memory efficiency, and parallel processing
- Backend selection via
libparameter (“pandas” or “polars”) - Result: Users can choose the best backend for their use case
-
FastF1 compatibility: Ensures output DataFrames match FastF1’s column names, types, and ordering conventions
- Column names: PascalCase (e.g.,
LapTime,Sector1Time) - Column types: timedelta64[ns] for times, float64 for numeric, category for categorical
- Column order: Matches FastF1’s
FASTF1_LAPS_COLUMN_ORDERconstant - Result: Drop-in replacement for FastF1 with zero code changes
- Column names: PascalCase (e.g.,
Performance Benchmarks
Typical performance characteristics on modern hardware (Intel i7/AMD Ryzen 7, 16GB RAM):| Operation | Dataset Size | Pandas Time | Polars Time | Memory Usage |
|---|---|---|---|---|
| Process 50 laps | 50 rows × 40 cols | 2-5ms | 3-7ms | ~200KB |
| Process 1000 laps | 1000 rows × 40 cols | 20-40ms | 15-30ms | ~3MB |
| Full session (20 drivers) | 1000 rows × 40 cols | 100-200ms | 80-150ms | ~50MB |
| Telemetry (1 driver) | 10000 rows × 15 cols | 50-100ms | 40-80ms | ~10MB |
| Weather data | 200 rows × 8 cols | 1-3ms | 2-4ms | ~50KB |
Performance Tip: For maximum performance, disable validation in production:This can reduce processing time by 10-30% for large datasets.
Core Concepts
JSON Payload Structure
The pipeline processes several types of JSON payloads, each with a distinct structure optimized for network efficiency and parsing speed.Lap Data Payload
Source Files:session_laptimes.json, {driver}_tel.json
Purpose: Contains lap timing data, sector times, tire information, and track status
Structure: Dictionary of arrays (columnar format for efficient parsing)
- Columnar format: Each field is an array, not an array of objects (faster parsing)
- Abbreviated keys: Short keys reduce JSON size by ~30% (e.g.,
"s1"instead of"sector_1_time") - Consistent lengths: All arrays must have the same length (validated by Pydantic)
- Nullable values:
nullvalues allowed for optional fields - Type flexibility: Numbers can be int or float, booleans can be 0/1 or true/false
Driver Metadata Payload
Source File:drivers.json
Purpose: Contains driver information, team assignments, and visual metadata
Structure: Array of driver objects
- Array format: List of driver objects (not a dictionary)
- 3-letter codes: Driver codes are always 3 uppercase letters (e.g.,
"VER","HAM") - Team colors: Hex color codes for visualization (e.g.,
"#3671C6") - Headshot URLs: Direct links to driver photos for UI integration
Weather Data Payload
Source File:weather.json
Purpose: Contains session weather conditions sampled at regular intervals
Structure: Dictionary of arrays (time-series data)
- Time-series format: Data sampled at regular intervals (typically 60 seconds)
- Abbreviated keys:
wT(time),wAT(air temp),wTT(track temp), etc. - Metric units: Temperatures in Celsius, pressure in mbar, wind speed in m/s
- Boolean rainfall:
true/falsefor rain detection
Race Control Messages Payload
Source File:rcm.json
Purpose: Contains race control messages, flags, and safety car deployments
Structure: Dictionary of arrays (event log)
- Event log format: Chronological list of race control events
- Category types: Flag, SafetyCar, DRS, Other
- Track status codes: “1” (green), “2” (yellow), “4” (safety car), “5” (red), “6” (VSC), “7” (VSC ending)
- Sector-specific: Some events apply to specific sectors (1, 2, or 3)
- Driver-specific: Some events target specific drivers (by driver number)
Column Naming Philosophy
The pipeline transforms abbreviated JSON keys into descriptive, FastF1-compatible column names through a sophisticated mapping system.Naming Conventions
| Format | Purpose | Example | Use Case |
|---|---|---|---|
| Abbreviated | Network efficiency | "s1", "vi1", "wAT" | JSON payloads from CDN |
| snake_case | Pydantic validation | "sector_1_time", "speed_i1", "air_temp" | Validated schemas |
| PascalCase | DataFrame columns | "Sector1Time", "SpeedI1", "AirTemp" | Final output |
Transformation Process
The pipeline supports bidirectional mapping to handle both raw and validated JSON:Mapping Tables
The complete mapping is defined inLAP_RENAME_MAP in src/tif1/core_utils/constants.py:
Timing Columns:
| JSON Key (Raw) | JSON Key (Validated) | DataFrame Column | Description |
|---|---|---|---|
lap | lap | LapNumber | Lap number (1-indexed) |
time | time | LapTime | Total lap time |
s1 | s1 | Sector1Time | Sector 1 time |
s2 | s2 | Sector2Time | Sector 2 time |
s3 | s3 | Sector3Time | Sector 3 time |
sesT | session_time | Time | Session time at lap end |
s1T | sector1_session_time | Sector1SessionTime | Session time at S1 end |
s2T | sector2_session_time | Sector2SessionTime | Session time at S2 end |
s3T | sector3_session_time | Sector3SessionTime | Session time at S3 end |
| JSON Key (Raw) | JSON Key (Validated) | DataFrame Column | Description |
|---|---|---|---|
vi1 | speed_i1 | SpeedI1 | Speed trap 1 (km/h) |
vi2 | speed_i2 | SpeedI2 | Speed trap 2 (km/h) |
vfl | speed_fl | SpeedFL | Finish line speed (km/h) |
vst | speed_st | SpeedST | Speed trap (km/h) |
| JSON Key (Raw) | JSON Key (Validated) | DataFrame Column | Description |
|---|---|---|---|
compound | compound | Compound | Tire compound name |
life | life | TyreLife | Tire age in laps |
stint | stint | Stint | Stint number |
fresh | fresh_tyre | FreshTyre | Fresh tire flag |
| JSON Key (Raw) | JSON Key (Validated) | DataFrame Column | Description |
|---|---|---|---|
drv | source_driver | Driver | 3-letter driver code |
dNum | driver_number | DriverNumber | Driver number (string) |
team | source_team | Team | Team name |
pos | pos | Position | Position at lap end |
status | status | TrackStatus | Track status code |
| JSON Key (Raw) | JSON Key (Validated) | DataFrame Column | Description |
|---|---|---|---|
pb | pb | IsPersonalBest | Personal best lap flag |
del | deleted | Deleted | Lap deleted flag |
delR | deleted_reason | DeletedReason | Deletion reason |
ff1G | fastf1_generated | FastF1Generated | FastF1 generated flag |
iacc | is_accurate | IsAccurate | Accuracy flag |
| JSON Key (Raw) | JSON Key (Validated) | DataFrame Column | Description |
|---|---|---|---|
wT | weather_time | WeatherTime | Weather sample time |
wAT | air_temp | AirTemp | Air temperature (°C) |
wTT | track_temp | TrackTemp | Track temperature (°C) |
wH | humidity | Humidity | Relative humidity (%) |
wP | pressure | Pressure | Air pressure (mbar) |
wR | rainfall | Rainfall | Rainfall flag |
wWD | wind_direction | WindDirection | Wind direction (degrees) |
wWS | wind_speed | WindSpeed | Wind speed (m/s) |
Type System
The pipeline enforces a strict type system to ensure data consistency and FastF1 compatibility. All type coercions are performed using vectorized operations for maximum performance.Type Categories
| Category | Pandas Type | Polars Type | Description | Example Values |
|---|---|---|---|---|
| Time values | timedelta64[ns] | Duration(ns) | Lap times, sector times, session times | 0 days 00:01:32.765000000 |
| Numeric values | float64 | Float64 | Speeds, temperatures, positions | 108.901, 18.5, 1.0 |
| Integer values | float64 | Float64 | Lap numbers, stint numbers (nullable) | 1.0, 2.0, NaN |
| Boolean flags | bool | Boolean | Personal best, fresh tyre | True, False |
| Nullable booleans | boolean (pandas) | Boolean | Deleted flag (pandas nullable bool) | True, False, <NA> |
| Categorical | category | Categorical | Driver, Team, Compound, TrackStatus | "VER", "Red Bull Racing" |
| String values | str / object | Utf8 | Driver numbers, deletion reasons | "33", "Track limits" |
Type Coercion Rules
Timedelta Conversion:Column-Specific Types
Lap DataFrame Types:Type Coercion Performance
Type coercion is performed using vectorized operations for maximum performance:| Operation | Method | Time (1000 rows) | Time (10000 rows) |
|---|---|---|---|
| Timedelta conversion | pd.to_timedelta() | ~0.5ms | ~2ms |
| Numeric coercion | pd.to_numeric() | ~0.3ms | ~1ms |
| Boolean coercion | .fillna().astype() | ~0.2ms | ~0.8ms |
| Categorical conversion | .astype('category') | ~1ms | ~5ms |
| Total (all columns) | Vectorized batch | ~5ms | ~20ms |
Categorical Optimization: Categorical types reduce memory usage by 50-80% for columns with low cardinality (Driver, Team, Compound, TrackStatus). However, they add overhead for small datasets. Use
polars_lap_categorical=False config to disable categorical types in polars for maximum performance.API Reference
_validate_json_payload
Validation Behavior
The validation process is path-aware and applies different schemas based on the resource type:| Path Pattern | Schema | Config Flag | Strict Mode |
|---|---|---|---|
drivers.json | validate_drivers | validate_data | Non-strict |
rcm.json | validate_race_control_data | validate_data | Non-strict |
weather.json | validate_weather_data | validate_data | Non-strict |
session_laptimes.json | validate_lap_data | validate_lap_times | Non-strict |
*_tel.json | validate_telemetry_data | validate_telemetry | Non-strict |
Parameters
-
path(str): Resource path for error context and schema selection- Examples:
"drivers.json","laps/VER/19_tel.json","weather.json" - Used to determine which validation schema to apply
- Included in error messages for debugging
- Examples:
-
data(dict[str, Any]): Raw JSON dictionary from CDN fetch- Must be a dictionary (not a list or primitive)
- Keys are JSON field names (abbreviated or snake_case)
- Values are typically lists of primitives or nested dictionaries
Returns
dict[str, Any]: Validated and potentially transformed JSON dictionary- Keys may be transformed from abbreviated to snake_case
- Values are type-checked and coerced where necessary
- Invalid fields may be removed or replaced with defaults
Raises
InvalidDataError: If validation fails in strict mode or encounters fatal errors- Includes the resource path in the error message
- Contains detailed validation error information
- Preserves the original exception as the cause
Special Handling
Telemetry Payload Sanitization: Telemetry payloads receive special treatment to remove validator-only defaults that would break DataFrame construction:Configuration
Validation is controlled by multiple config flags:Performance Impact
Validation adds overhead to the data pipeline:- Lap data validation: ~5-10ms per session
- Telemetry validation: ~10-20ms per driver
- Weather/race control validation: ~1-2ms per session
Example Usage
This function uses the global config singleton from
config.get_config(). The underlying implementation in async_fetch.py accepts a config parameter for testing, but the exported version in io_pipeline.py always uses the global config._extract_driver_codes
Parameters
drivers(list[dict] | None): List of driver dictionaries fromdrivers.json, orNone- Each dictionary must contain a
"driver"key with the 3-letter code - If
Noneor empty list, returns an empty set - Malformed dictionaries without
"driver"key are silently skipped
- Each dictionary must contain a
Returns
set[str]: Set of unique 3-letter driver codes- Examples:
{"VER", "HAM", "LEC", "SAI"} - Empty set if input is
Noneor empty - Duplicates are automatically removed by set construction
- Examples:
Implementation Details
The function performs a simple list comprehension with dictionary key access:Example Usage
Use Cases
This function is primarily used for:- Session validation: Checking if a session has driver data before processing
- Driver filtering: Determining which drivers to fetch telemetry for
- Quick lookups: Fast set membership tests without processing full metadata
- Debugging: Logging which drivers are present in a session
This function is extremely lightweight and performs no validation or transformation. It’s designed for quick driver enumeration without the overhead of full metadata processing.
_extract_driver_info_map
Parameters
drivers(list[dict] | None): List of driver dictionaries fromdrivers.json, orNone- Each dictionary contains full driver metadata
- If
Noneor empty list, returns an empty dictionary - Malformed dictionaries without
"driver"key are silently skipped
Returns
dict[str, dict]: Dictionary mapping driver codes to raw metadata dictionaries- Keys: 3-letter driver codes (e.g.,
"VER","HAM") - Values: Raw JSON dictionaries with all metadata fields
- Empty dictionary if input is
Noneor empty
- Keys: 3-letter driver codes (e.g.,
Metadata Fields
Each driver metadata dictionary contains the following fields:| Field | Type | Description | Example |
|---|---|---|---|
driver | str | 3-letter driver code | "VER" |
dn | str | Driver number (as string) | "33" |
team | str | Full team name | "Red Bull Racing" |
first_name | str | Driver’s first name | "Max" |
last_name | str | Driver’s last name | "Verstappen" |
team_color | str | Hex color code for team | "#3671C6" |
headshot_url | str | URL to driver photo | "https://..." |
Implementation Details
The function creates a dictionary comprehension that maps driver codes to their full metadata:Example Usage
Use Cases
This function is used throughout the pipeline for:- DataFrame enrichment: Adding driver metadata columns to lap DataFrames
- Team assignment: Mapping driver codes to team names
- Display formatting: Accessing driver names and colors for plotting
- Validation: Checking if a driver code is valid for a session
Performance Characteristics
- Time complexity: O(n) where n is the number of drivers (typically 20)
- Space complexity: O(n) for the dictionary storage
- Lookup time: O(1) for accessing driver info by code
_create_lap_df
Parameters
-
lap_data(dict): Dictionary of lap data arrays (columnar format, not row-based)- Keys: Internal JSON field names like
"lap","time","s1","s2","s3", etc. - Values: Lists/arrays of primitive values (numbers, strings, booleans)
- Structure: All arrays should have the same length (normalized automatically if mismatched)
- Example:
- Keys: Internal JSON field names like
-
driver(str): 3-letter driver code (e.g.,"VER","HAM","LEC")- Format: Exactly 3 uppercase letters
- Purpose: Added as a constant column to all rows
- Validation: No validation performed (assumed valid from upstream)
-
team(str): Full team name (e.g.,"Red Bull Racing","Mercedes","Ferrari")- Format: Free-form string (no length restrictions)
- Purpose: Added as a constant column to all rows
- Validation: No validation performed (assumed valid from upstream)
-
lib(str): DataFrame library to use ("pandas"or"polars")- pandas: Uses
pd.DataFrame(data, copy=False)for zero-copy construction - polars: Uses
pl.DataFrame(data, strict=False)for flexible schema inference - Default: No default (must be explicitly specified)
- pandas: Uses
Returns
DataFrame: Raw lap DataFrame with unnormalized column names- Columns: Raw JSON keys (e.g.,
"lap","time","s1") +"Driver"+"Team" - Types: Inferred from input data (not coerced yet)
- Order: Arbitrary (column order not guaranteed)
- Note: Column renaming and type coercion happen later in
_process_lap_df
- Columns: Raw JSON keys (e.g.,
Raw Columns Created
The function creates the following columns (before renaming): Core Timing Columns:lap: Lap number (1-indexed integer/float)time: Lap time in seconds (float)s1,s2,s3: Sector times in seconds (float)sesT: Session time at lap end in seconds (float)
vi1,vi2: Speed trap 1 and 2 in km/h (float)vfl: Finish line speed in km/h (float)vst: Speed trap in km/h (float)
compound: Tire compound name (string: SOFT, MEDIUM, HARD, INTERMEDIATE, WET)life: Tire age in laps (integer)stint: Stint number (integer)fresh: Fresh tire flag (boolean)
pb: Personal best lap flag (boolean)status: Track status code (string: “1”, “2”, “4”, “5”, “6”, “7”)pos: Position at lap end (integer)dNum: Driver number (string)drv: Driver code (string, may differ fromdriverparameter)team: Team name (string, may differ fromteamparameter)
del: Lap deleted flag (boolean)delR: Deletion reason (string)ff1G: FastF1 generated data flag (boolean)iacc: Accuracy flag (boolean)
pout: Pit out time in seconds (float)pin: Pit in time in seconds (float)
s1T,s2T,s3T: Session times at sector ends in seconds (float)lST: Lap start time in seconds (float)lSD: Lap start date (string)
wT: Weather sample time in seconds (float)wAT: Air temperature in Celsius (float)wTT: Track temperature in Celsius (float)wH: Humidity percentage (float)wP: Pressure in mbar (float)wR: Rainfall flag (boolean)wWD: Wind direction in degrees (float)wWS: Wind speed in m/s (float)
Driver: Driver code fromdriverparameter (string)Team: Team name fromteamparameter (string)
Array Length Normalization
The function automatically normalizes mismatched array lengths (required in Python 3.12+):- Calculate maximum length across all arrays
- Pad short arrays with
Nonevalues to match max length - Replicate scalar values to match max length
- Handle numpy arrays and other array-like objects
Backend-Specific Behavior
Pandas Backend (lib="pandas"):
lib="polars"):
Example Usage
Basic Usage:Performance Characteristics
- Time complexity: O(n × m) where n = number of rows, m = number of columns
- Space complexity: O(n × m) for DataFrame storage
- Zero-copy optimization: Avoids data duplication when possible
- Typical performance:
- 50 laps × 40 columns: ~1-2ms (pandas), ~2-3ms (polars)
- 1000 laps × 40 columns: ~10-20ms (pandas), ~15-25ms (polars)
Driver/Team Columns: The
driver and team parameters are added as constant columns to all rows. If the input lap_data already contains "Driver" or "Team" keys, they are removed before adding the parameter values. This ensures consistency and prevents duplicate columns._create_session_df
Parameters
-
data(dict[str, Any]): Raw data dictionary with arrays (columnar format)- Keys: JSON field names (abbreviated or snake_case)
- Values: Lists/arrays of primitive values
- Structure: All arrays should have consistent lengths
- Example:
-
rename_map(dict[str, str]): Column rename mapping dictionary- Purpose: Maps JSON keys to DataFrame column names
- Format:
{json_key: dataframe_column} - Available maps:
WEATHER_RENAME_MAP: Weather data columnsRACE_CONTROL_RENAME_MAP: Race control message columnsTELEMETRY_RENAME_MAP: Telemetry data columnsLAP_RENAME_MAP: Lap timing data columns
- Location:
src/tif1/core_utils/constants.py
-
lib(str): DataFrame library to use ("pandas"or"polars")- pandas: Uses
pd.DataFrame(data, copy=False)for zero-copy construction - polars: Uses
pl.DataFrame(data, strict=False)for flexible schema inference
- pandas: Uses
Returns
DataFrame: Session DataFrame with renamed columns- Columns: Renamed according to
rename_map(PascalCase) - Types: Inferred from input data (no type coercion applied)
- Order: Arbitrary (column order not guaranteed)
- Empty handling: Returns empty DataFrame if input is empty
- Columns: Renamed according to
Column Rename Maps
Weather Rename Map (WEATHER_RENAME_MAP):
RACE_CONTROL_RENAME_MAP):
TELEMETRY_RENAME_MAP):
Implementation Details
The function performs three main operations:- DataFrame Construction: Creates DataFrame using zero-copy optimization
- Empty Check: Returns empty DataFrame if input is empty
- Column Renaming: Applies rename map to transform column names
Example Usage
Weather Data:Backend-Specific Behavior
Pandas Backend (lib="pandas"):
lib="polars"):
Performance Characteristics
- Time complexity: O(n × m) where n = number of rows, m = number of columns
- Space complexity: O(n × m) for DataFrame storage
- Zero-copy optimization: Avoids data duplication when possible
- Typical performance:
- Weather data (200 rows × 8 cols): ~1-3ms (pandas), ~2-4ms (polars)
- Race control (50 rows × 10 cols): ~0.5-2ms (pandas), ~1-3ms (polars)
- Telemetry (10000 rows × 15 cols): ~50-100ms (pandas), ~40-80ms (polars)
Use Cases
This function is used throughout the pipeline for:- Weather DataFrames: Converting weather JSON to DataFrames
- Race Control DataFrames: Converting race control messages to DataFrames
- Telemetry DataFrames: Converting telemetry JSON to DataFrames (before lap-specific processing)
- Custom Session Data: Any session-level data that needs column renaming
No Type Coercion: This function does NOT perform type coercion. Types are inferred from the input data. For lap DataFrames that require type coercion (timedelta conversion, categorical types, etc.), use
_create_lap_df followed by _process_lap_df._process_lap_df
Parameters
-
lap_df(DataFrame): Raw lap DataFrame from_create_lap_df- Columns: Raw JSON keys (e.g.,
"lap","time","s1","s2") - Types: Inferred types from JSON (not coerced yet)
- Order: Arbitrary column order
- Source: Output from
_create_lap_df
- Columns: Raw JSON keys (e.g.,
-
lib(str): DataFrame library ("pandas"or"polars")- pandas: Full type coercion with categorical types
- polars: Selective type coercion (categorical types optional)
Returns
DataFrame: Fully processed lap DataFrame with:- Renamed columns: PascalCase FastF1-compatible names
- Proper data types: timedelta64[ns], float64, bool, category, etc.
- Categorical types: Applied to Driver, Team, Compound, TrackStatus (pandas default)
- FastF1 column order: Matches
FASTF1_LAPS_COLUMN_ORDERconstant - Additional columns:
LapTimeSeconds(float representation of LapTime)
Transformations Applied
The function applies six major transformations in sequence: 1. Duplicate Column Removal (pandas only):FastF1-Compatible Column Order
The final DataFrame has columns in this exact order (matching FastF1):Type Coercion Details
Timedelta Columns (pandas):LapTime,Time,Sector1Time,Sector2Time,Sector3TimeSector1SessionTime,Sector2SessionTime,Sector3SessionTimePitOutTime,PitInTime,LapStartTime,WeatherTime- Conversion: Float seconds →
timedelta64[ns] - Method:
pd.to_timedelta(values, unit='s')
LapNumber,Stint,TyreLife,PositionSpeedI1,SpeedI2,SpeedFL,SpeedSTAirTemp,TrackTemp,Humidity,Pressure,WindDirection,WindSpeedLapTimeSeconds- Conversion: Mixed types →
float64 - Method:
pd.to_numeric(values, errors='coerce')
IsPersonalBest,FreshTyre,FastF1Generated,IsAccurate,Rainfall- Conversion: Mixed boolean representations →
bool - Method:
values.fillna(False).astype(bool)
Deleted(pandas nullable boolean type)- Conversion: Mixed boolean representations →
boolean - Method:
values.astype('boolean')
DriverNumber,DeletedReason,LapStartDate,QualifyingSession- Conversion: No conversion (kept as object dtype)
Driver,Team,Compound,TrackStatus- Conversion: String →
category - Method:
values.astype('category') - Memory savings: 50-80% reduction for low-cardinality columns
Backend-Specific Behavior
Pandas Backend (lib="pandas"):
lib="polars"):
Configuration Options
Categorical Types in Polars:Example Usage
Basic Processing:Performance Characteristics
- Time complexity: O(n × m) where n = number of rows, m = number of columns
- Space complexity: O(n × m) for DataFrame storage
- Typical performance (pandas):
- 50 laps: ~2-5ms
- 1000 laps: ~20-40ms
- 10000 laps: ~200-400ms
- Typical performance (polars):
- 50 laps: ~3-7ms
- 1000 laps: ~15-30ms
- 10000 laps: ~150-300ms
Performance Breakdown
| Operation | Time (1000 laps) | Percentage |
|---|---|---|
| Column renaming | ~2ms | 10% |
| Timedelta conversion | ~8ms | 40% |
| Type coercion | ~5ms | 25% |
| Categorical conversion | ~3ms | 15% |
| Column reordering | ~2ms | 10% |
| Total | ~20ms | 100% |
Categorical Types: Categorical types provide significant memory savings (50-80%) for columns with low cardinality (Driver, Team, Compound, TrackStatus). However, they add overhead for small datasets (<100 laps). For maximum performance with small datasets, consider disabling categorical types.
Column naming conventions
The I/O pipeline transforms raw JSON keys to FastF1-compatible column names:| JSON Key | DataFrame Column | Type | Description |
|---|---|---|---|
lap | LapNumber | float64 | Lap number (1-indexed) |
time | LapTime | timedelta64[ns] | Lap time |
s1 | Sector1Time | timedelta64[ns] | Sector 1 time |
s2 | Sector2Time | timedelta64[ns] | Sector 2 time |
s3 | Sector3Time | timedelta64[ns] | Sector 3 time |
compound | Compound | str/category | Tire compound (SOFT, MEDIUM, HARD, INTERMEDIATE, WET) |
life | TyreLife | float64 | Tire age in laps |
stint | Stint | float64 | Stint number |
pb | IsPersonalBest | bool | Personal best lap flag |
vi1 | SpeedI1 | float64 | Speed trap 1 (km/h) |
vi2 | SpeedI2 | float64 | Speed trap 2 (km/h) |
vfl | SpeedFL | float64 | Finish line speed (km/h) |
vst | SpeedST | float64 | Speed trap (km/h) |
status | TrackStatus | str/category | Track status code |
pos | Position | float64 | Position at lap end |
del | Deleted | boolean | Lap deleted flag |
delR | DeletedReason | str | Reason for deletion |
ff1G | FastF1Generated | bool | FastF1 generated data flag |
sesT | Time | timedelta64[ns] | Session time at lap end |
dNum | DriverNumber | str | Driver number |
pout | PitOutTime | timedelta64[ns] | Pit out time |
pin | PitInTime | timedelta64[ns] | Pit in time |
The complete mapping is defined in
LAP_RENAME_MAP in src/tif1/core_utils/constants.py. Both validated (snake_case) and raw (abbreviated) JSON keys are supported.Library Support
The pipeline supports both pandas and polars libraries:- pandas: Uses
pd.DataFrame(data, copy=False)for zero-copy construction - polars: Uses
pl.DataFrame(data, strict=False)with schema inference - pandas: Applies categorical types by default for Driver, Team, Compound, TrackStatus
- polars: Categorical types disabled by default (enable with
polars_lap_categoricalconfig)
Data Validation
Whenvalidate_data is enabled in config, _validate_json_payload validates raw JSON using Pydantic schemas:
- Required fields: Ensures all required fields are present in JSON
- Type checking: Validates data types match schema definitions
- Value ranges: Checks values are within expected ranges
- Referential integrity: Validates driver codes, lap numbers, etc.
Validation is controlled by the
validate_data config option. When disabled, raw JSON is passed through without validation for maximum performance.Performance Considerations
The I/O pipeline is heavily optimized for speed:- Zero-copy construction: Uses
copy=Falsein pandas,strict=Falsein polars - Batch processing: Processes all laps at once, not row-by-row
- Vectorized operations: Uses numpy/pandas vectorization for type coercion
- Minimal allocations: Reuses arrays where possible, avoids intermediate copies
- Lazy categorical: Categorical types applied only when beneficial
- Process 50 laps: ~2-5ms
- Process 1000 laps: ~20-40ms
- Full session (20 drivers × 50 laps): ~100-200ms
Internal Implementation
Column Renaming Strategy
Column Renaming Strategy
The pipeline maintains two sets of column names:
- JSON keys: Abbreviated keys like
"lap","s1","vi1"(raw) or snake_case like"lap_number","sector_1_time"(validated) - DataFrame columns: PascalCase like
"LapNumber","Sector1Time","SpeedI1"
_process_lap_df() using LAP_RENAME_MAP from core_utils/constants.py. The map supports both raw and validated JSON keys for maximum compatibility.Type Coercion
Type Coercion
The pipeline coerces types to ensure FastF1 compatibility:
- Lap times (float seconds) → timedelta64[ns]
- Session times (float seconds) → timedelta64[ns]
- Lap numbers → float64 (not int, to allow NaN)
- Boolean flags → bool (fillna False for non-nullable)
- Deleted flag → boolean (nullable bool)
- Categorical data → category (pandas only by default)
- Driver numbers → str (not int, to preserve leading zeros)
Missing Data Handling
Missing Data Handling
Missing values are handled gracefully:
- Numeric fields: NaN (pandas) or null (polars)
- String fields: empty string or null
- Boolean fields: False (fillna applied)
- Deleted field: null (nullable boolean)
- Timedelta fields: NaT (not-a-time)
InvalidDataError for missing required fields.Array Length Normalization
Array Length Normalization
_create_lap_df normalizes mismatched array lengths (required in Python 3.12+):- Calculates max length across all arrays
- Pads short arrays with None values
- Replicates scalar values to match max length