Formula 1 Race Prediction & AI Analytics Platform

2026

Client / Project: Formula 1 Prediction System
Year: 2026
Industry: Motorsport Analytics / Artificial Intelligence
Tech Stack: Python, XGBoost, LightGBM, CatBoost, QLoRA, Hugging Face, Ollama, Pandas, BeautifulSoup

Key Features

Historical Race Data Pipeline

Custom web scraping infrastructure collects and maintains 30 years of Formula 1 race data with intelligent HTML caching to minimize redundant requests.

Advanced Feature Engineering

Generates 44 predictive features per driver per race, including rolling form metrics, championship standings, and circuit-specific performance statistics.

Multi-Model Prediction Engine

Combines six machine-learning approaches to predict race winners, podium finishers, and final race positions.

Data Leakage Prevention

Strict historical look-back methodology ensures models only use information available before each race weekend.

Local LLM Race Analysis

Integrates local large language models for race prediction and comparative AI reasoning without cloud dependencies.

Fine-Tuned Motorsport AI

A custom QLoRA-trained language model specializes in Formula 1 race analysis and prediction tasks.

Interactive Data Visualisation

Comprehensive analytics dashboards reveal long-term trends, driver performance, and circuit characteristics.

Live Race Prediction Workflow

End-to-end pipeline generates predictions for upcoming race weekends using freshly collected data.

Business Impact

High Prediction Accuracy

LightGBM achieved 87.5% top-three prediction accuracy during the 2024 season evaluation.

Reduced Infrastructure Costs

QLoRA fine-tuning produced a smaller specialized model that performs near larger models while requiring significantly less hardware.

Scalable Data Collection

Disk-based caching dramatically reduces scraping time and unnecessary network requests.

Explainable Predictions

Feature engineering and visualization tools provide transparency into prediction outcomes.

Rapid Model Experimentation

Multiple model architectures enable continuous evaluation and optimization.

Real-Time Race Forecasting

The platform supports live prediction generation for future race weekends using newly collected data.

Efficient Local Deployment

Entire AI pipeline runs on consumer-grade hardware equipped with a single RTX 4050 GPU.

Future Cloud Expansion

Architecture is prepared for GPU cloud deployment and API-based prediction services.

The Challenge

Formula 1 race prediction is exceptionally difficult because race outcomes are influenced by dozens of interconnected variables including driver form, team performance, circuit characteristics, weather conditions, reliability, and championship context. Most publicly available datasets are incomplete, inconsistent, or lack the historical depth required to build reliable predictive models. Another major challenge was preventing data leakage. Motorsport datasets frequently contain information that would not have been available before a race weekend, causing models to produce unrealistically high accuracy during testing. Building trustworthy predictions required a strict historical approach where every feature could only use information known before the race began. Hardware limitations also presented significant constraints. Fine-tuning modern language models typically requires enterprise-grade GPUs, while this project was developed entirely on consumer hardware equipped with an RTX 4050 laptop GPU. Achieving competitive performance under these limitations required careful optimization of both machine-learning and language-model training workflows.

The Outcome

The final solution is a complete end-to-end Formula 1 prediction platform built entirely from scratch. A custom scraping system collects and maintains race data from 1996 through 2026, covering 590 races, 134 drivers, 42 constructors, and 41 circuits. The data pipeline transforms raw race information into a machine-learning dataset containing 44 engineered features for every driver and race combination. Six predictive approaches were trained and evaluated, including XGBoost, LightGBM, CatBoost, Random Forest, ranking models, and large language models. LightGBM achieved the highest top-three prediction accuracy of 87.5%, while CatBoost delivered the lowest position prediction error with a mean absolute error of 2.71 positions. The project also successfully demonstrated affordable local AI fine-tuning using QLoRA techniques. A Qwen2.5-3B-Instruct model was trained on nearly 500 instruction-based racing examples, producing a specialized motorsport assistant that achieved near-parity with a much larger 14-billion-parameter model while requiring substantially fewer resources. Combined with live race prediction capabilities and detailed visual analytics, the platform delivers a practical and scalable foundation for AI-powered motorsport intelligence.