ADR-0002 — Data integrity finding & honest-modelling strategy¶
- Status: Accepted
- Date: 2026-06-03
- Author: Maxime GOURGUECHON
- Supersedes: —
- Related: ADR-0001,
reports/data_integrity_report.md
Context¶
Before committing to a modelling approach, I ran a formal data-integrity analysis
(bmw_sales.data.validation). It produced three reproducible findings:
- Structurally pristine — 50,000 rows, 0 nulls, 0 duplicates.
- No predictive signal — every feature is statistically independent of the targets:
- Max absolute off-diagonal Pearson correlation = 0.0087.
- One-way ANOVA of
Sales_Volumeacross each categorical: all p > 0.4 (Model 0.87, Region 0.50, Color 0.42, Fuel 0.73, Transmission 0.74). - Max mutual information across all features = 0.0042 nats (captures non-linear dependence too → none exists).
- Mean
Price_USDis flat across models (74.4k–75.6k), which is economically impossible for a real luxury line-up → confirms a synthetic, uniform DGP. - Target leakage —
Sales_Classification = High ⟺ Sales_Volume ≥ 7000, with zero class overlap. The label is a deterministic threshold on the regression target.
Decision¶
Adopt an honest-modelling strategy with three pillars:
- Report honest baselines. We build the full ML/DL/econometric pipeline and report real metrics, expecting R² ≈ 0 (regression) and ROC-AUC ≈ 0.5 (leakage-free classification). We do not tune toward, or cherry-pick, spurious gains.
- Demonstrate the leakage explicitly. We include
Sales_Volumein one classification run to show the trivial ~100% accuracy, then exclude it as the correct, leakage-free setup — turning the flaw into a teaching artefact. - Deliver business value through a labelled Scenario Simulator. Because the data cannot support forecasting, decision value comes from an explicit simulation grounded in (a) price elasticities from the automotive economics literature and (b) real macro/fuel/FX/CO₂ data pulled from external APIs. It is clearly labelled as a what-if decision-support tool, never presented as a fit to the historical data.
Alternatives considered¶
- Inflate metrics / overfit — rejected: dishonest, and trivially detectable by any senior reviewer (the held-out R² collapses).
- Silently re-synthesise a signal into the target — rejected: would mislead; only acceptable if explicitly labelled (which is exactly what the Simulator is).
- Abandon predictive modelling — rejected: the pipeline itself is the portfolio artefact (engineering, MLOps, testing), independent of dataset signal.
Consequences¶
- + The deliverable showcases senior judgement: detecting, proving and communicating a data problem most pipelines would mask.
- + Reproducible: the report regenerates from raw data via
make eda. - − Headline predictive metrics are intentionally unimpressive; mitigated by framing and the Scenario Simulator. This trade-off is the whole point.