ADR-0006 — Statistical signal audit & positive control¶

Status: Accepted
Date: 2026-06-04
Author: Maxime GOURGUECHON
Related: ADR-0002, reports/signal_audit.md

Context¶

ADR-0002 established (via correlation, ANOVA and MI) that the dataset is signal- free. Two objections remain that a rigorous reviewer would raise:

"Absence of evidence isn't evidence of absence — prove the null formally."
"Maybe your pipeline is just broken."

Decision¶

Add a dataset-agnostic bmw_sales.audit package providing falsifiable, transferable evidence:

Permutation (label-shuffle) test — compares the real held-out score to a null distribution from shuffled labels; a one-sided p-value quantifies the probability of seeing the real score by chance. Result: p ≈ 0.90 (regression), p ≈ 0.81 (classification) → indistinguishable from chance.
Positive control — runs the identical pipeline on a synthetic target that is a known function of the features. Result: R² ≈ 0.86 (synthetic) vs ≈ 0 (real) → the pipeline is sound; the data is empty.
KS uniformity and chi-squared independence — fingerprint the synthetic, uniform data-generating process.

The audit is surfaced in reports/signal_audit.md (via make audit) and as a "Statistical proof" panel in the dashboard's Data Integrity tab.

Rationale¶

This converts the project's central claim from an assertion into a reusable research instrument: the same module audits any dataset for exploitable signal. The positive control is the single most convincing artefact — it pre-empts the "your model is broken" objection with a number.

Consequences¶

+ Decisive, reproducible evidence; a transferable tool beyond this dataset.
+ Strengthens the credibility of every downstream null result.
− Permutation testing is compute-heavy; mitigated by sub-sampling and a fast HistGradientBoosting estimator, and cached in the app.