ADR-0001 — Project architecture & technology stack¶

Status: Accepted
Date: 2026-06-03
Author: Maxime GOURGUECHON

Context¶

We are building a production-grade analytics & forecasting platform on the BMW_sales_data_(2010-2024) dataset (50,000 rows, 11 columns, no missing values). The deliverable targets a senior data-science portfolio: it must demonstrate econometrics, ML, DL, external-data augmentation, a premium UI, containerisation and CI/CD — with clean, modular, typed, tested code.

Decision¶

Layout — `src/` layout package (`bmw_sales`)¶

A src/-layout installable package (pip install -e .) rather than loose notebook scripts. This forces explicit imports, prevents accidental reliance on the CWD, and makes the same code importable from tests, the Streamlit app and Docker identically.

src/bmw_sales/
  config.py          # typed settings + canonical dataset schema
  data/              # loading, validation, integrity report
  audit/             # No-Signal Auditor: permutation, positive control (ADR-0006)
  apis/              # hybrid real+mock external clients
  features/          # feature engineering pipelines
  econometrics/      # OLS / hedonic / elasticity models
  models/            # ML (XGB/LGBM/CatBoost) + DL (tabular NN) + MLflow tracking
  simulation/        # Scenario Simulator + Monte-Carlo uncertainty (ADR-0008)
  explainability/    # SHAP analysis
  sql/               # DuckDB analytics over sql/queries/*.sql (ADR-0007)
app/                 # Streamlit premium UI (7 tabs)
tests/               # pytest (unit + integration)
docs/                # MkDocs Material site + ADRs (ADR-0009)

This layout has grown with the project; the modules added later are recorded in their own ADRs (signal audit · SQL · uncertainty · observability), cross-linked above and listed in the ADR index.

Configuration — `pydantic-settings`¶

A single typed Settings object reads from env / .env. No magic strings; the dataset schema lives in one DatasetSchema class so a rename is one edit and is type-checked.

Stack rationale¶

Concern	Choice	Why
Econometrics	`statsmodels`	p-values, confidence intervals, robust SE — explanatory, not just predictive.
ML	XGBoost / LightGBM / CatBoost	SOTA gradient boosting on tabular; CatBoost handles native categoricals.
DL	PyTorch tabular MLP	Benchmarked against ML to justify (or refute) its use — see ADR-0004.
Explainability	SHAP	Model-agnostic, board-ready feature attributions.
External data	`requests` + `tenacity`	Hybrid real+mock with retry/circuit-breaker; offline-safe.
UI	Streamlit + Plotly	Fast, interactive, fully themeable to the BMW luxury identity.
SQL analytics	DuckDB-over-CSV	Portable, reviewable SQL with no ETL or server — ADR-0007.
Experiment tracking	MLflow (file store)	Zero-infrastructure run history — ADR-0009.
Docs	MkDocs Material + mkdocstrings	Docs-as-code: ADRs + auto API reference — ADR-0009.
Packaging	Docker multi-stage	Small, reproducible runtime image.
CI/CD	GitHub Actions	Lint · mypy · pytest (coverage gate) · pip-audit · Docker + Trivy — ADR-0005/0007.

Consequences¶

+ Clear separation of concerns; every layer is independently testable.
+ Offline-by-default reproducibility (CI never flakes on a third-party API).
− More upfront structure than a notebook; justified by the production bar.

A critical data-quality finding shapes the analytical framing of this project; it is recorded separately in ADR-0002 (Data Integrity).