Skip to content

ADR-0001 — Project architecture & technology stack

  • Status: Accepted
  • Date: 2026-06-03
  • Author: Maxime GOURGUECHON

Context

We are building a production-grade analytics & forecasting platform on the BMW_sales_data_(2010-2024) dataset (50,000 rows, 11 columns, no missing values). The deliverable targets a senior data-science portfolio: it must demonstrate econometrics, ML, DL, external-data augmentation, a premium UI, containerisation and CI/CD — with clean, modular, typed, tested code.

Decision

Layout — src/ layout package (bmw_sales)

A src/-layout installable package (pip install -e .) rather than loose notebook scripts. This forces explicit imports, prevents accidental reliance on the CWD, and makes the same code importable from tests, the Streamlit app and Docker identically.

src/bmw_sales/
  config.py          # typed settings + canonical dataset schema
  data/              # loading, validation, integrity report
  audit/             # No-Signal Auditor: permutation, positive control (ADR-0006)
  apis/              # hybrid real+mock external clients
  features/          # feature engineering pipelines
  econometrics/      # OLS / hedonic / elasticity models
  models/            # ML (XGB/LGBM/CatBoost) + DL (tabular NN) + MLflow tracking
  simulation/        # Scenario Simulator + Monte-Carlo uncertainty (ADR-0008)
  explainability/    # SHAP analysis
  sql/               # DuckDB analytics over sql/queries/*.sql (ADR-0007)
app/                 # Streamlit premium UI (7 tabs)
tests/               # pytest (unit + integration)
docs/                # MkDocs Material site + ADRs (ADR-0009)

This layout has grown with the project; the modules added later are recorded in their own ADRs (signal audit · SQL · uncertainty · observability), cross-linked above and listed in the ADR index.

Configuration — pydantic-settings

A single typed Settings object reads from env / .env. No magic strings; the dataset schema lives in one DatasetSchema class so a rename is one edit and is type-checked.

Stack rationale

Concern Choice Why
Econometrics statsmodels p-values, confidence intervals, robust SE — explanatory, not just predictive.
ML XGBoost / LightGBM / CatBoost SOTA gradient boosting on tabular; CatBoost handles native categoricals.
DL PyTorch tabular MLP Benchmarked against ML to justify (or refute) its use — see ADR-0004.
Explainability SHAP Model-agnostic, board-ready feature attributions.
External data requests + tenacity Hybrid real+mock with retry/circuit-breaker; offline-safe.
UI Streamlit + Plotly Fast, interactive, fully themeable to the BMW luxury identity.
SQL analytics DuckDB-over-CSV Portable, reviewable SQL with no ETL or server — ADR-0007.
Experiment tracking MLflow (file store) Zero-infrastructure run history — ADR-0009.
Docs MkDocs Material + mkdocstrings Docs-as-code: ADRs + auto API reference — ADR-0009.
Packaging Docker multi-stage Small, reproducible runtime image.
CI/CD GitHub Actions Lint · mypy · pytest (coverage gate) · pip-audit · Docker + Trivy — ADR-0005/0007.

Consequences

  • + Clear separation of concerns; every layer is independently testable.
  • + Offline-by-default reproducibility (CI never flakes on a third-party API).
  • More upfront structure than a notebook; justified by the production bar.

A critical data-quality finding shapes the analytical framing of this project; it is recorded separately in ADR-0002 (Data Integrity).