API Reference¶
Auto-generated from the bmw_sales package docstrings.
Data¶
bmw_sales.data.loader
¶
Dataset loading with schema enforcement.
The loader is the single supported entrypoint for reading the raw BMW dataset. It validates structure on the way in so that downstream code can assume a clean, well-typed frame and fail fast (with a clear message) otherwise.
SchemaValidationError
¶
Bases: ValueError
Raised when the loaded dataset does not match the expected schema.
Source code in src/bmw_sales/data/loader.py
18 19 | |
load_raw(path=None, *, validate=True, apply_dtypes=True)
¶
Load the raw BMW sales dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Optional[Path]
|
Override the default raw dataset location (useful for tests/fixtures). |
None
|
validate
|
bool
|
If |
True
|
apply_dtypes
|
bool
|
If |
True
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The validated dataset. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the dataset file does not exist. |
SchemaValidationError
|
If validation is enabled and the schema does not match. |
Source code in src/bmw_sales/data/loader.py
62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | |
bmw_sales.data.validation
¶
Data-integrity checks and the markdown report.
Looks at three things: structural integrity (shape, nulls, duplicates), whether the features carry any signal about the targets (Pearson correlation, one-way ANOVA, mutual information), and whether Sales_Classification is just a threshold on Sales_Volume (target leakage).
Run with python -m bmw_sales.data.validation to regenerate the report.
IntegrityFinding
dataclass
¶
A single, human-readable finding with its supporting statistic.
Source code in src/bmw_sales/data/validation.py
25 26 27 28 29 30 31 | |
DataIntegrityReport
dataclass
¶
Structured result of the data-integrity analysis.
Source code in src/bmw_sales/data/validation.py
34 35 36 37 38 39 40 41 42 43 44 45 46 47 | |
analyse(df=None)
¶
Run the full integrity analysis and return a structured report.
Source code in src/bmw_sales/data/validation.py
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 | |
to_markdown(report)
¶
Render a :class:DataIntegrityReport as a portfolio-ready markdown doc.
Source code in src/bmw_sales/data/validation.py
208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | |
main()
¶
CLI entrypoint: run analysis and persist the markdown report.
Source code in src/bmw_sales/data/validation.py
252 253 254 255 256 257 258 259 260 | |
Signal audit¶
bmw_sales.audit.signal_tests
¶
Statistical tests for whether a dataset has any learnable signal.
Three checks, usable on any frame:
- permutation (label-shuffle) test: compare the real held-out score to the distribution of scores under shuffled labels; a high p-value means no signal;
- Kolmogorov-Smirnov test of each numeric feature against a uniform fit;
- chi-squared test of independence between categoricals.
PermutationResult
dataclass
¶
Outcome of a label-permutation signal test.
Source code in src/bmw_sales/audit/signal_tests.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | |
has_signal
property
¶
Signal is present only if the real score beats the null at 5%.
UniformityResult
dataclass
¶
KS test of one numeric feature against a fitted Uniform distribution.
Source code in src/bmw_sales/audit/signal_tests.py
112 113 114 115 116 117 118 119 120 121 122 123 | |
looks_uniform
property
¶
Cannot reject Uniform at 5% ⇒ consistent with synthetic uniform data.
permutation_test(df, task, *, n_permutations=30, sample=6000)
¶
Run a label-permutation test for exploitable signal on task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Source dataset. |
required |
task
|
Task
|
|
required |
n_permutations
|
int
|
Size of the null distribution (each is a full model fit). |
30
|
sample
|
int
|
Sub-sample size for tractable runtime. |
6000
|
Source code in src/bmw_sales/audit/signal_tests.py
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | |
uniformity_tests(df)
¶
KS-test every numeric feature against Uniform(min, max).
Source code in src/bmw_sales/audit/signal_tests.py
126 127 128 129 130 131 132 133 134 135 136 | |
chi2_independence(df, col_a, col_b)
¶
Return the chi-squared independence p-value between two categoricals.
Source code in src/bmw_sales/audit/signal_tests.py
139 140 141 142 143 | |
bmw_sales.audit.control
¶
Positive control: run the same pipeline on a synthetic, signal-bearing target.
A null R2 on the real data could mean either no signal or a broken pipeline. Building a target that is a known function of the features and checking the pipeline recovers it (high R2) rules out the second case.
ControlResult
dataclass
¶
Held-out R² of the same pipeline on the real vs synthetic target.
Source code in src/bmw_sales/audit/control.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | |
pipeline_validated
property
¶
The pipeline is sound if it clearly learns the known synthetic signal.
make_signal_bearing_target(df, *, noise_sd=400.0)
¶
Construct a synthetic demand target that genuinely depends on the features.
synthetic_demand = f(region, premium tier, engine, year, price, electrified)
+ Gaussian noise. This is not a claim about the real world - it exists
only to verify the pipeline can learn a relationship that is known to exist.
Source code in src/bmw_sales/audit/control.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | |
run_control(df, *, model_name='LightGBM', sample=12000)
¶
Train the same model on the real and a synthetic signal-bearing target.
Source code in src/bmw_sales/audit/control.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | |
External-data APIs¶
bmw_sales.apis.base
¶
Base class for the external API clients.
Each client either hits a real endpoint or returns a deterministic mock, so the
project runs offline with no keys. Live calls retry with backoff (tenacity), and
after a failure a per-client circuit breaker falls back to the mock. Successful
responses are cached to parquet, and every result carries its provenance
(live / cache / mock). Subclasses implement _fetch_live and _mock.
DataSource
¶
Bases: str, Enum
Provenance of a returned dataset.
Source code in src/bmw_sales/apis/base.py
32 33 34 35 36 37 | |
APIResult
dataclass
¶
A dataset plus its provenance metadata.
Source code in src/bmw_sales/apis/base.py
40 41 42 43 44 45 46 47 48 49 50 | |
BaseAPIClient
¶
Bases: ABC
Abstract hybrid client with caching, retries and a circuit breaker.
Source code in src/bmw_sales/apis/base.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | |
fetch(**params)
¶
Return data for params, preferring cache → live → mock.
The method never raises on network failure: it degrades gracefully to a deterministic mock so downstream code always receives a valid frame.
Source code in src/bmw_sales/apis/base.py
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | |
bmw_sales.apis.enrichment
¶
Augmentation layer: join external API data onto the BMW sales dataset.
Builds a region×year (×fuel) external panel from the four hybrid clients and left-joins it onto the transactional sales data. All joins are left joins so the sales data is never dropped, and provenance per source is reported so the UI can show whether each block came from a live API or a mock fallback.
EnrichmentResult
dataclass
¶
Augmented dataset plus per-source provenance.
Source code in src/bmw_sales/apis/enrichment.py
24 25 26 27 28 29 | |
build_external_panel(start_year=2010, end_year=2024)
¶
Assemble the region×year(×fuel) external panel from all four clients.
Source code in src/bmw_sales/apis/enrichment.py
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
enrich_dataset(df, *, start_year=2010, end_year=2024)
¶
Left-join the external panel onto the sales dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The raw/clean sales dataset (must contain Region, Year, Fuel_Type). |
required |
Returns:
| Type | Description |
|---|---|
EnrichmentResult
|
The augmented frame and per-source provenance ( |
Source code in src/bmw_sales/apis/enrichment.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 | |
summarise_provenance(provenance)
¶
One-line human summary of where the external data came from.
Source code in src/bmw_sales/apis/enrichment.py
110 111 112 113 114 115 116 117 118 119 120 121 122 | |
Features & models¶
bmw_sales.features.engineering
¶
Feature engineering shared by the econometric and ML pipelines.
Adds vehicle age, usage intensity, electrification and premium-tier flags, and a couple of log transforms. The same frame feeds statsmodels and the boosters.
add_engineered_features(df, *, reference_year=REFERENCE_YEAR)
¶
Return a copy of df with domain-informed engineered features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
A frame containing the canonical raw columns. |
required |
reference_year
|
int
|
Year used as "today" when computing vehicle age. |
REFERENCE_YEAR
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
|
Source code in src/bmw_sales/features/engineering.py
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
feature_columns(*, include_leakage=False)
¶
Return the modelling feature sets (categorical vs numeric).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
include_leakage
|
bool
|
If |
False
|
Source code in src/bmw_sales/features/engineering.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | |
bmw_sales.models.preprocessing
¶
Shared preprocessing for the supervised ML/DL pipelines.
Provides a single, leakage-aware way to turn the (optionally enriched) dataset
into model-ready X/y plus a fitted-on-train ColumnTransformer. Using
the same preprocessing for every model keeps the benchmark fair.
Dataset
dataclass
¶
A train/validation/test split plus column metadata.
Source code in src/bmw_sales/models/preprocessing.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | |
select_features(df, *, include_leakage=False)
¶
Return (numeric, categorical) feature names present in df.
Adds API-enriched numeric columns when available. Never includes a target.
Source code in src/bmw_sales/models/preprocessing.py
54 55 56 57 58 59 60 61 62 63 64 65 | |
build_preprocessor(numeric, categorical)
¶
Standardise numerics and one-hot encode categoricals (dense, unknown-safe).
Source code in src/bmw_sales/models/preprocessing.py
68 69 70 71 72 73 74 75 76 77 78 79 80 | |
make_dataset(df, task, *, include_leakage=False, test_size=0.15, val_size=0.15)
¶
Build a leakage-aware train/val/test split for the requested task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Raw or enriched dataset. |
required |
task
|
Task
|
|
required |
include_leakage
|
bool
|
Classification only - include |
False
|
Source code in src/bmw_sales/models/preprocessing.py
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
Simulation¶
bmw_sales.simulation.scenario
¶
What-if demand simulator (not a fit to the historical data).
Projects demand under a constant-elasticity model:
Q' = Q0 * (1+dp)^ep * (1+dy)^ey * (1+df)^ef * R(ds) * (1+dfx)^ep
for changes in list price (dp), income (dy), fuel price (df), CO2-regulation stringency (ds) and FX (dfx). Elasticity priors are segment-specific: the premium tier is less price-elastic (own-price ~-0.3, with Veblen effects) and more income-elastic (~2.2) than the standard tier (~-0.7 / ~1.3). Baselines come from the macro APIs; all priors are adjustable in the UI.
ElasticityAssumptions
dataclass
¶
Segment-specific elasticity priors (all user-overridable in the UI).
Defaults are the standard segment; use :meth:for_segment for the
luxury/premium tier (less price-elastic, more income-elastic).
Source code in src/bmw_sales/simulation/scenario.py
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | |
for_segment(premium)
classmethod
¶
Return priors for the premium/luxury tier or the standard tier.
Premium: own-price ≈ -0.3 (Veblen-leaning, price-inelastic), income ≈ 2.2 (positional good), weaker fuel sensitivity. See the module docstring.
Source code in src/bmw_sales/simulation/scenario.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | |
ScenarioInput
dataclass
¶
A single what-if scenario.
Source code in src/bmw_sales/simulation/scenario.py
58 59 60 61 62 63 64 65 66 67 68 69 | |
FactorContribution
dataclass
¶
Multiplicative contribution of one driver to projected demand.
Source code in src/bmw_sales/simulation/scenario.py
72 73 74 75 76 77 78 79 80 81 | |
ScenarioResult
dataclass
¶
Output of a scenario projection.
Source code in src/bmw_sales/simulation/scenario.py
84 85 86 87 88 89 90 91 92 93 94 95 96 | |
simulate(scenario, assumptions=None)
¶
Project demand for a scenario using the constant-elasticity model.
All effects are independent and multiplicative; each is reported separately so the user can see why demand moves, not just by how much.
Source code in src/bmw_sales/simulation/scenario.py
103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 | |
macro_defaults(region, *, year=2024)
¶
Suggest scenario defaults from the (real/mock) external APIs for a region.
Returns plausible starting values: recent inflation, a GDP-growth proxy, and the latest regulation-stringency level - so the UI opens on realistic numbers.
Source code in src/bmw_sales/simulation/scenario.py
152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 | |
bmw_sales.simulation.uncertainty
¶
Monte-Carlo uncertainty for the scenario simulator.
Puts Gaussian priors on the elasticities and samples them through the constant-elasticity model, so each scenario returns a distribution of projected demand with credible intervals instead of a single point. Sampling is seeded.
ElasticityPriors
dataclass
¶
Gaussian priors (mean ± sd) on each elasticity - segment-specific.
Means match the deterministic model's standard-segment priors; the
standard deviations encode honest parameter uncertainty. Use
:meth:for_segment for the luxury/premium tier.
Source code in src/bmw_sales/simulation/uncertainty.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | |
for_segment(premium)
classmethod
¶
Priors for the premium/luxury tier (Veblen-leaning) or the standard tier.
Source code in src/bmw_sales/simulation/uncertainty.py
37 38 39 40 41 42 43 44 45 46 47 48 49 | |
ScenarioDistribution
dataclass
¶
Monte-Carlo distribution of projected demand for a scenario.
Source code in src/bmw_sales/simulation/uncertainty.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | |
pct_change_ci()
¶
80% credible interval expressed as % change vs baseline.
Source code in src/bmw_sales/simulation/uncertainty.py
74 75 76 77 78 79 | |
simulate_mc(scenario, priors=None, *, n_draws=5000)
¶
Propagate elasticity uncertainty through the demand model via Monte Carlo.
Each draw samples the elasticities from their priors and recomputes projected demand; the collection of draws forms the predictive distribution.
Source code in src/bmw_sales/simulation/uncertainty.py
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | |
SQL analytics¶
bmw_sales.sql.analytics
¶
Run the .sql files in sql/queries/ against the CSV with DuckDB.
No database server or ETL; the queries are plain SQL and this module just runs them.
list_queries()
¶
Return the available query names (.sql file stems), sorted.
Source code in src/bmw_sales/sql/analytics.py
35 36 37 | |
run_query(name, *, dataset_path=None)
¶
Execute the named query and return the result as a DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
A query name from :func: |
required |
dataset_path
|
Path | None
|
Optional dataset override (for tests/fixtures). |
None
|
Source code in src/bmw_sales/sql/analytics.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | |
run_all(*, dataset_path=None)
¶
Execute every query and return {name: DataFrame}.
Source code in src/bmw_sales/sql/analytics.py
61 62 63 | |