← Back to Journal
This Essay -- Four Parts

This is Part III of a four-part essay on building a Bayesian MMM worth trusting. Part I covered the credibility problem and the foundations of data architecture. Part II worked through the prior as a strategic asset. Part III covers the structural controls that isolate the marketing signal and the diagnostic sequence that determines whether the model is ready for a review room.

A note on the figures that follow: all illustrations in this essay use revenue as the example dependent variable. In practice the modeled KPI may be units, volume, share, or any other business outcome, and the diagnostic principles apply equally regardless of the dependent variable chosen.

05

Structural Controls: Isolating the Marketing Signal

A marketing mix model does not measure marketing in isolation. It measures marketing in the context of everything else simultaneously driving the business, and its ability to produce credible channel-level estimates depends entirely on how well it accounts for that context. Without structural controls, the model has no way to separate what marketing caused from what would have happened regardless using the data alone -- though in a Bayesian MMM with priors grounded in strong incrementality experiments, the prior itself can carry some of that separating work. The risk is that the model becomes more prior-dependent and more fragile: a channel contribution estimate that is defensible today becomes harder to defend the moment the experimental evidence behind its prior is questioned. And where priors are weak or absent, the stakes are real: simulation analysis found that omitting key confounders can inflate channel contribution estimates (for example, TV) to three times their actual value.

The Control Variable Landscape

One important framing note before diving in: the goal of including these non-media variables is not to generate actionable insights on pricing, promotions, or distribution directly from the MMM. It is to isolate the marketing signal more accurately and build confidence in the channel-level contribution estimates. A standard MMM is not the right tool for deep trade promotion analytics or pricing optimization -- those require purpose-built models designed around those decisions. The Bayesian methodology can absolutely be extended to understand those dynamics in more depth, but that is a separate modeling exercise. What belongs in the MMM is the minimum set of controls needed to tell the model what is not media, so that what remains can be attributed to media with greater confidence.

A related discipline: within each category, the goal is to include one well-specified variable rather than multiple overlapping ones. Including both a price index and a relative price ratio, or both a discount depth and a promo spend variable without careful thought, introduces collinearity within the control set itself -- the same problem that destabilizes media coefficients when channels move together. One clean, well-measured variable per category, chosen deliberately and documented with a rationale, is more defensible than a dense control set where the individual contributions cannot be separated.

What follows is not an exhaustive taxonomy but a practical orientation to the categories of non-marketing variables that most commonly belong in a well-specified MMM.

Seasonality and trend are distinct structural components that capture fundamentally different things and should be modeled separately. Seasonality captures the repeating within-year rhythm of the business -- the demand patterns that return at roughly the same time each year regardless of what marketing is doing. Fourier terms, pairs of sine and cosine functions at different frequencies, model this well because they are periodic by construction, learning the smooth seasonal shape from the data without requiring manual specification for every peak period. Trend captures the longer-term directional movement of the business -- underlying growth, gradual decline, or a structural shift that does not repeat -- and is typically modeled as a linear or piecewise linear component. Conflating the two into a single control is a common modeling error: a Fourier term applied to both will try to absorb trend as a very slow seasonal cycle, distorting both estimates in the process.

Pricing captures demand response to price level changes -- a continuous signal reflecting how volume shifts as the brand's price moves. Two variables are commonly used and they answer different questions. Average unit price (AUP), the brand's own absolute price computed as revenue over units, captures own-price elasticity -- how volume responds when the brand's price moves regardless of what competitors are doing. Relative price, the brand's price expressed as a ratio to the category average or a defined competitive set, captures competitive pricing position -- how volume responds when the brand moves relative to the market. Including both in the same model is typically not advisable because they move together and create collinearity within the control set; pick the one that matches the question being asked. Pricing effects are often semi-permanent: a price change that lands in week 12 continues to influence volume in weeks 13, 14, and beyond until another change occurs. Modeling pricing as a continuous variable allows the model to capture this persistent demand effect separately from the spikes that promotional mechanics produce.

Promotions capture temporary demand spikes from discrete promotional mechanics -- a price cut, a BOGO, a feature, an end-cap -- that are event-like in nature and closer in structure to a Gaussian bump than to a continuous price variable. Promo spend, the trade investment behind the promotion, should also be included where available: it separates the ROI of the investment from the demand effect of the discount itself, which is a distinction that matters for trade planning. A model that cannot see a promotional discount running in the same week as a media flight has no principled way to separate the promotional lift from the media effect. Separating pricing and promotions also matters for planning: pricing effects are structural and repeatable at the same cost, promotional effects are temporary and require investment to reproduce.

Distribution and availability capture changes in the brand's physical or digital footprint, new retailer listings, stockouts, and shelf placement changes that affect the baseline independently of marketing activity. These controls are particularly important in CPG and retail contexts where distribution gains can produce sales lifts that dwarf the contribution of any individual media channel.

Macroeconomic signals including consumer confidence indices, unemployment rates, and stock market indices are most relevant for high-consideration purchases where consumer spending is sensitive to broader economic conditions. Category-level demand indices are sometimes included alongside macro signals but warrant separate treatment: category demand can diverge significantly from macro trends depending on the vertical, and in some categories -- pet food, healthcare, essential CPG -- the correlation with macro conditions is weak enough that bundling category demand into the macro variable can mask the distinction. My preference is to keep them as separate variables in the model, each with its own coefficient, so the model can reflect the fact that macro tailwinds and category-specific demand shifts are distinct forces that may not move together.

Competitive activity captures the demand displacement effect when competitors increase media weight, launch new products, or run aggressive promotions. Where share of voice or competitor spend data is available at sufficient quality, this can be modeled directly as a competitive steal variable. Where syndicated data is accessible -- Nielsen, Circana, or equivalent -- competitor units sold or revenue is a more direct signal of competitive demand pressure than media weight or price alone, capturing actual volume shifts in the competitive set rather than inferring them. Where these data sources are incomplete or lagged, relative price is the more operationally reliable fallback, capturing both the brand's own pricing position and the competitive environment in a single well-measured variable (and worth noting: where a standalone pricing variable is already in the model, relative price may overlap -- the two should not both be included without careful thought about what each is uniquely contributing). Competitor data is often expensive and not always available at the right granularity, so the variable choice should reflect what is actually available rather than what is theoretically ideal.

One-time events -- including planned holidays, product reformulations, major PR moments, brand crises, competitor market exits, and macro shocks -- produce demand effects that are sharp, concentrated in time, and impossible for a smooth Fourier curve to absorb without distorting the seasonal estimate. The solution for all of these is the same: a Gaussian bump centered on the event date, whose width reflects how quickly the demand effect rises and dissipates. This means holiday effects, which are discrete and predictable, are handled the same way as unexpected events -- not as part of the seasonality control but as their own explicitly modeled variables. The practical discipline is to document candidate events during the data audit and make an explicit decision about which warrant a bump before the build begins.

Research on MMM bias taxonomy documents what practitioners call suppression bias: omitted variable bias occurs when a variable is left out, but suppression bias occurs from including the variable, making the selection of controls a two-sided problem with no safe default. The discipline is not to include every potentially relevant control but to include controls that are well-measured, conceptually clean, and genuinely orthogonal to the media variables the model is trying to isolate -- meaning they move independently of media spend rather than alongside it, so the model can separate their effects cleanly. Once a control set is large enough that variables interact in complex ways, the model becomes a black box where small changes in multiple controls affect the output in unpredictable ways, and in the vast majority of cases marketers do not trust the results and the MMM gets ignored.

Control variable reference
Non-media variables, primary use, and risk if omitted
Control variable Primary use Common form Potential risk if omitted*
Seasonality and trend Capture repeating within-year demand rhythm (seasonality) and longer-term directional movement (trend) as separate components Fourier terms for seasonality; linear or piecewise linear component for trend Channel coefficients biased; direction depends on whether seasonal peaks align with media timing; baseline absorbs unexplained variance
Pricing Capture persistent demand response to price level changes; pick the variable that matches the question Average unit price (AUP) for own-price elasticity; relative price ratio for competitive position Price-driven volume shifts misattributed to media if pricing changes correlate with media timing; confidence in media reads reduced
Promotions Capture temporary demand spikes from discrete promotional mechanics and the trade investment behind them Discount depth, promo flag, feature/end-cap indicator, promo spend Promotional lift misattributed to whichever channels were active during the promo window
Distribution and availability Isolate footprint changes from marketing effects Weighted distribution, ACV, stockout flag Distribution-driven volume changes misattributed to media; baseline drifts as footprint shifts
Macroeconomic signals Account for external demand conditions Consumer confidence index, unemployment rate, stock market index, category index Macro tailwinds or headwinds absorbed into baseline; media estimates unbiased but confidence bands widen
Competitive activity Capture demand displacement from competitor actions Share of voice, relative price ratio, competitor units sold or revenue (Nielsen, Circana) Competitor-driven volume shifts misattributed to brand media or absorbed into baseline
One-time events Isolate discrete demand shocks -- planned holidays, PR moments, crises, macro events Gaussian bump centered on event date; applies to holidays and unexpected events alike Event-driven spikes inflate or deflate channel coefficients depending on what media was active in the event window

*Potential risks materialize most fully when priors are weak or absent. In a Bayesian MMM with priors grounded in strong empirical evidence -- incrementality experiments, prior comprehensive MMMs with proper controls, or other rigorous causal measurement -- several of these risks are partially mitigated because the prior anchors the channel-level estimate even when a control is missing. The risks listed describe what can go wrong when both the control and a well-sourced prior are absent.

From Controls to Explainability: How the Model Accounts for Itself

The control variable framework matters not only because it produces more accurate estimates but because it makes the model's outputs explainable in terms that a CMO, a CFO, or a measurement consultant vetting the results can actually interrogate.

The standard output through which controls surface in a reported MMM is the revenue decomposition, sometimes called a waterfall or contribution chart. This visualization breaks total observed revenue into its constituent drivers: the baseline representing everything the brand would have generated without any marketing activity, and the aggregate contributions of each media channel, promotions, pricing, and the other structural controls included in the model. The decomposition is the model's account of itself, and it is the output that will be scrutinized most intensely in any serious model review.

The baseline deserves particular attention because it is the most consequential and least understood component of the decomposition. In a well-specified model, the baseline captures organic demand, brand equity built over time, distribution strength, and the residual effect of all non-marketing forces the control variables have accounted for. The key is to frame the baseline not as evidence that marketing does not work but as the foundation that makes marketing's incremental contribution legible. A high baseline is a sign of a strong brand with durable organic demand, and the marketing contribution sitting above it represents genuinely incremental revenue that would not have existed without the investment.

The decomposition should ideally be presented with control variables visible and labeled, and seldom collapsed into an undifferentiated baseline or residual category. Some control variables will appear as negative contributors in certain periods, reflecting their role in suppressing demand rather than amplifying it, and this is expected and informative rather than a sign of model error. The YoY waterfall view in the figure below makes this pattern particularly legible, separating headwinds from tailwinds so that the directional contribution of each driver is immediately visible to practitioners and executives alike.

Figure 5: Revenue Decomposition -- Waterfall, YoY Waterfall, and iROAS

Three views of the same underlying decomposition, each suited to a different audience and question.

Figure 5: Revenue decomposition across three views. Hypothetical data for illustrative purposes.

06

Diagnostics That Actually Mean Something

There are two kinds of diagnostics in MMM practice, and most practitioners only run one of them. The first tests whether the model is internally consistent. The second tests whether it is externally credible. A model can pass every internal diagnostic and still be wrong in ways that matter enormously for the decisions it informs. What follows is the sequence a measurement leader and data science team should work through in order, from exploratory checks before modeling begins through the external validation that determines whether the outputs are ready for a review room.

The ten-step diagnostic sequence
From exploratory checks to external triangulation
Step Diagnostic When What to look for Red flag
1EDABefore modelingOutliers, gaps, seasonal consistencyMissing data, implausible spikes
2CollinearityBefore prior specPairwise correlations, VIFCorrelation >0.8; prior quality insufficient to compensate
3Prior predictive checkBefore fittingPlausible revenue range from priorsNegative values, implausible magnitudes
4In-sample fitAfter fittingMAPE, R-squared, residual patternsMAPE persistently >10% with systematic residuals; R² <0.80; large in-sample/out-of-sample gap
5MCMC convergenceAfter fittingR-hat, effective sample size, divergencesR-hat >1.01, low ESS, any divergences
6Posterior predictive + channel-level posteriorsAfter fittingSimulated vs. observed distribution; channel posterior vs. priorSystematic time-series divergence; prior-dominated or implausibly far channel posteriors
7Holdout validationAfter fittingOut-of-sample MAPE gap and absolute levelHoldout MAPE lands in a range that limits decision use
8Posterior width and sensitivityAfter fittingPrior vs. posterior movement, sensitivity scenariosPosterior as wide as prior, large sensitivity shifts
9Decomposition sanityBefore reviewBaseline, seasonal peaks, channel plausibilityImplausible contributions, unexplained baseline
10TriangulationBefore reviewvs. experiments, prior runs, benchmarksMaterial unexplained cross-run shift

The Diagnostic Sequence

Step 1

Exploratory Data Analysis, before modeling begins

Much of the foundational data review described in Section 02 overlaps with this step: variable definitions, completeness checks, and stakeholder sign-off all happen upstream. What EDA adds here is the modeler's own analytical verification of the ingested data before any model code is written. Plot the dependent variable across the full window and flag outliers, structural breaks, and anomalous periods that may warrant Gaussian bumps. Plot each media variable and check for gaps, zeros, implausible spikes, and changes in spend patterns that could affect coefficient stability. Confirm seasonal patterns are consistent year over year. Problems caught at this stage cost hours to fix; the same problems discovered mid-build or in a review room cost weeks.

If problems are found: Return the data to the client or data engineering team for correction before proceeding. Do not model around known data problems. Document every anomaly and the decision made about it before the build begins.

Step 2

Correlation and collinearity assessment

Examine the correlation structure across all media variables before prior specification begins. Channels with pairwise correlations above 0.7 to 0.8 will produce unstable coefficient estimates in a frequentist context, and the same channels in a Bayesian MMM will become increasingly reliant on the prior to separate their individual contributions. High collinearity does not invalidate a Bayesian build, but it raises the stakes on prior quality: when channels move together, the model's ability to distinguish their effects depends more heavily on what the priors encode and less on what the data can show. A Variance Inflation Factor above 10 is a signal worth investigating in any modeling context, but in Bayesian MMM the more important question is whether the priors for correlated channels are sufficiently well-sourced to compensate for what the data cannot resolve on its own.

If high collinearity is found: Strengthen priors for the correlated channels using available experimental evidence before building. Where channels are structurally inseparable and combining them is not viable because marketing needs a channel-level read, the path forward is some combination of: anchoring priors with geo-based incrementality tests, designing deliberate spend variation into the next planning cycle, and labeling channel-level estimates with an explicit confidence level rather than presenting them as equally certain point estimates. If none of these are possible, disclose the limit in the review and flag any channel-level decisions as provisional until the underlying data or experimental foundation improves.

Figure 6: Collinearity Assessment

Correlation matrix and coefficient instability under high collinearity. Toggle between views using the buttons below.

Figure 6: Collinearity diagnostic -- correlation matrix and coefficient instability. Hypothetical data for illustrative purposes.

Step 3

Prior predictive checks

Before fitting the model, simulate from the prior distributions alone and examine whether the resulting predictions for the modeled KPI (revenue, units, or whatever the dependent variable is) span a plausible range. Simulations producing negative values, implausible seasonal shapes, or contribution ranges outside any reasonable business expectation signal a prior specification problem that should be resolved before data enters the model. The goal is not to force a specific answer but to confirm the model's starting assumptions are consistent with commercial reality.

If simulations are implausible: The check tells you something is wrong, not which prior is wrong. Investigate iteratively by holding the suspect priors at their tightest reasonable specification one at a time and re-running the check until the implausible behavior disappears. The most common culprits are an overly wide baseline prior, an aggressive control coefficient, or a contribution prior that compounds with wide adstock or saturation priors to produce implausible response magnitudes.

Figure 7: Prior Predictive Check

Simulated revenue trajectories generated from the prior distributions alone, before any data enters the model. The diagnostic is not whether the bands hug the observed line -- the prior is intentionally wider than what a posterior will produce. The diagnostic is whether the simulated range is commercially plausible. Look for negative values, implausible magnitudes, irregular spikes in the HDI bands, or seasonal shapes that bear no resemblance to the business. A well-specified prior produces a wide but coherent envelope. A poorly-specified prior produces nonsense.

Figure 7: Prior predictive check -- well-specified vs. poorly-specified prior. Hypothetical data for illustrative purposes.

Step 4

In-sample fit and MAPE

A MAPE below 10% is generally acceptable for weekly MMM, with values below 5% indicating strong in-sample fit. That said, MAPE above 10% is not automatically disqualifying -- in highly seasonal categories where weekly revenue swings are large relative to the annual mean, or in businesses with significant demand volatility driven by factors outside the model's control, MAPEs in the 12 to 15% range can still produce defensible contribution estimates provided the residual patterns are random rather than systematic. The threshold should be interpreted in the context of the business, not applied mechanically. An R-squared below 0.80 suggests the model is missing important structural variables, leaving too much observed variation unexplained. An R-squared above 0.95 is sometimes treated as evidence of overfitting, but the number alone does not tell the story -- the gap between in-sample and out-of-sample fit does. A 0.98 in-sample R-squared paired with a 0.96 holdout R-squared is a genuinely strong model. A 0.98 in-sample paired with a 0.78 holdout is overfitting: the model has fit the historical noise rather than learning to predict the future. Always evaluate in-sample fit alongside the holdout diagnostic in Step 7.

If fit is poor (R² below 0.80 or MAPE persistently above 10% with systematic residuals): Audit the variable set for missing structural drivers. If in-sample fit is very high but the holdout gap is large, the model is overfitting: simplify by removing low-quality controls, widening priors that may be locking the model too tightly to the training data, or checking whether an anomalous period in the modeling window is being absorbed into channel coefficients rather than into a dedicated event variable.

Step 5

MCMC convergence diagnostics

Before any posterior summary is interpreted, confirm the chains have converged. R-hat measures whether multiple chains have settled on the same posterior; values close to 1.00 indicate convergence, anything above 1.01 to 1.05 suggests the chains have not yet agreed and the posterior cannot be trusted. Effective sample size (ESS) measures how many independent samples the sampler has effectively drawn -- low ESS relative to total iterations means high autocorrelation in the chain and unreliable posterior summaries. Divergences during NUTS sampling indicate the sampler is hitting regions of the posterior with high curvature and failing to integrate them; even a small number warrants investigation before interpreting any output.

If convergence diagnostics flag: Do not interpret the posterior. Run longer chains, reparameterize hierarchical structures (non-centered parameterization), tighten priors that are creating funnel geometries, or simplify the model. Convergence is non-negotiable: a model that has not converged is not a model.

Step 6

Posterior predictive checks and channel-level posterior inspection

Step 6 pairs two related diagnostics that answer different questions about the fitted model. The time-series posterior predictive check asks whether the model's account of the aggregate data matches what was observed. Channel-level posterior inspection asks whether the model's account of each channel matches what the prior expected and what the experimental evidence supports. Both diagnostics live in this step because both are interpretive lenses on the same fitted posterior, and a model can pass one while failing the other.

The time-series posterior predictive check uses the fitted posterior to generate simulated datasets and compare them to observed data across revenue distribution, seasonal patterns, and response to known media events. A model whose simulated data systematically differs from observed data in any of these dimensions is encoding a picture of the business that does not match reality. Figure 8 shows what this looks like for a well-fit and a structurally misspecified model.

Channel-level posterior inspection is a sibling diagnostic to the time-series posterior predictive check. Strictly, it is not a posterior predictive check in the formal sense, since it does not simulate replicated data from the posterior. It inspects the posterior over parameters directly, asking whether the data has updated beliefs about each channel's contribution in a commercially defensible way. The three patterns introduced in Part II, Figure 2 -- prior-dominated, well-calibrated, and data-dominated -- apply directly here as an interpretive framework, now observed at post-fit time rather than anticipated at prior-spec time. This is the diagnostic that determines whether stakeholders should trust the channel-level outputs they are about to read in a review room, and it is among the most underused in commercial MMM practice.

If systematic divergence is found in the time-series posterior predictive check: Identify the periods where simulated and observed data diverge most. Residuals clustering around promotions point to missing promotion controls. Divergence in seasonal peaks points to misspecified Fourier terms or missing holiday indicators. Divergence following media flights points to misspecified adstock decay. Address the identified structural gap and re-run. If channel-level posteriors show a prior-dominated pattern: The channel lacked the spend variation needed to update beliefs. Flag it as data-limited, present the channel-level estimate as a directional range rather than a point estimate, and prioritize spend variation or an incrementality experiment in the next cycle. If channel-level posteriors show a data-dominated pattern that moved implausibly far from a well-sourced prior: Investigate for confounds, structural breaks, collinearity, or data definition problems before accepting any channel-level output.

Figure 8: Posterior Predictive Check

After fitting, the posterior is used to simulate datasets and compare them to observed data. HDI bands and mean prediction now reflect what the model has learned, not just what the prior believed.

Figure 8: Posterior predictive check -- well-fit model vs. structurally misspecified model. Hypothetical data for illustrative purposes.

Step 7

Holdout validation

Withhold the most recent three to six months from the model build and evaluate out-of-sample prediction accuracy. The diagnostic combines two readings: the gap between in-sample and holdout MAPE, and the absolute level the holdout MAPE lands on. A holdout within two to three percentage points of in-sample indicates clean generalization. Larger gaps require interpretation in context of the absolute holdout level: a model that fits at 2% in-sample and predicts at 8% out-of-sample carries a six-point gap but still produces forecasts well within the range needed for decision-making, while a model that fits at 7% in-sample and predicts at 14% out-of-sample carries a similar gap but lands on a holdout level that limits its usability for budget decisions and requires investigation before the model is trusted. The holdout period should include at least one high-stakes window, a seasonal peak or significant media flight, since a model that only predicts well during normal periods is not a model worth trusting when decisions are most consequential.

If the holdout MAPE lands in a range that limits decision use: First check whether the holdout period contains an unusual event that was not modeled, such as a structural break, a promotional anomaly, or a market shock. If not, the model is likely overfitting: simplify by removing low-quality controls, widening priors, or shortening the interaction structure. A large in-sample-to-holdout gap that still leaves the absolute holdout MAPE in a usable range does not require the same treatment, though the underlying cause is worth documenting for the next cycle. Do not present the model in a review room until the holdout MAPE supports the decisions the model will inform.

Figure 9: In-Sample MAPE vs. Holdout MAPE

The gap between fitting the past and predicting the future. A well-generalizing model vs. an overfitting model side by side.

Figure 9: In-sample vs. holdout MAPE -- good generalization vs. overfitting. Hypothetical data for illustrative purposes.

Step 8

Posterior width and prior sensitivity

A posterior nearly as wide as its prior has not learned from the data, signaling spend sparsity, data gaps, or collinearity. A posterior that has moved implausibly far from a well-sourced prior signals a structural data problem pulling the coefficient estimate. Run sensitivity scenarios across wider and narrower priors on key channels and adstock decay shifted one standard deviation in either direction. Contribution estimates that shift materially across these scenarios should be flagged explicitly before the model enters a review. If conclusions fall apart under slightly different assumptions, they were probably not solid to begin with.

If estimates are highly sensitive to prior choice: Diagnose the underlying cause rather than defaulting to an experiment. Sensitivity typically traces to insufficient spend variation, sparse history, collinearity with another media or control variable, or confounding with seasonality or promotions, and each cause calls for a different response. Insufficient variation is addressed by strengthening the prior using historical posteriors from prior cycles, planning evidence, or industry benchmarks, with the basis documented. Collinearity and confounding are structural problems and the model itself requires reformulation: consider merging closely correlated channels, adding the confound as an explicit control, or moving to a hierarchical structure that shares information across markets. Sparse history calls for evaluating whether the modeling window can be extended or whether aggregating to a coarser time grain produces stable estimates. An incrementality experiment is the right tool when prior strengthening and structural reformulation have been exhausted, targeted to the highest-stakes channel with the largest sensitivity rather than treated as a universal default. Flag the channel as data-limited in the review and treat the contribution estimate as a directional range until the underlying condition is addressed.

Step 9

Decomposition sanity check

Does the baseline make sense given the brand's organic demand and distribution footprint? Do seasonal peaks align with what the commercial team knows? Are any channel contributions implausible to an experienced practitioner? A decomposition that fails this check is not ready for a review room regardless of how clean the internal diagnostics are.

If the decomposition fails the sanity check: Do not present it. Walk the decomposition period by period against the commercial calendar to identify where the model's account of the business diverges from what actually happened. The divergence is almost always traceable to a missing control, a misspecified event bump, or a prior that was set without sufficient commercial input. Fix the structural cause rather than adjusting outputs to match expectations.

Step 10

Triangulation against prior results and incrementality experiments

The appropriate triangulation reference depends on where you are in the measurement program lifecycle. For a first-time build, the primary external check is whatever experimental evidence exists: geo-based holdout results, platform lift studies, or published benchmarks for the category. Where none of these are available, cross-industry MMM benchmarks can serve as a plausibility check, particularly for well-studied channels like paid search or national television, provided the sample is large and the category dynamics are reasonably comparable. The goal is not to anchor to an external number but to confirm the model's estimates are within a commercially defensible range before treating them as ground truth. For a model rebuild or refresh, the comparison against prior model run outputs is equally important: if a channel's contribution has shifted materially since the last build, and nothing meaningful has changed in spend levels, flighting, or market conditions, that shift is a finding worth investigating before the new outputs are accepted. Where incrementality experiments exist, compare model channel contribution estimates against them directionally and in order of magnitude. A model attributing 20% or more contribution to a channel whose multiple experiments have consistently shown 5 to 7% lift has a structural problem no internal diagnostic will surface. When meaningful divergence exists, the gap is itself a finding: understand what is driving it and feed that understanding back into the prior specification for the next build.

If triangulation reveals material divergence: Treat the gap as a diagnostic finding, not a discrepancy to explain away. Document the divergence, identify the most likely structural cause, and decide whether to investigate further before the review or to present the gap transparently with a proposed resolution path. A model that disagrees with a well-designed experiment is not automatically wrong, but the disagreement needs an explanation that both the data science team and the measurement leader can defend.

"Measurement programs that fail are often not the ones with the weakest models but the ones where stakeholder alignment was never built. The diagnostic sequence is not just a technical checklist; it is a shared account of what the model can and cannot see, written before the review room."

Next -- Part IV

"The meeting described in Section 01 had a more productive version. It happened six months later, with the same client, the same data science team, and the same measurement consultants. It went differently not because the model was more sophisticated or the diagnostics more thorough, but because the people in the room had been in the same conversations before the outputs were ever produced."

Part IV: The Model Review as a Strategic Ritual

References

[1] Towards Data Science (2025). Marketing Mix Modeling: How to Avoid Biased Channel Estimates.

[2] De Livera, A., Hyndman, R., and Snyder, R. (2011). Forecasting Time Series with Complex Seasonal Patterns Using Exponential Smoothing. Journal of the American Statistical Association.

[3] Sun, Y., Wang, Y., Jin, Y., Chan, D., and Koehler, J. (2017). A Hierarchical Bayesian Approach to Improve Media Mix Models Using Category Data. Google Inc.

[4] Scanmarqed (2025). A Taxonomy of Biases in Marketing Mix Model Effect Estimation, Part 2.

[5] Analytical Alley (2024). Multicollinearity in Marketing Data: What It Is and How to Fix It.

[6] PyMC-Marketing Documentation. Prior Predictive Modeling. pymc-marketing.io.

[7] MASS Analytics (2025). 7 Stats to Evaluate Your Marketing Mix Model.

[8] Towards Data Science (2025). Media Mix Modeling: Technical Guideline to Avoid Pitfalls.

[9] Funnel.io (2025). What Makes an MMM Model Good?

[10] LeadSources (2026). Marketing Mix Modeling (MMM).

[11] Gelman, A., Meng, X., and Stern, H. (1996). Posterior Predictive Assessment of Model Fitness via Realized Discrepancies. Statistica Sinica.

[12] Stan Documentation. Prior and Posterior Predictive Checks. mc-stan.org.

[13] Bayesian Analysis Reporting Guidelines (2021). PMC / Nature Human Behaviour.

[14] Robyn / Meta. An Analyst's Guide to MMM. facebookexperimental.github.io.

[15] Moussavi, H. (2026). Did Your Marketing Actually Work? Prior & Effect. priorandeffect.com.

About the Author

Hedi Moussavi, PhD

Connect on LinkedIn →