← Back to Journal
This Essay -- Four Parts

This is Part IV of a four-part essay on building a Bayesian MMM worth trusting. Part I covered the credibility problem and data architecture. Part II worked through prior specification. Part III covered structural controls and the 10-Step P&E MMM Diagnostic Sequence. Part IV closes the essay with the model review as a strategic ritual: what goes wrong before the room, what the productive version looks like, and what needs to leave every review with owners and timelines.

07

When the Model Is Lying to You

Return to the meeting from Section 01. The diagnostics came back clean. MAPE is below 10%, the posteriors have converged, the holdout validation is within range, and the decomposition sums correctly to observed revenue. The data science team has done everything right by every standard internal measure. The model is not broken, and yet the measurement consultants across the table are not satisfied, because something in the outputs does not match what the room knows to be true about the business.

A broken model is fixable. A model that has passed every diagnostic and is still producing outputs that cannot be defended in a review room is a different kind of failure, one that does not announce itself in the posterior distributions or the residual plots. It lives in the gap between what the model is technically capable of reporting and what the business is actually asking it to explain. The table below organizes the most common credibility gaps by what the room sees in the outputs, what is likely causing it, and where to look first.

Credibility gap reference
What the room sees, what may be causing it, and where to look first
What the room sees Possible root cause Where to look first
A channel's contribution is far above or below what past experiments have shown Collinearity with another channel, or insufficient spend variation to isolate the effect Correlation matrix, posterior width, triangulation against experiment results
A channel that has never performed well is suddenly carrying significant contribution weight Prior too weak, or an omitted control variable absorbing variance into that channel Prior specification review, control variable audit
The baseline is implausibly high or low given brand strength and distribution Seasonality or trend not fully accounted for, or the model pooling across two distinct market regimes, for example before and after a major distribution change, price repositioning, or demand shock Decomposition by year, seasonal component review, residuals plotted around the candidate event date (a known disruption such as a distribution change, competitive entry, or demand shock)
Budget recommendations require spend levels the model has never observed Saturation curve extrapolated beyond the training range Plot observed spend distribution against the response curve; flag the extrapolation zone explicitly
Contribution estimates shift materially when a single variable is added or removed Collinearity between that variable and one or more media channels Re-run sensitivity scenario with and without the variable; examine coefficient movement
The model's story changes significantly between refresh cycles Insufficient data window, structural break, or a control variable whose values have changed Compare prior and current posteriors directly; flag what moved and why
A channel's adstock implies carryover that no one in the room believes Posterior decay parameter too high, or the channel variable is absorbing a trend Plot the implied decay curve; compare against experimental half-life estimates if available
The decomposition tells a different story than the brand team's lived experience of the year Missing event controls, promotional timing misalignment, or a model averaging across two distinct market conditions Walk the decomposition period by period against the commercial calendar

The thread connecting every row in this table is the same: the model is technically correct within the assumptions it was given, and those assumptions are wrong in some way that the diagnostics could not see. The model is not lying in the adversarial sense; it is reporting faithfully on a picture of the business that was never accurate to begin with. The job of the measurement leader in that review room is to understand which assumption is failing and why, not to defend the model and not to dismiss it.

The resolution is not a better model or a more skeptical consultant. It is a structured process for building models in a way that makes their assumptions visible before outputs are produced, and for conducting reviews in a way that gives every person in the room a precise account of what the model can and cannot see. That process is what Section 08 describes.

08

The Model Review as a Strategic Ritual

Returning to Section 01

The meeting had been on the calendar for two weeks. This time, it went differently, not because the model was more sophisticated or the diagnostics more thorough, but because the people in the room had been in the same conversations before the outputs were ever produced. The data science team and the measurement consultants had sat together during the prior specification session, reviewed the correlation matrix jointly, and walked the commercial calendar together during the data audit. When the channel-level outputs appeared on the screen, the measurement consultant who would have challenged them in the original scene recognized the numbers, not because they matched her intuition perfectly, but because she had been part of the conversation that produced them. That is what a model review as a strategic ritual produces: not agreement, not the elimination of uncertainty, but a shared account of where each number came from and what it would take for it to be wrong.

What Goes Wrong Before the Room

Most model review failures do not originate in the review itself. They originate in the gap between the alignment session and the moment outputs are presented, when changes accumulate quietly and no one realizes the model being reviewed is not the model that was agreed upon. This section assumes alignment happened: priors jointly specified, data signed off, parameters agreed. What follows is what goes wrong after that, when the agreed model drifts before anyone sees the outputs. The most common ways that gap opens are:

  • Spend data pulled incorrectly by data engineering: a channel's weekly spend series contains a processing error, a duplicate week, a misallocated cost, or a suppressed spike that was never caught in EDA.
  • Priors silently revised after alignment: a prior that was jointly specified and documented gets tightened or loosened during the build without the change being surfaced to the measurement leader before the review.
  • Adstock or saturation parameters shifted from agreed ranges: changing how a channel's carryover or diminishing returns are modeled in ways that can materially move its contribution estimate.
  • A control variable or event bump added or removed after data sign-off: altering the baseline and redistributing contribution across channels without a corresponding update to the assumptions log.
  • The modeling window shifted without notification: adding a quarter or trimming a problematic period without surfacing the change for review.

Any one of these can produce outputs that look diagnostically clean and are commercially wrong, and the measurement leader who does not know the model changed has no basis for challenging the result. Before a review is scheduled, the following conditions should be confirmed:

  • Data sign-off is documented and no changes to the spend series have occurred since sign-off.
  • Prior specification is documented and matches what was jointly agreed before the build began.
  • The 10-Step P&E MMM Diagnostic Sequence from Section 06 has been completed, with any flags prepared for disclosure rather than discovered in the room.
  • The decomposition has passed the internal sanity check, with baseline, seasonal peaks, and channel contributions reviewed against business knowledge.
  • Where other empirical evidence exists, whether prior MMM runs, incrementality experiments, or other causal studies, the comparison between model estimates and that evidence is documented, with any divergence and its likely explanation ready to present.

The Argument That Happens in Every Room

Every practitioner reading this has been in a version of the same disagreement. The new model is on the screen, and CTV, which carried the strongest efficiency story in the prior run and anchored the budget recommendation the client acted on six months ago, has moved materially in the wrong direction, with no obvious change in spend levels, creative, or competitive conditions to explain it.

In the unproductive version of this conversation, the data science team defends the methodology and the measurement consultant defends the previously reported results, the argument stays at the level of outputs with no resolution available at that level, and everyone leaves the room frustrated with no decision made.

In the version where the assumptions are visible, someone pulls the spend data before the conversation goes further. In this case, a data engineering error had introduced a duplicate spend entry across several CTV weeks during the new modeling window, inflating the apparent cost base and compressing the channel's measured efficiency with no corresponding change in its actual performance. The correction is made, the model is rerun, and the CTV estimate returns to a range consistent with the prior result. The measurement consultant, recognizing that the experimental evidence on CTV has always been thinner than the confidence the prior model implied, recommends a geo-based incrementality test for the next planning cycle to build a stronger evidentiary foundation before the next budget commitment. That recommendation leaves the room as a committed action item, and the client has a cleaner basis for their upfront decision than either model run alone could have provided.

"Measurement programs that fail are often not the ones with the weakest models but the ones where stakeholder alignment was never built. The diagnostic sequence is not just a technical checklist; it is a shared account of what the model can and cannot see, written before the review room."

What Leaves the Room

Six things should leave every model review with owners and timelines. Accountability for all six sits with the measurement program leader, who owns the record and ensures it is shared cross-functionally with the data science, data engineering, and client-facing teams whose work it depends on:

01

A documented model decision

Accepted, accepted with caveats, or returned for revision with a clear scope. Where multiple candidates were evaluated, document why specific models were rejected and on what grounds the accepted model was selected, including key fit statistics and the business plausibility considerations that informed the final choice.

02

An assumptions log

Every prior, structural decision, and diagnostic flag captured for reference in the next cycle. Without it, each new build relitigates the same debates from scratch rather than compounding what the prior run learned.

03

A model configuration record

Variable inputs, prior specifications, adstock and saturation parameter ranges, control variables included, and the modeling window for this run. This is the baseline against which the next run is compared; without it there is no principled way to explain why the outputs moved.

04

A model-to-model change log

Any meaningful differences from the prior run: variables added or removed, priors revised, window extended, parameters adjusted. Cross-run movement should be attributed to a documented decision rather than treated as unexplained drift.

05

An open questions register

Unresolved tensions, sensitivity findings that warrant further investigation, and hypotheses generated during the review. A review that produces no open questions is not a rigorous review, it is a presentation.

06

A calibration commitment

Which channels and contribution ranges the next round of experiments is designed to test. This connects the model review to the experimental roadmap, ensuring that evidentiary gaps become measurement priorities for the next planning cycle.

These six are the review-level record, owned by the measurement program leader. They sit alongside, not in place of, the system-level logging maintained by the data science and engineering teams: pipeline run logs and data freshness timestamps, model fitting artifacts such as sampler diagnostics, trace objects, and convergence statistics, version control history for the modeling code, and data lineage records tracing each input back to its source. The review record captures the decisions and their rationale, the engineering logs capture the technical execution. Both are required, and neither substitutes for the other.

The model review generates the measurement agenda, not just the budget recommendation, and the meeting from Section 01 that ended without an answer should always end instead with these six outputs and a room that knows precisely what the model can see and what it cannot.

09

From Trusted Model to Living System

Most organizations treat their first credible MMM as a destination. The build is complete, the review went well, the budget recommendation was accepted, and the model is filed alongside the presentation deck that delivered it. Six months later the planning cycle returns, the data is stale, and the rebuild begins from scratch, relitigating the same prior specification debates, the same data audit decisions, and the same diagnostic sequence without the benefit of what the prior run learned. The institutional knowledge evaporates. The credibility that was earned has to be rebuilt.

Measurement programs that are ahead of this curve do not wait for the next planning cycle to find out if the model is still telling the truth. They monitor continuously through two mechanisms. The first is periodic model reruns: refreshing the data window weekly, monthly, or quarterly depending on how quickly the business and its media environment move, running the accepted model's posteriors forward as priors, and checking whether contribution estimates have moved in ways that can be explained by what actually changed in the business. The second is forecast monitoring: running updated spend and control data through the accepted model as-is to generate short-term revenue predictions, then comparing those predictions against actual outcomes on a rolling basis. A model whose forecasts are consistently off in a particular direction is signaling a structural change the current specification has not yet captured, and catching that signal between builds is far less costly than discovering it in a review room six months later.

The framework described in this essay is designed to prevent that. Data architecture decisions documented before the build, priors jointly specified and recorded, diagnostics run in sequence and flagged before the review room, model configuration and change logs maintained across runs: these are not overhead. They are the compounding asset that makes each successive model more credible than the last and each review room less contentious than the one before it.

The posterior-as-prior mechanism described in Section 03 is the technical expression of this compounding. By using the accepted model's posterior distributions as the starting priors for each subsequent refresh, the effective evidence base grows continuously even when the active window stays fixed. Every experiment result fed into the next build's prior, every assumption documented and revisited, every cross-run movement explained rather than presented: these practices accumulate into a measurement program that earns trust over time rather than defending outputs one review at a time. The lifecycle of that program, how it refreshes, when it rebuilds, how AI is beginning to automate the connective tissue work that frees measurement leaders to focus on the judgment calls that actually require human expertise, will be the subject of future essays.

Credibility Is a Practice, Not a Property

The meeting from Section 01 that ended without an answer was not a failure of the model. It was a failure of the practice surrounding it: the shared language, the visible assumptions, the joint ownership of the decisions that determined what the model could and could not see. A model that is technically correct and organizationally isolated will lose the room every time, not because the outputs are wrong but because the room has no basis for believing they are right.

The measurement leader who has built the practice described in this essay arrives in every review room with something more durable than a clean diagnostic report. They arrive with a documented account of every assumption that shaped the model, a prior-to-posterior record that shows what the model has learned across successive runs, an experimentally grounded evidentiary foundation that connects the model's outputs to causal reality, and a shared language with every person in the room for interrogating the outputs rather than simply accepting or rejecting them.

That is what a trusted model looks like from the inside: not a model that is never challenged, but a model whose challenges can always be answered specifically, precisely, and with reference to decisions that were made transparently and in good faith. Credibility is not a property a model has at delivery. It is a property a measurement program earns over time, through the accumulated discipline of building well, reviewing rigorously, working transparently, and learning continuously from whatever the data and the experiments have to say.

References

[1] Jin, Y., Wang, Y., Sun, Y., Chan, D., and Koehler, J. (2017). Bayesian Methods for Media Mix Modeling with Carryover and Shape Effects. Google Inc.

[2] Chan, D. and Perry, M. (2017). Challenges and Opportunities in Media Mix Modeling. Google Inc.

[3] Broadbent, S. (1979). One Way TV Advertisements Work. Journal of the Market Research Society.

[4] Google (2024). Modern Measurement Playbook. Think with Google.

[5] Zhang, Y. et al. (2024). Media Mix Model Calibration With Bayesian Priors. Google Research.

[6] Robyn / Meta. An Analyst's Guide to MMM. facebookexperimental.github.io.

[7] PyMC-Marketing Documentation. Prior Predictive Modeling. pymc-marketing.io.

[8] Bayesian Analysis Reporting Guidelines (2021). PMC / Nature Human Behaviour.

[9] Moussavi, H. (2026). Did Your Marketing Actually Work? Prior & Effect. priorandeffect.com.

[10] Scanmarqed (2025). A Taxonomy of Biases in Marketing Mix Model Effect Estimation, Part 2.

[11] Pirie, M. (1985). The Book of the Fallacy. Routledge.

[12] Huff, D. (1954). How to Lie with Statistics. W.W. Norton.

About the Author

Hedi Moussavi, PhD

Connect on LinkedIn →