Measurement Fundamentals · Issue 02 · Part I of IV

Building a Model Worth Believing

What it takes to build a Bayesian MMM whose outputs can be trusted - by the data science team, the measurement consultant, and the CMO.

HEDI MOUSSAVI, PHD
MARCH 2026·10 MIN READ
LinkedIn →

This Essay - Four Parts

Part I: The Credibility Problem - this installment
Part II: The Prior as a Strategic Asset
Part III: Structural Controls and the Diagnostic Sequence
Part IV: The Model Review as a Strategic Ritual
Full essay: all sections, complete - to follow

This essay is the second in a series on the practice of marketing measurement. Where Essay 01 examined whether your marketing worked at all, Essay 02 works through what it takes to build a model whose answer to that question can be trusted. It publishes in four parts.

01

The Credibility Problem

The meeting had been on the calendar for two weeks, a model review of the kind where outputs are presented, interrogated, and eventually translated into a budget recommendation that will move real money. The data science team had built something technically solid: a modern Bayesian MMM, carefully scoped with agreed-upon variables, clean posterior distributions and reasonable in-sample fit, and no obvious diagnostic flags. The measurement consultants on the other side of the table, whether internal analytics leads or the vendor managers who sit between the model and the business in many organizations, had not built a competing model, because their role was to vet it, to pressure-test the outputs against accumulated business knowledge so that whatever recommendation left the room would be as trustworthy as possible.

They were not satisfied with what they saw.

Connected TV had been the story of the prior model run, with contribution estimates that were among the strongest in the mix and efficiency metrics that anchored a clear recommendation: increase CTV investment in the next planning cycle. The client had acted on that recommendation, budgets had shifted, and the upfront had been signed. Now the new model was on the screen, and CTV was near the bottom of the efficiency ranking, not marginally lower but materially and significantly lower, in a direction that had no obvious explanation in the spend data, the creative rotation, or anything the client's media team could point to. The data science team noted that the diagnostics were clean, the posteriors had converged, and the new model reflected an updated six months of data alongside a refined prior specification. The measurement consultant, who had been the one to present the prior model's CTV findings to the client, was not reassured by that answer, because she had both a professional stake in the previously reported results and a legitimate methodological concern: if the model could move this far between runs with no corresponding change in market conditions, what was the recommendation actually worth, and how could she take it forward to key stakeholders with confidence?

Both sides were right about the parts they could see, and the budget decision left the room without an answer.

What Trustworthy Actually Means

Most practitioners define a trustworthy model the way a clinical trial defines a successful drug: it passed the test. The diagnostics look clean, the R-squared is high, the posterior distributions have converged, and the fit line tracks the actuals closely enough that no one in the review raises an objection. This is a reasonable definition of a model that is not obviously broken, and it is not a definition of a model that is strategically correct.

What makes this harder than it sounds is that trustworthy means something different to every person in the room. The data science team is asking whether the model is well-specified. The measurement consultant is asking whether the outputs are consistent with what she knows about the channels. The CMO is asking whether the recommendation is defensible to the CFO. This is where MMM practice diverges from classical scientific inquiry: a clinical trial operates within a tightly governed community with agreed-upon standards for what constitutes evidence, while a model review room has people with different roles, different incentives, and different tolerances for uncertainty all looking at the same output. Measurement programs that fail are often not the ones with the weakest models but the ones where stakeholder alignment was never built. Listening to all sides, understanding their expectations before results arrive, and knowing how to communicate what the model can and cannot say in terms that land with each audience is not a soft skill layered on top of the technical work. It is part of the technical work.

A model can pass every standard diagnostic and still give you a structurally wrong picture of your business. It can assign implausible contribution to a channel that has been running continuously for three years, because the model has no way to separate its effect from the baseline it has quietly become. It can undervalue a brand awareness investment because the adstock decay parameter is set too tight, and the long-tail carryover never shows up in the revenue signal. It can look perfectly confident while averaging over two distinct market regimes that should never have been pooled together. None of these failures announce themselves in the output. They live in decisions made before the first line of model code is written: what data to include, what to control for, how to structure the time dimension, what the priors are encoding and why. A model resting on the wrong architecture will produce outputs that look clean and feel wrong, and the measurement leader who cannot explain why they feel wrong will eventually lose the room.

This is what makes MMM diagnostics genuinely hard. The failure modes that matter most are not statistical, they are architectural, and the gap between what the model reports and what the model actually knows is precisely where trustworthiness lives.

Two Failure Modes and the Assumption That Changes Everything

There is a version of the credibility problem on each end of the spectrum, and both are common enough to deserve names.

The first is false precision. This is the model that arrives in a boardroom looking fully formed, with clean outputs, confident point estimates, and a clear budget recommendation. The uncertainty has been compressed into a single number, and the assumptions underneath it are invisible to everyone in the room. False precision is a recognized failure mode in statistical practice, defined broadly as presenting numerical data in a manner that implies better precision than the underlying evidence justifies. In an MMM context it is dangerous not because it is dishonest but because it is seductive, giving the room what it wants while hiding the cost of what it is glossing over.

The second is false humility. This is the model that arrives wrapped in so many caveats, confidence intervals, and qualifications that it communicates nothing actionable. Executives hear the uncertainty and disengage, and the model gets set aside while budget decisions get made on platform ROAS and gut feel instead, a far less rigorous and far more confident set of inputs. False humility feels responsible, and it may be intellectually defensible in academic settings where acknowledging uncertainty is a virtue, but in business environments that move fast and require hard decisions, it leads to paralysis. A model that cannot make a recommendation is not a more honest model, it is a less useful one, and in cases where the evidence genuinely supports a direction, over-hedging is not caution, it is abdication.

"The goal of trustworthy measurement is neither false precision nor false humility. Model quality is not just a property of the model itself - it is a property of the relationship between the model and the decisions it is supposed to inform."

Reaching that goal requires a reframe that the rest of this essay is built on. Trustworthiness is not a one-time achievement, it is a continuous practice, earned through the quality of the initial build, maintained through a rigorous rerun cadence, and renewed every time the business changes in ways the model was not designed to see. The measurement leader who understands this does not ask whether the model is correct. They ask what the model is currently capable of seeing, what it is blind to, and whether the decision being made falls within its field of vision. That question is harder to answer than any diagnostic, and what follows in this essay is a framework for building models that survive it.

02

Data Architecture Before Model Architecture

Most modeling teams begin by choosing a framework. The more disciplined ones begin by auditing their data, because the decisions made in that audit, what the dependent variable measures, which inputs to include, how far back the window extends, will constrain every modeling choice that follows. A model built on a poorly defined revenue variable or a spend series with a quiet data engineering error cannot be fixed by a more sophisticated prior or a better convergence diagnostic. The problem is upstream of the model, and it needs to be solved there.

The Dependent Variable is a Business Decision

Before any data is pulled, the most important early step is a general alignment with the client team on what they are trying to measure, what channels and time horizons are in scope, and whether the data infrastructure exists to support an MMM at all.

With that context established, the first and most consequential scoping decision is the formal definition of the dependent variable. This is not a technical choice, it is a business one, and it needs to be made explicitly and documented formally before any data is pulled.

Whether revenue is measured at the transaction level or net of returns, inclusive of trade promotions or stripped of them, or expressed as revenue or units sold, each choice scopes what the model is measuring and what its outputs mean for the budget decisions it informs. None of these choices is inherently wrong, but all of them need to be made explicitly and agreed upon before the build begins, not discovered in a review room. The same discipline applies to the independent variables: gross spend or net spend, impressions or dollars, working media only or inclusive of agency fees, each definition needs stakeholder sign-off before data is pulled.

The Data Window and Granularity Standard

The standard for MMM data architecture is two to three years of weekly observations. Two years is the commonly recommended minimum, sufficient to capture two full seasonal cycles and enough spend variation for reliable coefficient estimation. Longer windows reduce sensitivity to any single anomalous period and lower the risk of cross-run movement, with the benefit compounding as the program matures through successive quarterly refreshes. Each channel also needs sufficient spend variation across the window to support a reliable estimate: a channel running at flat or near-flat spend gives the model nothing to learn from, and its coefficient will be driven almost entirely by the prior rather than the data. Monthly data is too coarse to model adstock dynamics reliably, while daily data is acceptable but introduces additional noise and requires more careful outlier management.

Modeling with less than two years of data is possible, but model quality and posterior confidence will drop materially. Shorter windows amplify the influence of any single anomalous period, increase sensitivity to prior specification, and limit the model's ability to identify seasonal and trend components reliably. Where a sub-two-year build is unavoidable, stronger informative priors, tighter diagnostic scrutiny, and explicit communication of uncertainty to stakeholders are all warranted.

Figure 1: Posterior Distribution Width by Data Window Length

Prior distribution
12-month posterior
24-month posterior
36-month posterior

12 Months

Wide uncertainty

Posterior mean

9.8%

95% CI width

±6.1pp

Uncertainty range3.7% - 15.9%

24 Months

Moderate confidence

Posterior mean

9.8%

95% CI width

±3.5pp

Uncertainty range6.3% - 13.3%

36 Months

Tightest posterior

Posterior mean

9.8%

95% CI width

±1.9pp

Uncertainty range7.9% - 11.7%

Relative uncertainty across window lengths - 95% credible interval width

12 months
±6.1pp
24 months
±3.5pp
36 months
±1.9pp

Figure 1: Posterior distribution width by data window length across 12, 24, and 36-month modeling windows. All panels share the same prior (Normal(10%, 3.5%)) and posterior mean (9.8%), isolating the effect of window length on posterior uncertainty. Hypothetical data for illustrative purposes.

Data Sign-Off as a Governance Discipline

Data rarely arrives clean. The modeling team collects, ingests, and processes the inputs, then summarizes them back to the client before finalizing the data table used for modeling. This review step, where the client confirms that spend series, revenue figures, and control variables reflect what they intended to provide, is the most cost-effective intervention in the entire build. Problems caught here are hours to fix. The same problems discovered mid-build or in a review room are days to weeks of rework.

A practical data audit covers the following: dependent variable definition and scope, independent variable definitions for all media and control inputs, data completeness for the revenue series, temporal consistency to confirm that channel taxonomies have not changed mid-window, any restatements or retroactive data updates that could affect the historical series, and formal stakeholder sign-off documented before the build begins.

Getting the data right establishes the foundation. Getting the priors right determines what the model believes before it ever sees that data, and that distinction matters more than most practitioners realize.

Next - Part II

"Most practitioners treat prior specification as a technical formality - a set of distributional choices that gets documented in a methods appendix and rarely revisited. This is a significant underestimation of what the prior is doing."

Part II: The Prior as a Strategic Asset

References

[1] Pirie, W. (1985). False Precision. Encyclopedia of Statistical Sciences. Wiley.

[2] Huff, D. (1954). How to Lie with Statistics. W. W. Norton & Company.

[3] Chan, D., Perry, M. (2017). Challenges and Opportunities in Media Mix Modeling. Google Research.

[4] Towards Data Science (2025). Marketing Mix Modeling 101.

[5] Recast (2025). Branded Search Tests: The Fastest Way to Teach Your Org Incrementality.

About the Author

Hedi Moussavi, PhD

Connect on LinkedIn →
Prior & Effect

Measurement Fundamentals · Issue 02 · Part I of IV

Building a Model Worth Believing

What it takes to build a Bayesian MMM whose outputs can be trusted - by the data science team, the measurement consultant, and the CMO.

HEDI MOUSSAVI, PHD·LinkedIn →·MARCH 2026·10 MIN READ

This Essay - Four Parts

Part I: The Credibility Problem - this installment
Part II: The Prior as a Strategic Asset
Part III: Structural Controls and the Diagnostic Sequence
Part IV: The Model Review as a Strategic Ritual
Full essay: all sections, complete - to follow

This essay is the second in a series on the practice of marketing measurement. Where Essay 01 examined whether your marketing worked at all, Essay 02 works through what it takes to build a model whose answer to that question can be trusted. It publishes in four parts.

01

The Credibility Problem

The meeting had been on the calendar for two weeks, a model review of the kind where outputs are presented, interrogated, and eventually translated into a budget recommendation that will move real money. The data science team had built something technically solid: a modern Bayesian MMM, carefully scoped with agreed-upon variables, clean posterior distributions and reasonable in-sample fit, and no obvious diagnostic flags. The measurement consultants on the other side of the table, whether internal analytics leads or the vendor managers who sit between the model and the business in many organizations, had not built a competing model, because their role was to vet it, to pressure-test the outputs against accumulated business knowledge so that whatever recommendation left the room would be as trustworthy as possible.

They were not satisfied with what they saw.

Connected TV had been the story of the prior model run, with contribution estimates that were among the strongest in the mix and efficiency metrics that anchored a clear recommendation: increase CTV investment in the next planning cycle. The client had acted on that recommendation, budgets had shifted, and the upfront had been signed. Now the new model was on the screen, and CTV was near the bottom of the efficiency ranking, not marginally lower but materially and significantly lower, in a direction that had no obvious explanation in the spend data, the creative rotation, or anything the client's media team could point to. The data science team noted that the diagnostics were clean, the posteriors had converged, and the new model reflected an updated six months of data alongside a refined prior specification. The measurement consultant, who had been the one to present the prior model's CTV findings to the client, was not reassured by that answer, because she had both a professional stake in the previously reported results and a legitimate methodological concern: if the model could move this far between runs with no corresponding change in market conditions, what was the recommendation actually worth, and how could she take it forward to key stakeholders with confidence?

Both sides were right about the parts they could see, and the budget decision left the room without an answer.

What Trustworthy Actually Means

Most practitioners define a trustworthy model the way a clinical trial defines a successful drug: it passed the test. The diagnostics look clean, the R-squared is high, the posterior distributions have converged, and the fit line tracks the actuals closely enough that no one in the review raises an objection. This is a reasonable definition of a model that is not obviously broken, and it is not a definition of a model that is strategically correct.

What makes this harder than it sounds is that trustworthy means something different to every person in the room. The data science team is asking whether the model is well-specified. The measurement consultant is asking whether the outputs are consistent with what she knows about the channels. The CMO is asking whether the recommendation is defensible to the CFO. This is where MMM practice diverges from classical scientific inquiry: a clinical trial operates within a tightly governed community with agreed-upon standards for what constitutes evidence, while a model review room has people with different roles, different incentives, and different tolerances for uncertainty all looking at the same output. Measurement programs that fail are often not the ones with the weakest models but the ones where stakeholder alignment was never built. Listening to all sides, understanding their expectations before results arrive, and knowing how to communicate what the model can and cannot say in terms that land with each audience is not a soft skill layered on top of the technical work. It is part of the technical work.

A model can pass every standard diagnostic and still give you a structurally wrong picture of your business. It can assign implausible contribution to a channel that has been running continuously for three years, because the model has no way to separate its effect from the baseline it has quietly become. It can undervalue a brand awareness investment because the adstock decay parameter is set too tight, and the long-tail carryover never shows up in the revenue signal. It can look perfectly confident while averaging over two distinct market regimes that should never have been pooled together. None of these failures announce themselves in the output. They live in decisions made before the first line of model code is written: what data to include, what to control for, how to structure the time dimension, what the priors are encoding and why. A model resting on the wrong architecture will produce outputs that look clean and feel wrong, and the measurement leader who cannot explain why they feel wrong will eventually lose the room.

This is what makes MMM diagnostics genuinely hard. The failure modes that matter most are not statistical, they are architectural, and the gap between what the model reports and what the model actually knows is precisely where trustworthiness lives.

Two Failure Modes and the Assumption That Changes Everything

There is a version of the credibility problem on each end of the spectrum, and both are common enough to deserve names.

The first is false precision. This is the model that arrives in a boardroom looking fully formed, with clean outputs, confident point estimates, and a clear budget recommendation. The uncertainty has been compressed into a single number, and the assumptions underneath it are invisible to everyone in the room. False precision is a recognized failure mode in statistical practice, defined broadly as presenting numerical data in a manner that implies better precision than the underlying evidence justifies. In an MMM context it is dangerous not because it is dishonest but because it is seductive, giving the room what it wants while hiding the cost of what it is glossing over.

The second is false humility. This is the model that arrives wrapped in so many caveats, confidence intervals, and qualifications that it communicates nothing actionable. Executives hear the uncertainty and disengage, and the model gets set aside while budget decisions get made on platform ROAS and gut feel instead, a far less rigorous and far more confident set of inputs. False humility feels responsible, and it may be intellectually defensible in academic settings where acknowledging uncertainty is a virtue, but in business environments that move fast and require hard decisions, it leads to paralysis. A model that cannot make a recommendation is not a more honest model, it is a less useful one, and in cases where the evidence genuinely supports a direction, over-hedging is not caution, it is abdication.

"The goal of trustworthy measurement is neither false precision nor false humility. Model quality is not just a property of the model itself - it is a property of the relationship between the model and the decisions it is supposed to inform."

Reaching that goal requires a reframe that the rest of this essay is built on. Trustworthiness is not a one-time achievement, it is a continuous practice, earned through the quality of the initial build, maintained through a rigorous rerun cadence, and renewed every time the business changes in ways the model was not designed to see. The measurement leader who understands this does not ask whether the model is correct. They ask what the model is currently capable of seeing, what it is blind to, and whether the decision being made falls within its field of vision. That question is harder to answer than any diagnostic, and what follows in this essay is a framework for building models that survive it.

02

Data Architecture Before Model Architecture

Most modeling teams begin by choosing a framework. The more disciplined ones begin by auditing their data, because the decisions made in that audit, what the dependent variable measures, which inputs to include, how far back the window extends, will constrain every modeling choice that follows. A model built on a poorly defined revenue variable or a spend series with a quiet data engineering error cannot be fixed by a more sophisticated prior or a better convergence diagnostic. The problem is upstream of the model, and it needs to be solved there.

The Dependent Variable is a Business Decision

Before any data is pulled, the most important early step is a general alignment with the client team on what they are trying to measure, what channels and time horizons are in scope, and whether the data infrastructure exists to support an MMM at all.

With that context established, the first and most consequential scoping decision is the formal definition of the dependent variable. This is not a technical choice, it is a business one, and it needs to be made explicitly and documented formally before any data is pulled.

Whether revenue is measured at the transaction level or net of returns, inclusive of trade promotions or stripped of them, or expressed as revenue or units sold, each choice scopes what the model is measuring and what its outputs mean for the budget decisions it informs. None of these choices is inherently wrong, but all of them need to be made explicitly and agreed upon before the build begins, not discovered in a review room. The same discipline applies to the independent variables: gross spend or net spend, impressions or dollars, working media only or inclusive of agency fees, each definition needs stakeholder sign-off before data is pulled.

The Data Window and Granularity Standard

The standard for MMM data architecture is two to three years of weekly observations. Two years is the commonly recommended minimum, sufficient to capture two full seasonal cycles and enough spend variation for reliable coefficient estimation. Longer windows reduce sensitivity to any single anomalous period and lower the risk of cross-run movement, with the benefit compounding as the program matures through successive quarterly refreshes. Each channel also needs sufficient spend variation across the window to support a reliable estimate: a channel running at flat or near-flat spend gives the model nothing to learn from, and its coefficient will be driven almost entirely by the prior rather than the data. Monthly data is too coarse to model adstock dynamics reliably, while daily data is acceptable but introduces additional noise and requires more careful outlier management.

Modeling with less than two years of data is possible, but model quality and posterior confidence will drop materially. Shorter windows amplify the influence of any single anomalous period, increase sensitivity to prior specification, and limit the model's ability to identify seasonal and trend components reliably. Where a sub-two-year build is unavoidable, stronger informative priors, tighter diagnostic scrutiny, and explicit communication of uncertainty to stakeholders are all warranted.

Prior distribution
12-month posterior
24-month posterior
36-month posterior

12 Months

Wide uncertainty

Posterior mean

9.8%

95% CI width

±6.1pp

Uncertainty range3.7% - 15.9%

24 Months

Moderate confidence

Posterior mean

9.8%

95% CI width

±3.5pp

Uncertainty range6.3% - 13.3%

36 Months

Tightest posterior

Posterior mean

9.8%

95% CI width

±1.9pp

Uncertainty range7.9% - 11.7%

Relative uncertainty across window lengths - 95% credible interval width

12 months
±6.1pp
24 months
±3.5pp
36 months
±1.9pp

Figure: Posterior distribution width by data window length. Hypothetical data for illustrative purposes.

Data Sign-Off as a Governance Discipline

Data rarely arrives clean. The modeling team collects, ingests, and processes the inputs, then summarizes them back to the client before finalizing the data table used for modeling. This review step, where the client confirms that spend series, revenue figures, and control variables reflect what they intended to provide, is the most cost-effective intervention in the entire build. Problems caught here are hours to fix. The same problems discovered mid-build or in a review room are days to weeks of rework.

A practical data audit covers the following: dependent variable definition and scope, independent variable definitions for all media and control inputs, data completeness for the revenue series, temporal consistency to confirm that channel taxonomies have not changed mid-window, any restatements or retroactive data updates that could affect the historical series, and formal stakeholder sign-off documented before the build begins.

Getting the data right establishes the foundation. Getting the priors right determines what the model believes before it ever sees that data, and that distinction matters more than most practitioners realize.

Next - Part II

"Most practitioners treat prior specification as a technical formality - a set of distributional choices that gets documented in a methods appendix and rarely revisited. This is a significant underestimation of what the prior is doing."

Part II: The Prior as a Strategic Asset

References

[1] Pirie, W. (1985). False Precision. Encyclopedia of Statistical Sciences. Wiley.

[2] Huff, D. (1954). How to Lie with Statistics. W. W. Norton & Company.

[3] Chan, D., Perry, M. (2017). Challenges and Opportunities in Media Mix Modeling. Google Research.

[4] Towards Data Science (2025). Marketing Mix Modeling 101.

[5] Recast (2025). Branded Search Tests: The Fastest Way to Teach Your Org Incrementality.

About the Author

Hedi Moussavi, PhD

Connect on LinkedIn →
Prior & Effect