← Journal№ 001Field guide

What good looks like: the analytics maturity model

22 min readPublished May 19, 2026

The phrase we hear most often, by the time a team calls us, is "we need a fully integrated stack." Sometimes it arrives as "we want one source of truth." Sometimes as "we just need a single tool that does everything." The frustration underneath is real, and the diagnosis is almost always wrong. The analytics maturity model that actually works inside a $5M to $100M company is not a shopping list of integrated suites; it is a sequence of governance moves, made in the order they have to be made, that allows integration to follow on its own. This field guide is the version of that argument we end up writing on a whiteboard at least once a quarter, so we may as well write it down.

The analytics maturity model, briefly stated

Most maturity models you read on vendor blogs end up being five-stage ladders where the rungs are abstractions like "data-aware," "data-capable," "data-adept." We use a shorter version with four rungs that map to behaviours executives can actually observe in their own week. The point of the ladder is not to label the team; the point is to find the rung the team is stuck on, because that is where the next $40K of work has the highest payback.

L1 · Tracking. Events fire. Numbers exist. Whether they are the right numbers is unresolved. The question being asked, when the team is honest, is "are we measuring at all."

L2 · Reporting. Numbers reconcile to one source. Finance, marketing, and product look at the same definitions of revenue, of acquisition, of activation. The question being asked is "do we trust the dashboards."

L3 · Decision. Numbers shape the next $50K of spend, the next channel cut, the next hiring conversation. The question being asked is "what would change if the modeled answer were different."

L4 · Compounding. The stack itself gets reviewed and pruned on a calendar. Models that stopped earning their keep get retired. Kill-switches are scheduled in advance, not negotiated under pressure. The question being asked is "what are we still doing that we shouldn't be."

The honest observation, after auditing a few hundred stacks, is that most teams who tell us they want L4 are stuck at L1, and the most common failure mode is buying L4 tools while the L1 tracking plan does not yet exist in writing. Hightouch is not a fix for a missing measurement plan; PyMC-Marketing is not a fix for a missing warehouse. The order is non-negotiable.

A maturity model is a diagnostic. It is not a roadmap. The roadmap is what the model produces once you find the rung. You cannot model your way to L3 if the L2 warehouse has no signed definition of net revenue, and you cannot earn L4 governance unless somebody has earned the right to pull the kill cord. We have audited teams who installed reverse-ETL before they had a marts layer, and teams who bought Snowflake before they had agreed on what counts as a customer. Both kinds of teams paid an honest price to find out they had skipped a rung.

L1: tracking you can actually believe

This is where the team that just inherited a green-field Shopify build lives, and where most engagements diagnose their way into expensive surprises. The temptation, every time, is to ship tags first and write the plan later. The team that does this spends the next 18 months apologising for the data, sometimes from a new job.

A tracking plan is the artifact, not the GTM container

The mistake we see most often is that "the tracking plan" lives inside a GTM container or a GA4 admin panel rather than as a written document. The container has whatever an agency dropped into it last quarter; the admin panel has whatever someone toggled to silence an alert on a Tuesday. Neither is a contract. A tracking plan is a written object: every event by name, every property by type, the owner of the event, the business question the event exists to answer, and the conversion definitions tied through to each marketing platform. Without that document, the next agency, the next CRM swap, and the next ad-platform reshuffle each get a free pass to reinvent the taxonomy.

We model tracking plans on either the Segment Spec or the Snowplow IGLU registry, depending on the stack the client already runs. Object plus action event names. snake_case properties. PII flagged where it travels and stripped where it shouldn't. A schema file for every event, version-controlled in the same repo as the front-end code, reviewed in pull requests by the analytics owner before deployment. We have not seen a tracking-plan-first engagement produce ambiguous attribution at L2; we have seen every tags-first engagement produce ambiguous attribution at L2. The order matters more than the choice of tooling does.

The "integrated stack" question, restated

The most common version of this conversation, almost word-for-word, is a B2B retailer with a green-field Shopify build, no warehouse yet, sales split across Shopify, Amazon, and an offline channel, and a legacy on-prem accounting system. They are about to ask vendors for "a fully integrated solution." The honest answer is that integration is not a SKU on a marketplace. Integration is the consequence of a governed warehouse and a written metric contract; it does not ship in a box. The work to do, in the order to do it, is to write the tracking plan against the decisions the team has to make, stand up the smallest defensible warehouse, land Shopify and Amazon and the accounting system through Fivetran or Airbyte connectors, and only then choose a BI tool. Tracking plan first. Warehouse second. BI last. The all-in-one vendor goes home.

The ten-point review we audit against

When we audit a team's L1, we run a ten-point review adapted from Trackingplan's GA4 checklist. Property structure. Event taxonomy completeness. UTM consistency. User-ID and identity stitching. Conversion definitions tied to platform-side configurations. PII scanning. Tag-management container hygiene. Anomaly detection on event volume. Cross-domain integrity for multi-property setups. Custom dimensions and parameters that the team relies on but no one currently owns. Each one gets a P0, P1, or P2 grade, where P0 is revenue-at-risk or actively-wrong, P1 is trust-eroding, P2 is hygiene. The fix list comes out of the grading, not out of the consultant's intuition about what feels broken.

What good L1 looks like

A team at the top of L1 can answer three questions in writing, without re-asking the analyst: what events are we collecting and why, what do we believe each conversion number to be, and who is the documented owner of each event when the schema changes. That sounds like a low bar. We have audited eight-figure-revenue brands that do not pass it. The L1 deliverable is unglamorous, the kind of artifact the team has to be persuaded to value because nobody at a board meeting ever applauded a tracking plan. It is also the deliverable that makes every subsequent dollar spent on tooling worth its price.

L2: reporting that survives a CFO challenge

The most expensive Monday in a mid-market company is the one where the CFO and the VP Marketing are both technically right. Both numbers reconcile to a source; the sources just disagree. Reporting that survives a CFO challenge is the artifact that makes that Monday rare instead of routine, and it is the work that the analytics maturity model puts at L2.

The warehouse comes first

We are warehouse-first because every alternative has been tried and produced the same set of failure modes. Reporting straight out of GA4 produces a number that disagrees with finance by 8% to 30% on any given week, depending on cookie consent rates and attribution settings. Reporting from a BI tool's native connectors produces a number that disagrees with finance by a different 8% to 30%, this time tied to the specific connector's join logic. Reporting from a spreadsheet produces a number that disagrees with everyone, including last week's version of itself.

The smallest defensible warehouse, in 2026, is BigQuery on-demand or Snowflake at the smallest paid edition. For Google-native teams it is usually BigQuery, because the GA4 export already lands there for free and the IAM model is one fewer system to govern. For everyone else it is usually Snowflake, because the RBAC model and time-travel features make the governance conversation easier from week one. We have stood up either inside half a day. The cost-of-ownership target at this stage is under $1K a month including ingestion and BI; the cost of skipping this layer is, in our experience, about three analyst-days a week in reconciliation work and a recurring board meeting where two slides quote two different revenue numbers.

The metric contract is the second artifact

A metric contract is a single document that defines every executive metric on every dashboard, in plain English plus the SQL that backs it. Each entry names the source dbt model, the business definition, the refresh cadence, the named owner, the on-track threshold, and the variance that triggers an escalation. The CFO signs it. The CMO signs it. The CEO sees it on a wall.

The first time a team signs a metric contract, an honest argument happens. Marketing wants gross revenue inclusive of self-serve trial conversions; finance wants net revenue ex-shipping with returns netted into the fiscal week the return was processed. Both definitions are legitimate. The contract forces the choice; the dashboard then enforces it. We have watched contracts survive two CMO changes inside an 18-month window, which is the only test of a contract that ultimately matters. A contract that does not survive the next leadership change was a contract in name only.

A digital bank, fourteen source systems, OSFI on the line

A digital banking client we worked with had fourteen source systems each feeding an operational read replica, with sixty analysts writing ad-hoc SQL against the replicas to answer the same recurring questions in slightly incompatible ways. The board deck was quarterly because it took six weeks to assemble; the customer funnel was rebuilt from scratch every quarter by a different analyst. We replaced the read-replica chaos with a governed Snowflake warehouse, an SCD-2 dim_customer, an fct_journey_event capturing every customer-affecting event with stage attribution, and a signed metric contract on what an "active customer" meant under OSFI supervision. The weekly executive funnel replaced the quarterly board deck. A team that had argued about which spreadsheet to trust started arguing about which decision the dashboard should drive, which is the better argument. The longer write-up is at /work/customer-journey-warehouse.

The dbt project structure that holds up

dbt is the constant across our Build engagements because the project structure is portable, the framework runs against every major warehouse, and the project is readable by the next analyst the client hires. We ship staging/ one-to-one against sources, intermediate/ for re-grained business logic, and marts/ materialised as the tables the BI tool consumes. not_null and unique tests on every primary key. relationships tests across foreign keys. CI runs dbt build on every PR; red blocks merge. None of this is novel. The novelty is in enforcement, which is the part that takes the discipline that the warehouse vendor's slide deck does not sell.

-- models/marts/finance/fct_weekly_revenue.sql
{{ config(materialized='table') }}

select
    fiscal_week_start,
    channel,
    sum(net_revenue_usd)                  as net_revenue,
    sum(returns_usd)                      as returns,
    sum(net_revenue_usd - returns_usd)    as net_revenue_after_returns,
    max(_updated_at)                      as last_refreshed_at
from {{ ref('int_orders_enriched') }}
where order_status in ('paid', 'partially_refunded')
group by 1, 2

The point of the SQL above is not the SQL. The point is that there is one model the dashboard reads, one place the definition lives, and one named owner if the model breaks. That is L2. Everything else at this rung is a variant of the same idea.

L3: decisions, not dashboards

A team at L2 has a defensible weekly number. A team at L3 has a defensible answer to "what should we do next." The difference looks small on a slide and is the difference between consulting bills that recur because something broke and consulting bills that recur because the team wants more of what they paid for the last time. Most teams we meet are at L2 and wish they were at L3; most vendors we meet have a slide deck explaining how their tool gets the team there. The deck is usually wrong, because L3 is mostly about admitting which decisions the model has to defend and which it cannot.

Boring baselines first

The most-skipped step in L3 work is the boring baseline. Before the Bayesian media mix model, the Markov-chain attribution model, or the XGBoost churn predictor, we ship the boring baseline: linear attribution, last-non-direct, a 30-day moving average, a logistic regression on six features. The new model has to beat the baseline on a holdout the team did not see during training, by a margin worth the operational complexity of running it. If a hierarchical Bayesian MMM beats a 30-day moving average by 2% on holdout, we ship the moving average and move on. The savings buy a quarter of L4 governance work that actually compounds, which beats a 2% lift on a dashboard nobody can re-run.

Attribution that survives a CFO challenge

Attribution is where L3 work tends to start because attribution is where the highest-stakes weekly arguments live. A CPG holding we worked with had three brands sharing a paid media buy, and the attribution model had been rebuilt three times in 18 months by three different analysts because each new CMO inherited an unsigned measurement plan. We wrote the measurement plan first: the seven decisions the attribution number had to defend each quarter, signed by each brand's marketing leader and by the CFO. Only then did we build the weighted multi-touch model with per-channel time-decay calibrated to a Nielsen panel, landed as a versioned channel-calibration dim in the warehouse. The model survived two more CMO changes, which is the deliverable that matters. The model itself was not particularly clever; the contract underneath it was.

LTV measured on a real cohort

LTV is the second-most-common L3 question and the one most prone to vendor theater. The legitimate technique set is small: BG/NBD with Gamma-Gamma for non-contractual settings such as e-commerce and marketplaces, survival models (Cox PH, Kaplan-Meier) for contractual settings like SaaS and subscriptions, and cohort marts at (cohort × channel × month-since-acquisition) against fully-loaded spend for the executive view. The lifetimes Python library does the BG/NBD work in about thirty lines of code; the work is the schema and the calibration, not the algorithm. A two-sided healthtech marketplace we worked with had marketing counting bookings and finance counting completed-net revenue with a 28% to 34% gap per cohort. We wrote a one-page metric contract on what an "activated patient" was, stitched ad-click through to signup, booking, and completed appointment into one event spine, and built the cohort-LTV mart against fully-loaded spend. Two unprofitable channels got killed in week six. Blended CAC payback compressed from 14 months to 6. The longer write-up is at /work/marketplace-cohort-economics.

Anomaly detection where it earns its keep

Anomaly detection sits at L3 in our model because the decision being made is "should I treat this number as signal or as noise this week." Seasonal Hybrid ESD (the open-source approach Twitter shipped a decade ago and that still works), Prophet residuals where seasonality is clean, EWMA where it is not, are the working set. The wrong move is to wire the alerts to a #data-alerts Slack channel everyone has muted; the right move is to route the page to the named human who owns the metric, with a runbook describing what the anomaly type means and what the owner does next. False-positive tuning is the part that turns the artifact from a liability into a tool, because an alert system the team mutes is worse than no alert system at all.

When not to model

We turn down modeling engagements about as often as we accept them, because most asks for "AI in our analytics" turn out to be unmodelable from the data on the ground. The honest test is whether enough events exist at the right grain to support the model, whether the leadership team has agreed on the decision the model is supposed to defend, and whether there is an owner who will run the model's quarterly review. If any of the three is missing, the model gets shelved within a quarter regardless of how clever the implementation was. We say so on the discovery call and route into Diagnose or Build instead. Shipping a model into an L1 stack is malpractice with a deliverable.

L4: a stack that compounds, not one that accumulates

The honest description of L4 is "stop adding things and start removing them on a schedule." Most stacks accumulate. They do not compound. The difference is governance, and governance is mostly the practice of writing down what gets kept and what gets killed at a known cadence.

The quarterly model review

Every model we ship has a kill-switch in writing and a quarterly review on the calendar. The review is sixty minutes against the measurement plan: did the decisions the model was supposed to defend get made on its evidence, did the predictions hold up against the holdout we wrote down at handover, did drift show up in the input-output relationship, and is the cost of keeping the model still smaller than the value of keeping it. If the answer to the last question is no, we say so out loud and the model gets retired. The kill-switch is the single most-skipped artifact across our engagements, and the one that prevents the next consultancy's invoice from arriving in 18 months to re-do the same work.

Build, buy, kill, on every tool, on a schedule

The build-buy-kill conversation does not end with the audit; it becomes a quarterly habit. Every tool in the stack gets re-evaluated against three numbers: utilization (what percentage of seats are actually used in the last 30 days), fit (does this tool still serve the workflow it was bought for), and total cost of ownership (the monthly invoice plus the analyst hours spent maintaining it). Tools that fail any of the three move toward Kill, with a written migration note for the workflows they still own. We have seen teams keep paying for a $1,200-a-month BI tool because nobody wanted to write the migration note for the two dashboards it still owned. The migration note, when we eventually wrote it, cost less than two months of the invoice.

Documentation that the inheriting analyst can read

The handoff packet is itself a deliverable. dbt docs published. Architecture diagram. Source schema list. Dashboard ownership table. Metric contract. Runbooks per alert. A written onboarding plan for the inheriting analyst that names what a new hire reads in their first week. The dry-run test is the only test that matters: can a teammate who was not on the project stand up a dev environment from the docs alone, in under two hours, without us in the room. If not, the docs are not done. We have written this paragraph more times than we have written any other paragraph; it remains the work that buys the team the right to fire the next consultant.

Hiring conversations stop being abstract

A team that has earned L4 governance can recruit a full-time senior analyst against a system instead of against a vibe. The job description names the metric contract, the dbt project, the dashboards in scope, the monitoring runbooks, and the decisions the role is expected to defend in the next four quarterly model reviews. Candidates who would have ghosted a vibe-based JD say yes to a system-based one, because they can picture the work on day one. This is the L4 dividend that matters most to founders, even though it is the one that shows up nowhere on a vendor's slide deck and never makes its way into a case-study headline.

The integration trap, restated

Coming back to the conversation that opened the field guide. The B2B retailer with the green-field Shopify build, the Amazon channel, the offline sales, the on-prem accounting system, and the impulse to ask vendors for a "fully integrated solution" is asking the right question with the wrong shape. Integration is the consequence of the four rungs above, not a SKU you can buy off a marketplace. The vendor pitching the all-in-one suite is selling integration where the integration has no contract to integrate against. The slide says "fully integrated." The team's Slack says "still reconciling." The result, every time we have audited it, is a stack that integrates the wrong things and re-integrates them every time a vendor reshuffles a connector.

The order we recommend, in writing, for the team that asked the question:

First, write the tracking plan against the decisions the leadership team has to defend. Eight to twenty events, in a versioned schema, owned by a named person. This work fits in a week if leadership shows up to the workshop; it gets dragged for a quarter if they don't. The Reddit version of this team will be tempted to skip it because the Shopify build is green-field and "the data isn't there yet anyway." That is the exact moment to write the plan, because the cost of writing it before tags ship is two analyst-days and the cost of writing it after tags ship is two analyst-quarters.

Second, stand up the smallest defensible warehouse. BigQuery on-demand pricing or Snowflake at the smallest paid edition. RBAC and dev/prod separation on day one; the cost-control conversation matters less than the access-control conversation at this stage. The warehouse, not the BI tool, is the source of truth. Every alert, every dashboard, every reverse-ETL sync reads from it eventually.

Third, land ingestion through Fivetran or Airbyte for the SaaS sources you have connectors for: Shopify, Amazon Seller Central, the marketing platforms, the CRM. The accounting system, even if it lives on-prem and the connector has to talk to a SQL replica, lands here too. Custom Python orchestrated by Dagster or Prefect for the one or two critical sources nobody has built a hosted connector for. Source-freshness checks per source, against the SLA in the metric contract; alerts route to a named human, not a muted channel.

Fourth, build the dbt project in staging/ then intermediate/ then marts/. Tests written with the model, not bolted on at the end. Primary keys non-negotiable. Metric definitions tied to the contract; the dashboard reads the mart, never the staging layer, never raw. The temptation to write a "quick mart" without the staging discipline is the same temptation as shipping tags without a plan, and it ends the same way.

Fifth, pick a BI tool the team can hire for, not the BI tool with the loudest sales rep. Power BI for Microsoft-native teams. Looker for Google-native shops with Looker investment already on the books. Metabase or Lightdash for cost-sensitive open-source defaults. Tableau for teams whose analysts already use it. The BI tool is the surface layer; switching it later costs about a month of work and is cheaper than picking the wrong one because it had the prettiest demo at the conference last quarter.

Sixth, and only sixth, talk about activation. Reverse-ETL through Hightouch or Census back to Meta Conversions API, Google Customer Match, Klaviyo, HubSpot. Predictions piped from the L3 models you have already validated. Conversion APIs wired against the metric contract that already exists. This is the layer where the "fully integrated" pitch becomes true, and only because the four rungs underneath made it possible.

The retailer who follows this sequence ends up with the integration they wanted from the start, six to twelve weeks in, with documentation a new analyst can read on day one and a metric contract that survives a CMO change. The retailer who skips to the integration pitch ends up two years in, with a stack that integrates four versions of the wrong number and a board meeting that goes badly. We have audited both shapes of stack often enough to call the difference.

The analytics maturity model is not a brand or a framework; it is a sequence, and the sequence is the deliverable. Most of our engagements start by writing down where on the ladder a team actually lives, not where the leadership team feels it lives. The honest answer is usually one rung lower than the deck implies. The work is the climb. If the climb has already started and the diagnosis is already in writing, we built Build for what comes next. If not, the climb begins with the diagnosis, and we built Diagnose for that. Field notes on the patterns inside each rung will keep arriving in this journal; the next one, on the five reports we replace inside the first ninety days, is queued at /journal/five-reports-cfo-wrong.