Product

The PR gate that speaks engineering and finance.

Class1 turns a diff into a budget-case approval workflow. It does not wait for spend to appear in production.

01

Scan

Find LLM callsites in Python, TypeScript, and JavaScript. Detect OpenAI, Anthropic, LiteLLM, Gemini, LangChain, model constructors, aliases, max_tokens, tool/MCP lists, and default-model inference.

02

Translate

Convert only defensible code signals into scenario changes. Model swaps change the rate. max_tokens changes output tail. New callsites are counted but not fabricated into volume.

03

Simulate

Run paired Monte Carlo with common random numbers. Report expected, P50, P80, P90, P95, and worst monthly delta with exact-zero self-delta.

04

Explain

Rank tail drivers, project escalation, recommend a model by cost per completed task, and attach the Basis of Estimate.

05

Enforce

If a .class1 budget exists and positive P90 delta exceeds it, the check fails. No budget stays advisory. Non-increases never block.

06

Calibrate

After merge, actuals pair with the stored estimate. The class improves only when reality proves it.

What the gate actually reads

Class1 is conservative by design: it prices evidence from the diff and labels the rest.

The scanner looks for LLM callsites and concrete parameters: model identifiers, maximum output tokens, constructor-bound model choices, provider wrappers, LangChain chat classes, Gemini model objects, LiteLLM aliases, and TypeScript or JavaScript patterns when tree-sitter is available.

When a pull request changes a model from a cheaper model to a frontier model, the rate basis changes. When max_tokens rises, the output tail widens. When tool schemas are loaded into every request, MCP overhead becomes recurring input tokens. These are defensible signals because the code itself exposes them.

The product deliberately does not invent traffic volume from a new callsite. Added callsites are counted and called out, but their magnitude belongs to actuals and calibration. That restraint matters in sales: the report is credible because it refuses to fill unknowns with confident fiction.

PR comment anatomy

One report. Six decisions.

Cost bandexpected, P50, P90, P95, worst
Estimate classhow mature the inputs are
Risk driversretry loop, context, output, demand, fallback
Escalationprice, volume, structure, driver to attack
Model fitcheapest per completed task, not cheapest per token
Policy gateadvisory or blocking based on repo config

Approval semantics

The policy gate is narrow enough to trust in CI.

No budget configuredThe PR receives the full cost-risk comment, but the check exits successfully. This is the free education layer.
Budget configured and P90 below limitThe comment documents the budget-case delta and the check stays green, giving reviewers traceable approval context.
Budget configured and P90 above limitThe Action still posts the comment, then returns a non-zero exit so the PR check goes red with the reason visible.
Delta is zero or negativeThe gate cannot block a non-increase, including the exact-zero self-delta created by common random numbers.

Config surface

A budget gate small enough to read in one glance.

The enforcement lever is intentionally narrow: P90 monthly delta against a declared budget. A PR that does not increase cost cannot be blocked by the gate.

fail_pr_if:
  delta_p90_usd: 500
  warn_at_fraction: 0.8

Language coverage

Python AST is stdlib/offline. TS/JS uses tree-sitter when installed and degrades cleanly otherwise.

CI behavior

The Action runs tests, estimates the diff, posts the comment, then enforces the gate so the PR still gets context when it fails.

Machine output

--json writes the gate payload for dashboards, governance systems, and reporting workflows.

Private pilot

--persist stores estimates so later actuals can close the loop and improve estimate class.

Product objections

The questions buyers ask before putting a gate in the merge path.

Will this slow engineering down?

The default mode is advisory. Teams can start by posting comments, then turn on blocking only for repositories or workflows where P90 budget exposure matters.

What if the forecast is early and rough?

That is why the estimate class is part of the report. Class 5 and Class 4 are still useful for risk detection, but the product does not pretend they are definitive budgets.

Why not just watch production usage?

Production usage is necessary for calibration, but it arrives after the decision. Class1 uses actuals to improve future estimates while still reviewing the next risky change before merge.

Can engineers override it?

A failed check is not a moral judgement. It creates a controlled conversation: reduce the tail, change the model, add a budget owner, or explicitly accept the cost.