/posts/the-cheapest-token-is-the-one-you-never-send

The Cheapest Token Is the One You Never Send

TL;DR

Asking an LLM to find SOLID violations in a large codebase is like asking a physicist to explain gravity by staring at the universe. Build detectors first: ripgrep, ast-grep, jscpd scan for measurable signals in milliseconds. Then send classified candidates to the model for triage. Classification is cheap. Exploration burns tokens.

Mar 10, 2026 7 min updated Mar 10, 2026

Guide

Walk into an MIT physics lab. Ask the first person you see: "What's wrong with it?" They will stare at you. Wrong with what? The equipment? The grant proposal? The universe?

Now open a 200,000-line codebase. Paste it into an LLM. Ask: "Find the SOLID violations." Same energy. The model will burn through its context window generating plausible-sounding observations about half the files in the repo, most of which you already knew, none of which are prioritized, and several of which are wrong. You spent 50,000 tokens to get a book report.

Physicists do not measure gravity. They measure acceleration, orbital periods, spacetime curvature -- observables that gravity produces. The force is the abstraction. The falling apple is the signal. Refactoring works the same way. SOLID violations are abstractions. You cannot grep for "single responsibility." But you can count the signals that SRP violations produce, and you can do it in milliseconds with tools that already exist.

You Cannot Grep for Single Responsibility

Run wc -l on the file that scares your team:

wc -l src/services/OrderService.php
# 2,847 src/services/OrderService.php

2,847 lines. That number alone is not a diagnosis. But now count what it imports:

grep -c "^use " src/services/OrderService.php
# 34

34 dependencies. Database connections, HTTP clients, cache adapters, validation libraries, email formatters, PDF generators. One file touches six different infrastructure boundaries. You did not need a language model to find this. wc and grep told you in 3 milliseconds.

Each SOLID principle produces specific, measurable signals:

SRP: File length > 500 lines combined with high dependency count and diverse method prefixes (validate*, format*, send*, query*). The file does not have one reason to change -- it has six.
OCP: Long switch statements or type-dispatch cascades. Every new case means opening the same file. Count the cases: grep -c "case " src/handlers/dispatch.go. Twelve cases today, fifteen next sprint.
ISP: Interfaces with 12 methods where concrete implementations use 2 or 3. The rest are empty stubs or panic calls. ast-grep finds these in seconds.
DIP: Business logic constructing infrastructure objects directly -- new RedisClient() inside a domain service, sql.Open() in a handler function. grep -rn "new.*Client\|new.*Connection\|sql\.Open" src/domain/ lights them up.

DRY violations have their own signal set: duplicated literal strings across files, duplicated conditional chains (the same if/else cascade in three handlers), repeated database query patterns, copied validation rules with slight variations.

These are not opinions. They are measurements.

Detectors Run in Seconds, Models Burn Minutes

You already have the instruments:

# Duplicated code blocks across the repo
jscpd --min-lines 5 --min-tokens 50 src/

# Interfaces where implementations stub most methods
ast-grep -p 'func ($RECV) $METHOD($$$) { panic("not implemented") }' --lang go

# Files with the most import statements (dependency fan-out)
rg "^import |^use |^from .* import" --count-matches -t go -t php -t py \
  | sort -t: -k2 -nr | head -20

# Switch cascades longer than 8 cases
ast-grep -p 'switch $_ { $$$CASES }' --lang go \
  | xargs -I{} sh -c 'grep -c "case " {} | awk -F: "\$2 > 8"'

Each command finishes in seconds on a large repo. The output is a candidate list -- files and line numbers, ranked by severity. No tokens consumed. No hallucinations. No "I noticed that your codebase could benefit from..." preamble.

golangci-lint runs 50+ analyzers in one pass. PHPStan catches type violations and dead code paths. jscpd finds copy-paste duplication across any language it can tokenize. These tools are deterministic. Run them twice, get the same output. An LLM given the same prompt twice might give you completely different files.

The tools produce false positives. A 600-line file might be a generated protocol buffer. A 15-case switch might be an exhaustive enum handler that should stay together. The tools find candidates. Judgment happens next.

Classification Is Cheap, Exploration Burns Tokens

Here is the pipeline:

Stage 1 -- Deterministic detection. Local tools scan the repo and produce candidate lists. One list per smell class: "files over 500 lines with 10+ imports," "switch statements with 8+ cases," "interfaces where implementations stub methods." Each list is small, specific, and machine-generated.

Stage 2 -- Analytical triage. Send one candidate list to the model. Not the whole repo. Not "find problems." One batch, one smell class, one question: "Here are 12 files flagged for high dependency fan-out. For each, classify whether the coupling is structural (this file legitimately coordinates these concerns) or accidental (these concerns should be separated). For accidental cases, propose a minimal decomposition."

The model receives maybe 200 lines of context per candidate instead of 200,000 lines of repo. Classification on a pre-filtered list is a fundamentally different task than open-ended exploration. The model is not searching -- it is evaluating. That distinction is the difference between burning 100,000 tokens on vague suggestions and spending 5,000 tokens on actionable decompositions.

Open-ended exploration:
  Input:  entire repo (200k+ tokens)
  Output: broad observations, low signal
  Cost:   high
  Reliability: low (different results each run)

Classified triage:
  Input:  12 candidate files, one smell class (5k tokens)
  Output: per-file verdict + minimal refactor proposal
  Cost:   low
  Reliability: high (constrained evaluation)

Batch by smell class, not by file. Sending "everything wrong with OrderService.php" invites the model to freelance. Sending "here are the 8 files with the highest import counts -- classify each" keeps the task narrow.

A Persistent Inventory Makes Architecture Queryable

Run the detectors once and throw away the output, you are back to guessing next quarter. Instead, persist a simple dataset per file:

file_path           | lines | public_methods | imports | duplication_score | last_modified
--------------------|-------|----------------|---------|-------------------|-------------
src/services/Order  | 2847  | 43             | 34      | 0.23              | 2026-03-08
src/services/User   | 412   | 8              | 7       | 0.02              | 2026-02-15
src/handlers/api    | 1205  | 22             | 19      | 0.15              | 2026-03-09

Now architectural reasoning is a query:

# Files that grew more than 200 lines in the last 30 days
# (compare current scan to last month's snapshot)

# Top 10 candidates for decomposition: high lines + high imports + high duplication
sort -t'|' -k2 -nr inventory.csv | head -10

The inventory answers questions that used to require a senior engineer reading code for two days. Which files change most often? Which files have the most cross-cutting dependencies? Where is duplication concentrating? Track it over time and you can see whether refactoring efforts are working -- the numbers go down, or they don't.

Physics labs do not run one experiment and declare the field solved. They instrument, measure, record, compare.

Detector Pipeline vs LLM-First vs Manual Code Review

Axis	Detector + LLM triage	LLM-first exploration	Manual code review
Speed to first signal	Seconds (deterministic scan)	Minutes (context loading + generation)	Hours to days
Token cost per repo scan	Near zero (tools) + small triage batches	100k-500k tokens per pass	Zero (human time instead)
Reproducibility	Deterministic detection, consistent triage	Variable -- different results each run	Depends on reviewer
False positive handling	Tools flag, model classifies, human decides	Model flags and classifies in one pass (lower precision)	Human handles both
Scales to 500k lines	Yes -- tools do not care about repo size	No -- context window limits force chunking heuristics	No -- reviewer fatigue
When it wins	Ongoing maintenance of large, active codebases	Greenfield design review, small codebases, one-off audits	Mentorship, knowledge transfer, nuanced design decisions

LLM-first exploration works when the codebase fits in context and you need design-level insight, not mechanical detection. For a 2,000-line project where you want someone to challenge your architecture, paste the whole thing and ask. For a 200,000-line monolith where you need to find the 15 files that should be split, run the detectors.

Measurement Precedes Explanation

Every experimental physics lab runs the same pattern. Instruments collect signals. Signals get ranked by strength. The interesting phenomena get studied in detail. Nobody walks into CERN and asks the accelerator what is wrong with the Standard Model.

Build the detectors. Run them on CI. Persist the inventory. Send classified batches to the model for triage. Refactoring becomes what it should have been: experimental science. You measure first. Then you explain what you found.