/posts/the-cheapest-token-is-the-one-you-never-send
The Cheapest Token Is the One You Never Send
TL;DR
Asking an LLM to find SOLID violations in a large codebase is like asking a physicist to explain gravity by staring at the universe. Build detectors first: ripgrep, ast-grep, jscpd scan for measurable signals in milliseconds. Then send classified candidates to the model for triage. Classification is cheap. Exploration burns tokens.
Guide
Walk into an MIT physics lab. Ask the first person you see: "What's wrong with it?" They will stare at you. Wrong with what? The equipment? The grant proposal? The universe?
Now open a 200,000-line codebase. Paste it into an LLM. Ask: "Find the SOLID violations." Same energy. The model will burn through its context window generating plausible-sounding observations about half the files in the repo, most of which you already knew, none of which are prioritized, and several of which are wrong. You spent 50,000 tokens to get a book report.
Physicists do not measure gravity. They measure acceleration, orbital periods, spacetime curvature -- observables that gravity produces. The force is the abstraction. The falling apple is the signal. Refactoring works the same way. SOLID violations are abstractions. You cannot grep for "single responsibility." But you can count the signals that SRP violations produce, and you can do it in milliseconds with tools that already exist.
You Cannot Grep for Single Responsibility
Run wc -l on the file that scares your team:
wc -l src/services/OrderService.php
# 2,847 src/services/OrderService.php
2,847 lines. That number alone is not a diagnosis. But now count what it imports:
grep -c "^use " src/services/OrderService.php
# 34
34 dependencies. Database connections, HTTP clients, cache adapters, validation libraries, email formatters, PDF generators. One file touches six different infrastructure boundaries. You did not need a language model to find this. wc and grep told you in 3 milliseconds.
Each SOLID principle produces specific, measurable signals:
- SRP: File length > 500 lines combined with high dependency count and diverse method prefixes (
validate*,format*,send*,query*). The file does not have one reason to change -- it has six. - OCP: Long switch statements or type-dispatch cascades. Every new case means opening the same file. Count the cases:
grep -c "case " src/handlers/dispatch.go. Twelve cases today, fifteen next sprint. - ISP: Interfaces with 12 methods where concrete implementations use 2 or 3. The rest are empty stubs or panic calls.
ast-grepfinds these in seconds. - DIP: Business logic constructing infrastructure objects directly --
new RedisClient()inside a domain service,sql.Open()in a handler function.grep -rn "new.*Client\|new.*Connection\|sql\.Open" src/domain/lights them up.
DRY violations have their own signal set: duplicated literal strings across files, duplicated conditional chains (the same if/else cascade in three handlers), repeated database query patterns, copied validation rules with slight variations.
These are not opinions. They are measurements.
Detectors Run in Seconds, Models Burn Minutes
You already have the instruments:
# Duplicated code blocks across the repo
jscpd --min-lines 5 --min-tokens 50 src/
# Interfaces where implementations stub most methods
ast-grep -p 'func ($RECV) $METHOD($$$) { panic("not implemented") }' --lang go
# Files with the most import statements (dependency fan-out)
rg "^import |^use |^from .* import" --count-matches -t go -t php -t py \
| sort -t: -k2 -nr | head -20
# Switch cascades longer than 8 cases
ast-grep -p 'switch $_ { $$$CASES }' --lang go \
| xargs -I{} sh -c 'grep -c "case " {} | awk -F: "\$2 > 8"'
Each command finishes in seconds on a large repo. The output is a candidate list -- files and line numbers, ranked by severity. No tokens consumed. No hallucinations. No "I noticed that your codebase could benefit from..." preamble.
golangci-lint runs 50+ analyzers in one pass. PHPStan catches type violations and dead code paths. jscpd finds copy-paste duplication across any language it can tokenize. These tools are deterministic. Run them twice, get the same output. An LLM given the same prompt twice might give you completely different files.
The tools produce false positives. A 600-line file might be a generated protocol buffer. A 15-case switch might be an exhaustive enum handler that should stay together. The tools find candidates. Judgment happens next.
Classification Is Cheap, Exploration Burns Tokens
Here is the pipeline:
Stage 1 -- Deterministic detection. Local tools scan the repo and produce candidate lists. One list per smell class: "files over 500 lines with 10+ imports," "switch statements with 8+ cases," "interfaces where implementations stub methods." Each list is small, specific, and machine-generated.
Stage 2 -- Analytical triage. Send one candidate list to the model. Not the whole repo. Not "find problems." One batch, one smell class, one question: "Here are 12 files flagged for high dependency fan-out. For each, classify whether the coupling is structural (this file legitimately coordinates these concerns) or accidental (these concerns should be separated). For accidental cases, propose a minimal decomposition."
The model receives maybe 200 lines of context per candidate instead of 200,000 lines of repo. Classification on a pre-filtered list is a fundamentally different task than open-ended exploration. The model is not searching -- it is evaluating. That distinction is the difference between burning 100,000 tokens on vague suggestions and spending 5,000 tokens on actionable decompositions.
Open-ended exploration:
Input: entire repo (200k+ tokens)
Output: broad observations, low signal
Cost: high
Reliability: low (different results each run)
Classified triage:
Input: 12 candidate files, one smell class (5k tokens)
Output: per-file verdict + minimal refactor proposal
Cost: low
Reliability: high (constrained evaluation)
Batch by smell class, not by file. Sending "everything wrong with OrderService.php" invites the model to freelance. Sending "here are the 8 files with the highest import counts -- classify each" keeps the task narrow.
A Persistent Inventory Makes Architecture Queryable
Run the detectors once and throw away the output, you are back to guessing next quarter. Instead, persist a simple dataset per file:
file_path | lines | public_methods | imports | duplication_score | last_modified
--------------------|-------|----------------|---------|-------------------|-------------
src/services/Order | 2847 | 43 | 34 | 0.23 | 2026-03-08
src/services/User | 412 | 8 | 7 | 0.02 | 2026-02-15
src/handlers/api | 1205 | 22 | 19 | 0.15 | 2026-03-09
Now architectural reasoning is a query:
# Files that grew more than 200 lines in the last 30 days
# (compare current scan to last month's snapshot)
# Top 10 candidates for decomposition: high lines + high imports + high duplication
sort -t'|' -k2 -nr inventory.csv | head -10
The inventory answers questions that used to require a senior engineer reading code for two days. Which files change most often? Which files have the most cross-cutting dependencies? Where is duplication concentrating? Track it over time and you can see whether refactoring efforts are working -- the numbers go down, or they don't.
Physics labs do not run one experiment and declare the field solved. They instrument, measure, record, compare.
Detector Pipeline vs LLM-First vs Manual Code Review
| Axis | Detector + LLM triage | LLM-first exploration | Manual code review |
|---|---|---|---|
| Speed to first signal | Seconds (deterministic scan) | Minutes (context loading + generation) | Hours to days |
| Token cost per repo scan | Near zero (tools) + small triage batches | 100k-500k tokens per pass | Zero (human time instead) |
| Reproducibility | Deterministic detection, consistent triage | Variable -- different results each run | Depends on reviewer |
| False positive handling | Tools flag, model classifies, human decides | Model flags and classifies in one pass (lower precision) | Human handles both |
| Scales to 500k lines | Yes -- tools do not care about repo size | No -- context window limits force chunking heuristics | No -- reviewer fatigue |
| When it wins | Ongoing maintenance of large, active codebases | Greenfield design review, small codebases, one-off audits | Mentorship, knowledge transfer, nuanced design decisions |
LLM-first exploration works when the codebase fits in context and you need design-level insight, not mechanical detection. For a 2,000-line project where you want someone to challenge your architecture, paste the whole thing and ask. For a 200,000-line monolith where you need to find the 15 files that should be split, run the detectors.
Measurement Precedes Explanation
Every experimental physics lab runs the same pattern. Instruments collect signals. Signals get ranked by strength. The interesting phenomena get studied in detail. Nobody walks into CERN and asks the accelerator what is wrong with the Standard Model.
Build the detectors. Run them on CI. Persist the inventory. Send classified batches to the model for triage. Refactoring becomes what it should have been: experimental science. You measure first. Then you explain what you found.