Summary
The intuitive same-hand judge reads English authorial voice without recognition — strip a real Flint passage of its names and the judges still credit the hand. For Chinese it is recognition-bound.
The asymmetry tracks language, not author — confirmed directly (Session 30, B24b): on the same clean 7-model slate, four capable judges read transposed real Flint above the fakes (Δ +1.9 to +6.4) while none read transposed/obscure real Ma above the fakes. So it is a model-capability wall, not a calibration bug. Flint stays production; the Chinese voice instrument is blocked on model capability — not buildable with current models.
The instrument, and what its band certifies
The gate-4 judge is intuitive-holistic: ~8 unlabeled real-author exemplars, anonymized author, the question "are these the same hand?", gentle framing, no checklist, a single VOICEFIT: X/10 confidence line. The band answers "is this the author's own hand?" — a forensic bar, not a fitness-for-purpose one. A below-band score is not "worthless to the consumer," and the band is never lowered to make verdicts green.
Why intuitive, not rubric
A rubric judge — labeled exemplars plus a checklist of countable moves — skews both ways: it penalizes genuine author prose that doesn't tick every box, and rewards box-ticking fakes. Confirmed on three independent voice families:
| Voice | Rubric | Intuitive · same prose |
|---|---|---|
| 马伯庸 zh fiction | recall@0-FP 44% · fakes ≤5.2 | recall@0-FP 78% · fakes ≤2.2 |
| Flint en fiction | our output "6.8, near band" | ~3 — with the generic fakes |
| explainers en/de/fr/zh | mimic ceiling 4.3–5.33 | mimic ceiling 0.3–2.0 |
The variable a 2×2 factorial isolated was framing: gentle "same hand?" keeps real and fake correctly separated; harsh "find where it betrays itself" craters real and fake alike — a floorless flaw-hunter.
The recognition confound
On material the judge can place in its training memory, the score measures recognition of the specific text, not voice. Watch what a name-swap does:
The load-bearing proof is the fake
Real-Ma-transposed scoring low (1.62), on its own, is consistent with the innocent reading "the name-swap damaged the prose." The cell that refutes it is a fabrication scoring as high as the real author — and it keeps its names, so it is immune to the prose-damage objection:
A fabrication outscores the author's own prose the moment that prose is made unrecognizable. A fake inheriting the full score proves the high score isn't a measurement of authorship at all. And flat prose + real names = 0.5 (lowest in the study) shows the names are a retrieval key, not the signal.
The four-arm recognition test
Generalized from this — never prove recognition-binding with arm 1 alone:
- Real-author transposed — names swapped, prose else byte-identical.
- Fake of recognized material, named and transposed — the decisive arm.
- Obscure held-out real — a work the model never memorized.
- Names-only on flat, voiceless prose.
The decisive contrast is arm 2 named vs arm 1. A judge that lets a fake of recognized material outscore the real author's transposed prose is recognition-bound → rebuild it, or declare it blocked. Mechanism: the judge performs passage-identification — names as the retrieval key — not prose-craft reading.
The language asymmetry
The same deciders, the same de-confound method, opposite outcomes by language. For English there is a voice-detector under the recognition halo; for Chinese there is nothing under it.
English · Eric Flint READS VOICE ✓
Names stripped, the judge still credits the hand — recognition adds only ~0.5–2.5 on top.
Chinese · 马伯庸 CAN'T READ ✗
Transposed/obscure real Ma, AI fakes, and a different author collapse together. No separation.
This generalizes the standing 2026-05-29 "Opus blind to Chinese accent" finding — and Session 30 confirmed it directly. The capability probe (B24b) ran the full 7-model slate on a clean channel (all judges + all generated stimuli on OpenRouter official upstreams, no aiberm), scoring transposed/obscure real author vs clean AI-reconstruction fakes. Four capable judges credit transposed real Flint above the fakes — gpt +3.53, gemini +6.41, qwen +2.61, opus +1.93 — while the same four cannot separate real Ma from fakes (opus −0.32, gpt +0.40 wash, gemini +1.71 noise). opus is the cleanest demonstration: +1.93 English vs −0.32 Chinese. So a transposition-robust maboyong judge is not buildable with current models — a Chinese-authorial-voice-reading wall, not a broken instrument (stimuli re-confirmed clean, judge and generation). The earlier aiberm "opus signal" (3.57) was killed twice — clean judge, then clean fakes.
The wall is language-and-register, not language alone. Re-run the same probe on Lu Changhai's Chinese exposition (B24c, 2026-06-16): a strong few-shot mimic — a frontier model given Lu's own 8 exemplars, writing his hand on a fresh topic — clears the judge as easily as recognition-stripped real Lu (gpt-mimic glm 9.4 / qwen 7.8 / gemini 7.6 / opus 5.1, vs real-Lu-transposed 1.7 / 0.9 / 3.3 / 4.9). So exposition also fails the forensic hand-identity bar — Lu's signature technique is imitable, not a fingerprint. (This corrected an initial over-claim: a "+3.44 opus reads voice" delta against weak fakes, killed by the mimic capstone — §10.) But unlike Ma, transposed real Lu scores above flat/generic fakes for the graded judge (opus +2.72 obscure-only, gemini +4.28; Ma's fiction fell below, opus −0.32) — the exposition register survives transposition where fiction-narrator voice did not. That supports a fit-for-purpose gate (rejects slop; passes competent on-voice prose, real or mimic), not a forensic one — but only on an opus-led panel: the locked production deciders glm + qwen floor transposed real Lu (1.5 / 1.0, ~recognition-bound like Ma), and the gate passes strong mimics by construction.
Length sensitivity
The judge is calibrated at ~415 words and compresses toward 5 ("can't tell") the longer the passage — a long multi-POV surface gives the gestalt read more chances to find one "slightly off" draw.
Compression is judge-specific and flips by language: on English GPT-5.5 collapses while Opus holds; on Chinese it reverses. The fix — sliding-window scoring (window_score.py) — chops the passage into calibration-length windows, scores each on the unchanged panel, and aggregates.
Axis conflation: composition vs prose
The single VOICEFIT number blends two separable axes — composition (how a scene is built; in the plan) and prose (the wording) — and under-credits composition. Content-constant 2×2 on Flint:
| Composition | Prose | voicefit | compfit |
|---|---|---|---|
| Flint | Flint (real) | 7.7 | 9.0 |
| Flint | flat | 2.65 | 6.4 |
| conventional | Flint | 3.1 | 2.0–2.6 |
| neutral | flat | 0.7 | 2.0–2.2 |
The prose ruler needs both axes maxed; one at a time reads as a fake. The fix is a second instrument — compfit, the same exemplars asking "same construction?" — locked for Flint. compfit_zh is built but provisional (its first-pass used a recognizable scene, so it inherits §2's confound until re-run).
Model screening map
| Model | Discriminates | Role |
|---|---|---|
| GPT-5.5 | every voice — en/de/fr + Chinese gestalt | decide |
| Opus-4.8 | en/de/fr fully; Chinese coarse-gestalt only, blind on the accent axis | decide · not zh |
| Fable-5 | everything (reals saturate 9.0); coarse for iteration | certifier |
| glm-5.1 | Chinese — the validated zh discriminator | decide · zh |
| qwen3.7-max | 卢昌海 (zh explainer) | decide · zh |
| Gemini-3.x | clear cases only (saturating binary) | loose |
| kimi · deepseek · mistral | can't credit reals / gameable / fooled by mimics | rejected |
Recipe-level caution: a model can pass coarse gestalt yet be blind on the fine-grained accent axis. A coarse screen "passing" does not retire an accent finding — to retire it you must beat the accent adversarial (B21). Until then Opus stays off the zh voice panels.
Transport
Two traps, both first mis-recorded as model defects: (a) slug form differs by relay (aiberm wants bare slugs, OpenRouter prefixed — engine/models.py _route normalizes); (b) max_tokens as a footgun — a low cap makes reasoning-heavy models return empty content. With both fixed, only three models can't run on aiberm — glm-5.1 (verdict truncated), qwen3.7-max and mistral (not carried); every other judge matches OpenRouter within ±1 including on Chinese (GPT-5.5 on aiberm = a maboyong real 8.4). So panels route mixed — the incompatible members go to OpenRouter (run_mixed.sh), the rest to aiberm; the maboyong zh panel runs mixed (glm→OR, GPT→aiberm), and only uniflection_zh is OpenRouter-only because both its deciders (glm+qwen) are incompatible. So “zh on OpenRouter” is about those two models, not about aiberm and Chinese. Validate each judge on a real+fake pair on any new transport — a model-list match is not equivalence.
Per-voice scoreboard
| Voice | Band s/l | Fingerprint | Recognition | Verdict |
|---|---|---|---|---|
| Flint en | 7.5 / 5.5 | d607d5a83bf3 | clean | Reads voice. Production. |
| 马伯庸 zh | 6.0 / void | 058ef8e6b986 | FAILED | Recognition-bound. Void as voice. |
| PG en explainer | 8.0 | 7c8927ccae25 | owed | Real 9.5–10 · mimics 2.0 · unaudited reals |
| Freistetter de | 7.5 | 1982f8398fff | owed | POS 8.1–8.5 · mimics ≤0.5 |
| Louapre fr | 7.0 | 20fcaa5edf31 | owed | POS 7.7–8.4 · mimics ≤0.8 |
| 卢昌海 zh explainer | 5.5 recog-infl | 7ab47b0e9175 | B24c · forensic FAIL | Mimic clears it (glm 9.4 / qwen 7.8 / opus 5.1 ≥ real-Lu-tp) → not hand-identity. But register separates (opus +2.72) → fit-for-purpose gate only, opus-led; glm+qwen floor transposed real. |
The four explainer bands and Flint's short-form band were calibrated on reals whose recognizability was not audited. By §2's doctrine, a band is only a voice band once it holds on transposed reals — Flint's long-form is clean (transposed ref); the rest owe the audit.
Open questions
- B24(b) — RESOLVED (Session 30): no model reads Chinese voice without recognition. The full slate, screened clean (§3): none credit transposed/obscure real Ma above clean AI fakes; four read transposed real Flint above the fakes. → maboyong is blocked on model capability. Decision for Stephen: a fit-for-purpose bar (recognition-anchored, scoped to 1628) vs hold and re-probe at the next model refresh.
- B24(c) — 卢昌海 RESOLVED (2026-06-16); other explainers + Flint short-form still owed. The Lu exposition judge fails the forensic bar — a strong few-shot mimic clears it (glm 9.4 / qwen 7.8 / opus 5.1 ≥ real-Lu-transposed); band 5.5 is recognition-inflated. But it differs from Ma: exposition register survives transposition (transposed real Lu > flat fakes: opus +2.72, gemini +4.28), so a fit-for-purpose "competent on-voice explainer" gate is buildable — but only on an opus-led panel (the locked glm+qwen floor transposed real), and it passes mimics by construction. Decisive follow-up (not run): a different real Chinese science-popularizer, transposed, on glm+qwen — register-gate vs Lu-gate. Evidence:
recognition_probe_b24c_2026_06_16.json. PG / Freistetter / Louapre + Flint short-form still owe the transposed-reals audit. - compfit_zh — the content-constant 2×2, re-run on a non-recognizable scene.
- B21 — the accent adversarial. A Ma-gestalt passage salted with native-fluency defects, Opus vs native deciders head-to-head — the only test that can retire "Opus blind to Chinese accent."
The meta-lesson
A coarser or looser instrument flatters a result that a sharper test then corrects.
- The rubric judge scored our output 6.8; the intuitive judge read ~3.
- "Prompting is exhausted" — until the restraint test showed from-plan Flint moves ~2 → ~4.8.
- The Chinese bake-off ranked writers ~6.7 — until name-transposition floored all of them, real Ma included.
- "Opus blind to Chinese is retired" — until the axis was named: the gestalt screen never probed the accent axis.
- "Lu's non-fiction reads voice (opus +3.44)" — until the few-shot-mimic capstone showed a strong mimic clears the judge as easily as real Lu: what survives transposition is the register, not the hand. This time the flattering read appeared at the reasoning layer and the sharper stimulus corrected it (B24c).
This is why the findings here are dated, why every "at band" call owes the sharper test, and why the four-arm recognition test and transposition-before-banding are now hard rules. The same flattering-instrument failure the judge doctrine fixes at the measurement layer recurs at the reasoning layer — guard against it there too.