Summary
The intuitive same-hand judge reads English authorial voice without recognition — strip a real Flint passage of its names and the judges still credit the hand. For Chinese it is recognition-bound.
The asymmetry tracks language, not author — confirmed directly (Session 30, B24b): on the same clean 7-model slate, four capable judges read transposed real Flint above the fakes (Δ +1.9 to +6.4) while none read transposed/obscure real Ma above the fakes. So it is a model-capability wall, not a calibration bug. Flint stays production; the Chinese voice instrument is blocked on model capability — not buildable with current models.
The instrument, and what its band certifies
The gate-4 judge is intuitive-holistic: ~8 unlabeled real-author exemplars, anonymized author, the question "are these the same hand?", gentle framing, no checklist, a single VOICEFIT: X/10 confidence line. The band answers "is this the author's own hand?" — a forensic bar, not a fitness-for-purpose one. A below-band score is not "worthless to the consumer," and the band is never lowered to make verdicts green.
Why intuitive, not rubric
A rubric judge — labeled exemplars plus a checklist of countable moves — skews both ways: it penalizes genuine author prose that doesn't tick every box, and rewards box-ticking fakes. Confirmed on three independent voice families:
| Voice | Rubric | Intuitive · same prose |
|---|---|---|
| 马伯庸 zh fiction | recall@0-FP 44% · fakes ≤5.2 | recall@0-FP 78% · fakes ≤2.2 |
| Flint en fiction | our output "6.8, near band" | ~3 — with the generic fakes |
| explainers en/de/fr/zh | mimic ceiling 4.3–5.33 | mimic ceiling 0.3–2.0 |
The variable a 2×2 factorial isolated was framing: gentle "same hand?" keeps real and fake correctly separated; harsh "find where it betrays itself" craters real and fake alike — a floorless flaw-hunter.
The recognition confound
On material the judge can place in its training memory, the score measures recognition of the specific text, not voice. Watch what a name-swap does:
The load-bearing proof is the fake
Real-Ma-transposed scoring low (1.62), on its own, is consistent with the innocent reading "the name-swap damaged the prose." The cell that refutes it is a fabrication scoring as high as the real author — and it keeps its names, so it is immune to the prose-damage objection:
A fabrication outscores the author's own prose the moment that prose is made unrecognizable. A fake inheriting the full score proves the high score isn't a measurement of authorship at all. And flat prose + real names = 0.5 (lowest in the study) shows the names are a retrieval key, not the signal.
The four-arm recognition test
Generalized from this — never prove recognition-binding with arm 1 alone:
- Real-author transposed — names swapped, prose else byte-identical.
- Fake of recognized material, named and transposed — the decisive arm.
- Obscure held-out real — a work the model never memorized.
- Names-only on flat, voiceless prose.
The decisive contrast is arm 2 named vs arm 1. A judge that lets a fake of recognized material outscore the real author's transposed prose is recognition-bound → rebuild it, or declare it blocked. Mechanism: the judge performs passage-identification — names as the retrieval key — not prose-craft reading.
The language asymmetry
The same deciders, the same de-confound method, opposite outcomes by language. For English there is a voice-detector under the recognition halo; for Chinese there is nothing under it.
English · Eric Flint READS VOICE ✓
Names stripped, the judge still credits the hand — recognition adds only ~0.5–2.5 on top.
Chinese · 马伯庸 CAN'T READ ✗
Transposed/obscure real Ma, AI fakes, and a different author collapse together. No separation.
This generalizes the standing 2026-05-29 "Opus blind to Chinese accent" finding — and Session 30 confirmed it directly. The capability probe (B24b) ran the full 7-model slate on a clean channel (all judges + all generated stimuli on OpenRouter official upstreams, no aiberm), scoring transposed/obscure real author vs clean AI-reconstruction fakes. Four capable judges credit transposed real Flint above the fakes — gpt +3.53, gemini +6.41, qwen +2.61, opus +1.93 — while the same four cannot separate real Ma from fakes (opus −0.32, gpt +0.40 wash, gemini +1.71 noise). opus is the cleanest demonstration: +1.93 English vs −0.32 Chinese. So a transposition-robust maboyong judge is not buildable with current models — a Chinese-authorial-voice-reading wall, not a broken instrument (stimuli re-confirmed clean, judge and generation). The earlier aiberm "opus signal" (3.57) was killed twice — clean judge, then clean fakes.
The wall is language-and-register, not language alone. Re-run the same probe on Lu Changhai's Chinese exposition (B24c, 2026-06-16): a strong few-shot mimic — a frontier model given Lu's own 8 exemplars, writing his hand on a fresh topic — clears the judge as easily as recognition-stripped real Lu (gpt-mimic glm 9.4 / qwen 7.8 / gemini 7.6 / opus 5.1, vs real-Lu-transposed 1.7 / 0.9 / 3.3 / 4.9). So exposition also fails the forensic hand-identity bar — Lu's signature technique is imitable, not a fingerprint. (This corrected an initial over-claim: a "+3.44 opus reads voice" delta against weak fakes, killed by the mimic capstone — §10.) But unlike Ma, transposed real Lu scores above flat/generic fakes for the graded judge (opus +2.72 obscure-only, gemini +4.28; Ma's fiction fell below, opus −0.32) — the exposition register survives transposition where fiction-narrator voice did not. That supports a fit-for-purpose gate (rejects slop; passes competent on-voice prose, real or mimic), not a forensic one — but only on an opus-led panel: the locked production deciders glm + qwen floor transposed real Lu (1.5 / 1.0, ~recognition-bound like Ma), and the gate passes strong mimics by construction.
Resolved — the wall is language, not register (Paul Graham, en explainer, 2026-06-18). The same probe on PG comes out the opposite of Lu: three genuinely strong fresh few-shot PG mimics floor (gpt 2.45 / opus 3.06 / qwen 1.45 / gemini 1.22) while real PG scores 8.7–10 and generic prose ~0 — a forensic gap of +5.6 to +8.8 on PG's deciders, where the matched Lu mimic cleared its judge (glm 9.44). So the PG judge rejects imitation — it reads the hand. PG and Lu are both explainer register, so the same register gives opposite outcomes ⇒ with current models the forensic wall falls on the Chinese side, not the explainer-register side. The clean 2×2 — English fiction (Flint ✓) + English explainer (PG ✓) read voice; Chinese fiction (Ma ✗) + Chinese explainer (Lu ✗) are recognition/register-bound — with the one confound-free demonstration being within-model opus: it reads PG's hand and transposed Flint (+1.93 en) but cannot separate real Ma from fakes (−0.32 zh). Caveats: PG's "obscure" reals are published (this passes the mimic capstone, the test that killed Lu, but doesn't fully close recognition-independence); the name-transposition control is weak; the axis is capability-dated and n=1-per-cell (de/fr explainers still owe the probe). Evidence: recognition_probe_pg_2026_06_18.json.
Closed — the wall is Chinese-specific, not non-English (Freistetter de + Louapre fr, 2026-06-18). Both European explainers pass the same mimic capstone. Real Freistetter and Louapre score 6.7–10 on the capable judges while strong fresh few-shot mimics floor (≈2.3–3.8); FORENSIC (real − mimic) is positive on every judge — gpt +3.7 / +5.6, opus +2.8 / +3.3, qwen +7.3 / +7.6, gemini +7.2 / +8.1 (de / fr). So three Western languages — English (fiction + explainer), German, French — all read the hand and reject imitation, across fiction and exposition. Only Chinese (Ma + Lu) is recognition/register-bound. The wall is Chinese-specific among the languages tested, not "non-English" and not "explainer register." Caveats: capability-dated (within-model opus reads en/de/fr, fails zh); only one non-Western language was tested, so "Chinese" may stand for a broader CJK / low-resource class; deepseek over-credits the mimics in every language (gameable judge). Evidence: de_results.json, fr_results.json.
Length sensitivity
The judge is calibrated at ~415 words and compresses toward 5 ("can't tell") the longer the passage — a long multi-POV surface gives the gestalt read more chances to find one "slightly off" draw.
Compression is judge-specific and flips by language: on English GPT-5.5 collapses while Opus holds; on Chinese it reverses. The fix — sliding-window scoring (window_score.py) — chops the passage into calibration-length windows, scores each on the unchanged panel, and aggregates.
Axis conflation: composition vs prose
The single VOICEFIT number blends two separable axes — composition (how a scene is built; in the plan) and prose (the wording) — and under-credits composition. Content-constant 2×2 on Flint:
| Composition | Prose | voicefit | compfit |
|---|---|---|---|
| Flint | Flint (real) | 7.7 | 9.0 |
| Flint | flat | 2.65 | 6.4 |
| conventional | Flint | 3.1 | 2.0–2.6 |
| neutral | flat | 0.7 | 2.0–2.2 |
The prose ruler needs both axes maxed; one at a time reads as a fake. The fix is a second instrument — compfit, the same exemplars asking "same construction?" — locked for Flint. compfit_zh is built but provisional (its first-pass used a recognizable scene, so it inherits §2's confound until re-run).
Model screening map
| Model | Discriminates | Role |
|---|---|---|
| GPT-5.5 | every voice — en/de/fr + Chinese gestalt | decide |
| Opus-4.8 | en/de/fr fully; Chinese coarse-gestalt only, blind on the accent axis | decide · not zh |
| Fable-5 | everything (reals saturate 9.0); coarse for iteration | certifier |
| glm-5.1 | Chinese — the validated zh discriminator | decide · zh |
| qwen3.7-max | 卢昌海 (zh explainer) | decide · zh |
| Gemini-3.x | clear cases only (saturating binary) | loose |
| kimi · deepseek · mistral | can't credit reals / gameable / fooled by mimics | rejected |
Recipe-level caution: a model can pass coarse gestalt yet be blind on the fine-grained accent axis. A coarse screen "passing" does not retire an accent finding — to retire it you must beat the accent adversarial (B21). Until then Opus stays off the zh voice panels.
Transport
Two traps, both first mis-recorded as model defects: (a) slug form differs by relay (aiberm wants bare slugs, OpenRouter prefixed — engine/models.py _route normalizes); (b) max_tokens as a footgun — a low cap makes reasoning-heavy models return empty content. With both fixed, only three models can't run on aiberm — glm-5.1 (verdict truncated), qwen3.7-max and mistral (not carried); every other judge matches OpenRouter within ±1 including on Chinese (GPT-5.5 on aiberm = a maboyong real 8.4). So panels route mixed — the incompatible members go to OpenRouter (run_mixed.sh), the rest to aiberm; the maboyong zh panel runs mixed (glm→OR, GPT→aiberm), and only uniflection_zh is OpenRouter-only because both its deciders (glm+qwen) are incompatible. So “zh on OpenRouter” is about those two models, not about aiberm and Chinese. Validate each judge on a real+fake pair on any new transport — a model-list match is not equivalence.
Per-voice scoreboard
| Voice | Band s/l | Fingerprint | Recognition | Verdict |
|---|---|---|---|---|
| Flint en | 7.5 / 5.5 | d607d5a83bf3 | clean | Reads voice. Production. |
| 马伯庸 zh | 6.0 / void | 058ef8e6b986 | FAILED | Recognition-bound. Void as voice. |
| PG en explainer | 8.0 | 7c8927ccae25 | mimic-capstone ✓ | Reads the hand. 3 strong fresh mimics floor 1.2–3.5 vs real PG 8.7–10; FORENSIC +5.6/+6.9 → Lu failure mode excluded. Band = voice band vs imitation |
| Freistetter de | 7.5 | 1982f8398fff | mimic-capstone ✓ | Reads the hand: mimics floor ≈3 vs real 6.7–10; FORENSIC +2.8 to +7.3. Wall is Chinese-specific |
| Louapre fr | 7.0 | 20fcaa5edf31 | mimic-capstone ✓ | Reads the hand: mimics floor ≈2.5 vs real 6.9–10; FORENSIC +3.3 to +8.1. Wall is Chinese-specific |
| 卢昌海 zh explainer | 5.5 recog-infl | 7ab47b0e9175 | B24c · forensic FAIL | Mimic clears it (glm 9.4 / qwen 7.8 / opus 5.1 ≥ real-Lu-tp) → not hand-identity. But register separates (opus +2.72) → fit-for-purpose gate only, opus-led; glm+qwen floor transposed real. |
The four explainer bands and Flint's short-form band were calibrated on reals whose recognizability was not audited. By §2's doctrine, a band is only a voice band once it holds on transposed reals — Flint's long-form is clean (transposed ref); the rest owe the audit.
Open questions
- B24(b) — RESOLVED (Session 30): no model reads Chinese voice without recognition. The full slate, screened clean (§3): none credit transposed/obscure real Ma above clean AI fakes; four read transposed real Flint above the fakes. → maboyong is blocked on model capability. Decision for Stephen: a fit-for-purpose bar (recognition-anchored, scoped to 1628) vs hold and re-probe at the next model refresh.
- B24(c) — 卢昌海 RESOLVED (2026-06-16); other explainers + Flint short-form still owed. The Lu exposition judge fails the forensic bar — a strong few-shot mimic clears it (glm 9.4 / qwen 7.8 / opus 5.1 ≥ real-Lu-transposed); band 5.5 is recognition-inflated. But it differs from Ma: exposition register survives transposition (transposed real Lu > flat fakes: opus +2.72, gemini +4.28), so a fit-for-purpose "competent on-voice explainer" gate is buildable — but only on an opus-led panel (the locked glm+qwen floor transposed real), and it passes mimics by construction. Decisive follow-up (not run): a different real Chinese science-popularizer, transposed, on glm+qwen — register-gate vs Lu-gate. Evidence:
recognition_probe_b24c_2026_06_16.json. Freistetter / Louapre + Flint short-form still owe the transposed-reals audit (PG now done — below). - PG (en explainer) — mimic-capstone PASSED (2026-06-18); de/fr + Flint short-form still owed. The PG judge rejects imitation: 3 strong fresh few-shot mimics floor (1.2–3.5 on gpt/opus/qwen/gemini) vs real PG 8.7–10; FORENSIC +5.6 to +8.8 — the clean opposite of Lu. ⇒ the forensic wall is language (Chinese), not explainer-register; with current models English voice (Flint + PG) is readable, Chinese (Ma + Lu) is not (cleanest: within-model opus, +1.93 en / −0.32 zh). Caveats: PG's reals are published; transposition control weak; n=1-per-cell + capability-dated.
- de + fr explainers — mimic-capstone PASSED (2026-06-18); only Flint short-form still owes. Freistetter (de) and Louapre (fr) both reject strong fresh mimics (FORENSIC +2.8 to +8.1 on the capable judges) ⇒ the wall is Chinese-specific, not non-English and not explainer-register. Three Western languages × (fiction + exposition) all read voice; only Chinese fails. Open: only one non-Western language tested — Chinese vs CJK vs low-resource is unprobed.
- compfit_zh — the content-constant 2×2, re-run on a non-recognizable scene.
- B21 — the accent adversarial. A Ma-gestalt passage salted with native-fluency defects, Opus vs native deciders head-to-head — the only test that can retire "Opus blind to Chinese accent."
The meta-lesson
A coarser or looser instrument flatters a result that a sharper test then corrects.
- The rubric judge scored our output 6.8; the intuitive judge read ~3.
- "Prompting is exhausted" — until the restraint test showed from-plan Flint moves ~2 → ~4.8.
- The Chinese bake-off ranked writers ~6.7 — until name-transposition floored all of them, real Ma included.
- "Opus blind to Chinese is retired" — until the axis was named: the gestalt screen never probed the accent axis.
- "Lu's non-fiction reads voice (opus +3.44)" — until the few-shot-mimic capstone showed a strong mimic clears the judge as easily as real Lu: what survives transposition is the register, not the hand. This time the flattering read appeared at the reasoning layer and the sharper stimulus corrected it (B24c).
- "PG's recognition audit is DONE/PASSED" — until the red-team flagged PG's "obscure" reals are published (recognition-light) and the name-transposition control is argument-confounded. The honest claim is narrower and true: PG passes the mimic capstone (rejects imitation), not "fully recognition-independent." The reasoning-layer guard caught the over-claim before it shipped (PG probe).
This is why the findings here are dated, why every "at band" call owes the sharper test, and why the four-arm recognition test and transposition-before-banding are now hard rules. The same flattering-instrument failure the judge doctrine fixes at the measurement layer recurs at the reasoning layer — guard against it there too.