Voice-Judge Findings — VoiceWriter Calibration Record

Summary

The intuitive same-hand judge reads English authorial voice without recognition — strip a real Flint passage of its names and the judges still credit the hand. For Chinese it is recognition-bound.

A competent AI fake of a recognized scene scores as high as real Ma (~6.7), while real Ma with its names stripped floors (1.6). So 马伯庸's "band 6.0" is a recognition line, not a voice threshold — no unrecognized text clears it, real Ma included — and our Chinese generation's "below band" was an artifact: the judge never measured our voice.

The asymmetry tracks language, not author — confirmed directly (Session 30, B24b): on the same clean 7-model slate, four capable judges read transposed real Flint above the fakes (Δ +1.9 to +6.4) while none read transposed/obscure real Ma above the fakes. So it is a model-capability wall, not a calibration bug. Flint stays production; the Chinese voice instrument is blocked on model capability — not buildable with current models.

The instrument, and what its band certifies

The gate-4 judge is intuitive-holistic: ~8 unlabeled real-author exemplars, anonymized author, the question "are these the same hand?", gentle framing, no checklist, a single VOICEFIT: X/10 confidence line. The band answers "is this the author's own hand?" — a forensic bar, not a fitness-for-purpose one. A below-band score is not "worthless to the consumer," and the band is never lowered to make verdicts green.

Why intuitive, not rubric

A rubric judge — labeled exemplars plus a checklist of countable moves — skews both ways: it penalizes genuine author prose that doesn't tick every box, and rewards box-ticking fakes. Confirmed on three independent voice families:

Voice	Rubric	Intuitive · same prose
马伯庸 zh fiction	recall@0-FP 44% · fakes ≤5.2	recall@0-FP 78% · fakes ≤2.2
Flint en fiction	our output "6.8, near band"	~3 — with the generic fakes
explainers en/de/fr/zh	mimic ceiling 4.3–5.33	mimic ceiling 0.3–2.0

The variable a 2×2 factorial isolated was framing: gentle "same hand?" keeps real and fake correctly separated; harsh "find where it betrays itself" craters real and fake alike — a floorless flaw-hunter.

The recognition confound

On material the judge can place in its training memory, the score measures recognition of the specific text, not voice. Watch what a name-swap does:

马伯庸 panel · score 0–10 · named ● → transposed ○ credited floored names stripped

Read it as two zones. Everything the judge recognizes (殷商, 长安 named, 风起陇西 named, and AI reconstructions of 长安) sits right of the line at 6.2–8.6. Everything unrecognized — obscure real Ma, transposed real Ma, our generation, a different author — floors at 1–3, indistinguishable. The arrows are the same prose with its names swapped: strip recognition and the score slides through the line.

The load-bearing proof is the fake

Real-Ma-transposed scoring low (1.62), on its own, is consistent with the innocent reading "the name-swap damaged the prose." The cell that refutes it is a fabrication scoring as high as the real author — and it keeps its names, so it is immune to the prose-damage objection:

The comparison that ends the argument

6.7AI fake · named

1.62real Ma · transposed

A fabrication outscores the author's own prose the moment that prose is made unrecognizable. A fake inheriting the full score proves the high score isn't a measurement of authorship at all. And flat prose + real names = 0.5 (lowest in the study) shows the names are a retrieval key, not the signal.

The four-arm recognition test

Generalized from this — never prove recognition-binding with arm 1 alone:

Real-author transposed — names swapped, prose else byte-identical.
Fake of recognized material, named and transposed — the decisive arm.
Obscure held-out real — a work the model never memorized.
Names-only on flat, voiceless prose.

The decisive contrast is arm 2 named vs arm 1. A judge that lets a fake of recognized material outscore the real author's transposed prose is recognition-bound → rebuild it, or declare it blocked. Mechanism: the judge performs passage-identification — names as the retrieval key — not prose-craft reading.

The language asymmetry

✓ Resolved — the clean OpenRouter re-run (2026-06-24) CONFIRMS the finding

The forensic-PASS half below (PG / de / fr "reads voice") is now settled. The probation worry was that the floor might be slop-rejection, since the original probe mimics were judged raw. The full 4-step method then ran end-to-end on OpenRouter — both generation and judging, no aiberm: clean-by-construction mimic (anti-signature overlay baked into the generator) → surgical residual de-slop (decorative spans only, spliced by index, diff-gated) → verify the mimic is a credible, clean, meaning-preserving adversary → score on the calibrated panel, reading the independent (non-generator) decider cell. All 9 clean+surgical mimics (GPT / Opus / Gemini × en / de / fr) verified fair adversaries (credible 8.0–9.0, clean 7.3–9.0, meaning ~10) and floor on the independent cell — en 2.33–3.2, de 3.0–3.8, fr 2.67–3.6 — vs band 8.0 / 7.5 / 7.0 and real 8–9 (forensic gap +4 to +6). A credible, clean imitation still floors ⇒ the floor is voice-rejection, not slop-rejection ⇒ the judges read the hand, and the wall is Chinese-specific. Evidence: clean_evidence_2026_06_24.json (supersedes the void 2026-06-20 aiberm run).

The same deciders, the same de-confound method, opposite outcomes by language. For English there is a voice-detector under the recognition halo; for Chinese there is nothing under it.

English · Eric Flint READS VOICE ✓

Names stripped, the judge still credits the hand — recognition adds only ~0.5–2.5 on top.

This generalizes the standing 2026-05-29 "Opus blind to Chinese accent" finding — and Session 30 confirmed it directly. The capability probe (B24b) ran the full 7-model slate on a clean channel (all judges + all generated stimuli on OpenRouter official upstreams, no aiberm), scoring transposed/obscure real author vs clean AI-reconstruction fakes. Four capable judges credit transposed real Flint above the fakes — gpt +3.53, gemini +6.41, qwen +2.61, opus +1.93 — while the same four cannot separate real Ma from fakes (opus −0.32, gpt +0.40 wash, gemini +1.71 noise). opus is the cleanest demonstration: +1.93 English vs −0.32 Chinese. So a transposition-robust maboyong judge is not buildable with current models — a Chinese-authorial-voice-reading wall, not a broken instrument (stimuli re-confirmed clean, judge and generation). The earlier aiberm "opus signal" (3.57) was killed twice — clean judge, then clean fakes.

The wall is language-and-register, not language alone. Re-run the same probe on Lu Changhai's Chinese exposition (B24c, 2026-06-16): a strong few-shot mimic — a frontier model given Lu's own 8 exemplars, writing his hand on a fresh topic — clears the judge as easily as recognition-stripped real Lu (gpt-mimic glm 9.4 / qwen 7.8 / gemini 7.6 / opus 5.1, vs real-Lu-transposed 1.7 / 0.9 / 3.3 / 4.9). So exposition also fails the forensic hand-identity bar — Lu's signature technique is imitable, not a fingerprint. (This corrected an initial over-claim: a "+3.44 opus reads voice" delta against weak fakes, killed by the mimic capstone — §10.) But unlike Ma, transposed real Lu scores above flat/generic fakes for the graded judge (opus +2.72 obscure-only, gemini +4.28; Ma's fiction fell below, opus −0.32) — the exposition register survives transposition where fiction-narrator voice did not. That supports a fit-for-purpose gate (rejects slop; passes competent on-voice prose, real or mimic), not a forensic one — but only on an opus-led panel: the locked production deciders glm + qwen floor transposed real Lu (1.5 / 1.0, ~recognition-bound like Ma), and the gate passes strong mimics by construction.

Resolved — the wall is language, not register (Paul Graham, en explainer, 2026-06-18). The same probe on PG comes out the opposite of Lu: three genuinely strong fresh few-shot PG mimics floor (gpt 2.45 / opus 3.06 / qwen 1.45 / gemini 1.22) while real PG scores 8.7–10 and generic prose ~0 — a forensic gap of +5.6 to +8.8 on PG's deciders, where the matched Lu mimic cleared its judge (glm 9.44). So the PG judge rejects imitation — it reads the hand. PG and Lu are both explainer register, so the same register gives opposite outcomes ⇒ with current models the forensic wall falls on the Chinese side, not the explainer-register side. The clean 2×2 — English fiction (Flint ✓) + English explainer (PG ✓) read voice; Chinese fiction (Ma ✗) + Chinese explainer (Lu ✗) are recognition/register-bound — with the one confound-free demonstration being within-model opus: it reads PG's hand and transposed Flint (+1.93 en) but cannot separate real Ma from fakes (−0.32 zh). Caveats: PG's "obscure" reals are published (this passes the mimic capstone, the test that killed Lu, but doesn't fully close recognition-independence); the name-transposition control is weak; the axis is capability-dated and n=1-per-cell (de/fr explainers still owe the probe). Evidence: recognition_probe_pg_2026_06_18.json.

Closed — the wall is Chinese-specific, not non-English (Freistetter de + Louapre fr, 2026-06-18). Both European explainers pass the same mimic capstone. Real Freistetter and Louapre score 6.7–10 on the capable judges while strong fresh few-shot mimics floor (≈2.3–3.8); FORENSIC (real − mimic) is positive on every judge — gpt +3.7 / +5.6, opus +2.8 / +3.3, qwen +7.3 / +7.6, gemini +7.2 / +8.1 (de / fr). So three Western languages — English (fiction + explainer), German, French — all read the hand and reject imitation, across fiction and exposition. Only Chinese (Ma + Lu) is recognition/register-bound. The wall is Chinese-specific among the languages tested, not "non-English" and not "explainer register." Caveats: capability-dated (within-model opus reads en/de/fr, fails zh); only one non-Western language was tested, so "Chinese" may stand for a broader CJK / low-resource class; deepseek over-credits the mimics in every language (gameable judge). Evidence: de_results.json, fr_results.json.

Length sensitivity

The judge is calibrated at ~415 words and compresses toward 5 ("can't tell") the longer the passage — a long multi-POV surface gives the gestalt read more chances to find one "slightly off" draw.

Real Flint · whole-passage

7.7 → 4.9

415w → 3375w. Absolute scores aren't comparable across lengths.

Windowed · ~420w chunks

6.03 / 7.39

vs negatives ~3.5 — discrimination restored on the unchanged panel.

Compression is judge-specific and flips by language: on English GPT-5.5 collapses while Opus holds; on Chinese it reverses. The fix — sliding-window scoring (window_score.py) — chops the passage into calibration-length windows, scores each on the unchanged panel, and aggregates.

Axis conflation: composition vs prose

The single VOICEFIT number blends two separable axes — composition (how a scene is built; in the plan) and prose (the wording) — and under-credits composition. Content-constant 2×2 on Flint:

Composition	Prose	voicefit	compfit
Flint	Flint (real)	7.7	9.0
Flint	flat	2.65	6.4
conventional	Flint	3.1	2.0–2.6
neutral	flat	0.7	2.0–2.2

The prose ruler needs both axes maxed; one at a time reads as a fake. The fix is a second instrument — compfit, the same exemplars asking "same construction?" — locked for Flint. compfit_zh is built but provisional (its first-pass used a recognizable scene, so it inherits §2's confound until re-run).

Model screening map

DATED 2026-06-12 · re-screen at every model refresh · the protocol is judge_methodology.md §3

Model	Discriminates	Role
GPT-5.5	every voice — en/de/fr + Chinese gestalt	decide
Opus-4.8	en/de/fr fully; Chinese coarse-gestalt only, blind on the accent axis	decide · not zh
Fable-5	everything (reals saturate 9.0); coarse for iteration	certifier
glm-5.1	Chinese — the validated zh discriminator	decide · zh
qwen3.7-max	卢昌海 (zh explainer)	decide · zh
Gemini-3.x	clear cases only (saturating binary)	loose
kimi · deepseek · mistral	can't credit reals / gameable / fooled by mimics	rejected

Recipe-level caution: a model can pass coarse gestalt yet be blind on the fine-grained accent axis. A coarse screen "passing" does not retire an accent finding — to retire it you must beat the accent adversarial (B21). Until then Opus stays off the zh voice panels.

Transport

DATED · equivalence is per-judge, not per-relay

Two traps, both first mis-recorded as model defects: (a) slug form differs by relay (aiberm wants bare slugs, OpenRouter prefixed — engine/models.py _route normalizes); (b) max_tokens as a footgun — a low cap makes reasoning-heavy models return empty content. With both fixed, only three models can't run on aiberm — glm-5.1 (verdict truncated), qwen3.7-max and mistral (not carried); every other judge matches OpenRouter within ±1 including on Chinese (GPT-5.5 on aiberm = a maboyong real 8.4). So panels route mixed — the incompatible members go to OpenRouter (run_mixed.sh), the rest to aiberm; the maboyong zh panel runs mixed (glm→OR, GPT→aiberm), and only uniflection_zh is OpenRouter-only because both its deciders (glm+qwen) are incompatible. So “zh on OpenRouter” is about those two models, not about aiberm and Chinese. Validate each judge on a real+fake pair on any new transport — a model-list match is not equivalence.

Per-voice scoreboard

Voice	Band s/l	Fingerprint	Recognition	Verdict
Flint en	7.5 / 5.5	d607d5a83bf3	clean	Reads voice. Production.
马伯庸 zh	6.0 / void	058ef8e6b986	FAILED	Recognition-bound. Void as voice.
PG en explainer	8.0	7c8927ccae25	capstone ✓ · clean re-run ✓	Reads the hand. Confirmed (06-24). Strong fresh mimics floor 1.2–3.5 vs real PG 8.7–10; the clean+surgical credible mimics floor on the independent cell 2.33–3.2 vs band 8.0 / real 8.8–9.3
Freistetter de	7.5	1982f8398fff	capstone ✓ · clean re-run ✓	Reads the hand. Confirmed (06-24). Strong mimics floor vs real 6.7–10; clean+surgical credible mimics floor on the independent cell 3.0–3.8 vs band 7.5 / real 7.6–7.9
Louapre fr	7.0	20fcaa5edf31	capstone ✓ · clean re-run ✓	Reads the hand. Confirmed (06-24). Strong mimics floor vs real 8.0–8.8; clean+surgical credible mimics floor on the independent cell 2.67–3.6 vs band 7.0 / real 7.0–8.13
卢昌海 zh explainer	5.5 recog-infl	7ab47b0e9175	B24c · forensic FAIL	Mimic clears it (glm 9.4 / qwen 7.8 / opus 5.1 ≥ real-Lu-tp) → not hand-identity. But register separates (opus +2.72) → fit-for-purpose gate only, opus-led; glm+qwen floor transposed real.

The four explainer bands and Flint's short-form band were calibrated on reals whose recognizability was not audited. By §2's doctrine, a band is only a voice band once it holds on transposed reals — Flint's long-form is clean (transposed ref); the rest owe the audit.

Open questions

B24(b) — RESOLVED (Session 30): no model reads Chinese voice without recognition. The full slate, screened clean (§3): none credit transposed/obscure real Ma above clean AI fakes; four read transposed real Flint above the fakes. → maboyong is blocked on model capability. Decision for Stephen: a fit-for-purpose bar (recognition-anchored, scoped to 1628) vs hold and re-probe at the next model refresh.
B24(c) — 卢昌海 RESOLVED (2026-06-16); other explainers + Flint short-form still owed. The Lu exposition judge fails the forensic bar — a strong few-shot mimic clears it (glm 9.4 / qwen 7.8 / opus 5.1 ≥ real-Lu-transposed); band 5.5 is recognition-inflated. But it differs from Ma: exposition register survives transposition (transposed real Lu > flat fakes: opus +2.72, gemini +4.28), so a fit-for-purpose "competent on-voice explainer" gate is buildable — but only on an opus-led panel (the locked glm+qwen floor transposed real), and it passes mimics by construction. Decisive follow-up (not run): a different real Chinese science-popularizer, transposed, on glm+qwen — register-gate vs Lu-gate. Evidence: recognition_probe_b24c_2026_06_16.json. Freistetter / Louapre + Flint short-form still owe the transposed-reals audit (PG now done — below).
PG (en explainer) — CONFIRMED (clean OpenRouter re-run, 2026-06-24); Flint short-form still owes. The clean+surgical credible mimics floor on the independent cell 2.33–3.2 vs band 8.0 / real 8.8–9.3, replacing the void 2026-06-20 aiberm run. The mimic-capstone already stood: the PG judge rejects imitation — 3 strong fresh mimics floor (1.2–3.5) vs real PG 8.7–10; FORENSIC +5.6 to +8.8. ⇒ the forensic wall is language (Chinese), not explainer-register; English voice (Flint + PG) readable, Chinese (Ma + Lu) not (cleanest: within-model opus, +1.93 en / −0.32 zh). Residual caveat: PG's reals are published — but a credible clean mimic still floors, the load-bearing claim.
de + fr explainers — CONFIRMED (clean OpenRouter re-run, 2026-06-24); only Flint short-form still owes. Clean+surgical credible mimics floor on the independent cell — de 3.0–3.8, fr 2.67–3.6 vs band 7.5 / 7.0 and real 7.6–7.9 / 7.0–8.13. The mimic-capstone already stood (FORENSIC +2.8 to +8.1) ⇒ the wall is Chinese-specific, not non-English and not explainer-register. Three Western languages × (fiction + exposition) all read voice; only Chinese fails. Open: only one non-Western language tested — Chinese vs CJK vs low-resource is unprobed.
compfit_zh — the content-constant 2×2, re-run on a non-recognizable scene.
B21 — the accent adversarial. A Ma-gestalt passage salted with native-fluency defects, Opus vs native deciders head-to-head — the only test that can retire "Opus blind to Chinese accent."

Honest position: we cannot currently measure Chinese authorial voice without recognition. We don't know how good our Chinese generation is — only that it is unrecognized, the same as real Ma on this judge.

The meta-lesson

A coarser or looser instrument flatters a result that a sharper test then corrects.

The rubric judge scored our output 6.8; the intuitive judge read ~3.
"Prompting is exhausted" — until the restraint test showed from-plan Flint moves ~2 → ~4.8.
The Chinese bake-off ranked writers ~6.7 — until name-transposition floored all of them, real Ma included.
"Opus blind to Chinese is retired" — until the axis was named: the gestalt screen never probed the accent axis.
"Lu's non-fiction reads voice (opus +3.44)" — until the few-shot-mimic capstone showed a strong mimic clears the judge as easily as real Lu: what survives transposition is the register, not the hand. This time the flattering read appeared at the reasoning layer and the sharper stimulus corrected it (B24c).
"PG's recognition audit is DONE/PASSED" — until the red-team flagged PG's "obscure" reals are published (recognition-light) and the name-transposition control is argument-confounded. The honest claim is narrower and true: PG passes the mimic capstone (rejects imitation), not "fully recognition-independent." The reasoning-layer guard caught the over-claim before it shipped (PG probe).

This is why the findings here are dated, why every "at band" call owes the sharper test, and why the four-arm recognition test and transposition-before-banding are now hard rules. The same flattering-instrument failure the judge doctrine fixes at the measurement layer recurs at the reasoning layer — guard against it there too.

—

Evidence index

Recognition-binding · discovery (长安 6.62→1.62)judge_recognition_confound_2026_06_15.json

Recognition-binding · confirmed + Flint contrastjudge_recognition_second_test_2026_06_16.json

卢昌海 exposition recognition probe · B24c (forensic FAIL via mimic capstone; register-separation)recognition_probe_b24c_2026_06_16.json

PG (en explainer) mimic-capstone / language-vs-register probe (forensic PASS; the 2×2)recognition_probe_pg_2026_06_18.json

de + fr explainer mimic-capstone probes (Chinese-specific wall: en/de/fr pass, zh fails)de_results.json · fr_results.json

Clean OpenRouter re-run — en/de/fr read voice CONFIRMED (probation resolved)clean_evidence_2026_06_24.json

maboyong panel lock + generation re-measurejudge_lock_2026_06_11.json

Flint short-form band 7.5/7.0intuitive_band_2026_06_11.json

Flint long-form band 5.5 + length compressionlongform_band_2026_06_14.json

B17 · Flint named-vs-transposed (~2–3pt)rewrite_threshold_b17_2026_06_11.json

Two-aspect 2×2 + compfit locktwo_aspect_2026_06_13.json

Restraint recipe · from-plan ceiling ~4.8restraint_findings_2026_06_14.json

Chinese long-form bake-off (VOID) + compfit_zhlongform_and_compfit_2026_06_14.json

Claude-judge screen (Opus coarse-gestalt; Fable-5)claude_judges_2026_06_12.json

Transport equivalence per-judgetransport_aiberm_validation_2026_06_12.json

Explainer validations (en/zh/de/fr)intuitive_validation_2026_06_11.json

What the judges
actually measure