Voice-Judge Findings — VoiceWriter Calibration Record

00

Summary

The intuitive same-hand judge reads English authorial voice without recognition — strip a real Flint passage of its names and the judges still credit the hand. For Chinese it is recognition-bound.

A competent AI fake of a recognized scene scores as high as real Ma (~6.7), while real Ma with its names stripped floors (1.6). So 马伯庸's "band 6.0" is a recognition line, not a voice threshold — no unrecognized text clears it, real Ma included — and our Chinese generation's "below band" was an artifact: the judge never measured our voice.

The asymmetry tracks language, not author — confirmed directly (Session 30, B24b): on the same clean 7-model slate, four capable judges read transposed real Flint above the fakes (Δ +1.9 to +6.4) while none read transposed/obscure real Ma above the fakes. So it is a model-capability wall, not a calibration bug. Flint stays production; the Chinese voice instrument is blocked on model capability — not buildable with current models.

01

The instrument, and what its band certifies

The gate-4 judge is intuitive-holistic: ~8 unlabeled real-author exemplars, anonymized author, the question "are these the same hand?", gentle framing, no checklist, a single VOICEFIT: X/10 confidence line. The band answers "is this the author's own hand?" — a forensic bar, not a fitness-for-purpose one. A below-band score is not "worthless to the consumer," and the band is never lowered to make verdicts green.

Why intuitive, not rubric

A rubric judge — labeled exemplars plus a checklist of countable moves — skews both ways: it penalizes genuine author prose that doesn't tick every box, and rewards box-ticking fakes. Confirmed on three independent voice families:

Voice	Rubric	Intuitive · same prose
马伯庸 zh fiction	recall@0-FP 44% · fakes ≤5.2	recall@0-FP 78% · fakes ≤2.2
Flint en fiction	our output "6.8, near band"	~3 — with the generic fakes
explainers en/de/fr/zh	mimic ceiling 4.3–5.33	mimic ceiling 0.3–2.0

The variable a 2×2 factorial isolated was framing: gentle "same hand?" keeps real and fake correctly separated; harsh "find where it betrays itself" craters real and fake alike — a floorless flaw-hunter.

02

The recognition confound

On material the judge can place in its training memory, the score measures recognition of the specific text, not voice. Watch what a name-swap does:

马伯庸 panel · score 0–10 · named ● → transposed ○ credited floored names stripped

Read it as two zones. Everything the judge recognizes (殷商, 长安 named, 风起陇西 named, and AI reconstructions of 长安) sits right of the line at 6.2–8.6. Everything unrecognized — obscure real Ma, transposed real Ma, our generation, a different author — floors at 1–3, indistinguishable. The arrows are the same prose with its names swapped: strip recognition and the score slides through the line.

The load-bearing proof is the fake

Real-Ma-transposed scoring low (1.62), on its own, is consistent with the innocent reading "the name-swap damaged the prose." The cell that refutes it is a fabrication scoring as high as the real author — and it keeps its names, so it is immune to the prose-damage objection:

The comparison that ends the argument

6.7AI fake · named

>

1.62real Ma · transposed

A fabrication outscores the author's own prose the moment that prose is made unrecognizable. A fake inheriting the full score proves the high score isn't a measurement of authorship at all. And flat prose + real names = 0.5 (lowest in the study) shows the names are a retrieval key, not the signal.

The four-arm recognition test

Generalized from this — never prove recognition-binding with arm 1 alone:

Real-author transposed — names swapped, prose else byte-identical.
Fake of recognized material, named and transposed — the decisive arm.
Obscure held-out real — a work the model never memorized.
Names-only on flat, voiceless prose.

The decisive contrast is arm 2 named vs arm 1. A judge that lets a fake of recognized material outscore the real author's transposed prose is recognition-bound → rebuild it, or declare it blocked. Mechanism: the judge performs passage-identification — names as the retrieval key — not prose-craft reading.

03

The language asymmetry

The same deciders, the same de-confound method, opposite outcomes by language. For English there is a voice-detector under the recognition halo; for Chinese there is nothing under it.

English · Eric Flint READS VOICE ✓

Names stripped, the judge still credits the hand — recognition adds only ~0.5–2.5 on top.

This generalizes the standing 2026-05-29 "Opus blind to Chinese accent" finding — and Session 30 confirmed it directly. The capability probe (B24b) ran the full 7-model slate on a clean channel (all judges + all generated stimuli on OpenRouter official upstreams, no aiberm), scoring transposed/obscure real author vs clean AI-reconstruction fakes. Four capable judges credit transposed real Flint above the fakes — gpt +3.53, gemini +6.41, qwen +2.61, opus +1.93 — while the same four cannot separate real Ma from fakes (opus −0.32, gpt +0.40 wash, gemini +1.71 noise). opus is the cleanest demonstration: +1.93 English vs −0.32 Chinese. So a transposition-robust maboyong judge is not buildable with current models — a Chinese-authorial-voice-reading wall, not a broken instrument (stimuli re-confirmed clean, judge and generation). The earlier aiberm "opus signal" (3.57) was killed twice — clean judge, then clean fakes.

The wall is language-and-register, not language alone. Re-run the same probe on Lu Changhai's Chinese exposition (B24c, 2026-06-16): a strong few-shot mimic — a frontier model given Lu's own 8 exemplars, writing his hand on a fresh topic — clears the judge as easily as recognition-stripped real Lu (gpt-mimic glm 9.4 / qwen 7.8 / gemini 7.6 / opus 5.1, vs real-Lu-transposed 1.7 / 0.9 / 3.3 / 4.9). So exposition also fails the forensic hand-identity bar — Lu's signature technique is imitable, not a fingerprint. (This corrected an initial over-claim: a "+3.44 opus reads voice" delta against weak fakes, killed by the mimic capstone — §10.) But unlike Ma, transposed real Lu scores above flat/generic fakes for the graded judge (opus +2.72 obscure-only, gemini +4.28; Ma's fiction fell below, opus −0.32) — the exposition register survives transposition where fiction-narrator voice did not. That supports a fit-for-purpose gate (rejects slop; passes competent on-voice prose, real or mimic), not a forensic one — but only on an opus-led panel: the locked production deciders glm + qwen floor transposed real Lu (1.5 / 1.0, ~recognition-bound like Ma), and the gate passes strong mimics by construction.

04

Length sensitivity

The judge is calibrated at ~415 words and compresses toward 5 ("can't tell") the longer the passage — a long multi-POV surface gives the gestalt read more chances to find one "slightly off" draw.

Real Flint · whole-passage

7.7 → 4.9

415w → 3375w. Absolute scores aren't comparable across lengths.

Windowed · ~420w chunks

6.03 / 7.39

vs negatives ~3.5 — discrimination restored on the unchanged panel.

Compression is judge-specific and flips by language: on English GPT-5.5 collapses while Opus holds; on Chinese it reverses. The fix — sliding-window scoring (window_score.py) — chops the passage into calibration-length windows, scores each on the unchanged panel, and aggregates.

05

Axis conflation: composition vs prose

The single VOICEFIT number blends two separable axes — composition (how a scene is built; in the plan) and prose (the wording) — and under-credits composition. Content-constant 2×2 on Flint:

Composition	Prose	voicefit	compfit
Flint	Flint (real)	7.7	9.0
Flint	flat	2.65	6.4
conventional	Flint	3.1	2.0–2.6
neutral	flat	0.7	2.0–2.2

The prose ruler needs both axes maxed; one at a time reads as a fake. The fix is a second instrument — compfit, the same exemplars asking "same construction?" — locked for Flint. compfit_zh is built but provisional (its first-pass used a recognizable scene, so it inherits §2's confound until re-run).

06

Model screening map

DATED 2026-06-12 · re-screen at every model refresh · the protocol is judge_methodology.md §3

Model	Discriminates	Role
GPT-5.5	every voice — en/de/fr + Chinese gestalt	decide
Opus-4.8	en/de/fr fully; Chinese coarse-gestalt only, blind on the accent axis	decide · not zh
Fable-5	everything (reals saturate 9.0); coarse for iteration	certifier
glm-5.1	Chinese — the validated zh discriminator	decide · zh
qwen3.7-max	卢昌海 (zh explainer)	decide · zh
Gemini-3.x	clear cases only (saturating binary)	loose
kimi · deepseek · mistral	can't credit reals / gameable / fooled by mimics	rejected

Recipe-level caution: a model can pass coarse gestalt yet be blind on the fine-grained accent axis. A coarse screen "passing" does not retire an accent finding — to retire it you must beat the accent adversarial (B21). Until then Opus stays off the zh voice panels.

07

Transport

DATED · equivalence is per-judge, not per-relay

Two traps, both first mis-recorded as model defects: (a) slug form differs by relay (aiberm wants bare slugs, OpenRouter prefixed — engine/models.py _route normalizes); (b) max_tokens as a footgun — a low cap makes reasoning-heavy models return empty content. With both fixed, only three models can't run on aiberm — glm-5.1 (verdict truncated), qwen3.7-max and mistral (not carried); every other judge matches OpenRouter within ±1 including on Chinese (GPT-5.5 on aiberm = a maboyong real 8.4). So panels route mixed — the incompatible members go to OpenRouter (run_mixed.sh), the rest to aiberm; the maboyong zh panel runs mixed (glm→OR, GPT→aiberm), and only uniflection_zh is OpenRouter-only because both its deciders (glm+qwen) are incompatible. So “zh on OpenRouter” is about those two models, not about aiberm and Chinese. Validate each judge on a real+fake pair on any new transport — a model-list match is not equivalence.

08

Per-voice scoreboard

Voice	Band s/l	Fingerprint	Recognition	Verdict
Flint en	7.5 / 5.5	d607d5a83bf3	clean	Reads voice. Production.
马伯庸 zh	6.0 / void	058ef8e6b986	FAILED	Recognition-bound. Void as voice.
PG en explainer	8.0	7c8927ccae25	owed	Real 9.5–10 · mimics 2.0 · unaudited reals
Freistetter de	7.5	1982f8398fff	owed	POS 8.1–8.5 · mimics ≤0.5
Louapre fr	7.0	20fcaa5edf31	owed	POS 7.7–8.4 · mimics ≤0.8
卢昌海 zh explainer	5.5 recog-infl	7ab47b0e9175	B24c · forensic FAIL	Mimic clears it (glm 9.4 / qwen 7.8 / opus 5.1 ≥ real-Lu-tp) → not hand-identity. But register separates (opus +2.72) → fit-for-purpose gate only, opus-led; glm+qwen floor transposed real.

The four explainer bands and Flint's short-form band were calibrated on reals whose recognizability was not audited. By §2's doctrine, a band is only a voice band once it holds on transposed reals — Flint's long-form is clean (transposed ref); the rest owe the audit.

09

Open questions

B24(b) — RESOLVED (Session 30): no model reads Chinese voice without recognition. The full slate, screened clean (§3): none credit transposed/obscure real Ma above clean AI fakes; four read transposed real Flint above the fakes. → maboyong is blocked on model capability. Decision for Stephen: a fit-for-purpose bar (recognition-anchored, scoped to 1628) vs hold and re-probe at the next model refresh.
B24(c) — 卢昌海 RESOLVED (2026-06-16); other explainers + Flint short-form still owed. The Lu exposition judge fails the forensic bar — a strong few-shot mimic clears it (glm 9.4 / qwen 7.8 / opus 5.1 ≥ real-Lu-transposed); band 5.5 is recognition-inflated. But it differs from Ma: exposition register survives transposition (transposed real Lu > flat fakes: opus +2.72, gemini +4.28), so a fit-for-purpose "competent on-voice explainer" gate is buildable — but only on an opus-led panel (the locked glm+qwen floor transposed real), and it passes mimics by construction. Decisive follow-up (not run): a different real Chinese science-popularizer, transposed, on glm+qwen — register-gate vs Lu-gate. Evidence: recognition_probe_b24c_2026_06_16.json. PG / Freistetter / Louapre + Flint short-form still owe the transposed-reals audit.
compfit_zh — the content-constant 2×2, re-run on a non-recognizable scene.
B21 — the accent adversarial. A Ma-gestalt passage salted with native-fluency defects, Opus vs native deciders head-to-head — the only test that can retire "Opus blind to Chinese accent."

Honest position: we cannot currently measure Chinese authorial voice without recognition. We don't know how good our Chinese generation is — only that it is unrecognized, the same as real Ma on this judge.

10

The meta-lesson

A coarser or looser instrument flatters a result that a sharper test then corrects.

The rubric judge scored our output 6.8; the intuitive judge read ~3.
"Prompting is exhausted" — until the restraint test showed from-plan Flint moves ~2 → ~4.8.
The Chinese bake-off ranked writers ~6.7 — until name-transposition floored all of them, real Ma included.
"Opus blind to Chinese is retired" — until the axis was named: the gestalt screen never probed the accent axis.
"Lu's non-fiction reads voice (opus +3.44)" — until the few-shot-mimic capstone showed a strong mimic clears the judge as easily as real Lu: what survives transposition is the register, not the hand. This time the flattering read appeared at the reasoning layer and the sharper stimulus corrected it (B24c).

This is why the findings here are dated, why every "at band" call owes the sharper test, and why the four-arm recognition test and transposition-before-banding are now hard rules. The same flattering-instrument failure the judge doctrine fixes at the measurement layer recurs at the reasoning layer — guard against it there too.

—

Evidence index

Recognition-binding · discovery (长安 6.62→1.62)judge_recognition_confound_2026_06_15.json

Recognition-binding · confirmed + Flint contrastjudge_recognition_second_test_2026_06_16.json

卢昌海 exposition recognition probe · B24c (forensic FAIL via mimic capstone; register-separation)recognition_probe_b24c_2026_06_16.json

maboyong panel lock + generation re-measurejudge_lock_2026_06_11.json

Flint short-form band 7.5/7.0intuitive_band_2026_06_11.json

Flint long-form band 5.5 + length compressionlongform_band_2026_06_14.json

B17 · Flint named-vs-transposed (~2–3pt)rewrite_threshold_b17_2026_06_11.json

Two-aspect 2×2 + compfit locktwo_aspect_2026_06_13.json

Restraint recipe · from-plan ceiling ~4.8restraint_findings_2026_06_14.json

Chinese long-form bake-off (VOID) + compfit_zhlongform_and_compfit_2026_06_14.json

Claude-judge screen (Opus coarse-gestalt; Fable-5)claude_judges_2026_06_12.json

Transport equivalence per-judgetransport_aiberm_validation_2026_06_12.json

Explainer validations (en/zh/de/fr)intuitive_validation_2026_06_11.json

What the judges
actually measure