SpeakerCard-1M

TL;DR

Resource

10.2K speakers, 56.7K post-QC records, 1.78M utterance captions, balanced EN+ZH, with speaker-disjoint hard-negative triplets and field-level probe provenance.

Method

Tool-first, LLM-last: ten acoustic probes extract field-level evidence; a schema separates stable traits from utterance-level states; an LLM only verbalizes structured fields, never raw audio.

Benchmarks

Bidirectional T2S-R / S2T-R retrieval + AC-Verify (CF counterfactual rejection, Hard near-miss rejection). A dual-encoder baseline beats eight 7B–30B+ LALMs on pitch by 11.7% absolute.

Method overview

Construction pipeline

1Unified ingestion of VoxCeleb1/2 and CN-Celeb1/2

2Ten probes extract six traits and four utterance states

3Confidence-weighted aggregation builds speaker profiles

4Constrained LLM renders EN/ZH cards in four styles

5QC removes malformed, duplicate, leaky, and inconsistent cards

Evaluation protocols

What each metric asks

The benchmark is designed to separate retrieval, verification, and attribute grounding. Each protocol uses speaker-disjoint splits; Vox-only retrieval uses a fixed 1K-speaker English gallery, while the bilingual extension additionally reports a 144-speaker clean CN-Celeb gallery.

T2S-R

Text-to-Speaker Retrieval

Input: one Speaker Card text, such as short_query or identity_only.

Candidate set: speaker audio anchors, where each speaker embedding averages up to three utterances.

Correct if: the matching speaker appears in the top K results.

Example: query “female, Southwestern accent, very low pitch” should retrieve that speaker from a 1K-speaker gallery.

S2T-R

Speaker-to-Text Retrieval

Input: a speaker audio anchor.

Candidate set: one text anchor per candidate speaker, averaged over the three paraphrase variants when applicable.

Correct if: the matching Speaker Card is ranked in the top K.

Example: an utterance from speaker A should retrieve speaker A’s identity-only card, not a near-miss card with similar demographics.

AC-Verify CF

Counterfactual Rejection

Input: audio plus two deterministic trait templates.

Candidate set: the true identity template and a counterfactual template where exactly one trait is changed.

Correct if: the true template scores higher than the contradicted one.

Example: “pitch level: very_low” versus the same card with “pitch level: high.” Only the pitch field changes.

AC-Verify Hard

Hard-Negative Rejection

Input: audio plus two real Speaker Cards.

Candidate set: the positive card and a Stage5 mined hard negative that matches coarse traits but differs in finer attributes such as pitch or timbre.

Correct if: the positive card receives the higher audio-text compatibility score.

Example: two male speakers with the same age band and accent; the model must reject the near-miss card.

SV EER

Traditional Speaker Verification

Input: a VoxCeleb trial pair.

Candidate set: target and nontarget trial labels from VoxCeleb1-O/E/H.

Correct if: cosine scores separate same-speaker from different-speaker pairs; lower EER is better.

Example: “utt1 vs utt2” is scored using the same WavLM+MHFA audio encoder, without any text input.

LALM / Cascade

External Model Controls

Cascade: replaces the audio tower with probe→LLM text and compares texts in BGE-M3 space.

LALM forced choice: gives an audio language model the same audio and Card A/B, with balanced A/B order.

Correct if: the matched card is selected or ranked above the contradicted/negative card.

Example: Audio Flamingo, Qwen3-Omni, Gemini 2.5/3.5 Flash, and GPT audio mini answer which of two cards better matches the audio.

Concrete inputs

What AC-Verify actually compares

These are real card pairs sampled from the released test sets. In each panel the model sees one audio clip plus two text candidates and chooses A or B under a balanced forced-choice prompt. CF tests minimal trait flips with LLM-rewritten distractors; Hard uses real cards from different speakers that share coarse traits.

CF · Counterfactual rejection

One trait flipped, every other phrase preserved verbatim

The distractor is produced offline by Qwen3-32B-Instruct under a minimal-edit prompt, so style cannot leak the answer (style-symmetric llm_cf protocol).

Gender flip · EN female → male

A · matched Female, approximately 35-45, Mandarin accent, with a very high-pitched voice.

B · CF Male, approximately 35-45, Mandarin accent, with a very high-pitched voice.

Age flip · EN 25-35 → 18-24

A · matched Female, approximately 25-35, Mandarin accent, with a high-pitched voice.

B · CF Female, approximately 18-24, Mandarin accent, with a high-pitched voice.

Accent flip · EN southwestern → australian

A · matched Female, approximately 25-35, Southwestern accent, with a very high-pitched voice.

B · CF Female, approximately 25-35, Australian accent, with a very high-pitched voice.

Pitch flip · EN low → very_low

A · matched Male, approximately 18-25, Mandarin accent, with a low-pitched voice.

B · CF Male, approximately 18-25, Mandarin accent, with a very low-pitched voice.

Gender flip · ZH 女性 → 男性

A · matched 女性，Southwestern 口音，约 25-35 岁，very_high 音高。

B · CF 男性，Southwestern 口音，约 25-35 岁，very_high 音高。

Pitch flip · ZH very_high → high

A · matched 女性，大约 25-35 岁，带有 Mandarin 口音，音高 very_high。

B · CF 女性，大约 25-35 岁，带有 Mandarin 口音，音高 high。

Hard · Mined hard-negative rejection

Two different real speakers, nearly identical card text

Hard negatives are mined from real distractor speakers that share gender, age band, accent, and often pitch band with the positive. In the limit shown below, the rendered cards are essentially indistinguishable in text — the model can only solve this task by listening. Train/val/test speakers are speaker-ID disjoint.

Hard pair · English male two speakers, identical profile

A · positive Male, 25-35, England accent, with a low pitch. id10166

B · hard-neg Male, approximately 25-35, England accent, with a low pitch. id10055

Hard pair · US female (mid) two speakers, identical profile

A · positive Female, probably 35-45, US accent, with a mid-pitched voice. id10117

B · hard-neg Female, approximately 35-45, US accent, with a mid-pitched voice. id10026

Hard pair · Senior male (60+) two speakers, identical profile

A · positive Male, likely 60+, US accent, low pitch. id10113

B · hard-neg Male, 60+, likely US accent, low pitch. id10901

Why this matters: CF tests whether a model rejects an explicit single-trait contradiction. Hard tests whether it can still pick the correct speaker when the two text candidates are essentially synonymous — only the audio identity differs. CF is largely solvable from text-level reasoning (best LALM: 87.79%); Hard collapses on text alone and exposes the true cross-modal grounding gap. The dual encoder achieves 72.53% Hard vs the strongest LALM at 55.28%.

Main results

One corpus, two operating points

We release two trained checkpoints from the same audio–text dual encoder. The balanced recipe maximizes attribute discrimination (AC-Verify CF / Hard) while keeping retrieval competitive; the retrieval-specialized recipe continues from a retrieval-push checkpoint and pushes T2S / S2T recall further at a controlled cost to Hard. Numbers below are on the Vox-only 1K-speaker English gallery.

Model	T2S-R (%)			S2T-R (%)			AC-Verify (%)
Model	R@1	R@5	R@10	R@1	R@5	R@10	CF	Hard
Text-only reference
Cascade (probe → LLM card)	3.50	10.10	15.60	1.60	6.10	9.30	—	—
Ours (audio + text dual encoder)
Balanced	3.00	15.30	24.80	4.60	16.00	25.50	93.84	72.53
Retrieval-specialized	5.10	16.60	27.50	5.50	16.90	27.30	85.45	65.53

Cascade replaces the audio tower with the same probe → LLM verbalizer used in corpus construction, matched in BGE-M3 text–text space. The substantially larger gap on S2T (9.30 vs 25.50 at R@10) than on T2S (15.60 vs 24.80) localizes the audio tower's contribution to utterance-to-speaker aggregation rather than text retrieval per se. AC-Verify scores reported here use the style-symmetric LLM-generated counterfactual (llm_cf) protocol. SV EER columns from the paper (Vox1-O/E/H) are omitted here and reported in the manuscript.

Key findings

Where speaker cards change the picture

LALMs are weak on pitch

All 8 audio language models stay below 77%

Pitch CF

55.26AF3

49.20Qwen2

69.59Qwen3

70.27MiMo

65.07Kimi

74.74Gem2.5

76.99Gem3.5

70.26GPT

88.66Ours

On 2-way forced choice with pitch as the differing trait under the LLM-generated counterfactual protocol, all eight LALMs (open and closed, 7B–30B+) score 49–77% (mean 66.4%). Our dual encoder reaches 88.66%, an 11.7% absolute gap over the strongest LALM (Gemini 3.5 Flash).

Audio tower vs probe→LLM cascade

Direction-dependent contribution

T2S R@10 15.60 < 24.80 cascade trails dual encoder

S2T R@10 9.30 ≪ 25.50 cascade ≪ dual encoder

A text-only cascade sharing our probe pipeline trails the dual encoder on T2S retrieval and falls further behind on S2T, localizing the audio tower's contribution to utterance-to-speaker aggregation that single-utterance probe-then-verbalize cannot fully recover.

Schema enforcement matters

Training-time text view ablation

Training view	T2S@10	S2T@10	CF	Hard
detailed (no schema)	19.30	22.10	86.97	68.03
identity_only (schema)	24.80	25.50	93.84	72.53

Removing schema enforcement drops CF by 6.87% absolute and Hard by 4.50% absolute (−6.2% relative); retrieval recall also drops. Under the stricter LLM-generated CF protocol, schema enforcement protects attribute-conditioned verification at least as much as hard-negative discrimination.

Eval-time text view is a controllable axis

single balanced checkpoint, four query styles

Switching only the eval-time text view changes retrieval but leaves AC-Verify flat: identity_only matches training and wins on T2S/S2T R@10, while other views degrade accordingly. Hard varies by under 0.4% and CF stays at 93.84% across all four views, since both matched and counterfactual cards are scored under the same view.

Eval text style	T2S R@1	T2S R@10	S2T R@1	S2T R@10	Hard
`detailed`	2.30	14.80	2.30	16.10	72.30
`identity_only`	3.00	24.80	4.60	25.50	72.53
`technical_report`	0.70	8.60	2.50	15.30	72.17
`short_query`	2.70	16.80	3.40	16.40	72.43

AC-Verify zero-shot benchmark

How eight audio language models score on attribute grounding

Each model is shown one audio clip and two cards (the matched card and a counterfactual that differs in a single trait, or a mined hard-negative card), and chooses A or B with balanced A/B order. Counterfactuals are LLM-rewrites in the same surface style as the original card (the style-symmetric llm_cf protocol), so the distractor cannot be detected by style alone. CF is the per-trait average across gender, accent, age, and pitch. Higher is better.

Model	Gender	Accent	Age	Pitch	CF	Hard
Open-source
Audio Flamingo 3	94.59	71.88	56.06	55.26	69.45	50.05
Qwen2-Audio-7B	53.97	46.28	52.99	49.20	50.61	49.77
Qwen3-Omni-30B	97.76	95.37	80.93	69.59	85.91	55.28
MiMo-Audio-7B	97.45	70.12	67.45	70.27	76.32	51.90
Kimi-Audio-7B	94.90	81.51	64.44	65.07	76.51	48.58
Closed-source (API)
Gemini 2.5 Flash	96.73	94.62	75.18	74.74	85.32	53.41
Gemini 3.5 Flash	97.41	92.35	84.40	76.99	87.79	51.72
GPT audio mini	87.50	60.92	71.41	70.26	72.52	49.32
This work
Ours (dual-task, balanced)	95.93	97.43	93.33	88.66	93.84	72.53

Six of eight LALMs handle gender well (94–98%); accent shows a clear split with Qwen3-Omni and both Gemini variants above 92% versus 46–82% for the rest; age sees only Qwen3-Omni (80.93%) and Gemini 3.5 Flash (84.40%) above 80%. Pitch is the universal weak point: all eight LALMs score 49–77%, with one model below chance. Our dual encoder reaches 88.66% on pitch without conceding the coarse traits and is the only system above 90% on every per-trait column.

Speaker Card examples

Four styles, one real speaker

Real captions for cnceleb1/id00000, the same speaker referenced across the corpus. Each record exposes four text views so retrieval, verification, and interpretability experiments can choose how much state context to include. Trait phrases (gender, age, accent, pitch) are highlighted in orange; state phrases (emotion, channel, environment, rate) in teal; numeric evidence in gold.

Detaileddetailed

Female, 25-35 years, Southwestern accent. Very high pitch. Happy, slow speaking rate, noisy telephone environment.

Identity onlyidentity_only

Female, approximately 25-35, Southwestern accent, with a very high-pitched voice.

Technical Reporttechnical_report

Adult female; age 25-35 (P=0.514); median F0 361 Hz (very high pitch); jitter 15.83%, shimmer 0.56% (raspy phonation, P=0.657); habitual speech rate 110.3 wpm (slow).

Short queryshort_query

Female, Southwestern, 25-35, very high pitch, happy, slow, noisy telephone.

Detailed · ZHdetailed

该演讲者为女性，年龄在25-35岁之间，具有西南口音。她的音调非常高，表现出快乐的情绪，但录音环境较为嘈杂，且通过电话通道录制。她的语速较慢或非常慢。

Technical Report · ZHtechnical_report

成年女性；年龄25-35岁（P=0.514）；中位F0 361 Hz（极高音高，P=0.75）；jitter 15.83%，shimmer 0.56%（嘶哑发音，P=0.657）；习惯语速 110.3 wpm（缓慢）。

10.2Kspeakers

56.7Kpost-QC cards

1.78Mutterance captions

189KVox-only train triplets

Vox+CN extension

Bilingual transfer snapshot

Vox-only and Vox+CN use different training pools, validation sets, and test galleries, so the bilingual rows are reported as an extension snapshot rather than a strict same-gallery comparison.

Train	Test	Gallery	T2S@10	S2T@10	CF	Hard
EN	EN	1000	25.00	25.40	93.90	72.13
EN	ZH	144	31.94	31.25	70.34	70.37
ZH	EN	1000	7.50	6.90	71.30	62.20
ZH	ZH	144	57.64	59.03	90.49	74.77
Bilingual	EN	1000	22.40	25.60	93.60	73.97
Bilingual	ZH	144	53.47	62.50	90.67	74.77

Release

Annotation layer, protocols, and baselines

SpeakerCard-1M releases metadata and benchmark protocols only. Original VoxCeleb/CN-Celeb audio is not redistributed; released audio identifiers are source-relative lookup keys.

Dataset card Reproduction code