Text-to-Speaker Retrieval
Input: one Speaker Card text, such as short_query or identity_only.
Candidate set: speaker audio anchors, where each speaker embedding averages up to three utterances.
Correct if: the matching speaker appears in the top K results.
Example: query “female, Southwestern accent, very low pitch” should retrieve that speaker from a 1K-speaker gallery.