标签归档:Speaker Identification

H-VECTORS: UTTERANCE-LEVEL SPEAKER EMBEDDING USING A HIERARCHICAL
ATTENTION MODEL (ICASSP2020)

In this paper, a hierarchical attention network is proposed to generate utterance-level embeddings (H-vectors) for speaker identification and verification. Since different parts of an utterance may have different contributions to speaker identities, the use of hierarchical structure aims to learn speaker related information locally and globally. In the proposed approach, frame-level encoder and attention are applied on segments
of an input utterance and generate individual segment vectors. Then, segment level attention is applied on the segment vectors to construct an utterance representation. To evaluate the effectiveness of the proposed approach, the data of the NIST SRE2008 Part1 is used for training, and two datasets, the Switchboard Cellular (Part1) and the CallHome American English Speech, are used to evaluate the quality of extracted utterance embeddings on speaker identification and verification tasks. In comparison with two baselines, X-vectors and X-vectors+Attention, the obtained results show that the use of H-vectors can achieve a significantly better performance. Furthermore, the learned utterance-level embeddings are more discriminative than the two baselines when mapped into a 2D space using t-SNE.

utterance:说话者

作者提出一种级联注意网络用于speaker识别和确认任务中生成说话人级别的embeddings(H-vectors)。由于一个说话人的不同部分对于speaker识别有不同的贡献,所以使用级联结构来学习说话人的局部信息和全局信息。

在提出的方法中,frame-level编码和注意被应用于片段输入并生成独立的片段向量。然后,片段级别注意被应用于片段向量去构建一个说话人表示。