多特征融合的越南语关键词生成方法

陈瑞清; 高盛祥; 余正涛; 张迎晨; 张磊; 杨舰

doi:10.7540/j.ynu.P00018

多特征融合的越南语关键词生成方法

Vietnamese keyphrase generation method based on multi-feature fusion

摘要

摘要: 越南语属于低资源语种，高质量关键词新闻数据稀缺，为了解决样本不足条件下生成越南语新闻关键词准确性不高的问题，提出了一种多特征融合的越南语关键词生成模型，拟提升生成的越南语关键词与越南语新闻文档的相关性. 首先，将越南语新闻实体、词性、词汇位置特征与词向量拼接，使输入模型的词向量包含更多维度的语义信息；其次，利用双向注意力机制捕获上下文与新闻标题的依赖关系，增强标题在关键词生成中的指导作用；最后，结合复制机制生成越南语关键词，从而提高关键词的语义相关性. 在构建的越南语新闻关键词数据集上进行实验，结果表明融合多特征的关键词生成模型能在越南语训练样本有限的条件下生成高质量关键词，F1@10、R@50 分数比TG-Net分别提升了13.2%和17.1%.

Abstract: Vietnamese is a low-resource language and high-quality keyphrase news corpus is scarce. In order to solve the problem that the accuracy of generating Vietnamese news keyphrases is not high under the condition of insufficient samples, a multi-feature fusion Vietnamese keyphrase generation model is proposed to improve the relevance of the generated Vietnamese keyphrases and Vietnamese news documents. Firstly, the features of Vietnamese news entity, part of speech, vocabulary position are spliced with the word vector, so that the word vector of the input model contains more dimensional semantic information. Secondly, the bidirectional attention mechanism is used to capture the dependence of context and news headlines and enhance the guiding role of headlines in keyphrase generation. Finally, it combine the copy mechanism to generate Vietnamese keyphrases for improving the semantic relevance of keyphrases. Experiments on the constructed Vietnamese news corpus show that the keyphrase generation model fused with multiple features can generate high-quality keyphrases under the condition of limited Vietnamese training corpus. Compared with TG-Net, the F1@10 and R@50 score are improved by 13.2% and 17.1% respectively.

HTML全文

参考文献(21)

施引文献

资源附件(0)