基于增强对比学习的多语言事件发现方法

潘通; 余正涛; 黄于欣; 关昕; 严海宁; 杨溪

doi:10.7540/j.ynu.20230070

基于增强对比学习的多语言事件发现方法

Multilingual event discovery based on augmentation contrast learning

摘要

摘要: 多语言事件发现是把描述同一事件的多种语言文本聚类到同一个簇，是多语言事件分析的基础. 目前基于深度学习的聚类方法主要通过优化文本表示之间的距离实现聚类，其性能严重依赖模型表示能力，多语环境下文本表示对齐效果不理想，多语言事件聚类难度大. 文章提出一种基于增强对比学习的多语言事件发现方法，通过优化事件文本到簇心和多语言正负样本之间的距离，使同一事件的多语言文本在表示空间更加接近，提高模型对多语言文本的表示能力. 针对事件聚类任务引入事件要素的表征作为事件聚类中心，进一步提升多语言事件聚类效果. 在路透社数据集上的实验结果表明，提出的方法在多个预训练模型基础上性能均有提升，准确率和标准化互信息最优效果分别达到了76.14%和91.09%.

Abstract: Multilingual event discovery is the clustering of multiple language texts that describe the same event into the same cluster, and it is the foundation of multilingual event analysis. Deep clustering methods based on optimizing the distance between text representations are used to achieve clustering, and their performance heavily depends on the model's representation ability. In a multilingual environment, text representation alignment is not ideal, which makes multilingual event clustering difficult. This paper proposes a multilingual event discovery method based on augmentation contrastive learning. This method optimizes the distance between event texts and the centroids of clusters, as well as the distance between multilingual positive and negative samples. This enhances the proximity of multilingual texts describing the same event in the representation space and improves the model's representation ability for multilingual texts. Additionally, the method introduces event features as the representation of event clustering centers, further improving the effectiveness of multilingual event clustering. Experimental results on the Reuters dataset show that the proposed method improves the performance of multiple pre-trained models, achieving the best accuracy and standardized mutual information of 76.14% and 91.09%, respectively.

HTML全文

参考文献(26)

施引文献

资源附件(0)