郭月江, 严馨, 刘小惠, 余正涛, 线岩团, 莫源源. 融合主题和要素的汉柬可比语料获取方法[J]. 云南大学学报(自然科学版), 2017, 39(3): 360-368. doi: 10.7540/j.ynu.20160540
引用本文: 郭月江, 严馨, 刘小惠, 余正涛, 线岩团, 莫源源. 融合主题和要素的汉柬可比语料获取方法[J]. 云南大学学报(自然科学版), 2017, 39(3): 360-368. doi: 10.7540/j.ynu.20160540
GUO Yue-jiang, YAN Xin, LIU Xiao-hui, YU Zheng-tao, XIAN Yan-tuan, MO Yuan-yuan. A method of building Chinese-Khmer comparable corpus mixing with themes and elements[J]. Journal of Yunnan University: Natural Sciences Edition, 2017, 39(3): 360-368. DOI: 10.7540/j.ynu.20160540
Citation: GUO Yue-jiang, YAN Xin, LIU Xiao-hui, YU Zheng-tao, XIAN Yan-tuan, MO Yuan-yuan. A method of building Chinese-Khmer comparable corpus mixing with themes and elements[J]. Journal of Yunnan University: Natural Sciences Edition, 2017, 39(3): 360-368. DOI: 10.7540/j.ynu.20160540

融合主题和要素的汉柬可比语料获取方法

A method of building Chinese-Khmer comparable corpus mixing with themes and elements

  • 摘要: 为了有效地获取可比语料,选取汉柬双语新闻文档作为可比语料库的候选语料,提出一种融合发布时间要素、实体要素和主题分布的可比语料获取方法.该方法首先计算文本的主题概率分布的JS距离,并融合各主题和要素特征,计算文本相似度;然后利用改进型的层次聚类算法对双语文本进行聚类,最后从聚簇类结果中获取可比语料.与基于词典的文本相似度计算方法进行聚类相比,该文方法有更高的Purity和F值并且获得的高质量的可比语料更多,说明了本文方法的有效性.

     

    Abstract: In order to effectively obtain comparable corpus,this paper presents a method based on the integration of time distribution,entity feature and topic distribution to obtain comparable corpus,selecting Chinese-Khmer bilingual news documents as the candidate corpus.The method first calculates the JS distance with the topic probability distribution of the text,and combines the theme and element features to calculate the text similarity;Then,the improved hierarchical clustering algorithm is used to cluster the bilingual texts;Finally,we can get comparable corpus from each cluster.Compared with text similarity computation method based on the dictionary,the proposed method has higher Purity and F values,and obtains more higher quality comparable corpus by this method.

     

/

返回文章
返回