Abstract:
In order to effectively obtain comparable corpus,this paper presents a method based on the integration of time distribution,entity feature and topic distribution to obtain comparable corpus,selecting Chinese-Khmer bilingual news documents as the candidate corpus.The method first calculates the JS distance with the topic probability distribution of the text,and combines the theme and element features to calculate the text similarity;Then,the improved hierarchical clustering algorithm is used to cluster the bilingual texts;Finally,we can get comparable corpus from each cluster.Compared with text similarity computation method based on the dictionary,the proposed method has higher Purity and F values,and obtains more higher quality comparable corpus by this method.