Abstract:
Segmentation of Tibetan sentences is one of the difficult task in the area of Tibetan information processing,and is also one of the key foundational researches of Tibetan-Chinese Machine Translation,Text Categorization,etc.To deal with the ambiguous functions of the Tibetan punctuations,this paper proposes a method of automatic segmentation of Tibetan sentences,which combines statistics and rules.The experiment shows that this approach works really well:the F1-measure reaches 98% and more.First,the experience method is used in rules to identify the ambiguous Tibetan sentences which are the candidate sentences.Then the analysis of compound sentences which is based on conjunctive words is used to combine clauses to form the further candidate sentences.Finally,the method of Maximum Entropy is used to cut up the further candidate sentences according to the meanings.Thus the experience method and the analysis of compound sentences have solved the problems of sparse corpus and clauses that Maximum Entropy cannot work out.