徐涛, 加羊吉, 于洪志. 统计与规则相结合的藏文句子自动断句方法[J]. 云南大学学报(自然科学版), 2012, 34(6): 653-657,663.
引用本文: 徐涛, 加羊吉, 于洪志. 统计与规则相结合的藏文句子自动断句方法[J]. 云南大学学报(自然科学版), 2012, 34(6): 653-657,663.
XU Tao, JIA Yang-ji, YU Hong-zhi. An approach of automatic segmentation for Tibetan sentence based on rules and statistics[J]. Journal of Yunnan University: Natural Sciences Edition, 2012, 34(6): 653-657,663.
Citation: XU Tao, JIA Yang-ji, YU Hong-zhi. An approach of automatic segmentation for Tibetan sentence based on rules and statistics[J]. Journal of Yunnan University: Natural Sciences Edition, 2012, 34(6): 653-657,663.

统计与规则相结合的藏文句子自动断句方法

An approach of automatic segmentation for Tibetan sentence based on rules and statistics

  • 摘要: 藏文句子断句是藏文信息处理领域的难点之一,也是藏汉机器翻译、藏文文本分类等工作的一项重要基础性研究.提出了一种统计与规则相结合的藏文句子自动断句方法以解决藏文标点符号功能的歧义问题,实验结果表明该方法具有比较好的效果,F1值达到98%以上.在规则中首先使用经验的方法,识别出不确定的藏文句子作为候选句子,然后采用基于关联词的复句分析方法进行分句合并形成二次候选句子;最后使用最大熵的方法对二次候选句子进行断句.经验方法和复句分析有效解决了最大熵算法无法触及的语料稀疏和分句问题.

     

    Abstract: Segmentation of Tibetan sentences is one of the difficult task in the area of Tibetan information processing,and is also one of the key foundational researches of Tibetan-Chinese Machine Translation,Text Categorization,etc.To deal with the ambiguous functions of the Tibetan punctuations,this paper proposes a method of automatic segmentation of Tibetan sentences,which combines statistics and rules.The experiment shows that this approach works really well:the F1-measure reaches 98% and more.First,the experience method is used in rules to identify the ambiguous Tibetan sentences which are the candidate sentences.Then the analysis of compound sentences which is based on conjunctive words is used to combine clauses to form the further candidate sentences.Finally,the method of Maximum Entropy is used to cut up the further candidate sentences according to the meanings.Thus the experience method and the analysis of compound sentences have solved the problems of sparse corpus and clauses that Maximum Entropy cannot work out.

     

/

返回文章
返回