祖弦, 谢飞. 一种基于全局和局部特征表示的关键词抽取算法[J]. 云南大学学报(自然科学版), 2023, 45(4): 825-836. doi: 10.7540/j.ynu.20220337
引用本文: 祖弦, 谢飞. 一种基于全局和局部特征表示的关键词抽取算法[J]. 云南大学学报(自然科学版), 2023, 45(4): 825-836. doi: 10.7540/j.ynu.20220337
ZU Xian, XIE Fei. A keyphrase extraction algorithm based on global and local feature representation[J]. Journal of Yunnan University: Natural Sciences Edition, 2023, 45(4): 825-836. DOI: 10.7540/j.ynu.20220337
Citation: ZU Xian, XIE Fei. A keyphrase extraction algorithm based on global and local feature representation[J]. Journal of Yunnan University: Natural Sciences Edition, 2023, 45(4): 825-836. DOI: 10.7540/j.ynu.20220337

一种基于全局和局部特征表示的关键词抽取算法

A keyphrase extraction algorithm based on global and local feature representation

  • 摘要: 为解决传统关键词算法易忽略文档上下文语义信息,以及单词重要的统计特征未在深度学习方法中得到充分利用等问题,提出一种基于全局和局部特征表示的关键词抽取算法. 首先,利用Transformer和卷积神经网络搭建深度学习模型,通过多头注意力机制计算单词的全局语义特征表示,并利用每个单词的词性和词频统计特征信息,与语义特征拼接融合得出单词的特征向量表示;然后,采用多层卷积神经网络融合空洞卷积神经网络高效捕获单词局部特征信息和单词间依赖关系;最后,将关键词抽取工作看成序列标注任务抽取最终关键词. 通过在两个公开语料库上的多项调参和对比实验,证明提出的算法效果优于现有的主流关键词抽取算法,在Inspec和kp20k数据集上的F1值分别达到了49.87%和35.77%,有效提高了关键词自动抽取结果的准确性.

     

    Abstract: In order to solve the problems that traditional keyphrase algorithms tend to ignore the semantic information of the document, and the important statistical features of words are not fully utilized in deep learning methods, a keyphrase extraction algorithm based on global and local feature representation is proposed. Firstly, the transformer and convolutional neural network models are used to build a deep learning based keyphrase extraction framework in which the global semantic feature representation of the word is calculated through the multi-head attention mechanism, and the feature vector representation of words is obtained by concatenating and fusing the two statistical feature information of the part-of-speech and word frequency of each word with semantic features. Secondly, a multi-layer convolutional neural network is used to fuse the dilated convolutional neural network to efficiently capture word local feature information and inter-word dependencies. Finally, the keyphrase extraction is regarded as a sequence labeling task to extract the final keyphrases. Through multiple parameter tuning and comparison experiments on two public corpora, it is proved that the algorithm is better than the existing mainstream keyword extraction algorithms. The F1 values on the Inspec and kp20k datasets reach 49.87% and 35.77%, respectively, which effectively improves the accuracy of our automatic keyphrase extraction results.

     

/

返回文章
返回