面向中药视觉问答的多任务学习方法研究

Research on multi-task learning methods for traditional chinese medicine visual question answering

  • 摘要: 随着人工智能技术的快速发展,视觉问答(visual question answering,VQA)技术在中医药领域的应用逐渐兴起. 给模型一张包含中药的图片,然后模型可以回答与图片所示中药的相关问题,借助视觉问答技术使得人们可以更方便更容易地认识中药,也利于中医药文化的转播与推广. 目前中药视觉问答的相关数据集严重匮乏,以及现有的视觉问答模型在中药视觉问答任务上适配度不高,为此本研究构建了一个中药视觉问答数据集,并在此基础上提出一种基于多任务学习的中药视觉问答模型(visual question answering method for traditional Chinese medicine based on multi-task learning,TCMML). 该模型结合了Faster R-CNN和Chinese BERT分别提取图像和文本特征,并使用基于自注意力和交叉注意力的端到端的联合注意力网络进行特征融合. 此外,模型采用多任务学习策略,通过任务共享层进行模态对齐和学习任务间相关联的信息,并由5个任务专家模块分别应对中药视觉问答的5个子任务,最终实现高精度的答案输出. 实验结果表明,TCMML模型在中药视觉问答任务中取得了较高的准确率,与现有主流模型相比具有优势,验证了多任务学习策略在该任务中的有效性.

     

    Abstract: With the rapid advancement of artificial intelligence technologies, visual question answering (VQA) has gained increasing attention in the field of traditional Chinese medicine (TCM). VQA enables users to provide an image containing medicinal herbs, allowing the model to answer questions related to the herbs depicted in the image. This technology facilitates a more intuitive and accessible way for people to understand TCM, while also promoting the dissemination and popularization of TCM culture. However, the lack of specialized datasets and the limited adaptability of existing VQA models to TCM-specific tasks pose significant challenges. To address these issues, this study constructs a dedicated TCM VQA dataset and proposes a novel visual question answering method for traditional Chinese medicine based on multi-task learning (TCMML). The proposed model integrates Faster R-CNN and Chinese BERT to extract image and text features, respectively, and employs an end-to-end joint attention network based on self-attention and cross-attention mechanisms for feature fusion. Additionally, the model adopts a multi-task learning strategy, leveraging a shared task layer to achieve modality alignment and capture inter-task dependencies. Five task-specific expert modules are designed to address the five sub-tasks of TCM VQA, enabling the model to generate highly accurate answers. Experimental results demonstrate that the TCMML model achieves superior accuracy in TCM-related visual question answering tasks compared to existing state-of-the-art models, thereby validating the effectiveness of the multi-task learning strategy in this domain.

     

/

返回文章
返回