论文标题
具有同义词边缘化的生物医学实体表示
Biomedical Entity Representations with Synonym Marginalization
论文作者
论文摘要
生物医学命名实体通常在许多生物医学文本挖掘工具中起重要作用。但是,由于提供的同义词不完整和表面形式的众多变化,因此生物医学实体的标准化非常具有挑战性。在本文中,我们专注于仅基于实体同义词的生物医学实体的学习表示。要从不完整的同义词中学习,我们使用基于模型的候选者选择,并最大程度地提高顶级候选人中同义词的边际可能性。随着模型的发展,我们基于模型的候选者进行了迭代更新,以包含更困难的负样本。通过这种方式,我们避免了超过400K候选人的负面样品的明确预选。在具有三种不同实体类型(疾病,化学,不良反应)的四个生物医学实体归一化数据集中,我们的模型Biosyn始终胜过先前的最新模型,几乎达到了每个数据集上的上限。
Biomedical named entities often play important roles in many biomedical text mining tools. However, due to the incompleteness of provided synonyms and numerous variations in their surface forms, normalization of biomedical entities is very challenging. In this paper, we focus on learning representations of biomedical entities solely based on the synonyms of entities. To learn from the incomplete synonyms, we use a model-based candidate selection and maximize the marginal likelihood of the synonyms present in top candidates. Our model-based candidates are iteratively updated to contain more difficult negative samples as our model evolves. In this way, we avoid the explicit pre-selection of negative samples from more than 400K candidates. On four biomedical entity normalization datasets having three different entity types (disease, chemical, adverse reaction), our model BioSyn consistently outperforms previous state-of-the-art models almost reaching the upper bound on each dataset.
