论文标题
深度学习模型,用于代表量不计的单词
Deep learning models for representing out-of-vocabulary words
论文作者
论文摘要
随着社交网络和应用程序的普及,沟通变得越来越活跃,这些社交网络和应用程序使人们能够表达自己并立即沟通。在这种情况下,分布式表示模型的质量受到经常出现或源自拼写错误的新单词的影响。这些模型未知的单词,即被称为vocabulary(OOV)单词,需要正确处理,以不降低自然语言处理(NLP)应用程序的质量,这取决于文本的适当矢量表示。为了更好地理解这个问题并找到处理OOV单词的最佳技术,在这项研究中,我们对代表OOV单词的深度学习模型进行了全面的绩效评估。我们使用基准数据集和使用不同的NLP任务的外部评估进行了内在评估:文本分类,命名实体识别和语音标签。尽管结果表明,处理OOV单词的最佳技术在每个任务中都是不同的,但Comick是一种深度学习方法,它根据OOV单词的上下文和形态结构来渗透嵌入,获得了有希望的结果。
Communication has become increasingly dynamic with the popularization of social networks and applications that allow people to express themselves and communicate instantly. In this scenario, distributed representation models have their quality impacted by new words that appear frequently or that are derived from spelling errors. These words that are unknown by the models, known as out-of-vocabulary (OOV) words, need to be properly handled to not degrade the quality of the natural language processing (NLP) applications, which depend on the appropriate vector representation of the texts. To better understand this problem and finding the best techniques to handle OOV words, in this study, we present a comprehensive performance evaluation of deep learning models for representing OOV words. We performed an intrinsic evaluation using a benchmark dataset and an extrinsic evaluation using different NLP tasks: text categorization, named entity recognition, and part-of-speech tagging. Although the results indicated that the best technique for handling OOV words is different for each task, Comick, a deep learning method that infers the embedding based on the context and the morphological structure of the OOV word, obtained promising results.
