cs.AI updates on arXiv.org 10月09日 12:05
科学领域最优词表示算法研究
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文针对科学领域,研究最优词表示算法及其分词方法,旨在为下游科学领域NLP任务提供支持,并构建一个全面的评估套件以评估不同算法。

arXiv:2510.06244v1 Announce Type: cross Abstract: Finding an optimal word representation algorithm is particularly important in terms of domain specific data, as the same word can have different meanings and hence, different representations depending on the domain and context. While Generative AI and transformer architecture does a great job at generating contextualized embeddings for any given work, they are quite time and compute extensive, especially if we were to pre-train such a model from scratch. In this work, we focus on the scientific domain and finding the optimal word representation algorithm along with the tokenization method that could be used to represent words in the scientific domain. The goal of this research is two fold: 1) finding the optimal word representation and tokenization methods that can be used in downstream scientific domain NLP tasks, and 2) building a comprehensive evaluation suite that could be used to evaluate various word representation and tokenization algorithms (even as new ones are introduced) in the scientific domain. To this end, we build an evaluation suite consisting of several downstream tasks and relevant datasets for each task. Furthermore, we use the constructed evaluation suite to test various word representation and tokenization algorithms.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

词表示算法 科学领域 NLP 分词方法 评估套件
相关文章