首份库尔德语语义文本相似度数据集发布

cs.AI updates on arXiv.org 10月06日

首份库尔德语语义文本相似度数据集发布

本文介绍首个库尔德语语义文本相似度数据集，包含10,000对句子，标注相似度，用于库尔德语义和低资源NLP研究。

arXiv:2510.02336v1 Announce Type: cross Abstract: Semantic Textual Similarity (STS) measures the degree of meaning overlap between two texts and underpins many NLP tasks. While extensive resources exist for high-resource languages, low-resource languages such as Kurdish remain underserved. We present, to our knowledge, the first Kurdish STS dataset: 10,000 sentence pairs spanning formal and informal registers, each annotated for similarity. We benchmark Sentence-BERT, multilingual BERT, and other strong baselines, obtaining competitive results while highlighting challenges arising from Kurdish morphology, orthographic variation, and code-mixing. The dataset and baselines establish a reproducible evaluation suite and provide a strong starting point for future research on Kurdish semantics and low-resource NLP.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

库尔德语语义文本相似度数据集低资源NLP 语义研究

相关文章

MS MARCO Web Search: A Large-Scale Information-Rich Web Dataset Featuring Millions of Real Clicked Query-Document Labels

This Week In Machine Learning & AI - 5/27/16: The White House on AI & Aggressive Self-Driving Cars

CinePile: A Novel Dataset and Benchmark Specifically Designed for Authentic Long-Form Video Understanding

‘RAG Me Up’: A Generic AI Framework (Server + UIs) that Enables You to Do RAG on Your Own Dataset Easily

HuggingFace Releases ? FineWeb: A New Large-Scale (15-Trillion Tokens, 44TB Disk Space) Dataset for LLM Pretraining

Unlocking the Language of Proteins: How Large Language Models Are Revolutionizing Protein Sequence Understanding

MAGPIE: A Self-Synthesis Method for Generating Large-Scale Alignment Data by Prompting Aligned LLMs with Nothing

Midjourney: ↩️ @kortizart To the best of our knowledge; you are not in our dataset. Here's a "portrait by Karla Ortiz" vs a "portrait by artist". FW...

Hugging Face: We're excited to welcome @argilla_io to the Hugging Face team! ? Time to democratise good Machine Learning, one dataset at a time!...

Hugging Face: Hugging Face is hosting a demo site for @iclr_conf authors to find and claim their papers and discuss those papers on dedicated pages Th...