Bengali文本净化新方法

cs.AI updates on arXiv.org 前天 13:30

Bengali文本净化新方法

本文提出一种结合Pareto优化的LLMs和Chain-of-Thought（CoT）提示的Bengali文本净化新方法，构建BanglaNirTox数据集，显著提升净化效果。

arXiv:2511.01512v1 Announce Type: cross Abstract: Toxic language in Bengali remains prevalent, especially in online environments, with few effective precautions against it. Although text detoxification has seen progress in high-resource languages, Bengali remains underexplored due to limited resources. In this paper, we propose a novel pipeline for Bengali text detoxification that combines Pareto class-optimized large language models (LLMs) and Chain-of-Thought (CoT) prompting to generate detoxified sentences. To support this effort, we construct BanglaNirTox, an artificially generated parallel corpus of 68,041 toxic Bengali sentences with class-wise toxicity labels, reasonings, and detoxified paraphrases, using Pareto-optimized LLMs evaluated on random samples. The resulting BanglaNirTox dataset is used to fine-tune language models to produce better detoxified versions of Bengali sentences. Our findings show that Pareto-optimized LLMs with CoT prompting significantly enhance the quality and consistency of Bengali text detoxification.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

文本净化 Bengali LLMs CoT提示 BanglaNirTox数据集

相关文章

FinRobot: A Novel Open-Source AI Agent Platform Supporting Multiple Financially Specialized AI Agents Powered by LLMs

Show HN: 让开发人员方便使用 LLM 的 CLI

如何优化 LLM 以提高准确性

Show HN: Chatty - 用于在浏览器中运行 LLM 的免费人工智能私人聊天工具

法学硕士在引用资料来源时几乎都是正确的，对此最好的解释是什么？

对谈 MoonBit：AI 时代的编程语言应该是什么样子的？丨编码人声

一文读懂大模型协作策略：Merge、Ensemble、Cooperate！

what-beats-rock - Play Rock, Paper, Scissors with an AI...forever?

Usefulness grounds truth

This AI Paper from the National University of Singapore Introduces a Defense Against Adversarial Attacks on LLMs Utilizing Self-Evaluation