Aleph alpha 09月28日 23:41
透明与控制:多语言LLM创新
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

自2019年开始为德国及国际市场构建基于LLM的系统以来,透明度和控制始终是我们创新的核心。我们将它们视为AI主权的基石。因此,当GPT-3几乎无法处理英语以外的任何语言时,我们率先进行了多语言研究。语言不仅是沟通工具,更是商业、研究、合作的驱动力,也是文化和价值观的载体。除了关注语言公平性,特别是对低资源语言,我们的客户始终使用我们的技术处理最复杂和关键的应用场景。这些领域具有专门的词汇,几乎像一种语言本身。想象一下德国工程专利的语言,它与你平均的互联网文本有何不同。如果AI要服务于人们,它必须说他们的语言。当然,2025年的LLM可以写诗、整理笨拙的邮件草稿,甚至计划你的下一个假期。但当它涉及到真正能创造价值的 企业级应用场景时,问题就出现了。因为公司内部的语言(想想合同、技术规格、内部行话)通常与大多数LLM为优化而准备的随意网络语 言大相径庭。我们如何弥补这一差距?通过摆脱分词器。分词器的任务是把文本分成块(token)并给每个块分配一个ID号。在训练过程中,LLM学习如何将每个token表示为向量(数字列表),因为这是神经网络操作的格式。但是,有一个问题:一旦文本被分词,原始字符就变得无关紧要了。模型只看到token ID,而不是里面的实际字母。这意味着LLM在训练时必须从上下文和许多数据样本中推断出telephone(英语)和Telefon(德语)应该获得几乎相同的向量。这对于分词器来说远非易事,因为它们看不到单词里面的字母。更关键的是,分词器是在LLM训练开始之前训练的,之后无法适应。每个分词器都有一个固定的token集合,即它的词汇量,主要针对标准英语文本。这种以英语为先的假设意味着,如果分词器不为特定的语言或领域设计,它就会回退到碎片化的表示,经常把单词分成只有一两个字符的token。例如,德语单词Bundeskanzler(总理)被分解成四个token,而chancellor只需要一个。更多的token意味着更多的内存、更多的计算步骤和最终更高的成本。它还使底层知识的学习变得非常困难。有时,这足以让商业案例完全失效。这就是我们无分词器架构T-Free的用武之地。它旨在消除这一瓶颈,并为效率和领域特定智能解锁新的可能性。T-Free如何更具威力?它不是依赖单独的分词器,而是让LLM直接将单词转换为向量,消除了将单词分成笨拙块的碎片化。这保留了即使是罕见或领域特定术语的完整性,并允许我们一致地将更多字符打包到每个向量中。虽然传统的LLM每个向量平均约四个字符,但T-Free接近七个。这种效率直接转化为更低的成本、减少的能耗和更少的数据需求用于训练——T-Free还可以使用标准LLM隐藏的字符模式相似性。它可以直接看到,甚至在训练之前,'telephone'和'Telefon'几乎是一样的词。这种内置的意识为微调提供了先发优势,使LLM从一开始就更适应。不再受英语为先假设的束缚,为所有语言带来效率和功能,并使特定企业知识成为可能。T-Free通过使在专有数据、低资源语言上训练更具实用性,同时仍然保留我们今天在LLM中重视的通用功能,为主权AI策略打开了大门。现在,我们自豪地分享最新的T-Free检查点:在基准测试中表现出色的模型,为捕捉最重要的事物——你的知识、语言和主权——提供了最坚实的基础。

🌍 透明与控制始终是自2019年以来我们为德国及国际市场构建基于LLM的系统的创新核心,被视为AI主权的基石,强调语言不仅是沟通工具,更是商业、研究、合作的驱动力,以及文化和价值观的载体。

📚 我们率先进行多语言研究以应对GPT-3在处理非英语语言上的局限性,并关注语言公平性,特别是对低资源语言的处理,同时客户一直使用我们的技术处理最复杂和关键的应用场景,如德国工程专利等具有专门词汇的领域。

🤖 传统LLM依赖分词器将文本分割成token,导致原始字符信息丢失,如' telephone'和'Telefon'需要推断为相似向量,且分词器固定,无法适应特定语言或领域,如德语'Bundeskanzler'被分解成多个token。

💡 我们的T-Free无分词器架构直接将单词转换为向量,保留了词语的完整性,将更多字符打包到每个向量中(传统约4字符,T-Free近7字符),提高了效率,降低了成本和能耗,并使LLM更适应特定领域。

🔑 T-Free通过直接处理字符模式相似性(如' telephone'和'Telefon'),无需预先分词,使LLM能更好地适应多语言和低资源语言,同时保持通用功能,为训练专有数据和实现主权AI策略提供了更坚实的基础。

Transparency and control have been at the forefront of our innovation since we started to build LLM-based systems for the German and broader international market in 2019. To us, they’re the foundation of sovereignty in AI. That’s why we started with multilingual research when GPT-3 could hardly handle any language beyond English.

Not just for Europe, language is more than a means to communicate. It’s a driver of business, research and collaboration, as well as a carrier of culture and values.

In addition to our focus on language fairness, especially for low-resource languages, our customers have always used our technology for the most complex and critical use cases. These are domains where specialized knowledge comes with its own vocabulary, almost behaving like a language in itself. Just imagine the language of a German engineering patent and how different it is from your average internet text.

If AI is going to serve people, it must speak their language.

Sure, 2025’s LLMs can write poetry, clean up clunky email drafts and even plan your next vacation. But when it comes to the kind of enterprise use cases that can really drive value, the cracks start to show. That’s because the language used inside companies (think contracts, technical specs, internal jargon) is often a far cry from the casual chatter of the web corpus that most LLMs are optimized for.

How do we address this gap? By breaking free of tokenizers.

A tokenizer’s job is to break text into chunks (tokens) and assign each chunk an ID number. During training, the LLM learns how to represent each token as a vector (a list of numbers) because that’s the format a neural network needs to operate. 

But, here’s the catch: once text is tokenized, the original characters become irrelevant. The model only sees the token ID, not the actual letters inside it. That means that while the LLM trains, it has to infer from context and many data samples that telephone (English) and Telefon (German) should receive nearly the same vectors. This is something that’s far from trivial with tokenizers because they cannot see the letters inside the words.

More critically, tokenizers are trained before the LLM training even begins and cannot adapt thereafter. Each tokenizer has a fixed set of tokens, essentially its vocabulary, optimized mostly for standard English text. This English-first assumption means that if the tokenizers aren’t built for a specific language or domain, it falls back to a fragmented representation, often breaking words into tokens of just one or two characters. For example, the word Bundeskanzler (German for chancellor), is broken down into four tokens, while chancellor requires only one.

More tokens mean more memory, more compute steps and, ultimately, higher costs. It also makes the underlying knowledge significantly harder to learn. Sometimes, that’s enough to shut down the business case entirely. That’s exactly where our tokenizer-free architecture, T-Free, comes in. It’s designed to remove this bottleneck and unlock new possibilities for efficiency and domain-specific intelligence.

How T-Free packs more punch

Instead of relying on a separate tokenizer, our LLM converts words directly into vectors, eliminating the fragmentation caused by breaking words into awkward chunks. This preserves the integrity of even rare or domain-specific terms and allows us to consistently pack more characters into each vector. While traditional LLMs average around four characters per vector, T-Free reaches nearly seven. This efficiency directly translates into lower costs, reduced energy consumption and less data required for training––

T-Free can also use similarities in character patterns that are hidden to standard LLMs. It can see straightaway, even before training, that “telephone” and “Telefon” are almost the same word. This built-in awareness gives fine-tuning a head start, enabling more adaptable LLMs from the get-go.

The result?

AI that’s no longer locked into English-first assumptions, bringing efficiency and capability to all languages and enabling specialized enterprise knowledge. T-Free opens the door to sovereign AI strategies by making it more practical to train on proprietary data, low-resource languages and still retain the general-purpose capabilities we value in today’s LLMs.

And now, we’re proud to share the latest T-Free checkpoints:

Models that excel at benchmarks and offer the strongest foundation for capturing what matters most: your knowledge, language and sovereignty.

The post Breaking Free of Tokenizers: Why We Built T-Free and What It Means for Sovereign AI appeared first on Aleph Alpha.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

透明度 控制 多语言LLM 创新 无分词器架构 T-Free AI主权 低资源语言 领域特定智能 效率
相关文章