Transparency and control have been at the forefront of our innovation since we started to build LLM-based systems for the German and broader international market in 2019. To us, they’re the foundation of sovereignty in AI. That’s why we started with multilingual research when GPT-3 could hardly handle any language beyond English.
Not just for Europe, language is more than a means to communicate. It’s a driver of business, research and collaboration, as well as a carrier of culture and values.
In addition to our focus on language fairness, especially for low-resource languages, our customers have always used our technology for the most complex and critical use cases. These are domains where specialized knowledge comes with its own vocabulary, almost behaving like a language in itself. Just imagine the language of a German engineering patent and how different it is from your average internet text.
If AI is going to serve people, it must speak their language.
Sure, 2025’s LLMs can write poetry, clean up clunky email drafts and even plan your next vacation. But when it comes to the kind of enterprise use cases that can really drive value, the cracks start to show. That’s because the language used inside companies (think contracts, technical specs, internal jargon) is often a far cry from the casual chatter of the web corpus that most LLMs are optimized for.
How do we address this gap? By breaking free of tokenizers.
A tokenizer’s job is to break text into chunks (tokens) and assign each chunk an ID number. During training, the LLM learns how to represent each token as a vector (a list of numbers) because that’s the format a neural network needs to operate.
But, here’s the catch: once text is tokenized, the original characters become irrelevant. The model only sees the token ID, not the actual letters inside it. That means that while the LLM trains, it has to infer from context and many data samples that telephone (English) and Telefon (German) should receive nearly the same vectors. This is something that’s far from trivial with tokenizers because they cannot see the letters inside the words.
More critically, tokenizers are trained before the LLM training even begins and cannot adapt thereafter. Each tokenizer has a fixed set of tokens, essentially its vocabulary, optimized mostly for standard English text. This English-first assumption means that if the tokenizers aren’t built for a specific language or domain, it falls back to a fragmented representation, often breaking words into tokens of just one or two characters. For example, the word Bundeskanzler (German for chancellor), is broken down into four tokens, while chancellor requires only one.
More tokens mean more memory, more compute steps and, ultimately, higher costs. It also makes the underlying knowledge significantly harder to learn. Sometimes, that’s enough to shut down the business case entirely. That’s exactly where our tokenizer-free architecture, T-Free, comes in. It’s designed to remove this bottleneck and unlock new possibilities for efficiency and domain-specific intelligence.
How T-Free packs more punch
Instead of relying on a separate tokenizer, our LLM converts words directly into vectors, eliminating the fragmentation caused by breaking words into awkward chunks. This preserves the integrity of even rare or domain-specific terms and allows us to consistently pack more characters into each vector. While traditional LLMs average around four characters per vector, T-Free reaches nearly seven. This efficiency directly translates into lower costs, reduced energy consumption and less data required for training––
T-Free can also use similarities in character patterns that are hidden to standard LLMs. It can see straightaway, even before training, that “telephone” and “Telefon” are almost the same word. This built-in awareness gives fine-tuning a head start, enabling more adaptable LLMs from the get-go.
The result?
AI that’s no longer locked into English-first assumptions, bringing efficiency and capability to all languages and enabling specialized enterprise knowledge. T-Free opens the door to sovereign AI strategies by making it more practical to train on proprietary data, low-resource languages and still retain the general-purpose capabilities we value in today’s LLMs.
And now, we’re proud to share the latest T-Free checkpoints:
Models that excel at benchmarks and offer the strongest foundation for capturing what matters most: your knowledge, language and sovereignty.
Ready to dive deeper? Read up on the full research release on our T-Free-HAT models here.
The post Breaking Free of Tokenizers: Why We Built T-Free and What It Means for Sovereign AI appeared first on Aleph Alpha.
