philschmid RSS feed 09月30日 19:11
亚马逊Bedrock Titan Embeddings评测:性能与价格分析
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

亚马逊近期推出了Bedrock服务,提供基于基础模型的文本、图像和聊天API访问。本文评测了其中的Titan Embeddings模型,使用MTEB基准测试其在多种任务上的表现。评测发现,Titan Embeddings在性能上落后于顶级开源模型,且价格也显著高于开源模型。此外,Bedrock服务缺乏批处理功能和分词器支持,导致使用不便。评测结果对考虑使用Titan Embeddings的开发者提供了重要参考。

🔍 Titan Embeddings是亚马逊Bedrock服务的一部分,支持处理长达8K个token的文本输入,输出1536维度的嵌入向量,覆盖25种语言,适用于文本检索、语义相似度和聚类等场景。

📊 在MTEB基准测试中,Titan Embeddings的平均得分为66.01,明显低于BGE和GTE等顶级开源模型(平均74.35),也落后于OpenAI的Embedding模型。

💰 亚马逊Bedrock的定价为每1M tokens 0.1美元,与OpenAI持平,但比运行开源模型如bge-base-en-v1.5(74.35)高出125倍。

👥 Bedrock服务目前不支持批处理,一次只能处理一个文档,导致计算和网络开销巨大,吞吐量仅为每秒6个文档(单线程),嵌入100万文档需要约46小时。

🤔 Titan Embeddings声称支持25种语言,但缺乏相关文档说明,使得多语言支持的实际价值大打折扣。此外,模型不支持8K token输入时提供的分词器,增加了使用难度。

Update (November 29, 2023): Since the time of writing the blog AWS reached out to me and also released new features and added missing information:

    Amazon Bedrock now supports batch inferenceDocumentation has been added detailing the supported languages, which now includes: English, Arabic, Chinese (Sim.), French, German, Hindi, Japanese, Spanish, Czech, Filipino, Hebrew, Italian, Korean, Portuguese, Russian, Swedish, Turkish, Chinese (trad), Dutch, Kannada, Malayalam, Marathi, Polish, Tamil, Telugu and "others".

I've left the original content of my post intact to preserve the context around what features were initially missing.

Amazon recently announced Bedrock, a fully managed service that provides access to foundation models for text, image, and chat through an API. One of the initial models included is Titan Embeddings, which generates vector representations of input text. Embeddings can be a powerful tool for semantic search, document clustering, and finding semantic similarity. Over the last few weeks, I took a closer look at Amazon Titan Embeddings using the Multitask Text Embedding Benchmark (MTEB).

MTEB is a comprehensive benchmark to measure how well text embedding models capture semantics and meaning. It includes tasks like semantic similarity, text classification, and information retrieval. By evaluating Titan Embeddings on MTEB, we can get a sense of its capabilities on a diverse set of embedding tasks. Key questions we will explore through the post will be:

    How does Titan Embeddings compare to other state-of-the-art embedding models on MTEB?What kinds of use cases could Titan Embeddings be well-suited for based on its performance?

What is Amazon Titan Embeddings?

The Titan Embeddings model is one of the initial offerings included with Amazon Bedrock, providing text embedding capabilities. Amazon states that Titan Embeddings is a "large language model" capable of handling up to 8K tokens of input text and outputting embeddings with a dimension of 1536. It supports over 25 languages and is intended for use cases like text retrieval, semantic similarity, and clustering.

However, Amazon has not provided details on the architecture, training data, or model performance of Titan Embeddings. The lack of transparency is disappointing. Without further technical documentation, potential users have little insight into the quality or biases of the Titan embeddings.

The code examples are easy and straightforward using the Bedrock API to generate embeddings but lack any evaluation of the resulting emebeddings. Are they high quality? Do they capture semantics well across languages and use cases? What languages are included in the 25? How do they compare to other publicly available embedding models? These are open questions for those considering using Titan.

Ideally, Amazon would release a technical report with an in-depth analysis of Titan's embeddings on established benchmarks like MTEB. The lack of transparency makes me doubt the claims that Titan is a cutting-edge embedding model.

Evaluating Titan Embeddings on MTEB

Based on my doubts, I evaluated Amazon Titan Embeddings on MTEB, a comprehensive benchmark for evaluating text embedding models across a diverse set of tasks and datasets. With over 50 datasets covering 8 task types and 112 languages, MTEB aims to provide a holistic assessment of embedding models to determine the current state-of-the-art across different use cases. Hugging Face hosts an open Leaderboard for the MTEB benchmark.

At the time of writing, Amazon Bedrock is not offering batching, meaning I could only send one document at a time with strict load limits. This made it impossible to evaluate Titan Embeddings on all 50 datasets. I hand-selected 8 datasets across all tasks:

    MSMARCO: A collection of datasets focused on deep learning in searchArguAna: A Full-Text Learning to Rank Dataset for Medical Information RetrievalAmazon Counterfactual Classification: Amazon customer reviews annotated for counterfactual detection pair classification.Banking77: Online banking queries annotated with their corresponding intents.SciDocsRR: Ranking of related scientific papers based on their title.SciFact: Scientific claims using evidence from the research literature containing scientific paper abstracts.NQ: Question answering dataset of google.com query and a corresponding Wikipedia page.TRECCOVID: Scientific articles related to the COVID-19 pandemic

Note: Evaluating Titan on those datasets took ~3 weeks and cost ~150$.

I wrote a small custom module to benchmark Amazon Bedrock with the MTEBpython package, which allowed me to use the Bedrock API. You can find the code on Github to reproduce my results. During the evaluation, I ran into several painful errors, all included in the “The ugly parts about Titan Embedding” section.

Okay, and now, how good is Amazon Titan Embedding? Well, not as good as you might think. I compared the performance of Titan to the top 3 open-source models and OpenAIs Embedding model (text-embedding-ada-002). Amazon Titan achieves an average score of 66.01, the worst of the five evaluated models. The best open-source model achieves an average of 74.35. You can find an excerpt of the results in the table below or in this Google Sheet.

But I didn’t stop there. I also took a look at the pricing. How does Amazon Bedrock compare to OpenAI or to hosting open-source Models?

For The pricing of open-source models, I ran a benchmark with Text-Embedding-Inference using documents with a sequence length of 512 and batch size of 32. The duration was then multiplied by the compute cost of the instance.

Amazon Bedrock is equally expensive as OpenAI with $0.1 per 1M tokens, or 125x more expensive than running bge-base-en-v1.5 (74.35).

Amazon Titan Embeddings Model is not only significantly worse than current open source models, it is also significantly more expensive than open source models.

The ugly parts about Amazon Bedrock and Titan Embeddings

While testing Titan Embedding, I encountered several limitations and pain points that made using the model and evaluation difficult:

    The model supposedly supports 25 languages, but I couldn’t find any documentation about what languages. That makes the whole point of multilingualism useless.Titan should accept up to 8K tokens of text input but doesn’t provide a tokenizer. What does Amazon think how someone should know how long your document is? During the evaluation, I ran into Too many input tokens. Max input tokens: 8192, request input token count: 18422. To overcome this, I recursively reduced the length of the document until the error was gone.Amazon Bedrock is currently not supporting batching. This means you can only send one document at a time, which is a huge waste of computing and additional network overhead, especially for embeddings. During my test, I only achieved a through of 6 documents per second (single-threaded). For example, embedding 1 million documents would take ~46 hours.

Going back to our initial questions from the introduction.

How does Titan Embeddings compare to other state-of-the-art embedding models on MTEB?

    Amazon Titan Embeddings not only lags behind top open-source models like BGE and GTE, it is also worse than competitive managed solutions.

What kinds of use cases could Titan Embeddings be well-suited for based on its performance?

    At the current state, I can not recommend investing in building applications on Amazon Titan Embeddings.

Conclusion

Overall, evaluating and using Titan Embeddings was less exciting and smooth than I thought when starting the benchmark. The benchmark of Amazon's new Titan Embeddings and Amazon Bedrock showed significant room for improvement.

The service lacks critical features like batching and access to Tokenizer, which are standards for building production-ready applications. The low throughput and quota limits makes it almost impoossible for Titan Embeddings to be used in real applications.

If Amazonians are reading this and are interested in collaborating to improve Titan Embeddings, please let me know. I believe with transparency and technical collaboration, Bedrock could become a very good solution for using hosted foundations and embedding models.


Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

亚马逊Bedrock Titan Embeddings 文本嵌入 MTEB基准测试 性能评测 开源模型 定价分析
相关文章