philschmid RSS feed 09月30日 19:11
Llama 2部署优化指南
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文通过对比60种不同部署配置,为企业在Amazon SageMaker上部署Llama 2提供优化建议。测试涵盖7B、13B、70B模型,结合g5.2xlarge、g5.12xlarge、p4d.24xlarge实例及GPTQ量化技术,评估成本、延迟与吞吐量。研究发现GPTQ量化显著降低成本,13B模型在g5.2xlarge上平衡性能与价格,而70B模型适合高吞吐量场景。最佳延迟方案为7B模型配合g5.12xlarge实例。文章提供详细配置数据与部署建议,助力企业按需选择。

📊 评估60种Llama 2部署配置,涵盖7B、13B、70B模型,结合g5.2xlarge、g5.12xlarge、p4d.24xlarge实例及GPTQ量化技术,全面测试成本、延迟与吞吐量表现。

💡 研究发现GPTQ量化技术可大幅降低部署成本,使13B模型在g5.2xlarge实例上实现性能与价格的理想平衡,适合预算有限但需较高效率的场景。

⚡ 高吞吐量场景建议采用70B模型配合ml.p4d.24xlarge实例,其吞吐量达688 tokens/sec,尽管成本较高,但能满足大规模数据处理需求。

🔥 最佳延迟方案为7B模型配合ml.g5.12xlarge实例,延迟低至16ms/token,适合实时交互应用如聊天机器人,确保用户响应速度。

🔧 文章提供完整配置数据与GitHub代码库,支持企业根据自身需求(成本、延迟或吞吐量优先)进行可复现的基准测试,实现Llama 2的精细化部署。

Orginially published on the Hugging Face Blog

Deploying large language models (LLMs) and other generative AI models can be challenging due to their computational requirements and latency needs. To provide useful recommendations to companies looking to deploy Llama 2 on Amazon SageMaker with the Hugging Face LLM Inference Container, we created a comprehensive benchmark analyzing over 60 different deployment configurations for Llama 2.

In this benchmark, we evaluated varying sizes of Llama 2 on a range of Amazon EC2 instance types with different load levels. Our goal was to measure latency (ms per token), and throughput (tokens per second) to find the optimal deployment strategies for three common use cases:

    Most Cost-Effective Deployment: For users looking for good performance at low costBest Latency Deployment: Minimizing latency for real-time servicesBest Throughput Deployment: Maximizing tokens processed per second

To keep this benchmark fair, transparent, and reproducible, we share all of the assets, code, and data we used and collected:

We hope to enable customers to use LLMs and Llama 2 efficiently and optimally for their use case. Before we get into the benchmark and data, let's look at the technologies and methods we used.

What is the Hugging Face LLM Inference Container?

Hugging Face LLM DLC is a purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI), an open-source, purpose-built solution for deploying and serving LLMs. TGI enables high-performance text generation using Tensor Parallelism and dynamic batching for the most popular open-source LLMs, including StarCoder, BLOOM, GPT-NeoX, Falcon, Llama, and T5. VMware, IBM, Grammarly, Open-Assistant, Uber, Scale AI, and many more already use Text Generation Inference.

What is Llama 2?

Llama 2 is a family of LLMs from Meta, trained on 2 trillion tokens. Llama 2 comes in three sizes - 7B, 13B, and 70B parameters - and introduces key improvements like longer context length, commercial licensing, and optimized chat abilities through reinforcement learning compared to Llama (1). If you want to learn more about Llama 2 check out this blog post.

What is GPTQ?

GPTQ is a post-training quantziation method to compress LLMs, like GPT. GPTQ compresses GPT (decoder) models by reducing the number of bits needed to store each weight in the model, from 32 bits down to just 3-4 bits. This means the model takes up much less memory and can run on less Hardware, e.g. Single GPU for 13B Llama2 models. GPTQ analyzes each layer of the model separately and approximates the weights to preserve the overall accuracy. If you want to learn more and how to use it, check out Optimize open LLMs using GPTQ and Hugging Face Optimum.

Benchmark

To benchmark the real-world performance of Llama 2, we tested 3 model sizes (7B, 13B, 70B parameters) on four different instance types with four different load levels, resulting in 60 different configurations:

    Models: We evaluated all currently available model sizes, including 7B, 13B, and 70B.Concurrent Requests: We tested configurations with 1, 5, 10, and 20 concurrent requests to determine the performance on different usage scenarios.Instance Types: We evaluated different GPU instances, including g5.2xlarge, g5.12xlarge, g5.48xlarge powered by NVIDIA A10G GPUs, and p4d.24xlarge powered by NVIDIA A100 40GB GPU.Quantization: We compared performance with and without quantization. We used GPTQ 4-bit as a quantization technique.

As metrics, we used Throughput and Latency defined as:

    Throughput (tokens/sec): Number of tokens being generated per second.Latency (ms/token): Time it takes to generate a single token

We used those to evaluate the performance of Llama across the different setups to understand the benefits and tradeoffs. If you want to run the benchmark yourself, we created a Github repository.

You can find the full data of the benchmark in the Amazon SageMaker Benchmark: TGI 1.0.3 Llama 2 sheet. The raw data is available on GitHub.

If you are interested in all of the details, we recommend you to dive deep into the provided raw data.

Recommendations & Insights

Based on the benchmark, we provide specific recommendations for optimal LLM deployment depending on your priorities between cost, throughput, and latency for all Llama 2 model sizes.

Note: The recommendations are based on the configuration we tested. In the future, other environments or hardware offerings, such as Inferentia2, maybe even more cost-efficient.

Most Cost-Effective Deployment

The most cost-effective configuration focuses on the right balance between performance (latency and throughput) and cost. Maximizing the output per dollar spent is the goal. We looked at the performance during 5 concurrent requests. We can see that GPTQ offers the best cost-effectiveness, allowing customers to deploy Llama 2 13B on a single GPU.

ModelQuantizationInstanceconcurrent requestsLatency (ms/token) medianThroughput (tokens/second)On-demand cost ($/h) in us-west-2Time to generate 1 M tokens (minutes)cost to generate 1M tokens ($)
Llama 2 7BGPTQg5.2xlarge534.245736120.0941633$1.52138.78$3.50
Llama 2 13BGPTQg5.2xlarge556.23748471.70560104$1.52232.43$5.87
Llama 2 70BGPTQml.g5.12xlarge5138.34792833.33372399$7.09499.99$59.08

Best Throughput Deployment

The Best Throughput configuration maximizes the number of tokens that are generated per second. This might come with some reduction in overall latency since you process more tokens simultaneously. We looked at the highest tokens per second performance during twenty concurrent requests, with some respect to the cost of the instance. The highest throughput was for Llama 2 13B on the ml.p4d.24xlarge instance with 688 tokens/sec.

ModelQuantizationInstanceconcurrent requestsLatency (ms/token) medianThroughput (tokens/second)On-demand cost ($/h) in us-west-2Time to generate 1 M tokens (minutes)cost to generate 1M tokens ($)
Llama 2 7BNoneml.g5.12xlarge2043.99524449.9423027$7.0937.04$4.38
Llama 2 13BNoneml.g5.12xlarge2067.4027465295.6378071$7.0918.72$2.21
Llama 2 70BNoneml.p4d.24xlarge2059.798591321.5369158$37.6916.61$10.43

Best Latency Deployment

The Best Latency configuration minimizes the time it takes to generate one token. Low latency is important for real-time use cases and providing a good experience to the customer, e.g. Chat applications. We looked at the lowest median for milliseconds per token during 1 concurrent request. The lowest overall latency was for Llama 2 7B on the ml.g5.12xlarge instance with 16.8ms/token.

ModelQuantizationInstanceconcurrent requestsLatency (ms/token) medianThorughput (tokens/second)On-demand cost ($/h) in us-west-2Time to generate 1 M tokens (minutes)cost to generate 1M tokens ($)
Llama 2 7BNoneml.g5.12xlarge116.81252661.45733054$7.09271.19$32.05
Llama 2 13BNoneml.g5.12xlarge121.00271547.15736567$7.09353.43$41.76
Llama 2 70BNoneml.p4d.24xlarge141.34854324.5142928$37.69679.88$427.05

Conclusions

In this benchmark, we tested 60 configurations of Llama 2 on Amazon SageMaker. For cost-effective deployments, we found 13B Llama 2 with GPTQ on g5.2xlarge delivers 71 tokens/sec at an hourly cost of $1.55. For max throughput, 13B Llama 2 reached 296 tokens/sec on ml.g5.12xlarge at $2.21 per 1M tokens. And for minimum latency, 7B Llama 2 achieved 16ms per token on ml.g5.12xlarge.

We hope the benchmark will help companies deploy Llama 2 optimally based on their needs. If you are want to get started deploying Llama 2 on Amazon SageMaker, check out Introducing the Hugging Face LLM Inference Container for Amazon SageMaker and Deploy Llama 2 7B/13B/70B on Amazon SageMaker blog posts.


Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Llama 2 Amazon SageMaker GPTQ 量化部署 机器学习优化
相关文章