philschmid RSS feed 09月30日
使用Hugging Face部署T5 11B模型
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用Hugging Face Inference Endpoints部署T5 11B模型进行推理。T5模型是一种基于Transformer的文本到文本转换器,广泛应用于各种NLP任务。部署过程包括准备模型仓库、自定义处理程序和额外依赖项,然后将自定义处理程序作为推理端点部署。最后,通过Python发送HTTP请求进行测试。本文还讨论了使用混合精度和分片技术来减少内存占用,以及使用LLM.int8()进行量化以进一步降低内存需求。

📚 T5模型是一种基于Transformer的文本到文本转换器,广泛应用于各种NLP任务,如翻译、摘要和问答等。

🔧 部署T5 11B模型需要进行模型分片和混合精度处理,以适应单GPU的内存限制。具体方法包括将模型权重转换为fp16格式,并使用LLM.int8()技术进行量化。

🚀 通过Hugging Face Inference Endpoints,用户可以轻松地将模型部署为推理端点,并通过API进行调用。部署过程包括准备模型仓库、自定义处理程序和额外依赖项,以及选择合适的GPU配置。

📈 部署后的模型可以通过Python发送HTTP请求进行测试,确保其正常运行。用户还可以通过Hugging Face提供的工具进行调试和优化。

🔗 本文还提供了详细的代码示例和操作步骤,帮助用户快速上手。

This blog will teach you how to deploy T5 11B for inference using Hugging Face Inference Endpoints. The T5 model was presented in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer paper and is one of the most used and known Transformer models today.

T5 is an encoder-decoder model pre-trained on a multi-task mixture of unsupervised and supervised tasks and for which each task is converted into a text-to-text format. T5 works well on various tasks out-of-the-box by prepending a different prefix to the input corresponding to each task, e.g., for translation: translate English to German: …, for summarization: summarize: ...

Before we can get started, make sure you meet all of the following requirements:

    An Organization/User with an active plan and WRITE access to the model repository.You can access the UI: https://ui.endpoints.huggingface.co

The Tutorial will cover how to:

    Prepare model repository, custom handler, and additional dependenciesDeploy the custom handler as an Inference EndpointSend HTTP request using Python

What is Hugging Face Inference Endpoints?

🤗 Inference Endpoints offers a secure production solution to easily deploy Machine Learning models on dedicated and autoscaling infrastructure managed by Hugging Face.

A Hugging Face Inference Endpoint is built from a Hugging Face Model Repository. It supports all the Transformers and Sentence-Transformers tasks and any arbitrary ML Framework through easy customization by adding a custom inference handler. This custom inference handler can be used to implement simple inference pipelines for ML Frameworks like Keras, Tensorflow, and scit-kit learn or can be used to add custom business logic to your existing transformers pipeline.

Tutorial: Deploy T5-11B on a single NVIDIA T4

In this tutorial, you will learn how to deploy T5 11B for inference using Hugging Face Inference Endpoints and how you can integrate it via an API into your products.

1. Prepare model repository, custom handler, and additional dependencies

T5 11B is, with 11 billion parameters of the largest openly available Transformer models. The weights in float32 are 45.2GB and are normally too big to deploy on an NVIDIA T4 with 16GB of GPU memory.

To be able to fit T5-11b into a single GPU, we are going to use two techniques:

    mixed precision and sharding: Converting the weights to fp16 will reduce the memory footprint by 2x, and sharding will allow us to easily place each “shard” on a GPU without the need to load the model into CPU memory first.LLM.int8(): introduces a new quantization technique for Int8 matrix multiplication, which cuts the memory needed for inference by half while. To learn more about check out this blog post or the paper.

We already prepared a repository with sharded fp16 weights of T5-11B on the Hugging Face Hub at: philschmid/t5-11b-sharded. Those weights were created using the following snippet.

Note: If you want to convert the weights yourself, e.g. to deploy google/flan-t5-xxl you need at least 80GB of memory.

import torchfrom transformers import AutoModelWithLMHeadfrom huggingface_hub import HfApi # load model as float16model = AutoModelWithLMHead.from_pretrained("t5-11b", torch_dtype=torch.float16, low_cpu_mem_usage=True)# shard model an push to hubmodel.save_pretrained("sharded", max_shard_size="2000MB")# push to hubapi = HfApi()api.upload_folder(    folder_path="sharded",    repo_id="philschmid/t5-11b-sharded-fp16",)

After we have our sharded fp16 model weights, we can prepare the additional dependencies we will need to use the **LLM.int8().** LLM.int8() has been natively integrated into transformers through bitsandbytes.

To add custom dependencies, we need to add a requirements.txt file to your model repository on the Hugging Face Hub with the Python dependencies you want to install.

accelerate==0.13.2bitsandbytes

The last step before creating our Inference Endpoint is to create a custom Inference Handler. If you want to learn how to create a custom Handler for Inference Endpoints, you can either checkout the documentation or go through “Custom Inference with Hugging Face Inference Endpoints”.

from typing import Dict, List, Anyfrom transformers import AutoModelForSeq2SeqLM, AutoTokenizerimport torch class EndpointHandler:    def __init__(self, path=""):        # load model and processor from path        self.model =  AutoModelForSeq2SeqLM.from_pretrained(path, device_map="auto", load_in_8bit=True)        self.tokenizer = AutoTokenizer.from_pretrained(path)     def __call__(self, data: Dict[str, Any]) -> Dict[str, str]:        """        Args:            data (:obj:):                includes the deserialized image file as PIL.Image        """        # process input        inputs = data.pop("inputs", data)        parameters = data.pop("parameters", None)         # preprocess        input_ids = self.tokenizer(inputs, return_tensors="pt").input_ids         # pass inputs with all kwargs in data        if parameters is not None:            outputs = self.model.generate(input_ids, **parameters)        else:            outputs = self.model.generate(input_ids)         # postprocess the prediction        prediction = self.tokenizer.decode(outputs[0], skip_special_tokens=True)         return [{"generated_text": prediction}]

2. Deploy the custom handler as an Inference Endpoint

UI: https://ui.endpoints.huggingface.co/

Since we prepared our model weights, dependencies and custom handler we can now deploy our model as an Inference Endpoint. We can deploy our custom Custom Handler the same way as a regular Inference Enhttps://www.philschmid.de/static/blog/deploy-t5-11b/model.pngg/deploy-t5-11b/model.png" alt="model id">

Select the repository, the cloud, and the region. After that we need to open the “Advanced Settings” to select GPU • small • 1x NIVIDA Tesla T4 .

Note: If you are trying to deploy the model onhttps://www.philschmid.dehttps://www.philschmid.de/static/blog/deploy-t5-11b/instance.png>

The Inference Endpoint Service will check during the creation of your Endpoint if there is a handler.py available and will use it for serving requests no matter which “Task” you select.

The deployment can take 20-40 minutes due to the image artifact's model size (~30GB) build. After deployinghttps://www.philschmid.dehttps://www.philschmid.de/static/blog/deploy-t5-11b/inference.pnginference widget.

3. Send HTTP request using Python

Hugging Face Inference endpoints can be used with an HTTP client in any language. We will use Python and the requests library to send our requests. (make your you have it installed pip install requests)

import jsonimport requests as r ENDPOINT_URL=""# url of your endpointHF_TOKEN="" # payload samplesregular_payload = { "inputs": "translate English to German: The weather is nice today." }parameter_payload = {    "inputs": "translate English to German: Hello my name is Philipp and I am a Technical Leader at Hugging Face",  "parameters" : {    "max_length": 40,  }} # HTTP headers for authorizationheaders= {    "Authorization": f"Bearer {HF_TOKEN}",    "Content-Type": "application/json"} # send requestresponse = r.post(ENDPOINT_URL, headers=headers, json=paramter_payload)generated_text = response.json() print(generated_text)

Conclusion

That's it we successfully deploy our T5-11b to Hugging Face Inference Endpoints for less than $500.

To underline this again, we deployed one of the biggest available transformers in a managed, secure, scalable inference endpoint. This will allow Data scientists and Machine Learning Engineers to focus on R&D, improving the model rather than fiddling with MLOps topics.

Now, it's your turn! Sign up and create your custom handler within a few minutes!


Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Hugging Face T5 11B Transformer Inference Endpoint Machine Learning
相关文章