philschmid RSS feed 09月30日
Falcon模型在Amazon SageMaker上部署指南
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何在Amazon SageMaker上部署Falcon(7B & 40B)开源LLM模型。这些模型采用Apache 2.0许可证,支持商业使用。文章详细指导了从开发环境设置、获取Hugging Face LLM推理容器,到部署Falcon 40B模型并运行推理的完整流程。重点使用ml.g5.12xlarge实例(4 NVIDIA A10G GPU)进行部署,并探讨了输入长度调整和量化配置以优化性能。

💻 文章详细介绍了在Amazon SageMaker上部署Falcon(7B & 40B)开源LLM模型的完整流程,包括开发环境设置、获取Hugging Face LLM推理容器、部署模型以及运行推理,为开发者提供了实用的部署指南。

🔧 指导使用ml.g5.12xlarge实例(配备4 NVIDIA A10G GPU)进行模型部署,并探讨了针对长输入(>1k tokens)的配置调整,如启用int-8量化以优化模型内存占用和延迟。

📚 强调了Falcon模型采用Apache 2.0许可证,支持商业使用,降低了企业应用生成式AI技术的门槛,同时提供了多语言(德语、西班牙语、法语等)数据集支持。

🗣️ 通过实例展示了如何与部署的Falcon 40B模型进行对话交互,利用参数如top_p、temperature、max_new_tokens等控制生成效果,验证了模型在实际应用中的交互能力。

🧹 提供了模型和端点清理步骤(delete_model()和delete_endpoint()),帮助用户在完成实验后正确释放SageMaker资源,避免不必要的费用产生。

The Falcon models are taking the open-source LLM space by storm! Falcon (7B & 40B) are currently the most exciting models, offering commercial use through the Apache 2.0 license! The Falcon model family comes in two sizes 7B, trained on 1.5T tokens, and 40B, trained on 1T Tokens. Falcon 40B was trained on a multi-lingual dataset, including German, Spanish, and French!

Last week, we announced the new Hugging Face LLM Inference Container for Amazon SageMaker, which allows you to easily deploy the most popular open-source LLMs, including Falcon, StarCoder, BLOOM, GPT-NeoX, Llama, and T5.

This blog will guide you through deploying the Instruct Falcon 40B model to Amazon SageMaker. We will cover how to:

    Setup development environmentRetrieve the new Hugging Face LLM DLCDeploy Falcon 40B to Amazon SageMakerRun inference and chat with our model

By the end of this guide, you will have a fully operational SageMaker Endpoint running the Falcon 40B, ready to be used for your Generative AI application.

1. Setup development environment

We are going to use the sagemaker python SDK to deploy BLOOM to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker python SDK installed.

# install supported sagemaker SDK!pip install "sagemaker>=2.175.0" --upgrade --quiet

If you are going to use Sagemaker in a local environment, you need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemakerimport boto3sess = sagemaker.Session()# sagemaker session bucket -> used for uploading data, models and logs# sagemaker will automatically create this bucket if it not existssagemaker_session_bucket=Noneif sagemaker_session_bucket is None and sess is not None:    # set to default bucket if a bucket name is not given    sagemaker_session_bucket = sess.default_bucket() try:    role = sagemaker.get_execution_role()except ValueError:    iam = boto3.client('iam')    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] sess = sagemaker.Session(default_bucket=sagemaker_session_bucket) print(f"sagemaker role arn: {role}")print(f"sagemaker session region: {sess.boto_region_name}")

2. Retrieve the new Hugging Face LLM DLC

Compared to deploying regular Hugging Face models, we first need to retrieve the container uri and provide it to our HuggingFaceModel model class with a image_uri pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri method provided by the sagemaker SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend, session, region, and version. You can find the available versions here.

from sagemaker.huggingface import get_huggingface_llm_image_uri # retrieve the llm image urillm_image = get_huggingface_llm_image_uri(  "huggingface",  version="1.0.3") # print ecr image uriprint(f"llm image uri: {llm_image}")

3. Deploy Falcon 40B to Amazon SageMaker

Note: Quotas for Amazon SageMaker can vary between accounts. If you receive an error indicating you've exceeded your quota, you can increase them through the Service Quotas console.

To deploy Falcon 40b Instruct to Amazon SageMaker, we create a HuggingFaceModel model class and define our endpoint configuration including the hf_model_id, instance_type, etc. We will use a g5.12xlarge instance type with 4 NVIDIA A10G GPUs and 96GB of GPU memory.

Note: If you plan to have long inputs > 1k tokens adjust the config and enable int-8 quantization for a smaller memory footprint of the model (that comes with a small latency increasement).

import jsonfrom sagemaker.huggingface import HuggingFaceModel # sagemaker configinstance_type = "ml.g5.12xlarge"number_of_gpu = 4health_check_timeout = 300 # TGI configconfig = {  'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", # model_id from hf.co/models  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize} # create HuggingFaceModelllm_model = HuggingFaceModel(  role=role,  image_uri=llm_image,  env=config)

After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method. We will deploy the model with the ml.g5.12xlarge instance type. TGI will automatically distribute and shard the model across all GPUs.

# Deploy model to an endpoint# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deployllm = llm_model.deploy(  initial_instance_count=1,  instance_type=instance_type,  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model)

SageMaker will now create our endpoint and deploy the model to it. This can take 10 minutes.

4. Run inference and chat with our model

After our endpoint is deployed, we can run inference on it using the predict method from the predictor. We can use different parameters to control the generation, defining them in the parameters attribute of the payload. The Hugging Face LLM Inference Container supports a wide variety of generation parameters, including top_p, temperature, stop, max_new_token … You can find a full list of supported parameters here.

The tiiuae/falcon-40b-instruct is a conversational chat model meaning we can chat with it using the following prompt:

# define payloadprompt = """You are an helpful Assistant, called Falcon. Knowing everyting about AWS. User: Can you tell me something about Amazon SageMaker?Falcon:""" # hyperparameters for llmpayload = {  "inputs": prompt,  "parameters": {    "do_sample": True,    "top_p": 0.9,    "temperature": 0.8,    "max_new_tokens": 1024,    "repetition_penalty": 1.03,    "stop": ["\nUser:","<|endoftext|>","</s>"]  }} # send request to endpointresponse = llm.predict(payload) # print assistant respondassistant = response[0]["generated_text"][len(prompt):]

As a response, you should get something like this.

Sure! Amazon SageMaker is a fully managed service that provides everything you need to build, train and deploy machine learning models. It allows you to create, train and deploy machine learning models, as well as host real-time inference to make predictions on new data.

Before we wrap up, let's use this response and chat for another turn.

new_prompt = f"""{prompt}{assistant}User: How would you recommend start using Amazon SageMaker? If i am new to Machine Learning?Falcon:"""# update payloadpayload["inputs"] = new_prompt # send request to endpointresponse = llm.predict(payload) # print assistant respondnew_assistant = response[0]["generated_text"][len(new_prompt):]print(new_assistant)

Let see what it says.

If you are new to machine learning, I would recommend starting with Amazon SageMaker Studio. This is a web-based interface that provides a step-by-step guide to creating, training, and deploying machine learning models.

Awesome! 🚀 We have successfully deployed Falcon 40B to Amazon SageMaker and run inference on it. To clean up, we can delete the model and endpoint.

llm.delete_model()llm.delete_endpoint()

Conclusion

We successfully deployed Falcon 40B using the new Hugging Face LLM Inference DLC. The easy-to-use API and deployment process allowed us to deploy the Falcon 40B model to Amazon SageMaker. We covered how to set up the development environment, retrieve the new Hugging Face LLM DLC, deploy the model, and run inference on it. Remember to clean up by deleting the model and endpoint after you're done. Good luck with your Generative AI application!


Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Falcon模型 Amazon SageMaker Hugging Face LLM 开源LLM 生成式AI 部署指南 机器学习
相关文章