philschmid RSS feed 09月30日
基础设施即代码助力生成式AI部署
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

基础设施即代码(IaC)通过程序化接口管理和提供基础设施,帮助企业更轻松地管理、更新和使用生成式AI。利用IaC可以简化模型部署和维护,解锁批量处理、自动模型端到端评估等新功能。Hugging Face Hub的Python库现已支持推理端点,允许用户程序化管理推理端点,创建、更新、暂停、删除或发送请求。Hugging Face推理端点为生成式AI模型的生产部署提供便捷安全的解决方案,简化部署流程,支持自动扩展,降低基础设施成本。

🏗️ 基础设施即代码(IaC)通过程序化接口管理和提供基础设施,帮助企业更轻松地管理、更新和使用生成式AI,简化模型部署和维护流程。

🤖 利用IaC可以解锁新功能,如批量处理和自动模型端到端评估,提升生成式AI的应用效率和灵活性。

🔗 Hugging Face Hub的Python库现已支持推理端点,允许用户程序化管理推理端点,包括创建、更新、暂停、删除或发送请求,增强模型管理的便捷性。

🚀 Hugging Face推理端点为生成式AI模型的生产部署提供便捷安全的解决方案,支持自动扩展、降低基础设施成本,并具备高级安全特性。

🔧 通过示例教程,用户可以学习如何使用huggingface_hub库创建、发送请求、暂停和删除Hugging Face推理端点,实现高效可扩展的生成式AI模型管理。

Infrastructure as Code (IaC) allows us to manage and provise infrastructure through programmatic interfaces. This approach can help companies to easier manage, update and use Generative AI in more automatic setups.

By leveraging IaC, we can streamline the deployment and maintenance of models and unlock new capabilities like batch processing, automatic model end-to-end model evaluation.

I am happy to share the huggingface_hub python library now supports Hugging Face Inference Endpoints. This allows you to programmatically manage Inference Endpoints. You can now create, update, pause, delete or send requests using the huggingface_hub library.

Hugging Face Inference Endpoints offers an easy and secure way to deploy Generative AI models for use in production. Inference Endpoints empower developers and data scientists alike to create AI applications without managing infrastructure: simplifying the deployment process to a few clicks, including handling large volumes of requests with autoscaling, reducing infrastructure costs with scale-to-zero, and offering advanced security.

The huggingface_hub library allows you to interact with the Hugging Face Hub, a machine learning platform for creators and collaborators.

End-to-End Example

This tutorial will guide you through an example of managing Inference Endpoints using the huggingface_hub library. We'll focus on the model Zehpyr for this example. Support for managing inference endpoints came with version 0.20.

pip install "huggingface_hub>=0.20.1" --upgrade

Before we can create and managed our endpoints we need to login using a Hugging Face Token. We also want to define our namespace the namespace is the organization/account we use. It is the identifier in the url when you go to your profile, e.g. huggingface.

from huggingface_hub import login # set credentialslogin(token="YOUR_TOKEN")# set namespace for IE accountnamespace="huggingface"

The huggingface_hub library provides a create_inference_endpoints method which accepts the same parameters as the HTTP Endpoint from Inference Endpoints. This means we need to define:

    endpoint_name: The name for the endpointrepository: The model repository to useframework: The framework to use, most likely pytorchtask: The task to use for LLMs this is text-generationvendor: The cloud provider to use, e.g. awsregion: The region to use, e.g. us-east-1type: The type security type to use, e.g. protectedinstance_size: The instance size to use can be found in the UIinstance_type: The instance type to use can be found in the UIaccelerator: The accelerator to use, e.g. gpunamespace: The namespace to use, e.g. huggingfacecustom_image: The custom image to use, this is optional and can be used to define a custom image to use.

We are going to use custom_image to use Text Generation Inference. This is the same image you get when going to the UI and deploying a LLM.

In our example we want to use the model Zephyr. First lets define our custom image. The custom_image allows us also to define TGI specific parameters like MAX_BATCH_PREFILL_TOKENS, MAX_INPUT_LENGTH and MAX_TOTAL_TOKENS. Below is an example of how those, make sure to adjust them to your needs, e.g. input length.

# define TGI as custom imagecustom_image = {    "health_route": "/health",  # Health route for TGI    "env": {        "MAX_BATCH_PREFILL_TOKENS": "2048", # can be adjusted to your needs        "MAX_INPUT_LENGTH": "1024", # can be adjusted to your needs        "MAX_TOTAL_TOKENS": "1512", # can be adjusted to your needs        "MODEL_ID": "/repository",  # IE will save the model in /repository    },    "url": "ghcr.io/huggingface/text-generation-inference:1.3.3",}

After we defined our custom image we can create our inference endpoint.

# Create Inference Endpoint to run Zephyr 7Bprint("Creating Inference Endpoint for Zephyr 7B")zephyr_endpoint = create_inference_endpoint(    "zehpyr-ie-test",    repository="HuggingFaceH4/zephyr-7b-beta",    framework="pytorch",    task="text-generation",    vendor="aws",    region="us-east-1",    type="protected",    instance_size="medium",    instance_type="g5.2xlarge",    accelerator="gpu",    namespace=namespace,    custom_image=custom_image,)

The huggingface_hub library will return an InferenceEndpoint object. This object allows us to interact with the endpoint. This means we can directly send requests, pause, delete or update the endpoint. We can also call the wait method to wait for the endpoint to be ready for inference. This is super handy when you want to run inference right after creating the endpoint, e.g. for batch processing or automatic model evaluation.

Note: This may take a few minutes.

print("Waiting for endpoint to be deployed")zephyr_endpoint.wait()

After the endpoint is ready the .wait() method will return. This means we can test our endpoint and send requests.

print("Running Inference")res = zephyr_endpoint.client.text_generation(    "<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate.</s>\n<|user|>\nHow many helicopters can a human eat in one sitting?</s>\n<|assistant|>")print(res)# Matey, I'm afraid I've never heard of a human eating a helic

For more details about how to use the InferenceClient, check out the Inference guide.

If we want to temporarily pause the endpoint you can call the pause method.

print("Pausing Inference Endpoint")zephyr_endpoint.pause()

To delete the endpoint we call the delete method.

print("Deleting Inference Endpoint")zephyr_endpoint.delete()

Conclusion

In this tutorial, you've learned how to use the huggingface_hub library to create, send requests to, pause, and delete Hugging Face Inference Endpoints. This allows for efficient and scalable management of Generative AI models in production environments.

We have a more in depth documentation about managing Inference Endpoints in our documentation. The huggingface_hub offers more capabilities like listing endpoints, updating scaling and more.

If you are missing a feature or have feedback, please let us know.


Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

基础设施即代码 生成式AI Hugging Face 推理端点 模型管理
相关文章