基础设施即代码助力生成式AI部署

Infrastructure as Code (IaC) allows us to manage and provise infrastructure through programmatic interfaces. This approach can help companies to easier manage, update and use Generative AI in more automatic setups.

By leveraging IaC, we can streamline the deployment and maintenance of models and unlock new capabilities like batch processing, automatic model end-to-end model evaluation.

I am happy to share the huggingface_hub python library now supports Hugging Face Inference Endpoints. This allows you to programmatically manage Inference Endpoints. You can now create, update, pause, delete or send requests using the huggingface_hub library.

Hugging Face Inference Endpoints offers an easy and secure way to deploy Generative AI models for use in production. Inference Endpoints empower developers and data scientists alike to create AI applications without managing infrastructure: simplifying the deployment process to a few clicks, including handling large volumes of requests with autoscaling, reducing infrastructure costs with scale-to-zero, and offering advanced security.

The huggingface_hub library allows you to interact with the Hugging Face Hub, a machine learning platform for creators and collaborators.

End-to-End Example

This tutorial will guide you through an example of managing Inference Endpoints using the huggingface_hub library. We'll focus on the model Zehpyr for this example. Support for managing inference endpoints came with version 0.20.

pip install "huggingface_hub>=0.20.1" --upgrade

Before we can create and managed our endpoints we need to login using a Hugging Face Token. We also want to define our namespace the namespace is the organization/account we use. It is the identifier in the url when you go to your profile, e.g. huggingface.

from huggingface_hub import login # set credentialslogin(token="YOUR_TOKEN")# set namespace for IE accountnamespace="huggingface"

The huggingface_hub library provides a create_inference_endpoints method which accepts the same parameters as the HTTP Endpoint from Inference Endpoints. This means we need to define:

endpoint_name

repository

framework

pytorch

task

text-generation

vendor

aws

region

us-east-1

type

protected

instance_size

instance_type

accelerator

gpu

namespace

huggingface

custom_image

We are going to use custom_image to use Text Generation Inference. This is the same image you get when going to the UI and deploying a LLM.

In our example we want to use the model Zephyr. First lets define our custom image. The custom_image allows us also to define TGI specific parameters like MAX_BATCH_PREFILL_TOKENS, MAX_INPUT_LENGTH and MAX_TOTAL_TOKENS. Below is an example of how those, make sure to adjust them to your needs, e.g. input length.

# define TGI as custom imagecustom_image = {    "health_route": "/health",  # Health route for TGI    "env": {        "MAX_BATCH_PREFILL_TOKENS": "2048", # can be adjusted to your needs        "MAX_INPUT_LENGTH": "1024", # can be adjusted to your needs        "MAX_TOTAL_TOKENS": "1512", # can be adjusted to your needs        "MODEL_ID": "/repository",  # IE will save the model in /repository    },    "url": "ghcr.io/huggingface/text-generation-inference:1.3.3",}

After we defined our custom image we can create our inference endpoint.

# Create Inference Endpoint to run Zephyr 7Bprint("Creating Inference Endpoint for Zephyr 7B")zephyr_endpoint = create_inference_endpoint(    "zehpyr-ie-test",    repository="HuggingFaceH4/zephyr-7b-beta",    framework="pytorch",    task="text-generation",    vendor="aws",    region="us-east-1",    type="protected",    instance_size="medium",    instance_type="g5.2xlarge",    accelerator="gpu",    namespace=namespace,    custom_image=custom_image,)

The huggingface_hub library will return an InferenceEndpoint object. This object allows us to interact with the endpoint. This means we can directly send requests, pause, delete or update the endpoint. We can also call the wait method to wait for the endpoint to be ready for inference. This is super handy when you want to run inference right after creating the endpoint, e.g. for batch processing or automatic model evaluation.

Note: This may take a few minutes.

print("Waiting for endpoint to be deployed")zephyr_endpoint.wait()

After the endpoint is ready the .wait() method will return. This means we can test our endpoint and send requests.

print("Running Inference")res = zephyr_endpoint.client.text_generation(    "<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate.</s>\n<|user|>\nHow many helicopters can a human eat in one sitting?</s>\n<|assistant|>")print(res)# Matey, I'm afraid I've never heard of a human eating a helic

For more details about how to use the InferenceClient, check out the Inference guide.

If we want to temporarily pause the endpoint you can call the pause method.

print("Pausing Inference Endpoint")zephyr_endpoint.pause()

To delete the endpoint we call the delete method.

print("Deleting Inference Endpoint")zephyr_endpoint.delete()

Conclusion

In this tutorial, you've learned how to use the huggingface_hub library to create, send requests to, pause, and delete Hugging Face Inference Endpoints. This allows for efficient and scalable management of Generative AI models in production environments.

We have a more in depth documentation about managing Inference Endpoints in our documentation. The huggingface_hub offers more capabilities like listing endpoints, updating scaling and more.

If you are missing a feature or have feedback, please let us know.

Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.

End-to-End Example

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签