philschmid RSS feed 09月30日
开源LLM部署指南
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何在Hugging Face Inference Endpoints上部署开源LLM(如Falcon、Open-LLaMA等),包括部署Falcon 40B instruct模型的步骤、测试LLM端点的方法,以及如何在JavaScript和Python中流式传输响应。文章还详细说明了Hugging Face Inference Endpoints的关键特性,如易于部署、成本效益和安全性,并提供了控制文本生成的参数和代码示例。

💡开源LLM如Falcon、Open-LLaMA等在近期取得了显著进步,能在特定场景下与ChatGPT或GPT-4等闭源模型竞争。然而,高效优化这些模型的部署仍是一大挑战。

📈Hugging Face Inference Endpoints是一个托管的SaaS解决方案,简化了模型部署流程,支持自动扩展、按需缩减规模,并具备高级安全特性,如离线端点和SOC2 Type 2认证。

🔧文章详细介绍了部署Falcon 40B instruct模型的步骤,包括选择合适的实例类型(如Nvidia A100 GPU)和点击创建端点,以及10分钟后的在线可用性。

🛠️通过Inference Widget或cURL命令,用户可以手动测试LLM端点,并使用多种参数(如temperature、max_new_tokens、stop_sequences等)控制文本生成过程。

🌐文章提供了在Python和JavaScript中流式传输响应的代码示例,通过text-generation客户端和HuggingFace.js库,实现实时生成和提升用户体验。

Open-source LLMs like Falcon, (Open-)LLaMA, X-Gen, StarCoder or RedPajama, have come a long way in recent months and can compete with closed-source models like ChatGPT or GPT4 for certain use cases. However, deploying these models in an efficient and optimized way still presents a challenge.

In this blog post, we will show you how to deploy open-source LLMs to Hugging Face Inference Endpoints, our managed SaaS solution that makes it easy to deploy models. Additionally, we will teach you how to stream responses and test the performance of our endpoints. So let's get started!

    How to deploy Falcon 40B instructTest the LLM endpointStream responses in Javascript and Python

Before we start, let's refresh our knowledge about Inference Endpoints.

What is Hugging Face Inference Endpoints

Hugging Face Inference Endpoints offers an easy and secure way to deploy Machine Learning models for use in production. Inference Endpoints empower developers and data scientists alike to create AI applications without managing infrastructure: simplifying the deployment process to a few clicks, including handling large volumes of requests with autoscaling, reducing infrastructure costs with scale-to-zero, and offering advanced security.

Here are some of the most important features for LLM deployment:

    Easy Deployment: Deploy models as production-ready APIs with just a few clicks, eliminating the need to handle infrastructure or MLOps.Cost Efficiency: Benefit from automatic scale to zero capability, reducing costs by scaling down the infrastructure when the endpoint is not in use, while paying based on the uptime of the endpoint, ensuring cost-effectiveness.Enterprise Security: Deploy models in secure offline endpoints accessible only through direct VPC connections, backed by SOC2 Type 2 certification, and offering BAA and GDPR data processing agreements for enhanced data security and compliance.LLM Optimization: Optimized for LLMs, enabling high throughput with Paged Attention and low latency through custom transformers code and Flash Attention power by Text Generation InferenceComprehensive Task Support: Out of the box support for 🤗 Transformers, Sentence-Transformers, and Diffusers tasks and models, and easy customization to enable advanced tasks like speaker diarization or any Machine Learning task and library.

You can get started with Inference Endpoints at: https://ui.endpoints.huggingface.co/

1. How to deploy Falcon 40B instruct

To get started, you need to be logged in with a User or Organization account with a payment method on file (you can add one here), then access Inference Endpoints at https://ui.endpoints.huggingface.co

Then, click on “New endpoint”. Select the repository, the cloud, and the region, adjust the instance and security settings, and deploy in our case tiiuae/falcon-40b-instruct.

Inference Endpoints suggest an instance type based on the model size, which should be big enough to run the model. Here 4x NVIDIA T4 GPUs. To get the best performance for the LLM, change the instance to GPU [xlarge] · 1x Nvidia A100.

Note: If the instance type cannot be selected, you need to contact us and request an instance quothttps://www.philschmid.de/static/blog/endpoints-llm/instance-selection.pnglm/instance-selection.png" alt="Select Instance Type" title="Select Instance Type">

You can then deploy your model with a click on “Create Endpoint”. After 10 minutes, the Endpoint should be online and available to serve requests.

2. Test the LLM endpoint

The Endpoint overview provides access to the Inference Widget, which can be used to manually send requests. This allows you to quickly test your Endpoint with different inputs and share it with team members. Those Widgets do not support parameters - in this case this https://www.philschmid.dehttps://www.philschmid.de/static/blog/endpoints-llm/widget.png

The widget also generates a cURL command you can use. Just add your hf_xxx and test.

curl https://j4xhm53fxl9ussm8.us-east-1.aws.endpoints.huggingface.cloud \-X POST \-d '{"inputs":"Once upon a time,"}' \-H "Authorization: Bearer <hf_token>" \-H "Content-Type: application/json"

You can use different parameters to control the generation, defining them in the parameters attribute of the payload. As of today, the following parameters are supported:

    temperature: Controls randomness in the model. Lower values will make the model more deterministic and higher values will make the model more random. Default value is 1.0.max_new_tokens: The maximum number of tokens to generate. Default value is 20, max value is 512.repetition_penalty: Controls the likelihood of repetition. Default is null.seed: The seed to use for random generation. Default is null.stop: A list of tokens to stop the generation. The generation will stop when one of the tokens is generated.top_k: The number of highest probability vocabulary tokens to keep for top-k-filtering. Default value is null, which disables top-k-filtering.top_p: The cumulative probability of parameter highest probability vocabulary tokens to keep for nucleus sampling, default to nulldo_sample: Whether or not to use sampling; use greedy decoding otherwise. Default value is false.best_of: Generate best_of sequences and return the one if the highest token logprobs, default to null.details: Whether or not to return details about the generation. Default value is false.return_full_text: Whether or not to return the full text or only the generated part. Default value is false.truncate: Whether or not to truncate the input to the maximum length of the model. Default value is true.typical_p: The typical probability of a token. Default value is null.watermark: The watermark to use for the generation. Default value is false.

3. Stream responses in Javascript and Python

Requesting and generating text with LLMs can be a time-consuming and iterative process. A great way to improve the user experience is streaming tokens to the user as they are generated. Below are two examples of how to stream tokens using Python and JavaScript. For Python, we are going to use the client from Text Generation Inference, and for JavaScript, the HuggingFace.js library

Streaming requests with Python

First, you need to install the text-generation client

pip install text-generation

We can create a Client providing our endpoint URL and credential alongside the hyperparameter we want to use

from text_generation import Client # HF Inference Endpoints parameterendpoint_url = "https://YOUR_ENDPOINT.endpoints.huggingface.cloud"hf_token = "hf_YOUR_TOKEN" # Streaming Clientclient = Client(endpoint_url, headers={"Authorization": f"Bearer {hf_token}"}) # generation parametergen_kwargs = dict(    max_new_tokens=512,    top_k=30,    top_p=0.9,    temperature=0.2,    repetition_penalty=1.02,    stop_sequences=["\nUser:", "<|endoftext|>", "</s>"],)# promptprompt = "What can you do in Nuremberg, Germany? Give me 3 Tips" stream = client.generate_stream(prompt, **gen_kwargs)# yield each generated tokenfor r in stream:    # skip special tokens    if r.token.special:        continue    # stop if we encounter a stop sequence    if r.token.text in gen_kwargs["stop_sequences"]:        break    # yield the generated token    print(r.token.text, end = "")    # yield r.token.text

Replace the print command with the yield

Streaming requests with JavaScript

First, you need to install the @huggingface/inference library.

npm install @huggingface/inference

We can create a HfInferenceEndpoint providing our endpoint URL and credential alongside the hyperparameter we want to use.

import { HfInferenceEndpoint } from '@huggingface/inference' const hf = new HfInferenceEndpoint(  'https://YOUR_ENDPOINT.endpoints.huggingface.cloud',  'hf_YOUR_TOKEN') //generation parameterconst gen_kwargs = {  max_new_tokens: 512,  top_k: 30,  top_p: 0.9,  temperature: 0.2,  repetition_penalty: 1.02,  stop_sequences: ['\nUser:', '<|endoftext|>', '</s>'],}// promptconst prompt = 'What can you do in Nuremberg, Germany? Give me 3 Tips' const stream = hf.textGenerationStream({ inputs: prompt, parameters: gen_kwargs })for await (const r of stream) {  // # skip special tokens  if (r.token.special) {    continue  }  // stop if we encounter a stop sequence  if (gen_kwargs['stop_sequences'].includes(r.token.text)) {    break  }  // yield the generated token  process.stdout.write(r.token.text)}

Replace the process.stdout calhttps://www.philschmid.dehttps://www.philschmid.de/static/blog/endpoints-llm/js-stream.giffunction you want to stream the tokens to.

Conclusion

In this blog post, we showed you how to deploy open-source LLMs using Hugging Face Inference Endpoints, how to control the text generation with advanced parameters, and how to stream responses to a Python or JavaScript client to improve the user experience. By using Hugging Face Inference Endpoints you can deploy models as production-ready APIs with just a few clicks, reduce your costs with automatic scale to zero, and deploy models into secure offline endpoints backed by SOC2 Type 2 certification.


Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

开源LLM Hugging Face Inference Endpoints 模型部署 流式传输 自然语言处理