LitServe：构建多端点机器学习API的实践

MarkTechPost@AI 23分钟前

本文深入探讨了LitServe框架，一个轻量级且功能强大的服务框架，能够轻松地将机器学习模型部署为API。我们构建并测试了多个端点，涵盖了文本生成、批处理、流式传输、多任务处理和缓存等实际功能，所有这些都在本地运行，无需依赖外部API。通过本教程，读者将清晰地了解如何设计可扩展、灵活且高效的ML服务流水线，易于扩展以满足生产级应用的需求。文中提供了完整的代码示例，便于用户学习和实践。

💡 **LitServe框架的核心优势**：LitServe是一个轻量级且强大的服务框架，允许用户以最小的成本将机器学习模型部署为API。它简化了模型服务化的过程，使得开发者能够快速地构建和测试各种功能，例如文本生成、批处理、流式传输、多任务处理和缓存。

🚀 **多样化的API功能实现**：教程中详细展示了如何通过LitServe实现多种高级API功能。包括：使用DistilGPT2进行文本生成的API；支持批量处理文本进行情感分析的API；模拟实时流式传输的文本生成API；以及能够处理情感分析和文本摘要两种任务的多任务API。这些示例清晰地展示了LitServe的灵活性和强大能力。

⚡ **性能优化与效率提升**：文章特别强调了LitServe在性能优化方面的能力，通过实现一个带有缓存机制的API，有效减少了重复计算，提高了响应速度，并实时追踪了缓存的命中与未命中次数。这证明了LitServe能够帮助用户构建高效的、针对重复推理场景进行优化的服务。

💻 **本地测试与生产级应用**：教程不仅展示了如何构建API，还包含了在不启动外部服务器的情况下本地测试所有API的详细步骤。这确保了每个组件都能平稳高效地运行，为最终部署到生产环境打下坚实基础。LitServe的设计理念在于保持灵活性、性能和简洁性，从而简化ML系统的部署流程。

In this tutorial, we explore LitServe, a lightweight and powerful serving framework that allows us to deploy machine learning models as APIs with minimal effort. We build and test multiple endpoints that demonstrate real-world functionalities such as text generation, batching, streaming, multi-task processing, and caching, all running locally without relying on external APIs. By the end, we clearly understand how to design scalable and flexible ML serving pipelines that are both efficient and easy to extend for production-level applications. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

!pip install litserve torch transformers -qimport litserve as lsimport torchfrom transformers import pipelineimport timefrom typing import List

We begin by setting up our environment on Google Colab and installing all required dependencies, including LitServe, PyTorch, and Transformers. We then import the essential libraries and modules that will allow us to define, serve, and test our APIs efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

class TextGeneratorAPI(ls.LitAPI):   def setup(self, device):       self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() else -1)       self.device = device   def decode_request(self, request):       return request["prompt"]   def predict(self, prompt):       result = self.model(prompt, max_length=100, num_return_sequences=1, temperature=0.8, do_sample=True)       return result[0]['generated_text']   def encode_response(self, output):       return {"generated_text": output, "model": "distilgpt2"}class BatchedSentimentAPI(ls.LitAPI):   def setup(self, device):       self.model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0 if device == "cuda" and torch.cuda.is_available() else -1)   def decode_request(self, request):       return request["text"]   def batch(self, inputs: List[str]) -> List[str]:       return inputs   def predict(self, batch: List[str]):       results = self.model(batch)       return results   def unbatch(self, output):       return output   def encode_response(self, output):       return {"label": output["label"], "score": float(output["score"]), "batched": True}

Here, we create two LitServe APIs, one for text generation using a local DistilGPT2 model and another for batched sentiment analysis. We define how each API decodes incoming requests, performs inference, and returns structured responses, demonstrating how easy it is to build scalable, reusable model-serving endpoints. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

class StreamingTextAPI(ls.LitAPI):   def setup(self, device):       self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() else -1)   def decode_request(self, request):       return request["prompt"]   def predict(self, prompt):       words = ["Once", "upon", "a", "time", "in", "a", "digital", "world"]       for word in words:           time.sleep(0.1)           yield word + " "   def encode_response(self, output):       for token in output:           yield {"token": token}

In this section, we design a streaming text-generation API that emits tokens as they are generated. We simulate real-time streaming by yielding words one at a time, demonstrating how LitServe can handle continuous token generation efficiently. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

class MultiTaskAPI(ls.LitAPI):   def setup(self, device):       self.sentiment = pipeline("sentiment-analysis", device=-1)       self.summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6", device=-1)       self.device = device   def decode_request(self, request):       return {"task": request.get("task", "sentiment"), "text": request["text"]}   def predict(self, inputs):       task = inputs["task"]       text = inputs["text"]       if task == "sentiment":           result = self.sentiment(text)[0]           return {"task": "sentiment", "result": result}       elif task == "summarize":           if len(text.split()) < 30:               return {"task": "summarize", "result": {"summary_text": text}}           result = self.summarizer(text, max_length=50, min_length=10)[0]           return {"task": "summarize", "result": result}       else:           return {"task": "unknown", "error": "Unsupported task"}   def encode_response(self, output):       return output

We now develop a multi-task API that handles both sentiment analysis and summarization via a single endpoint. This snippet demonstrates how we can manage multiple model pipelines through a unified interface, dynamically routing each request to the appropriate pipeline based on the specified task. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

class CachedAPI(ls.LitAPI):   def setup(self, device):       self.model = pipeline("sentiment-analysis", device=-1)       self.cache = {}       self.hits = 0       self.misses = 0   def decode_request(self, request):       return request["text"]   def predict(self, text):       if text in self.cache:           self.hits += 1           return self.cache[text], True       self.misses += 1       result = self.model(text)[0]       self.cache[text] = result       return result, False   def encode_response(self, output):       result, from_cache = output       return {"label": result["label"], "score": float(result["score"]), "from_cache": from_cache, "cache_stats": {"hits": self.hits, "misses": self.misses}}

We implement an API that uses caching to store previous inference results, reducing redundant computation for repeated requests. We track cache hits and misses in real time, illustrating how simple caching mechanisms can drastically improve performance in repeated inference scenarios. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

def test_apis_locally():   print("=" * 70)   print("Testing APIs Locally (No Server)")   print("=" * 70)   api1 = TextGeneratorAPI(); api1.setup("cpu")   decoded = api1.decode_request({"prompt": "Artificial intelligence will"})   result = api1.predict(decoded)   encoded = api1.encode_response(result)   print(f"✓ Result: {encoded['generated_text'][:100]}...")   api2 = BatchedSentimentAPI(); api2.setup("cpu")   texts = ["I love Python!", "This is terrible.", "Neutral statement."]   decoded_batch = [api2.decode_request({"text": t}) for t in texts]   batched = api2.batch(decoded_batch)   results = api2.predict(batched)   unbatched = api2.unbatch(results)   for i, r in enumerate(unbatched):       encoded = api2.encode_response(r)       print(f"✓ '{texts[i]}' -> {encoded['label']} ({encoded['score']:.2f})")   api3 = MultiTaskAPI(); api3.setup("cpu")   decoded = api3.decode_request({"task": "sentiment", "text": "Amazing tutorial!"})   result = api3.predict(decoded)   print(f"✓ Sentiment: {result['result']}")   api4 = CachedAPI(); api4.setup("cpu")   test_text = "LitServe is awesome!"   for i in range(3):       decoded = api4.decode_request({"text": test_text})       result = api4.predict(decoded)       encoded = api4.encode_response(result)       print(f"✓ Request {i+1}: {encoded['label']} (cached: {encoded['from_cache']})")   print("=" * 70)   print(" All tests completed successfully!")   print("=" * 70)test_apis_locally()

We test all our APIs locally to verify their correctness and performance without starting an external server. We sequentially evaluate text generation, batched sentiment analysis, multi-tasking, and caching, ensuring each component of our LitServe setup runs smoothly and efficiently.

In conclusion, we create and run diverse APIs that showcase the framework's versatility. We experiment with text generation, sentiment analysis, multi-tasking, and caching to experience LitServe’s seaMLess integration with Hugging Face pipelines. As we complete the tutorial, we realize how LitServe simplifies model deployment workflows, enabling us to serve intelligent ML systems in just a few lines of Python code while maintaining flexibility, performance, and simplicity.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post An Implementation on Building Advanced Multi-Endpoint Machine Learning APIs with LitServe: Batching, Streaming, Caching, and Local Inference appeared first on MarkTechPost.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签