MarkTechPost@AI 08月20日
A Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本教程详细介绍了如何在Google Colab环境中搭建一个完整的Ollama自托管大型语言模型(LLM)工作流。首先,通过官方脚本安装Ollama,并在后台启动服务以暴露HTTP API。接着,拉取轻量级模型(如qwen2.5:0.5b-instruct或llama3.2:1b),以便在CPU环境下运行。文章演示了如何使用Python的requests模块通过/api/chat端点与模型进行流式交互,实现逐token输出。最后,集成了Gradio构建用户界面,支持多轮对话、参数配置和实时结果查看,为在Notebook环境中实验LLM提供了便捷的解决方案。

🚀 **Ollama安装与服务启动**:教程首先指导用户在Colab虚拟机上安装Ollama,并启动其后台服务,通过localhost:11434暴露HTTP API,为后续的模型交互打下基础。

🧠 **模型选择与拉取**:为了适应Colab的资源限制,教程推荐并演示了如何拉取如qwen2.5:0.5b-instruct或llama3.2:1b等轻量级模型,并验证了模型是否已存在于服务器,若否则自动拉取。

💬 **流式API交互**:通过Python的requests模块,利用Ollama的/api/chat端点实现了与模型的流式交互,能够逐个token捕获模型输出,从而实现实时响应效果。

✨ **Gradio用户界面集成**:教程最后展示了如何利用Gradio构建一个用户友好的聊天界面,允许用户输入提示词、管理对话历史、调整温度和上下文长度等参数,并实时查看模型回复,完成了一个完整的自托管LLM工作流。

In this tutorial, we implement a fully functional Ollama environment inside Google Colab to replicate a self-hosted LLM workflow. We begin by installing Ollama directly on the Colab VM using the official Linux installer and then launch the Ollama server in the background to expose the HTTP API on localhost:11434. After verifying the service, we pull lightweight models such as qwen2.5:0.5b-instruct or llama3.2:1b, which balance resource constraints with usability in a CPU-only environment. To interact with these models programmatically, we use the /api/chat endpoint via Python’s requests module with streaming enabled, which allows token-level output to be captured incrementally. Finally, we layer a Gradio-based UI on top of this client so we can issue prompts, maintain multi-turn history, configure parameters like temperature and context size, and view results in real time. Check out the Full Codes here.

import os, sys, subprocess, time, json, requests, textwrapfrom pathlib import Pathdef sh(cmd, check=True):   """Run a shell command, stream output."""   p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)   for line in p.stdout:       print(line, end="")   p.wait()   if check and p.returncode != 0:       raise RuntimeError(f"Command failed: {cmd}")if not Path("/usr/local/bin/ollama").exists() and not Path("/usr/bin/ollama").exists():   print(" Installing Ollama ...")   sh("curl -fsSL https://ollama.com/install.sh | sh")else:   print(" Ollama already installed.")try:   import gradio except Exception:   print(" Installing Gradio ...")   sh("pip -q install gradio==4.44.0")

We first check if Ollama is already installed on the system, and if not, we install it using the official script. At the same time, we ensure Gradio is available by importing it or installing the required version when missing. This way, we prepare our Colab environment for running the chat interface smoothly. Check out the Full Codes here.

def start_ollama():   try:       requests.get("http://127.0.0.1:11434/api/tags", timeout=1)       print(" Ollama server already running.")       return None   except Exception:       pass   print(" Starting Ollama server ...")   proc = subprocess.Popen(["ollama", "serve"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)   for _ in range(60):       time.sleep(1)       try:           r = requests.get("http://127.0.0.1:11434/api/tags", timeout=1)           if r.ok:               print(" Ollama server is up.")               break       except Exception:           pass   else:       raise RuntimeError("Ollama did not start in time.")   return procserver_proc = start_ollama()

We start the Ollama server in the background and keep checking its health endpoint until it responds successfully. By doing this, we ensure the server is running and ready before sending any API requests. Check out the Full Codes here.

MODEL = os.environ.get("OLLAMA_MODEL", "qwen2.5:0.5b-instruct")print(f" Using model: {MODEL}")try:   tags = requests.get("http://127.0.0.1:11434/api/tags", timeout=5).json()   have = any(m.get("name")==MODEL for m in tags.get("models", []))except Exception:   have = Falseif not have:   print(f"  Pulling model {MODEL} (first time only) ...")   sh(f"ollama pull {MODEL}")

We define the default model to use, check if it is already available on the Ollama server, and if not, we automatically pull it. This ensures that the chosen model is ready before we start running any chat sessions. Check out the Full Codes here.

OLLAMA_URL = "http://127.0.0.1:11434/api/chat"def ollama_chat_stream(messages, model=MODEL, temperature=0.2, num_ctx=None):   """Yield streaming text chunks from Ollama /api/chat."""   payload = {       "model": model,       "messages": messages,       "stream": True,       "options": {"temperature": float(temperature)}   }   if num_ctx:       payload["options"]["num_ctx"] = int(num_ctx)   with requests.post(OLLAMA_URL, json=payload, stream=True) as r:       r.raise_for_status()       for line in r.iter_lines():           if not line:               continue           data = json.loads(line.decode("utf-8"))           if "message" in data and "content" in data["message"]:               yield data["message"]["content"]           if data.get("done"):               break

We create a streaming client for the Ollama /api/chat endpoint, where we send messages as JSON payloads and yield tokens as they arrive. This lets us handle responses incrementally, so we see the model’s output in real time instead of waiting for the full completion. Check out the Full Codes here.

def smoke_test():   print("n Smoke test:")   sys_msg = {"role":"system","content":"You are concise. Use short bullets."}   user_msg = {"role":"user","content":"Give 3 quick tips to sleep better."}   out = []   for chunk in ollama_chat_stream([sys_msg, user_msg], temperature=0.3):       print(chunk, end="")       out.append(chunk)   print("n Done.n")try:   smoke_test()except Exception as e:   print(" Smoke test skipped:", e)

We run a quick smoke test by sending a simple prompt through our streaming client to confirm that the model responds correctly. This helps us verify that Ollama is installed, the server is running, and the chosen model is working before we build the full chat UI. Check out the Full Codes here.

import gradio as grSYSTEM_PROMPT = "You are a helpful, crisp assistant. Prefer bullets when helpful."def chat_fn(message, history, temperature, num_ctx):   msgs = [{"role":"system","content":SYSTEM_PROMPT}]   for u, a in history:       if u: msgs.append({"role":"user","content":u})       if a: msgs.append({"role":"assistant","content":a})   msgs.append({"role":"user","content": message})   acc = ""   try:       for part in ollama_chat_stream(msgs, model=MODEL, temperature=temperature, num_ctx=num_ctx or None):           acc += part           yield acc   except Exception as e:       yield f" Error: {e}"with gr.Blocks(title="Ollama Chat (Colab)", fill_height=True) as demo:   gr.Markdown("#  Ollama Chat (Colab)nSmall local-ish LLM via Ollama + Gradio.n")   with gr.Row():       temp = gr.Slider(0.0, 1.0, value=0.3, step=0.1, label="Temperature")       num_ctx = gr.Slider(512, 8192, value=2048, step=256, label="Context Tokens (num_ctx)")   chat = gr.Chatbot(height=460)   msg = gr.Textbox(label="Your message", placeholder="Ask anything…", lines=3)   clear = gr.Button("Clear")   def user_send(m, h):       m = (m or "").strip()       if not m: return "", h       return "", h + [[m, None]]   def bot_reply(h, temperature, num_ctx):       u = h[-1][0]       stream = chat_fn(u, h[:-1], temperature, int(num_ctx))       acc = ""       for partial in stream:           acc = partial           h[-1][1] = acc           yield h   msg.submit(user_send, [msg, chat], [msg, chat])      .then(bot_reply, [chat, temp, num_ctx], [chat])   clear.click(lambda: None, None, chat)print(" Launching Gradio ...")demo.launch(share=True)

We integrate Gradio to build an interactive chat UI on top of the Ollama server, where user input and conversation history are converted into the correct message format and streamed back as model responses. The sliders let us adjust parameters like temperature and context length, while the chat box and clear button provide a simple, real-time interface for testing different prompts.

In conclusion, we establish a reproducible pipeline for running Ollama in Colab: installation, server startup, model management, API access, and user interface integration. The system uses Ollama’s REST API as the core interaction layer, providing both command-line and Python streaming access, while Gradio handles session persistence and chat rendering. This approach preserves the “self-hosted” design described in the original guide but adapts it for Colab’s constraints, where Docker and GPU-backed Ollama images are not practical. The result is a compact yet technically complete framework that lets us experiment with multiple LLMs, adjust generation parameters dynamically, and test conversational AI locally within a notebook environment.


Check out the Full Codes here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post A Coding Implementation to Build a Complete Self-Hosted LLM Workflow with Ollama, REST API, and Gradio Chat Interface appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Ollama Google Colab LLM 自托管 Gradio
相关文章