智谱开源端侧大语言和多模态模型GLM-Edge系列！

GLM-Edge系列是智谱AI推出的端侧优化模型，包含1.5B/2B对话模型和4B/5B多模态模型，专为手机、车机和PC设计，通过混合量化和推理优化，在高通骁龙8 Elite平台上实现高速解码，满足不同设备需求。

系列是智谱开源的面向端侧真实落地使用的场景下的一次尝试，由两种尺寸的大语言对话模型和多模态理解模型组成（ GLM-Edge-1.5B-Chat，GLM-Edge-4B-Chat，GLM-Edge-V-2B，GLM-Edge-V-5B）。其中，1.5B / 2B模型主要面向手机、车机等平台， 4B / 5B 模型主要面向PC等平台。

基于GLM-4系列的技术积累，研究团队针对端侧实际部署情况，对模型结构和尺寸做了针对性的调整，以求在模型表现、实机推理效果和落地便利度之间达到平衡。同时，通过与伙伴企业的深入合作和在推理优化上的不懈努力，在一些端侧平台上，GLM-Edge系列模型能以极快的速度运行。

例如，在高通骁龙8 Elite平台上，借助其强大的NPU算力，GLM-Edge通过混合量化方案，1.5B对话模型、2B多模态模型能实现每秒60 tokens以上的解码速度。在应用投机采样技术之后，两个模型能以峰值每秒100 tokens以上的解码速度运行。这些推理方案会由GLM团队或合作伙伴后续放出。

GLM-Edge合集：

https://modelscope.cn/collections/GLM-Edge-ff0306563d2844

体验空间：

https://modelscope.cn/studios/ZhipuAI/GLM-Edge-V-5B-Demo

体验空间：

https://modelscope.cn/studios/ZhipuAI/GLM-Edge-1.5B-Chat-Demo

实机运行数据

数据采集日截止到2024年11月28日。GLM团队还在积极地与合作伙伴们一道优化这些性能。

高通

模型	任务	量化方案	框架	1st token latency (ms)	Token Rate (tokens/s)	Peak Memory Footprint (GB)
GLM-Edge-1.5B-Chat	(input/output=512/128)	INT4	QNN	260	65	1.2
GLM-Edge-4B-Chat	(input/output=512/128)	INT4	QNN	660	24	2.9

在高通8 Elite（Gen4）平台上测试，模型全部运行在NPU上

如运行V模型，另外需要单图890ms的处理时间和约660M的额外内存

使用投机解码方案时，Token Rate还有最高50%的提升

Intel

模型	任务	量化方案	框架	1st token latency (ms)	Token Rate (tokens/s)	Peak Memory Footprint (GB)
GLM-Edge-4B-Chat	(input/output=1024/128)	INT4	OPENVINO	541.2	27	3.9
GLM-Edge-1.5B-Chat	(input/output=1024/128)	INT4	OPENVINO	228.2	63	2.3
GLM-Edge-V-2B	Single image understanding (672x672)	INT4	OPENVINO	362.1	70	3.4

在Intel LNL 288V (ARC 140V 8X@2.05GHz) 平台上测试。

如运行V模型，另外需要单图1.7s的处理时间和约2G的额外内存。

模型效果

图片描述

数数：

数学：

OCR和理解

模型推理

在魔搭社区提供的免费GPU算力完成模型推理

transformers推理

安装依赖

请安装源代码的transformers库。

pip install git+https://github.com/huggingface/transformers.git

大语言模型推理

from modelscope import AutoModelForCausalLM, AutoTokenizer
MODEL_PATH = "ZhipuAI/glm-edge-4b-chat"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto")
message = [{"role": "user", "content": "hello!"}]
inputs = tokenizer.apply_chat_template(    message,    return_tensors="pt",    add_generation_prompt=True,    return_dict=True,).to(model.device)
generate_kwargs = {    "input_ids": inputs["input_ids"],    "attention_mask": inputs["attention_mask"],    "max_new_tokens": 128,    "do_sample": False,}out = model.generate(**generate_kwargs)print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

显存占用：

多模态模型推理

import torchfrom PIL import Imagefrom modelscope import snapshot_downloadfrom transformers import (    AutoTokenizer,    AutoImageProcessor,    AutoModelForCausalLM,)
url = "example.png"messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "describe this image"}]}]image = Image.open(url)
model_dir = snapshot_download("ZhipuAI/glm-edge-v-5b")
processor = AutoImageProcessor.from_pretrained(model_dir, trust_remote_code=True)tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained(    model_dir,    torch_dtype=torch.bfloat16,    device_map="auto",    trust_remote_code=True,)
inputs = tokenizer.apply_chat_template(    messages, add_generation_prompt=True, return_dict=True, tokenize=True, return_tensors="pt").to(next(model.parameters()).device)
generate_kwargs = {    **inputs,    "pixel_values": torch.tensor(processor(image).pixel_values).to(next(model.parameters()).device),}output = model.generate(**generate_kwargs, max_new_tokens=100)print(tokenizer.decode(output[0][len(inputs["input_ids"][0]):], skip_special_tokens=True))

显存占用：

llama.cpp推理

环境安装

目前针对该模型的适配代码正在积极合入官方llama.cpp中,可通过下述适配版本进行测试：

git clone https://github.com/piDack/llama.cpp -b support_glm_edge_modelcd llama.cppcmake -B build -DGGML_CUDA=ON # 或开启其他加速硬件cmake --build build -- -j

大语言模型推理

模型下载：

使用魔搭社区的cli命令下载模型

modelscope download --model=ZhipuAI/glm-edge-4b-chat-gguf --local_dir . ggml-model-Q5_K_M.gguf

推理

安装完成后，您可以通过以下命令启动GLM-Edge Chat模型：

cd build/bin./llama-cli -m /mnt/workspace/ggml-model-Q5_K_M.gguf -p "<|user|>\nhi<|assistant|>\n" -ngl 999

在命令行界面，您可以与模型进行交互，输入您的需求，模型将为您提供相应的回复。

多模态模型推理

模型下载：

使用魔搭社区的cli命令下载模型

modelscope download --model=ZhipuAI/glm-edge-v-5b-gguf --local_dir . ggml-model-Q5_K.gguf modelscope download --model=ZhipuAI/glm-edge-v-5b-gguf --local_dir . mmproj-model-f16.gguf

推理

安装完成后，您可以通过以下命令启动GLM-Edge Chat模型：

cd build/bin./llama-llava-cli -m /mnt/workspace/ggml-model-Q5_K.gguf --mmproj /mnt/workspace/mmproj-model-f16.gguf --image img_path/image.jpg -p "<|system|>\n system prompt <image><|user|>\n prompt <|assistant|>\n"

在命令行界面，您可以与模型进行交互，输入您的需求，模型将为您提供相应的回复。

模型微调

ms-swift是魔搭社区官方提供的LLM工具箱，支持400+大语言模型和100+多模态大模型的微调到部署。

ms-swift开源地址：

https://github.com/modelscope/ms-swift

在开始微调之前，请确保您的环境已正确安装ms-swift

# 安装ms-swiftgit clone https://github.com/modelscope/ms-swift.gitcd ms-swiftpip install -e .[llm]

glm-edge-1.5b-chat

我们对glm-edge-1.5b-chat进行自我认知微调。自我认知数据集：https://www.modelscope.cn/datasets/swift/self-cognition

微调脚本：

# 实验环境：3090, A10CUDA_VISIBLE_DEVICES=0 swift sft \    --model_type glm-edge-1_5b-chat \    --model_id_or_path ZhipuAI/glm-edge-1.5b-chat \    --dataset alpaca-zh#500 alpaca-en#500 self-cognition#500 \    --logging_steps 5 \    --learning_rate 1e-4 \    --output_dir output \    --lora_target_modules ALL \    --model_name 小黄 'Xiao Huang' \    --model_author 魔搭 ModelScope

微调显存消耗：

glm-edge-v-2b

我们对glm-edge-v-2b进行图像描述微调。微调数据集：https://modelscope.cn/datasets/modelscope/coco_2014_caption

微调脚本：

# 实验环境：3090, A10CUDA_VISIBLE_DEVICES=0 \swift sft \  --model_type glm-edge-v-2b \  --model_id_or_path ZhipuAI/glm-edge-v-2b \  --sft_type lora \  --learning_rate 1e-4 \  --output_dir output \
  --dataset coco-en-mini#20000

如果要使用自定义数据集，只需按以下方式进行指定：

# val_dataset可选。若不指定，则会从dataset中切出一部分数据作为验证集  --dataset train.jsonl \  --val_dataset val.jsonl \

自定义数据集格式：

{"query": "<image>query", "response": "response", "images": ["image_path"]}{"query": "query3", "response": "response3", "history": [["query1", "response1"], ["query2", "response2"]]}

微调显存消耗：

微调后推理脚本如下，这里的ckpt_dir需修改为训练生成的last checkpoint文件夹。

# 如果需要merge lora，添加`--merge_lora true`即可CUDA_VISIBLE_DEVICES=0 swift infer \    --ckpt_dir output/glm-edge-v-2b/vx-xxx/checkpoint-xxx \    --load_dataset_config true --show_dataset_sample 10

微调后模型对验证集进行推理的效果（训练了400个steps）：

点击阅读原文，即可跳转合集链接~

?点击关注ModelScope公众号获取

更多技术信息~

高通

Intel

transformers推理

安装依赖

大语言模型推理

显存占用：

多模态模型推理

显存占用：

llama.cpp推理

环境安装

大语言模型推理

模型下载：

推理

多模态模型推理

模型下载：

推理

glm-edge-1.5b-chat

glm-edge-v-2b

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签