Nvidia Developer 10月24日 01:58
Unsloth项目简化LLM微调,赋能更多开发者
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Unsloth是一个开源框架,旨在简化和加速大型语言模型(LLM)的微调和强化学习过程。它通过自定义Triton内核和算法,实现了训练吞吐量翻倍、显存占用减少70%且无精度损失。该项目现已针对NVIDIA Blackwell GPU进行优化,并支持Llama、gpt-oss和DeepSeek等流行模型。Unsloth可轻松部署于消费级NVIDIA GPU(如GeForce RTX 50系列)到企业级NVIDIA DGX Cloud实例,使得LLM定制化开发更加普及,为AI创新注入新动力。

💡 **简化LLM微调与强化学习**:Unsloth开源项目通过其创新的Triton内核和算法,显著降低了LLM微调和强化学习的门槛。它提供了高达2倍的训练吞吐量和70%的显存节省,同时保证了模型的精度不受影响,使得个人开发者和小型团队也能轻松进行LLM的定制化开发和研究。

🚀 **NVIDIA Blackwell GPU优化与广泛兼容**:Unsloth已针对NVIDIA Blackwell GPU进行深度优化,并引入了NVFP4精度支持,进一步提升了性能。该框架能够无缝运行在从消费级GeForce RTX 50系列、RTX PRO 6000 Blackwell系列到企业级的NVIDIA DGX Cloud和NVIDIA DGX Spark等多种NVIDIA硬件上,极大地拓展了LLM训练的硬件可及性。

📈 **性能提升与长上下文处理**:在NVIDIA Blackwell GPU上,Unsloth相较于其他优化设置(包括Flash Attention 2)能实现2倍的训练速度提升、70%的VRAM降低,并支持12倍更长的上下文窗口。这意味着,即便是拥有700亿以上参数的模型,也能在单块Blackwell GPU上进行微调,或者在更小的GPU上处理更长的文本序列。

🛠️ **灵活的部署与易用性**:Unsloth提供了多种便捷的安装和部署方式,包括pip安装、独立的虚拟环境以及Docker容器化部署,满足不同开发者的需求。用户可以轻松地在本地NVIDIA GPU上进行实验,并将工作流程无缝扩展到NVIDIA DGX Cloud等云平台,实现大规模生产环境下的模型训练和微调。

Fine-tuning and reinforcement learning (RL) for large language models (LLMs) require advanced expertise and complex workflows, making them out of reach for many. The open source Unsloth project changes that by streamlining the process, making it easier for individuals and small teams to explore LLM customization. When paired with the efficiency and throughput of the NVIDIA Blackwell GPUs, this combination helps democratize access to LLM development, opening the door for a wider community of practitioners to innovate.

This post explains how developers can train custom LLMs locally on NVIDIA RTX PRO 6000 Blackwell Series, GeForce RTX 50 Series, and NVIDIA DGX Spark using Unsloth. It also covers how these same workflows scale seamlessly into Blackwell-powered cloud instances, such as NVIDIA DGX Cloud and those from NVIDIA Cloud Partners, for production workloads.

What is Unsloth?

Unsloth is an open source framework that simplifies and accelerates LLM fine-tuning and RL. It uses custom Triton kernels and algorithms to deliver:

    2x faster training throughput70% less VRAM usageNo accuracy loss

It supports popular models such as Llama, gpt-oss, and DeepSeek, and is now optimized for NVIDIA Blackwell GPUs with NVFP4 precision.

With support from the NVIDIA DGX Cloud AI team, Unsloth extends from consumer GPUs, such as the GeForce RTX 50 Series, RTX PRO 6000 Blackwell Series, and NVIDIA GB10-based developer workstations (such as the NVIDIA DGX Spark), to enterprise-class NVIDIA HGX B200 and NVIDIA GB200 NVL72 systems. This makes fine-tuning accessible to everyone.

How does Unsloth perform on NVIDIA Blackwell? 

Unsloth benchmarks show that, with NVIDIA Blackwell, it delivers significant gains compared to other optimized setups, including Flash Attention 2. Specifically, it delivers:

    2x increase in training speed70% VRAM reduction (even for 70B+ parameter models)12x longer context windows

These results mean that you can now fine-tune models with as many as 40 billion parameters on a single Blackwell GPU.

Test setup: NVIDIA GeForce RTX 5090 GPU with 32 GB of VRAM, Alpaca dataset, batch size = 2, gradient accumulation = 4, rank = 32, QLoRA applied on all linear layers.

ModelVRAMUnsloth speedVRAM reductionLonger contextHugging Face + FA2
Llama 3.1 (8B)80 GB2x>70%12x longer1x
Table 1. Performance benchmarks for Unsloth on a GeForce RTX 5090 GPU
VRAMUnsloth context lengthHugging Face + FA2 context length
8 GB2,972OOM
12 GB21,848932
16 GB40,7242,551
24 GB78,4755,789
32 GB122,1819,711
Table 2. Detailed benchmarks for different context lengths for Unsloth on a GeForce RTX 5090 GPU

How to set up Unsloth on NVIDIA GPUs

Unsloth setup is easy, whether you prefer a quick pip install, an isolated virtual environment, or a containerized Docker deployment. Try the following examples on any Blackwell generation GPU, including the GeForce RTX 50 Series.

Running a 20B model

The following example shows what it might look like to run the gpt-oss-20b model:

from unsloth import FastLanguageModelimport torchmax_seq_length = 1024# 4bit pre quantized models we support for 4x faster downloading + no OOMs.fourbit_models = [    "unsloth/gpt-oss-20b-unsloth-bnb-4bit", # 20B model using bitsandbytes 4bit quantization    "unsloth/gpt-oss-120b-unsloth-bnb-4bit",    "unsloth/gpt-oss-20b", # 20B model using MXFP4 format    "unsloth/gpt-oss-120b",] # More models at https://huggingface.co/unslothmodel, tokenizer = FastLanguageModel.from_pretrained(    model_name = "unsloth/gpt-oss-20b",    max_seq_length = max_seq_length, # Choose any for long context!    load_in_4bit = True,  # 4 bit quantization to reduce memory    full_finetuning = False, # [NEW!] We have full finetuning now!    # token = "hf_...", # use one if using gated models)

Docker deployment

Unsloth also offers a prebuilt Docker image, which is supported in NVIDIA Blackwell GPUs. 

Note that the Docker container requires the NVIDIA Container Toolkit to be installed on your host system.
Before running the following command, fill in your specific information:

docker run -d -e JUPYTER_PASSWORD="mypassword" \  -p 8888:8888 -p 2222:22 \  -v $(pwd)/work:/workspace/work \  --gpus all \  unsloth/unsloth

Using an isolated environment

Issue the following commands from the shell to install Unsloth using Python:

python -m venv unslothsource unsloth/bin/activatepip install unsloth

Note: Depending on your system, you may need to use pip3 / pip3.13 and python3 / python3.13.

Handling issues with xFormers 

If you encounter issues with xFormers, build from source. 

First, uninstall any existing xFormers:

pip uninstall xformers -y

Next, clone and build:

pip install ninjaexport TORCH_CUDA_ARCH_LIST="12.0"git clone --depth=1 https://github.com/facebookresearch/xformers --recursivecd xformers && python setup.py install && cd ..

Using uv

If you prefer to use uv, install Unsloth using the following command:

While Unsloth enables local experimentation with 20B and 40B models on a single Blackwell GPU, the same workflows are fully portable to NVIDIA DGX Cloud and NVIDIA Cloud Partners. This enables scaling to clusters of Blackwell GPUs for fine-tuning 70B+ models, reinforcement learning, and enterprise workloads without changing a line of code.

Get started transforming LLM training runs

From experimentation to production, NVIDIA DGX Cloud and NVIDIA Cloud Partners deliver the power to train and fine-tune at any scale—combining elastic compute, enterprise storage, and real-time monitoring in fully managed AI environments optimized for NVIDIA GPUs.

According to Unsloth Co-Founder Daniel Han, “AI shouldn’t be an exclusive club. The next great AI breakthrough could come from anywhere—students, individual researchers, or small startups. Unsloth is here to ensure they have the tools they need.”

Start locally on your NVIDIA GeForce RTX 50 Series GPU, NVIDIA RTX PRO 6000 Blackwell Series GPU, or NVIDIA DGX Spark system to fine-tune models with Unsloth. Then scale seamlessly with NVIDIA DGX Cloud or an NVIDIA Cloud Partner to harness clusters of Blackwell GPUs with enterprise-grade reliability and visibility—all without compromise. Check out the step-by-step guide to fine-tuning LLMs with NVIDIA Blackwell GPUs and Unsloth, and how to install the software on NVIDIA DGX Spark.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Unsloth LLM Fine-tuning Reinforcement Learning NVIDIA Blackwell AI Development 开源 大型语言模型 微调 强化学习 英伟达
相关文章