AWS Machine Learning Blog 08月12日
Fine-tune OpenAI GPT-OSS models on Amazon SageMaker AI using Hugging Face libraries
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI于2025年8月5日发布了GPT-OSS系列模型gpt-oss-20b和gpt-oss-120b,现已通过Amazon SageMaker AI和Amazon Bedrock在AWS上可用。这些基于Mixture-of-Experts(MoE)架构的Transformer模型,在处理编码、科学分析和数学推理方面表现出色,并支持128,000的上下文长度。文章详细介绍了如何在SageMaker上利用Hugging Face TRL库、Accelerate库和DeepSpeed ZeRO-3技术对GPT-OSS模型进行微调,以适应特定领域和用例,从而实现高效、可扩展的定制化。

🚀 **GPT-OSS模型发布及核心特性**:OpenAI发布了gpt-oss-20b和gpt-oss-120b两个文本模型,基于Mixture-of-Experts(MoE)架构,能在处理每个token时仅激活部分参数,从而在保证高推理性能的同时降低计算成本。它们擅长编码、科学分析和数学推理,并支持128,000的超长上下文窗口,提供可调的推理级别、链式思考(CoT)能力、结构化输出和工具使用,使其非常适合代理AI工作流。模型经过了安全训练和对抗性微调评估,以增强其鲁棒性。

🔧 **SageMaker与Bedrock的集成部署**:GPT-OSS模型可通过Amazon SageMaker JumpStart进行部署,或通过Amazon Bedrock API访问。这为开发者提供了极大的灵活性,可以将模型集成到生产级的AI工作流中。此外,用户还可以利用Hugging Face生态系统的开源工具,在SageMaker的全托管基础设施上对模型进行微调,以满足特定领域和用例的需求。

💡 **高效微调技术与实践**:文章详细阐述了在SageMaker上微调GPT-OSS模型的过程,重点介绍了如何结合Hugging Face TRL库、Accelerate库和DeepSpeed ZeRO-3优化技术,以实现大规模模型的高效分布式训练。同时,还探讨了MXFP4(Microscaling FP4)量化格式和参数高效微调(PEFT)方法(如LoRA)的应用,这些技术共同作用,能够以可管理的成本实现高性能模型定制化。

📊 **多语言推理能力与数据集**:文章强调了GPT-OSS模型在处理复杂的多语言推理任务上的优势。通过使用HuggingFaceH4/Multilingual-Thinking数据集进行微调,该模型能够更好地理解和生成跨语言的链式思考逻辑。这对于构建多语言虚拟助手、跨地域支持系统或国际知识系统至关重要,能够验证模型在不同语言和推理模式间切换时的逻辑连贯性。

⚙️ **SageMaker的端到端AI服务**:Amazon SageMaker作为一个托管的机器学习服务,能够简化整个基础模型(FM)的生命周期管理。它提供交互式笔记本、全托管的训练作业以及SageMaker HyperPod集群,支持从模型探索、大规模微调到生产部署的全流程。SageMaker还内置了AIOps工具,如可复用管道和MLflow,支持实验跟踪、模型注册和无缝部署,并具备完善的治理和企业级安全特性。

Released on August 5, 2025, OpenAI’s GPT-OSS models, gpt-oss-20b and gpt-oss-120b, are now available on AWS through Amazon SageMaker AI and Amazon Bedrock. These pre-trained, text-only Transformer models are built on a Mixture-of-Experts (MoE) architecture that activates only a subset of parameters per token, delivering high reasoning performance while reducing compute costs. They specialize in coding, scientific analysis, and mathematical reasoning, and support a 128,000 context length, adjustable reasoning levels (low/medium/high), chain-of-thought (CoT) reasoning with audit-friendly traces, structured outputs, and tool use to support agentic-AI workflows. As discussed in OpenAI’s documentation, both models have undergone safety-focused training and adversarial fine-tuning evaluations to assess and strengthen robustness against misuse. The following table summarizes the model specifications.

Model Layers Total Parameters Active Parameters Per Token Total Experts Active Experts Per Token Context Length
openai/gpt-oss-120b 36 117 billion 5.1 billion 128 4 128,000
openai/gpt-oss-20b 24 21 billion 3.6 billion 32 4 128,000

The GPT-OSS models are deployable using Amazon SageMaker JumpStart and also accessible through Amazon Bedrock APIs. Both options provide developers the flexibility to deploy and integrate GPT-OSS models into your production-grade AI workflows. Beyond out-of-the-box deployment, these models can be fine-tuned to align with specific domains and use cases, using open source tools from the Hugging Face ecosystem and running on the fully managed infrastructure of SageMaker AI.

Fine-tuning large language models (LLMs) is the process of adjusting a pre-trained model’s weights using a smaller, task-specific dataset to tailor its behavior to a particular domain or application. Fine-tuning large models like GPT-OSS transforms them from a broad generalist into a domain-specific expert without the cost of training from scratch. Adapting the model to your data and terminology can deliver more accurate, context-aware outputs, improves reliability, and reduces hallucinations. The result is a specialized GPT-OSS that excels at targeted tasks while retaining the scalability, flexibility, and open-weight benefits ideal for secure, enterprise-grade deployment.

In this post, we walk through the process of fine-tuning a GPT-OSS model in a fully managed training environment using SageMaker AI training jobs. The workflow uses the Hugging Face TRL library for fine-tuning, the Hugging Face Accelerate library to simplify distributed training across multiple GPUs and nodes, and the DeepSpeed ZeRO-3 optimization technique to reduce memory usage by partitioning model states across devices for efficient training of billion-parameter models. We then apply this setup to fine-tune the GPT-OSS model on a multilingual reasoning dataset, HuggingFaceH4/Multilingual-Thinking, enabling GPT-OSS to handle structured, CoT reasoning across multiple languages.

Solution overview

SageMaker AI is a managed machine learning (ML) service that streamlines the entire foundation model (FM) lifecycle. It provides hosted, interactive notebooks for rapid exploration, fully managed ephemeral training jobs for large-scale and distributed fine-tuning, and Amazon SageMaker HyperPod clusters that offer granular control over persistent training infrastructure for large-scale model training and fine-tuning workloads. By using managed hosting in SageMaker, you can serve models reliably in production, and the suite of AIOps-ready tools, such as reusable pipelines and fully managed MLflow, support experiment tracking, model registration, and seamless deployment. With built-in governance and enterprise-grade security, SageMaker AI provides data engineers, data scientists, and ML engineers with a unified, fully managed platform to build, train, deploy, and govern FMs end-to-end.

GPT-OSS can be fine-tuned on SageMaker using the latest Hugging Face TRL library, which can be written as recipes for fine-tuning LLMs using Hugging Face SFTTrainer. These recipes can also be adapted to fine-tune other open-weight language or vision models such as Qwen, Mistral, Meta, and many more. In this post, we show how to fine-tune GPT-OSS in a distributed setup either on a single node multi-GPU setup or across multi-node multi-GPU setup, using Hugging Face Accelerate to manage multi-device training and DeepSpeed ZeRO-3 to train large models more efficiently. Together, they help you fine-tune faster and scale to larger datasets.

We also highlight MXFP4 (Microscaling FP4), a 4-bit floating-point quantization format from the Open Compute Project. It groups tensors into small blocks, each sharing a scaling factor, which reduces memory and compute needs while helping preserve model accuracy—making it well-suited for efficient model training. Complementing quantization, we explore Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA, for the adaptation of large models by learning a small set of additional parameters instead of modifying all weights. This approach is memory- and compute-efficient, highly compatible with quantized models, and supports fine-tuning even on constrained hardware environments.

The following diagram illustrates this configuration (source).

By using MXFP4 quantization, PEFT fine-tuning methods like LoRA, and distributed training with Hugging Face Accelerate and DeepSpeed ZeRO-3 together, we can efficiently and scalably fine-tune large models like gpt-oss-120b and gpt-oss-20b for high-performance customization while keeping infrastructure and compute costs manageable.

Prerequisites

To fine-tune GPT-OSS models on SageMaker AI, you must have the following prerequisites:

Business outcomes for fine-tuning GPT-OSS

Global enterprises increasingly need AI tools that support complex reasoning across multiple languages—whether for multilingual virtual assistants, cross-location support desks, or international knowledge systems. Although FMs offer a powerful starting point, their effectiveness in diverse linguistic contexts hinges on structured reasoning inputs—datasets that surface logic steps explicitly and across languages. That’s why testing with a multilingual, CoT-style dataset is a valuable first step. It lets you verify how well a model holds reasoning coherence when switching between languages and reasoning patterns, laying a robust foundation before scaling to larger, domain-specific multilingual datasets. GPT-OSS is particularly well-suited for this task, with its native CoT capabilities, long 128,000 context window, and adjustable reasoning levels, making it ideal for evaluating and refining multilingual reasoning performance before production deployment.

Fine-tune GPT-OSS models for multi-lingual reasoning on SageMaker AI

In this section, we walk through how to fine-tune OpenAI’s GPT-OSS models on SageMaker AI using training jobs. SageMaker training jobs support distributed multi-GPU and multi-node configurations, so you can spin up high-performance clusters on demand, train billion-parameter models faster, and automatically shut down resources when the job finishes.

Set up your environment

In the following sections, we run the code from SageMaker Studio JupyterLab notebook instances. You can also use your preferred IDE, such as VS Code or PyCharm, but make sure your local environment is configured to work with AWS, as discussed in the prerequisites.

Complete the following steps:

    On the SageMaker AI console, choose Domains in the navigation pane, then open your domain. In the navigation pane under Applications and IDEs, choose Studio. On the User profiles tab, locate your user profile, then choose Launch and Studio.

    In SageMaker Studio, launch an ml.t3.medium JupyterLab notebook instance with at least 50 GB of storage.

A large notebook instance isn’t required, because the fine-tuning job will run on a separate ephemeral training job instance with NVIDIA accelerators.

    To begin fine-tuning, start by cloning the GitHub repo and navigating to 3_distributed_training/models/openai--gpt-oss directory, then launch the finetune_gpt_oss.ipynb notebook with a Python 3.12 or higher version kernel:
# clone github repogit clone https://github.com/aws-samples/amazon-sagemaker-generativeai.git

Dataset for fine-tuning

Selecting and curating the right dataset is a critical first step in fine-tuning any LLM. In this post, we use the Hugging FaceH4/Multilingual-Thinking dataset, which is a multilingual reasoning dataset containing CoT examples translated into languages such as French, Spanish, and German. Its combination of diverse languages, varied reasoning tasks, and explicit step-by-step thought processes makes it well-suited for evaluating how a model handles structured reasoning, adapts to multilingual inputs, and maintains logical consistency across different linguistic contexts. With around 1,000 examples, it’s small enough for quick experimentation yet sufficient to demonstrate fine-tuning and evaluation of large pre-trained models like GPT-OSS. The dataset can be loaded in just a few lines of code using the Hugging Face Datasets library:

# load datasets in memorydataset_name = 'HuggingFaceH4/Multilingual-Thinking'dataset = load_dataset(dataset_name, split="train")

The following code is some sample data:

{  "reasoning_language": "French",  "developer": "You are a recipe suggestion bot, ...",  "user": "Can you provide me with a step-by-step ...",  "analysis": "D'accord, l'utilisateur souhaite une recette ...",  "final": "Certainly! Here's a classic homemade chocolate ...",  "messages": [    {      "content": "reasoning language: French\n\nYou are a ...",      "role": "system",      "thinking": null    },    {      "content": "Can you provide me with a step-by-step ...",      "role": "user",      "thinking": null    },    {      "content": "Certainly! Here's a classic homemade chocolate ...",      "role": "assistant",      "thinking": "D'accord, l'utilisateur souhaite une recette ...“    }  ]}

For supervised fine-tuning, we use only the data in the messages key to train our GPT-OSS model. Because TRL’s SFTTrainer natively supports this format, it can be used as-is. We extract all rows containing only the messages key, save them in JSONL format, and upload the file to Amazon Simple Storage Service (Amazon S3). This makes sure the dataset is readily accessible to SageMaker training jobs at runtime.

# preserve only messages key dataset = dataset.remove_columns(    [col for col in dataset.column_names if col != "messages"])# save as JSONL formatdataset_filename = os.path.join(dataset_parent_path, f"{dataset_name.replace('/', '--').replace('.', '-')}.jsonl")dataset.to_json(dataset_filename, lines=True)...from sagemaker.s3 import S3Uploader# select a data destination bucketdata_s3_uri = f"s3://{sess.default_bucket()}/dataset"# upload to S3uploaded_s3_uri = S3Uploader.upload(    local_path=dataset_filename,    desired_s3_uri=data_s3_uri)print(f"Uploaded {dataset_filename} to > {uploaded_s3_uri}")

Experimentation tracking with MLflow (Optional)

SageMaker AI offers the fully managed MLflow capability, so you can track multiple training runs within experiments, compare results with visualizations, evaluate models, and register the best ones in the model registry. MLflow also supports integration with agentic workflows.

TRL’s SFTTrainer natively integrates with experimentation tracking tools such as MLflow, TensorBoard, Weights & Biases, and more. With SFTTrainer, you can log training parameters, hyperparameters, loss metrics, system metrics, and more to a centralized location, providing you with audit trails, governance, and streamlined experiment tracking. This step is optional; if you choose not to use SageMaker managed MLflow, you can set the SFTTrainer parameter reports_to to tensorboard, which will log all metrics locally to disk for visualization using a local or remote TensorBoard service.

# set none to log to local diskMLFLOW_TRACKING_SERVER_ARN = None # or "arn:aws:sagemaker:us-west-2:<account-id>:mlflow-tracking-server/<server-name>"if MLFLOW_TRACKING_SERVER_ARN:    reports_to = "mlflow"else:    reports_to = "tensorboard"print("reports to:", reports_to)

Experiments logged from TRL’s SFTTrainer to an MLflow tracking server in SageMaker automatically capture key metrics and parameters. The SageMaker managed MLflow service renders real-time visualizations, profiles training hardware with minimal setup, enables side-by-side run comparisons, and provides built-in evaluation tools to track, train, and assess your fine-tuning jobs end-to-end.

Fine-tune GPT-OSS on training jobs

The following example demonstrates how to fine-tune the gpt-oss-20b model. To switch to gpt-oss-120b, simply update the model_name. The model-to-instance mapping shown in this section has been tested as part of this notebook workflow. You can adjust the instance type and instance count to fit your specific use case.

The following table summarizes the different model specifications.

GPT‑OSS Model SageMaker Instance GPU Specifications
openai/gpt-oss-120b ml.p5en.48xlarge 8× NVIDIA H200 GPUs, 96 GB HBM3 each
openai/gpt-oss-20b ml.p4de.24xlarge 8× NVIDIA A100 GPUs, 80 GB HBM2e each
# User-defined variablesmodel_name = "openai/gpt-oss-20b"tokenizer_name = "openai/gpt-oss-20b"# dataset path inside a sagemaker containerdataset_path = "/opt/ml/input/data/training/HuggingFaceH4--Multilingual-Thinking.jsonl"output_path = "/opt/ml/model/openai-gpt-oss-20b-HuggingFaceH4-Multilingual-Thinking/"# support only for Ampere, Hopper and Grace Blackwellbf16_flag = "true" 

SageMaker training jobs automatically download datasets from the specified S3 prefix or file into the training container, mapping them to /opt/ml/input. Training artifacts and logs are stored in /opt/ml/output, and the final trained or fine-tuned model is saved to /opt/ml/model. Saving the model to this path allows SageMaker to automatically detect it for downstream workflows such as model registration, deployment, and other automation. You can set or unset the bf16_flag to choose between float16 and bfloat16. float16 uses less memory but has a smaller numeric range, whereas bfloat16 provides a wider range with similar memory savings, making it more stable for training large models. bfloat16 is supported on newer GPU architectures such as NVIDIA Ampere, Hopper, and Grace Blackwell.

Fine-tuning with open source Hugging Face recipes

With Hugging Face’s TRL library, you can define Supervised Fine-Tuning (SFT) recipes, which are essentially preconfigured training workflows that streamline fine-tuning FMs like Meta, Qwen, Mistral, and now OpenAI GPT‑OSS with minimal setup. These recipes simplify the process of adapting models to new datasets using TRL’s SFTTrainer and configuration tools.

yaml_template = """# Model argumentsmodel_name_or_path: {{ model_name }}tokenizer_name_or_path: {{ tokenizer_name }}model_revision: maintorch_dtype: bfloat16attn_implementation: kernels-community/vllm-flash-attn3bf16: {{ bf16_flag }}tf32: falseoutput_dir: {{ output_dir }}# Dataset argumentsdataset_id_or_path: {{ dataset_path }}max_seq_length: 2048packing: truepacking_strategy: wrapped# LoRA argumentsuse_peft: truelora_target_modules: "all-linear"### Specific to GPT-OSSlora_modules_to_save: ["7.mlp.experts.gate_up_proj", "7.mlp.experts.down_proj", "15.mlp.experts.gate_up_proj", "15.mlp.experts.down_proj", "23.mlp.experts.gate_up_proj", "23.mlp.experts.down_proj"]lora_r: 8lora_alpha: 16# Training argumentsnum_train_epochs: 1. per_device_train_batch_size: 6per_device_eval_batch_size: 6gradient_accumulation_steps: 3gradient_checkpointing: trueoptim: adamw_torch_fusedgradient_checkpointing_kwargs:  use_reentrant: truelearning_rate: 1.0e-4lr_scheduler_type: cosinewarmup_ratio: 0.1max_grad_norm: 0.3bf16: {{ bf16_flag }}bf16_full_eval: {{ bf16_flag }}tf32: false# Logging argumentslogging_strategy: stepslogging_steps: 2report_to:  - {{ reports_to }}save_strategy: "epoch"seed: 42"""config_filename = "openai-gpt-oss-20b-qlora.yaml"

The recipe.yaml file contains the following key parameters:

After a recipe is defined and tested, you can seamlessly swap configurations such as the model name, dataset, number of epochs, or PEFT settings and run or rerun the fine-tuning workflow with minimal or no code changes.

SageMaker estimators

As a next step, we use a SageMaker training job estimator to spin up a training cluster and run the model fine-tuning. The SageMaker AI estimators API provide a high-level API to define and run training jobs on fully managed infrastructure, handling environment setup, scaling, and artifact management. You can specify training scripts, input data, and compute resources without manually provisioning servers. SageMaker also offers prebuilt Hugging Face and PyTorch estimators, which come optimized for their respective frameworks, making it straightforward to train and fine-tune models with minimal setup.

It’s recommended to use Python 3.12 and higher to fine-tune GPT-OSS with the following packages installed. Add or update the requirements.txt file in your script’s root directory with the following packages. SageMaker estimators will automatically detect this file and install the listed dependencies at runtime.

%%writefile code/requirements.txttransformers>=4.55.0kernels>=0.9.0datasets==4.0.0bitsandbytes==0.46.1trl>=0.20.0peft>=0.17.0lighteval==0.10.0hf-transfer==0.1.8hf_xettensorboard liger-kernel==0.6.1deepspeed==0.17.4lm-eval[api]==0.4.9Pillowmlflowsagemaker-mlflow==0.1.0tritongit+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

Define a SageMaker estimator and point it to your local training script directory. SageMaker will package the contents and place them in /opt/ml/code inside the training container. This includes your training script, additional modules in the directory, and if a requirements.txt file is present, SageMaker will automatically install the listed packages at runtime.

pytorch_estimator = PyTorch(    image_uri="763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.7.1-gpu-py312-cu128-ubuntu22.04-sagemaker",    entry_point="accelerate_sagemaker_train.sh", # Adapted bash script to train using accelerate on SageMaker - Multi-GPU    source_dir="code",    instance_type=training_instance_type,    instance_count=1, # multi-node training support    base_job_name=f"{job_name}-pytorch",    role=role,    ...    hyperparameters={        "num_process": NUM_GPUS, # define the number of GPUs to run distributed training, per instance        "config": f"recipes/{config_filename}",    })

The following is the directory structure for fine-tuning GPT-OSS on SageMaker AI training jobs:

code/├── accelerate/                       # Accelerate configuration files├── accelerate_sagemaker_train.sh      # Launch script for distributed training with Accelerate on SageMaker training jobs├── gpt_oss_sft.py                     # Main training script for supervised fine-tuning (SFT) of GPT-OSS├── recipes/                           # Predefined training configuration recipes (YAML)└── requirements.txt                   # Python dependencies installed at runtime

To fine-tune across multiple GPUs, we use Hugging Face Accelerate and DeepSpeed ZeRO-3, which work together to train large models more efficiently. Hugging Face Accelerate simplifies launching distributed training by automatically handling device placement, process management, and mixed precision settings. DeepSpeed ZeRO-3 reduces memory usage by partitioning optimizer states, gradients, and parameters across devices—allowing billion-parameter models to fit and train faster.

You can run your SFTTrainer script with Hugging Face Accelerate using a simple command like the following:

accelerate launch \    --config_file accelerate/zero3.yaml \     --num_processes 8 gpt_oss_sft.py \     --config recipes/openai-gpt-oss-20b-qlora.yaml

SageMaker executes this command inside the training container because we set entry_point="accelerate_sagemaker_train.sh" when initializing the SageMaker estimator. The accelerate_sagemaker_train.sh script is defined as follows:

#!/bin/bashset -e...# Launch fine-tuning with Accelerate + DeepSpeed (Zero3)accelerate launch \  --config_file accelerate/zero3.yaml \  --num_processes "$NUM_GPUS" \  gpt_oss_sft.py \  --config "$CONFIG_PATH" 

PEFT vs. full fine-tuning

The gpt_oss_sft.py script lets you choose between PEFT and full fine-tuning by setting use_peft to true or false. Full fine-tuning gives you greater control over the base model weights, enabling broader adaptability and expressiveness. However, it also carries the risk of catastrophic forgetting and higher resource consumption during the training process.

At the end of training, you will have the fully adapted model weights, which can be deployed to a SageMaker endpoint for inference. You can then run predictions against the deployed endpoint using the SageMaker Predictor.

Conclusion

In this post, we demonstrated how to fine-tune OpenAI’s GPT-OSS models (gpt-oss-120b and gpt-oss-20b) on SageMaker AI using SageMaker training jobs, the Hugging Face TRL library, and distributed training with Hugging Face Accelerate and DeepSpeed ZeRO-3. By combining the fully managed, ephemeral infrastructure of SageMaker with TRL’s streamlined fine-tuning recipes, you can adapt GPT-OSS to your domain quickly and efficiently, using either PEFT for cost-effective customization or full fine-tuning for maximum model control. With the resulting model artifacts, you can deploy to SageMaker endpoints for secure, scalable inference and bring advanced reasoning capabilities directly into your enterprise workflows.

If you’re interested in exploring further, the GitHub repo contains all the resources used in this walkthrough. It’s a great starting point for experimenting with fine-tuning GPT-OSS on your own datasets and deploying the resulting models to SageMaker for real-world applications. You can get set up with a notebook in minutes using the SageMaker Studio domain quick setup and start experimenting right away.


About the authors

Pranav Murthy is a Senior Generative AI Data Scientist at AWS, specializing in helping organizations innovate with Generative AI, Deep Learning, and Machine Learning on Amazon SageMaker AI. Over the past 10+ years, he has developed and scaled advanced computer vision (CV) and natural language processing (NLP) models to tackle high-impact problems—from optimizing global supply chains to enabling real-time video analytics and multilingual search. When he’s not building AI solutions, Pranav enjoys playing strategic games like chess, traveling to discover new cultures, and mentoring aspiring AI practitioners. You can find Pranav on LinkedIn.

Sumedha Swamy is a Senior Manager of Product Management at Amazon Web Services (AWS), where he leads several areas of the Amazon SageMaker, including SageMaker Studio – the industry-leading integrated development environment for machine learning, developer and administrator experiences, AI infrastructure, and SageMaker SDK.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GPT-OSS OpenAI AWS SageMaker Bedrock AI模型微调 MoE 多语言推理
相关文章