Nvidia Developer 14小时前
NVIDIA DGX Spark:本地化AI开发新选择
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA DGX Spark是一款紧凑型超级计算机,旨在为AI开发者提供替代云服务和数据中心队列的本地化解决方案。它搭载Blackwell架构,提供1 petaflop的FP4 AI计算性能,128GB统一系统内存,以及273 GB/秒的内存带宽。DGX Spark预装了NVIDIA AI软件栈,使得开发者可以在本地处理大型、计算密集型任务,如模型微调、图像生成、数据科学和推理,而无需将工作迁移至云端。文章详细介绍了DGX Spark在这些方面的性能表现,并提供了相关的基准测试数据,强调了其在大内存和高性能计算方面的优势。

🚀 **本地化AI开发解决方案**:NVIDIA DGX Spark提供了一种将AI开发工作从云端或数据中心转移到本地的强大替代方案。其紧凑的设计集成了Blackwell架构,能够提供高达1 petaflop的FP4 AI计算性能和128GB的统一系统内存,让开发者能够直接在本地处理复杂的AI任务,无需等待云资源或面临数据传输延迟。

💡 **卓越的计算性能与内存优势**:DGX Spark在模型微调、图像生成、数据科学和推理等多个AI工作负载中展现出显著优势。例如,在模型微调方面,它能够实现极高的tokens/sec吞吐量,远超消费级GPU的处理能力,尤其是在内存密集型任务上。在图像生成方面,其大内存和计算能力支持更高分辨率和更高精度的图像生成,FP4数据格式加速了这一过程。

📊 **加速数据科学与推理流程**:通过支持CUDA-X库如cuML和cuDF,DGX Spark极大地加速了数据科学任务,使得大规模数据集的处理和分析能在数秒内完成。在推理方面,DGX Spark的Blackwell GPU支持FP4数据格式,能在保证近乎FP8精度的前提下,显著提升模型性能和响应速度,为用户提供更快的交互体验。

🔗 **多模型与多GPU支持**:DGX Spark支持多种AI模型,包括Llama、Flux.1、SDXL等,并能够通过多种后端(如TensorRT-LLM、llama.cpp)进行优化。文章还展示了通过连接两个DGX Spark实例,可以运行内存需求超过120GB的超大型模型(如Qwen3 235B),这为开发者进行实验和原型开发提供了前所未有的灵活性。

Today’s demanding AI developer workloads often need more memory than desktop systems provide or require access to software that laptops or PCs lack. This forces work to be moved to the cloud or data center.

NVIDIA DGX Spark provides an alternative to cloud instances and data-center queues. The Blackwell-powered, compact supercomputer contains 1 petaflop of FP4 AI computer performance, 128 GB of coherent unified system memory, memory bandwidth of 273 GB/second, and the NVIDIA AI software stack preinstalled. With DGX Spark, you can work with large, compute intensive tasks locally, without moving to the cloud or data center.

We’ll walk you through how DGX Spark’s compute performance, large memory, and preinstalled AI software accelerate fine-tuning, image generation, data science, and inference workloads. Keep reading for some benchmarks.

Fine-tuning workloads on DGX Spark

Tuning pre-trained models is a common task for AI developers. To show how DGX Spark performs at this workload, we ran three tuning tasks using different methodologies: full fine-tuning, LoRA, and QLoRA. 

In full fine-tuning of a Llama 3.2B model, we reached a peak of 82,739.2 tokens per second. Tuning a Llama 3.1 8B model using LoRA on DGX Spark reached a peak of 53,657.6 tokens per second. Tuning a Llama 3.3 70B model using QLoRA on DGX Spark reached a peak of 5,079.4 tokens per second. 

Since fine-tuning is so memory intensive, none of these tuning workloads can run on a 32 GB consumer GPU.

Fine-tuning
ModelMethodBackendConfigurationPeak tokens/sec
Llama 3.2 3B

Full fine tuningPyTorchSequence length: 2048
Batch size: 8
Epoch: 1
Steps: 125BF16
82,739.20
Llama 3.1 8BLoRAPyTorchSequence length: 2048
Batch size: 4
Epoch: 1
Steps: 125BF16
53,657.60
Llama 3.3 70BQLoRAPyTorchSequence length: 2048
Batch size: 8
Epoch: 1
Steps: 125FP4
5,079.04
Table 1. Fine-tuning performance

DGX Spark’s image-generation capabilities

Image generation models are always pushing for greater accuracy, higher resolutions, and faster performance. Creating high-resolution images or multiple images per prompt drives the need for more memory, as well as the compute required to generate the images.

DGX Spark’s large GPU memory and strong compute performance lets you work with larger-resolution images and higher-precision models to provide higher image quality. Support for the FP4 data format enables DGX Spark to generate images quickly, even at high resolutions.

Using the Flux.1 12B model at FP4 precision, DGX Spark can generate a 1K image every 2.6 seconds (see Table 2 below). DGX Spark’s large system memory provides the capacity necessary to run a BF16 SDXL 1.0 model and generate seven 1K images per minute.

Image generation
ModelPrecisionBackendConfigurationImages/min
Flux.1 12B SchnellFP4TensorRTResolution: 1024×1024 
Denoising steps: 4 
Batch size: 1
23
SDXL1.0BF16TensorRTResolution: 1024×1024
Denoising steps: 50
Batch size: 2
7
Table 2. Image-generation performance

Using DGX Spark for data science

DGX Spark supports foundational CUDA-X libraries like NVIDIA cuML and cuDF. NVIDIA cuML accelerates machine-learning algorithms in scikit-learn, as well as UMAP and HDBSCAN on GPUs with zero code changes required. 

For computationally intensive ML algorithms like UMAP and HDBSCAN, DGX Spark can process 250 MB datasets in seconds. (See Table 3 below.) NVIDIA cuDF significantly speeds up common pandas data analysis tasks like joins and string methods. cuDF pandas operations on datasets with tens of millions of records run in just seconds on DGX Spark.

Data science
LibraryBenchmarkDataset sizeTime
NVIDIA cuMLUMAP250 MB4 secs
NVIDIA cuMLHDBSCAN250 MB10 secs
NVIDIA cuDF pandasKey data analysis operations (joins, string methods, UDFs)0.5 to 5 GB11 secs
Table 3. Data-science performance

Using DGX Spark for inference

DGX Spark’s Blackwell GPU supports the FP4 data format, specifically the NVFP4 data format that provides near-FP8 accuracy (<1% degradation). This enables use of smaller models without sacrificing accuracy. The smaller data footprint of FP4 also improves performance. Table 4 below provides inference performance data for DGX Spark.

DGX Spark supports a range of 4-bit data formats: NVFP4, MXFP4, as well as many backends such as TRT-LLM, llama.cpp, and vLLM. The system’s 1 petaflop of AI performance enables it to deliver fast prompt processing, as shown in Table 4. The quick prompt processing results in a faster time-to-first response token, which delivers a better experience for users and speeds up end-to-end throughput. 

Inference (ISL|OSL= 2048|128, BS=1)
ModelPrecisionBackendPrompt processing throughput
(tokens/sec)
Token generation throughput
(tokens/sec)
Qwen3 14BNVFP4TRT-LLM5928.9522.71
GPT-OSS-20BMXFP4llama.cpp3670.4282.74
GPT-OSS-120BMXFP4llama.cpp1725.4755.37
Llama 3.1 8BNVFP4TRT-LLM10256.938.65
Qwen2.5-VL-7B-InstructNVFP4TRT-LLM65831.7741.71
Qwen3 235B
(on dual DGX Spark)
NVFP4TRT-LLM23477.0311.73
Table 4. Inference performance

NVFP4: 4-bit floating point format was introduced with the NVIDIA Blackwell GPU architecture. MXFP4: Microscaling FP4 is a 4-bit floating point format created by the Open Compute Project (OCP). ISL (Input Sequence Length): Number of tokens in the input prompt (a.k.a. prefill tokens). And OSL (Output Sequence Length): Number of tokens generated by the model in response (a.k.a. decode tokens).

We also connected two DGX Sparks together via their ConnectX-7 chips to run the Qwen3 235B model. The model uses over 120 GB of memory, including overhead. Such models typically run on large cloud or data-center servers, but the fact that they can run on dual DGX Spark systems shows what’s possible for developer experimentation. As shown in the last row of Table 4, the token generation throughput on dual DGX Sparks was 11.73 tokens per second. 

The new NVFP4 version of the NVIDIA Nemotron Nano 2 model also performs well on DGX Spark. With the NVFP4 version, you can now achieve up to 2x higher throughput with little to no accuracy degradation. Download the model checkpoints from Hugging Face or as an NVIDIA NIM

And get your DGX Spark, join the DGX Spark developer community, and start your AI-building journey today. 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA DGX Spark AI开发 本地AI GPU计算 Blackwell架构 模型微调 图像生成 数据科学 AI推理 NVIDIA AI NVIDIA DGX Spark AI Development Local AI GPU Computing Blackwell Architecture Model Fine-tuning Image Generation Data Science AI Inference NVIDIA AI
相关文章