Nvidia Developer 09月25日
Windows ML 赋能开发者,释放 RTX GPU AI 推理潜能
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软宣布 Windows ML 正式面向开发者开放,允许 C#, C++ 和 Python 开发者在 PC 硬件上高效运行 AI 模型。通过集成 NVIDIA TensorRT for RTX EP,Windows ML 能充分利用 GeForce RTX 和 RTX Pro GPU 的 Tensor Cores 以及 FP8、FP4 等先进特性,在 Windows 11 上实现最快的 AI 推理性能。Windows ML 基于 ONNX Runtime API,能自动管理跨 CPU、NPU 和 GPU 的执行提供者,简化了开发者的依赖管理。NVIDIA TensorRT for RTX EP 提供了低延迟、高吞吐量的推理能力,并支持动态形状、CUDA 图等优化技术,显著提升了 AI 应用在 Windows 上的运行效率。

🚀 **Windows ML 的发布与核心功能**:微软正式推出 Windows ML,为 C#, C++, Python 开发者提供了一个在 PC 本地高效运行 AI 模型的新平台。该工具能够充分利用 PC 上的 CPU、NPU 和 GPU 硬件资源,实现最优化的 AI 模型性能。

⚡ **NVIDIA TensorRT for RTX EP 的性能优势**:通过与 NVIDIA TensorRT for RTX Execution Provider (EP) 的深度集成,Windows ML 能够解锁 GeForce RTX 和 RTX Pro GPU 的全部潜力。它利用 GPU 的 Tensor Cores 和 FP8、FP4 等先进架构特性,在 Windows 11 上提供了前所未有的 AI 推理速度,相较于 DirectML 实现,吞吐量提升达 50%,延迟更低。

📦 **ONNX Runtime 与执行提供者管理**:Windows ML 基于 ONNX Runtime API 构建,并扩展了其功能,能够动态管理和初始化跨不同硬件(CPU, NPU, GPU)的执行提供者。这意味着开发者无需手动管理不同硬件供应商的依赖和软件包,Windows ML 会按需自动下载所需的执行提供者,大大简化了开发流程。

💡 **性能优化与部署便利性**:NVIDIA TensorRT for RTX EP 提供了一系列优化,包括仅需约 200MB 的轻量级安装包、模型执行的即时编译(Just-in-time compilation)、以及对 LLMs、扩散模型、CNN 等多种模型架构的支持。此外,它还支持 CUDA 图和运行时缓存,能够进一步减少启动开销和加快模型加载速度,实现更快的推理。

🔗 **最小化数据传输开销**:利用 ONNX Runtime 的设备 API 和 Windows ML,开发者可以分配设备特定的张量,从而最大限度地减少数据传输的开销。通过 CopyTensors 和 IOBinding,实现了 EP-无关的 GPU 加速推理,显著提高了性能并简化了代码设计,例如在 Stable Diffusion 模型上的应用中,可实现高达 75 倍的拷贝时间缩减。

Today, Microsoft is making Windows ML available to developers. Windows ML enables C#, C++ and Python developers to optimally run AI models locally across PC hardware from CPU, NPU and GPUs. On NVIDIA RTX GPUs, it utilizes the NVIDIA TensorRT for RTX Execution Provider (EP) leveraging the GPU’s Tensor Cores and architectural advancements like FP8 and FP4, to provide the fastest AI inference performance on Windows-based RTX AI PCs.

“Windows ML unlocks full TensorRT acceleration for GeForce RTX and RTX Pro GPUs, delivering exceptional AI performance on Windows 11,” said Logan Iyer, VP, Distinguished Engineer, Windows Platform and Developer. “We’re excited it’s generally available for developers today to build and deploy powerful AI experiences at scale.”

Overview of Windows ML and TensorRT for RTX EP

Video 1. Deploy high-performance AI models in Windows applications on NVIDIA RTX AI PCs

Windows ML is built upon the ONNX Runtime APIs for inferencing. It extends the ONNX Runtime APIs to handle dynamic initialization and dependency management of the execution provider across CPU, NPU, and GPU hardware on the PC. In addition, Windows ML also automatically downloads the necessary execution provider on demand, mitigating the need for app developers to manage dependencies and packages across multiple different hardware vendors.

Figure 1. Windows ML stack diagram

NVIDIA TensorRT for RTX Execution Provider (EP) provides several benefits to Windows ML developers using ONNX Runtime including: 

    Run ONNX models with low-latency inference and 50% faster throughput compared to prior DirectML implementations on NVIDIA RTX GPUs, as shown in below figure.Integrated directly with WindowsML with its flexible EP architecture and integration with ORT.Just-in-time compilation for streamlined deployment on end-user devices. Learn more about the compilation process within TensorRT for RTX. This compilation process is supported in ONNX Runtime as EP context models.Leverage architecture advancements like FP8 and FP4 on the Tensor CoresLightweight package at just under 200 MB.Support for a variety of model architectures from LLMs (with ONNX Runtime GenAI SDK extension), diffusion, CNN, and more.

Learn more about TensorRT for RTX.

Selecting an execution provider

The 1.23.0 release of ONNX Runtime, included with WindowsML, provides vendor and execution provider independent APIs for device selection. This dramatically reduces the amount of application logic necessary to take advantage of the optimal execution provider for each hardware vendor platform. See below for a code excerpt of how to effectively do this and obtain maximum performance on NVIDIA GPUs.

// Register desired execution provider libraries of various vendorsauto env = Ort::Env(ORT_LOGGING_LEVEL_WARNING);env.RegisterExecutionProviderLibrary("nv_tensorrt_rtx", L"onnxruntime_providers_nv_tensorrt_rtx.dll");// Option 1: Rely on ONNX Runtime Execution policyOrt::SessionOptions sessions_options;sessions_options.SetEpSelectionPolicy(OrtExecutionProviderDevicePolicy_PREFER_GPU);// Option 2: Interate over EpDevices to perform manual device selection std::vector<Ort::ConstEpDevice> ep_devices = env.GetEpDevices();std::vector<Ort::ConstEpDevice> selected_devices = select_ep_devices(ep_devices);Ort::SessionOptions session_options;Ort::KeyValuePairs ep_options;session_options.AppendExecutionProvider_V2(env, selected_devices, ep_options);
# Register desired execution provider libraries of various vendorsort.register_execution_provider_library("NvTensorRTRTXExecutionProvider", "onnxruntime_providers_nv_tensorrt_rtx.dll")# Option 1: Rely on ONNX Runtime Execution policysession_options = ort.SessionOptions()session_options.set_provider_selection_policy(ort.OrtExecutionProviderDevicePolicy.PREFER_GPU)# Option 2: Interate over EpDevices to perform manual device selectionep_devices = ort.get_ep_devices()ep_device = select_ep_devices(ep_devices)provider_options = {}sess_options.add_provider_for_devices([ep_device], provider_options)

Precompiled runtimes offering quick load times

Model runtimes can now be precompiled using EP context ONNX files within ONNX Runtime. Each execution provider can use this to optimize entire subgraphs of an ONNX model, and provide an EP specific implementation. This process can be serialized to disk to enable quick load times with WindowsML, oftentimes this is quicker than prior traditional operator based methods in Direct ML.

The below chart shows that TensorRT for RTX EP takes time to compile, but is faster to load and do inference on the model as the optimizations are already serialized. In addition, the runtime cache feature within TensorRT for RTX EP ensures that the generated kernels during the compile phase are serialized and stored to a directory, so that they don’t have to be recompiled for next inferences.

Figure 3. Different load times of DeepSeek-R1-Distill-Qwen-7B model runtimes including ONNX model, EP context files, and with EP context and Runtime Cache. Lower is better.

Minimal data transfer overheads with ONNX Runtime Device API and Windows ML

The new ONNX Runtime device API, also available in Windows ML, enumerates available devices for each execution provider. Using this new notion, developers can now allocate device-specific tensors, without additional EP-dependent type specifications.

By leveraging CopyTensors and IOBinding, this API enables developers to perform EP-agnostic, GPU-accelerated inference with minimal runtime data transfer overhead—leading to improved performance and cleaner code design.

Figure 5 showcases the Stable Diffusion 3.5 Medium model leveraging the ONNX Runtime Device API. Figure 4 below represents the time required for a single iteration in the diffusion loop for the same model, both with and without device IO bindings.

Figure 4. Stable Diffusion 3.5 Medium running with and without device bindings on AMD Ryzen 7 7800X3D CPU + RTX 5090 GPU connected via PCI 5. Lower time is better.

Using Nsight systems, we visualized the performance overhead due to repetitive copies between host and device when not using IO binding:

Figure 5. Nsight Systems timeline that shows the overhead that additional synchronous PCI traffic creates.

Prior to every inference run, a copy operation of the input tensor is done, which is highlighted as green in our profile and a device to host a copy of the output takes about the same time. In addition, ONNX Runtime by default uses pageable memory for which the device to host copy is an implicit synchronization, although the cudaMemCpyAsync API is used by ONNX Runtime.

On the other hand, when input and output tensors are IO bound, the host-to-device copy of input happens just once prior to the multi-model inference pipeline. The same applies for the device-to-host copy of the output, after which we synchronize the CPU with the GPU again. The async Nsight trace above depicts multiple inference runs in the loop without any copy operations or synchronization operations in between, even freeing CPU resources in the meantime. This results in a device copy time of 4.2 milliseconds and a one-time host copy time of 1.3 milliseconds, making the total copy time of 5.5 milliseconds, irrespective of the number of iterations in the inference loop. For reference, this approach results in a ~75x reduction in copy time for a 30 iteration loop!

TensorRT for RTX Specific Optimizations

TensorRT for RTX execution offers custom options to further optimize performance. The most important optimizations are listed below. 

    CUDA graphs: Enabled by setting enable_cuda_graph to capture all CUDA kernels launched from TensorRT inside a graph, thereby reducing the launch overhead on CPU. This is important if the TensorRT graph launches many small kernels so that the GPU can execute these faster than the CPU can submit them. This method generates around 30% performance gain with LLMs, and is useful for many model types, including traditional AI models and CNN architectures.
Figure 6. Showcases the throughput speedups of CUDA Graphs being enabled compared to CUDA graphs being disabled in ONNX Runtime API. Data measured on NVIDIA RTX 5090 GPU with several LLMs.
    Runtime cache: nv_runtime_cache_path points to a directory where compiled kernels can be  cached for quick load times in combination with using EP context nodes.
    Dynamic shapes: Overwrite known dynamic shape ranges by setting the 3 options profile_{min|max|opt]_shapes or by specifying static shapes using AddFreeDimensionOverrideByName to fix the input shapes of a model. Currently, this feature is in experimental mode.

Summary

We’re excited to collaborate with Microsoft to bring Windows ML and TensorRT for RTX EP to Windows application developers for maximum performance on NVIDIA RTX GPUs. Top Windows application developers including Topaz Labs, and Wondershare Filmora are currently working on integrating Windows ML and TensorRT for RTX EP into their applications. 

Get started with Windows ML, ONNX Runtime APIs, and TensorRT for RTX EP using the below resources:

Stay tuned for future improvements and get up to speed with the new APIs that our samples are demonstrating. If there’s a feature request from your side, feel free to open an issue on GitHub and let us know!

Acknowledgements

We would like to thank Gaurav Garg, Kumar Anshuman, Umang Bhatt, and Vishal Agarawal for their contributions to the blog.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Windows ML NVIDIA TensorRT AI GPU Developer Tools ONNX Runtime RTX AI PC Inference Performance
相关文章