Windows ML 赋能开发者，释放 RTX GPU AI 推理潜能

Today, Microsoft is making Windows ML available to developers. Windows ML enables C#, C++ and Python developers to optimally run AI models locally across PC hardware from CPU, NPU and GPUs. On NVIDIA RTX GPUs, it utilizes the NVIDIA TensorRT for RTX Execution Provider (EP) leveraging the GPU’s Tensor Cores and architectural advancements like FP8 and FP4, to provide the fastest AI inference performance on Windows-based RTX AI PCs.

“Windows ML unlocks full TensorRT acceleration for GeForce RTX and RTX Pro GPUs, delivering exceptional AI performance on Windows 11,” said Logan Iyer, VP, Distinguished Engineer, Windows Platform and Developer. “We’re excited it’s generally available for developers today to build and deploy powerful AI experiences at scale.”

Overview of Windows ML and TensorRT for RTX EP

Video 1. Deploy high-performance AI models in Windows applications on NVIDIA RTX AI PCs

Windows ML is built upon the ONNX Runtime APIs for inferencing. It extends the ONNX Runtime APIs to handle dynamic initialization and dependency management of the execution provider across CPU, NPU, and GPU hardware on the PC. In addition, Windows ML also automatically downloads the necessary execution provider on demand, mitigating the need for app developers to manage dependencies and packages across multiple different hardware vendors.

NVIDIA TensorRT for RTX Execution Provider (EP) provides several benefits to Windows ML developers using ONNX Runtime including:

Just-in-time compilation

Learn more

ONNX Runtime GenAI SDK extension

Learn more about TensorRT for RTX.

*Figure 3. Different load times of DeepSeek-R1-Distill-Qwen-7B model runtimes including ONNX model, EP context files, and with EP context and Runtime Cache. Lower is better.*

Minimal data transfer overheads with ONNX Runtime Device API and Windows ML

The new ONNX Runtime device API, also available in Windows ML, enumerates available devices for each execution provider. Using this new notion, developers can now allocate device-specific tensors, without additional EP-dependent type specifications.

By leveraging CopyTensors and IOBinding, this API enables developers to perform EP-agnostic, GPU-accelerated inference with minimal runtime data transfer overhead—leading to improved performance and cleaner code design.

Figure 5 showcases the Stable Diffusion 3.5 Medium model leveraging the ONNX Runtime Device API. Figure 4 below represents the time required for a single iteration in the diffusion loop for the same model, both with and without device IO bindings.

*Figure 4. Stable Diffusion 3.5 Medium running with and without device bindings on AMD Ryzen 7 7800X3D CPU + RTX 5090 GPU connected via PCI 5. Lower time is better.*

Using Nsight systems, we visualized the performance overhead due to repetitive copies between host and device when not using IO binding:

*Figure 5. Nsight Systems timeline that shows the overhead that additional synchronous PCI traffic creates.*

Prior to every inference run, a copy operation of the input tensor is done, which is highlighted as green in our profile and a device to host a copy of the output takes about the same time. In addition, ONNX Runtime by default uses pageable memory for which the device to host copy is an implicit synchronization, although the cudaMemCpyAsync API is used by ONNX Runtime.

On the other hand, when input and output tensors are IO bound, the host-to-device copy of input happens just once prior to the multi-model inference pipeline. The same applies for the device-to-host copy of the output, after which we synchronize the CPU with the GPU again. The async Nsight trace above depicts multiple inference runs in the loop without any copy operations or synchronization operations in between, even freeing CPU resources in the meantime. This results in a device copy time of 4.2 milliseconds and a one-time host copy time of 1.3 milliseconds, making the total copy time of 5.5 milliseconds, irrespective of the number of iterations in the inference loop. For reference, this approach results in a ~75x reduction in copy time for a 30 iteration loop!

TensorRT for RTX Specific Optimizations

TensorRT for RTX execution offers custom options to further optimize performance. The most important optimizations are listed below.

CUDA graphs

enable_cuda_graph

Runtime cache

nv_runtime_cache_path

Dynamic shapes

profile_{min|max|opt]_shapes

AddFreeDimensionOverrideByName

Summary

We’re excited to collaborate with Microsoft to bring Windows ML and TensorRT for RTX EP to Windows application developers for maximum performance on NVIDIA RTX GPUs. Top Windows application developers including Topaz Labs, and Wondershare Filmora are currently working on integrating Windows ML and TensorRT for RTX EP into their applications.

Get started with Windows ML, ONNX Runtime APIs, and TensorRT for RTX EP using the below resources:

Stay tuned for future improvements and get up to speed with the new APIs that our samples are demonstrating. If there’s a feature request from your side, feel free to open an issue on GitHub and let us know!

Acknowledgements

We would like to thank Gaurav Garg, Kumar Anshuman, Umang Bhatt, and Vishal Agarawal for their contributions to the blog.

Overview of Windows ML and TensorRT for RTX EP

Minimal data transfer overheads with ONNX Runtime Device API and Windows ML

TensorRT for RTX Specific Optimizations

Summary

Acknowledgements

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签