MarkTechPost@AI 10月06日 13:51
StreamTensor:AI推理的新型FPGA编译器
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

StreamTensor是一款创新的编译器,它将PyTorch大型语言模型(LLM)的计算图转化为AMD Alveo U55C FPGA上的数据流加速器。该系统引入了“迭代张量”(itensor)类型,用于编码流的切片和顺序,从而实现可证明的内核间流式传输,并自动化DMA引擎、FIFO和布局转换器的插入与尺寸调整。与GPU相比,StreamTensor在LLM解码工作负载上实现了高达0.64倍的低延迟和高达1.99倍的能效提升,有效避免了大量DRAM的读写。

💡 **StreamTensor编译器优化LLM推理效率:** StreamTensor通过将PyTorch LLM计算图转换为FPGA上的数据流加速器,显著提高了AI推理的效率。它利用片上FIFO和流转换器来处理数据块,减少了对DRAM的依赖,从而在延迟和能耗方面取得了显著的改进。

🔄 **创新的“迭代张量”(itensor)类型:** 该系统引入了“迭代张量”(itensor)这一核心概念,用于精确编码数据流的切片、顺序和布局。这使得内核间的流式传输更加可靠,并能够自动化地插入和配置DMA引擎、FIFO和布局转换器,确保数据在不同处理单元之间高效传递。

🚀 **显著的性能提升:** 在LLM解码工作负载上,StreamTensor取得了高达0.64倍的低延迟和高达1.99倍的能效提升。这一成果表明,通过数据流编译和FPGA硬件的协同设计,可以有效超越传统的GPU解决方案,尤其是在对延迟和能耗敏感的应用场景中。

🛠️ **端到端的PyTorch到加速器流程:** StreamTensor提供了一个完整的从PyTorch到硬件加速器的编译流程。它支持多种LLM模型,如GPT-2、Llama、Qwen和Gemma,并将它们无缝转换为FPGA上的流式数据流内核,无需手动进行RTL代码编写,大大简化了开发过程。

Why treat LLM inference as batched kernels to DRAM when a dataflow compiler can pipe tiles through on-chip FIFOs and stream converters?StreamTensor is a compiler that lowers PyTorch LLM graphs (GPT-2, Llama, Qwen, Gemma) into stream-scheduled dataflow accelerators on AMD’s Alveo U55C FPGA. The system introduces an iterative tensor (“itensor”) type to encode tile/order of streams, enabling provably correct inter-kernel streaming and automated insertion/sizing of DMA engines, FIFOs, and layout converters. On LLM decoding workloads, the research team reports up to 0.64× lower latency vs. GPUs and up to 1.99× higher energy efficiency.

https://arxiv.org/pdf/2509.13694

What StreamTensor does?

StreamTensor compiles PyTorch graphs into a stream-oriented dataflow design so that intermediate tiles are largely avoids off-chip DRAM round-trips via on-chip streaming and fusion; DMAs are inserted only when required; they are forwarded through on-chip FIFOs to downstream kernels. The compiler’s central abstraction—iterative tensors (itensors)—records iteration order, tiling, and layout, which makes inter-kernel stream compatibility explicit and drives converter generation only where needed. The framework also searches hierarchically over tiling, fusion, and resource allocation, and uses a linear program to size FIFOs to avoid stalls or deadlock while minimizing on-chip memory.

https://arxiv.org/pdf/2509.13694

What’s actually new?

Results

Latency: up to 0.76× vs prior FPGA LLM accelerators and 0.64× vs a GPU baseline on GPT-2; Energy efficiency: up to 1.99× vs A100 on emerging LLMs (model-dependent). Platform context: Alveo U55C (HBM2 16 GB, 460 GB/s, PCIe Gen3×16 or dual Gen4×8, 2×QSFP28).

https://arxiv.org/pdf/2509.13694

Our Comments

The useful contribution here is a PyTorch→Torch-MLIR→dataflow compiler that emits stream-scheduled kernels and a host/runtime for AMD’s Alveo U55C; the iterative tensor type plus linear-programming-based FIFO sizing enables safe inter-kernel streaming rather than DRAM round-trips. On reported LLM decoding benchmarks across GPT-2, Llama, Qwen, and Gemma, the research team show geometric-mean latency as low as 0.64× vs. a GPU baseline and energy efficiency up to 1.99×, with scope limited to decoding workloads. The hardware context is clear: Alveo U55C provides 16 GB HBM2 at 460 GB/s with dual QSFP28 and PCIe Gen3×16 or dual Gen4×8, which aligns with the streaming dataflow design.


Check out the Paper. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post StreamTensor: A PyTorch-to-Accelerator Compiler that Streams LLM Intermediates Across FPGA Dataflows appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

StreamTensor FPGA LLM AI推理 数据流编译器 低延迟 高能效 PyTorch AMD Alveo U55C
相关文章