MarkTechPost@AI 09月19日
MIT的LEGO:无需模板,自动生成AI芯片硬件
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MIT研究人员开发了LEGO,一个创新的编译器框架,能够自动将张量工作负载(如GEMM、Conv2D、Attention)转化为可综合的RTL(寄存器传输级)代码,用于空间加速器,彻底摆脱了手动编写模板的繁琐。LEGO的前端采用关系中心仿射表示法来表达工作负载和数据流,并进行功能单元互连和片上内存布局设计,以最大化复用。其后端则将设计优化为原始图,利用线性规划和图变换插入流水线寄存器、重布广播连接、提取归约树,从而缩小面积和功耗。在多项基准测试中,LEGO生成的硬件在同等资源下,相比Gemmini实现了3.2倍的速度提升和2.4倍的能效比。

💡 **模板无关的硬件生成:** LEGO框架的核心创新在于其无需手动编写模板,能够直接从张量工作负载(如GEMM、Conv2D、Attention)的高层描述自动生成可综合的RTL代码,用于空间加速器。这克服了传统方法中模板固化、难以适应现代动态工作负载的限制,为AI芯片设计开辟了新的可能性。

📐 **关系中心仿射表示与前后端协同设计:** LEGO的前端采用关系中心仿射表示法来精确描述张量程序,解耦了控制流与数据流,使得复用检测和地址生成成为线性代数问题。后端则通过线性规划和图变换技术,对生成的图进行深入优化,包括延迟匹配、广播重布、归约树提取等,显著降低了硬件面积和功耗。

🚀 **优异的性能与能效表现:** 经过多项基准测试(包括基础模型和经典CNN/Transformer)的验证,LEGO生成的硬件在同等资源配置下,相较于Gemmini,平均实现了3.2倍的速度提升和2.4倍的能效比。这得益于其高效的性能模型指导映射,以及动态空间数据流切换的能力,使其在处理不同类型的AI任务时都能表现出色。

🛠️ **硬件即代码,赋能广泛应用:** LEGO将硬件生成过程类比为软件编译,使得研究人员和开发者能够以更低门槛、更系统化的方式设计定制化AI加速器。其生成的硬件能够适应多种模型,尤其适合在边缘设备(如可穿戴设备、IoT)上部署功耗优化的AI加速器,加速AI技术在各领域的落地应用。

MIT researchers (Han Lab) introduced LEGO, a compiler-like framework that takes tensor workloads (e.g., GEMM, Conv2D, attention, MTTKRP) and automatically generates synthesizable RTL for spatial accelerators—no handwritten templates. LEGO’s front end expresses workloads and dataflows in a relation-centric affine representation, builds FU (functional unit) interconnects and on-chip memory layouts for reuse, and supports fusing multiple spatial dataflows in a single design. The back end lowers to a primitive-level graph and uses linear programming and graph transforms to insert pipeline registers, rewire broadcasts, extract reduction trees, and shrink area and power. Evaluated across foundation models and classic CNNs/Transformers, LEGO’s generated hardware shows 3.2× speedup and 2.4× energy efficiency over Gemmini under matched resources.

https://hanlab.mit.edu/projects/lego

Hardware Generation without Templates

Existing flows either: (1) analyze dataflows without generating hardware, or (2) generate RTL from hand-tuned templates with fixed topologies. These approaches restrict the architecture space and struggle with modern workloads that need to switch dataflows dynamically across layers/ops (e.g., conv vs. depthwise vs. attention). LEGO directly targets any dataflow and combinations, generating both architecture and RTL from a high-level description rather than configuring a few numeric parameters in a template.

https://hanlab.mit.edu/projects/lego

Input IR: Affine, Relation-Centric Semantics (Deconstruct)

LEGO models tensor programs as loop nests with three index classes: temporal (for-loops), spatial (par-for FUs), and computation (pre-tiling iteration domain). Two affine relations drive the compiler:

This affine-only representation eliminates modulo/division in the core analysis, making reuse detection and address generation a linear-algebra problem. LEGO also decouples control flow from dataflow (a vector c encodes control signal propagation/delay), enabling shared control across FUs and substantially reducing control logic overhead.

Front End: FU Graph + Memory Co-Design (Architect)

The main objectives is to maximize reuse and on-chip bandwidth while minimizing interconnect/mux overhead.

    Interconnection synthesis. LEGO formulates reuse as solving linear systems over the affine relations to discover direct and delay (FIFO) connections between FUs. It then computes minimum-spanning arborescences (Chu-Liu/Edmonds) to keep only necessary edges (cost = FIFO depth). A BFS-based heuristic rewrites direct interconnects when multiple dataflows must co-exist, prioritizing chain reuse and nodes already fed by delay connections to cut muxes and data nodes.Banked memory synthesis. Given the set of FUs that must read/write a tensor in the same cycle, LEGO computes bank counts per tensor dimension from the maximum index deltas (optionally dividing by GCD to reduce banks). It then instantiates data-distribution switches to route between banks and FUs, leaving FU-to-FU reuse to the interconnect.Dataflow fusion. Interconnects for different spatial dataflows are combined into a single FU-level Architecture Description Graph (ADG); careful planning avoids naïve mux-heavy merges and yields up to ~20% energy gains compared to naïve fusion.

Back End: Compile & Optimize to RTL (Compile & Optimize)

The ADG is lowered to a Detailed Architecture Graph (DAG) of primitives (FIFOs, muxes, adders, address generators). LEGO applies several LP/graph passes:

These passes focus on the datapath, which dominates resources (e.g., FU-array registers ≈ 40% area, 60% power), and produce ~35% area savings versus naïve generation.

Outcome

Setup. LEGO is implemented in C++ with HiGHS as the LP solver and emits SpinalHDL→Verilog. Evaluation covers tensor kernels and end-to-end models (AlexNet, MobileNetV2, ResNet-50, EfficientNetV2, BERT, GPT-2, CoAtNet, DDPM, Stable Diffusion, LLaMA-7B). A single LEGO-MNICOC accelerator instance is used across models; a mapper picks per-layer tiling/dataflow. Gemmini is the main baseline under matched resources (256 MACs, 256 KB on-chip buffer, 128-bit bus @ 16 GB/s).

End-to-end speed/efficiency. LEGO achieves 3.2× speedup and 2.4× energy efficiency on average vs. Gemmini. Gains stem from: (i) a fast, accurate performance model guiding mapping; (ii) dynamic spatial dataflow switching enabled by generated interconnects (e.g., depthwise conv layers choose OH–OW–IC–OC). Both designs are bandwidth-bound on GPT-2.

Resource breakdown. Example SoC-style configuration shows FU array and NoC dominate area/power, with PPUs contributing ~2–5%. This supports the decision to aggressively optimize datapaths and control reuse.

Generative models. On a larger 1024-FU configuration, LEGO sustains >80% utilization for DDPM/Stable Diffusion; LLaMA-7B remains bandwidth-limited (expected for low operational intensity).

https://hanlab.mit.edu/projects/lego

Importance for each segment

How the “Compiler for AI Chips” Works—Step-by-Step?

    Deconstruct (Affine IR). Write the tensor op as loop nests; supply affine f_{I→D} (data mapping), f_{TS→I} (dataflow), and control flow vector c. This specifies what to compute and how it is spatialized, without templates.Architect (Graph Synthesis). Solve reuse equations → FU interconnects (direct/delay) → MST/heuristics for minimal edges and fused dataflows; compute banked memory and distribution switches to satisfy concurrent accesses without conflicts.Compile & Optimize (LP + Graph Transforms). Lower to a primitive DAG; run delay-matching LP, broadcast rewiring (MST), reduction-tree extraction, and pin-reuse ILP; perform bit-width inference and optional power gating. These passes jointly deliver ~35% area and ~28% energy savings vs. naïve codegen.

Where It Lands in the Ecosystem?

Compared with analysis tools (Timeloop/MAESTRO) and template-bound generators (Gemmini, DNA, MAGNET), LEGO is template-free, supports any dataflow and their combinations, and emits synthesizable RTL. Results show comparable or better area/power versus expert handwritten accelerators under similar dataflows and technologies, while offering one-architecture-for-many-models deployment.

Summary

LEGO operationalizes hardware generation as compilation for tensor programs: an affine front end for reuse-aware interconnect/memory synthesis and an LP-powered back end for datapath minimization. The framework’s measured 3.2× performance and 2.4× energy gains over a leading open generator, plus ~35% area reductions from back-end optimizations, position it as a practical path to application-specific AI accelerators at the edge and beyond.


Check out the Paper and Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post MIT’s LEGO: A Compiler for AI Chips that Auto-Generates Fast, Efficient Spatial Accelerators appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LEGO AI芯片 编译器 硬件生成 空间加速器 MIT RTL 深度学习 AI Chip Compiler Hardware Generation Spatial Accelerator Deep Learning
相关文章