The TensorFlow Blog 09月12日
TensorFlow v2新增分布式FFT支持
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

分布式快速傅里叶变换(Distributed FFT)是一种用于处理大型图像数据集的重要信号处理方法,通过将计算分布到多个设备来克服内存限制。本文介绍了TensorFlow v2通过DTensor API引入的Distributed FFT原生支持,以及DTensor同步分布式计算框架。用户只需将分片张量输入现有FFT操作,即可实现分布式计算,输出同样为分片张量。性能分析表明,虽然分布式FFT能处理更多数据,但通信开销会降低计算速度,本地FFT仅占总时间的3.6%。未来可通过切换FFT算法、优化NCCL通信、减少集体通信次数和采用N-d局部FFT等策略提升性能。

🔍 分布式快速傅里叶变换(Distributed FFT)是一种通过将计算任务分布到多个设备来处理大型图像数据集的重要信号处理方法,解决了单设备内存限制的问题。

🆕 TensorFlow v2通过DTensor API引入了对Distributed FFT的原生支持,用户只需将分片张量输入现有FFT操作(如tf.signal.fft2d),即可利用多个设备并行计算。

⚙️ DTensor是一个同步分布式计算扩展,通过Single Program, Multiple Data (SPMD)扩展将程序和张量分发到多个设备,为传统数据并行和模型并行提供了统一的API。

📊 性能分析显示,分布式FFT虽然能处理比非分布式FFT更多的数据,但通信开销(如ncclAllToAll操作)占用了大部分时间,本地FFT计算时间仅占总时间的3.6%。

🚀 未来性能优化方向包括切换到不同的DFT/FFT算法、调整NCCL通信设置以提升网络带宽利用率、减少集体通信次数以及采用N-d局部FFT替代多个1-d局部FFT。

Posted by Ruijiao Sun, Google Intern - DTensor team

Fast Fourier Transform is an important method of signal processing, which is commonly used in a number of ways, including speeding up convolutions, extracting features, and regularizing models. Distributed Fast Fourier Transform (Distributed FFT) offers a way to compute Fourier Transforms in models that work with image-like datasets that are too large to fit into the memory of a single accelerator device. In a previous Google Research Paper, “Large-Scale Discrete Fourier Transform on TPUs” by Tianjian Lu, a Distributed FFT algorithm was implemented for TensorFlow v1 as a library. This work presents the newly added native support in TensorFlow v2 for Distributed FFT, through the new TensorFlow distribution API, DTensor.

About DTensor

DTensor is an extension to TensorFlow for synchronous distributed computing. It distributes the program and tensors through a procedure called Single program, multiple data (SPMD) extension. DTensor offers an uniform API for traditional data and model parallelism patterns used widely in Machine Learning.

Example Usage

The API interface for distributed FFT is the same as the original FFT in TensorFlow. Users just need to pass a sharded tensor as an input to the existing FFT ops in TensorFlow, such as tf.signal.fft2d. The output of a distributed FFT becomes sharded too.

import TensorFlow as tffrom TensorFlow.experimental import dtensor# Set up devicesdevice_type = dtensor.preferred_device_type()if device_type == 'CPU':cpu = tf.config.list_physical_devices(device_type)tf.config.set_logical_device_configuration(cpu[0], [tf.config.LogicalDeviceConfiguration()] 8)if device_type == 'GPU':gpu = tf.config.list_physical_devices(device_type)tf.config.set_logical_device_configuration(gpu[0], [tf.config.LogicalDeviceConfiguration(memory_limit=1000)] 8)dtensor.initialize_accelerator_system()# Create a meshmesh = dtensor.create_distributed_mesh(mesh_dims=[('x', 1), ('y', 2), ('z', 4)], device_type=device_type)# Set up a distributed input Tensorinput = tf.complex(tf.random.stateless_normal(shape=(2, 2, 4), seed=(1, 2), dtype=tf.float32),tf.random.stateless_normal(shape=(2, 2, 4), seed=(2, 4), dtype=tf.float32))init_layout = dtensor.Layout(['x', 'y', 'z'], mesh)d_input = dtensor.relayout(input, layout=init_layout)# Run distributed fft2d. DTensor determines the most efficient# layout of of d_output.d_output = tf.signal.fft2d(d_input)

Performance Analysis

The following experiment demonstrates that the distributed FFT can process more data than the non-distributed one by utilizing memory across multiple devices. The tradeoff is spending additional time on communication and data transposes that slow down the calculation speed.

This phenomenon is shown in detail from the profiling result of the 10K*10K distributed FFT experiment. The current implementation of distributed FFT in TensorFlow follows the simple shuffle+local FFT method, which is also used by other popular distributed FFT libraries such as FFTW and PFFT. Notably, the two local FFT ops only take 3.6% of the total time (15ms). This is around 1/3 of the time for non-distributed fft2d. Most of the computing time is spent on data shuffling, represented by the ncclAllToAll Operation. Note that these experiments were conducted on an 8xV100 GPU system.

Next steps

The feature is new and we have adopted a simplest distributed FFT algorithm. A few ideas to fine tune or improve the performance are:

Try the new distributed FFT! We welcome your feedback on the TensorFlow Forum and look forward to working with you on improving the performance. Your input would be invaluable!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Distributed FFT TensorFlow DTensor Signal Processing Machine Learning
相关文章