4.1. OP融合(计算,通信)

4.1.1. 计算融合

将模型网络中顺序执行的多个OPs进行融合能够减少OP 调度的开销,提升训练速度。目前Fleet 中支持如下3种的OP 融合:

  • fuse_all_optimizer_ops:表明是否融合(fuse) 是否融合 optimizer_op,仅对部分 optimizer 可用(SGD、Adam和Momentum)。

  • fuse_elewise_add_act_ops:表明是否融合(fuse) elementwise_add_op和activation_op。

  • fuse_bn_act_ops:表明是否融合(fuse) batch_norm_op 和 activation_op。

通常使用这些策略都会使整体执行过程更快。

4.1.2. 通信融合

AllReduce 融合默认情况下会将同一layer中参数的梯度的多个AllReduce操作合并成一个。 比如对于 fc 中有Weight和Bias两个参数,打开该选项之前,需要两次AllReduce操作;打开该选项之后,只用一次AllReduce 操作。这样可以减少梯度同步时的通信耗时。

此外,为支持更大粒度的参数梯度融合,Fleet 提供了以下两个选项,用户可以在训练程序运行前在DistributedStrategy中设置:

  • fuse_grad_size_in_MB: 指定每个AllReduce操作的梯度字节数,如该参数等于16 则每次AllReduce调用传输16MB的梯度。 该参数的经验值为总通信量的十分之一。

  • fuse_grad_size_in_TFLOPS: 指定每次AllReduce操作的最大层数,即到达该层数就进行AllReduce。如该参数等于50, 则最多每50层做一次 fused AllReduce。

注意: AllReduce融合目前不支持sparse参数梯度。

4.1.3. 操作实践

# 计算融合
build_strategy = paddle.static.BuildStrategy()
build_strategy.fuse_elewise_add_act_ops = True
build_strategy.fuse_bn_act_ops = True
build_strategy.fuse_relu_depthwise_conv = True
build_strategy.fuse_broadcast_ops = True
build_strategy.fuse_all_optimizer_ops = True

strategy = paddle.distributed.fleet.DistributedStrategy()
strategy.build_strategy = build_strategy

# 通信融合
strategy.fuse_grad_size_in_MB = 16
strategy._fuse_grad_size_in_TFLOPS = 50
strategy.fuse_all_reduce_ops=True

上述例子存放在:example/resnet/train_fleet_static_op_fusion.py。 假设要运行2卡的任务,那么只需在命令行中执行:

fleetrun --gpus=0,1 train_fleet_static_op_fusion.py

您将看到显示如下日志信息:

-----------  Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
...
------------------------------------------------
WARNING 2021-01-19 14:53:04,943 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-01-19 14:53:04,945 launch_utils.py:472] Local start 8 processes. First process distributed environment info (Only For Debug):
    +=======================================================================================+
    |                        Distributed Envs                      Value                    |
    +---------------------------------------------------------------------------------------+
    |                 PADDLE_CURRENT_ENDPOINT                 127.0.0.1:28355               |
    |                     PADDLE_TRAINERS_NUM                        8                      |
    |                PADDLE_TRAINER_ENDPOINTS  ... 0.1:33653,127.0.0.1:27766,127.0.0.1:16631|
    |                     FLAGS_selected_gpus                        0                      |
    |                       PADDLE_TRAINER_ID                        0                      |
    +=======================================================================================+
...
W0119 14:53:16.871562 68031 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 9.2
W0119 14:53:16.875859 68031 device_context.cc:372] device: 0, cuDNN Version: 7.4.
W0119 14:53:25.973377 68031 build_strategy.cc:116] Currently, fuse_broadcast_ops only works under Reduce mode.
I0119 14:53:27.382609 68031 graph_pattern_detector.cc:101] ---  detected 16 subgraphs
I0119 14:53:27.390769 68031 graph_pattern_detector.cc:101] ---  detected 16 subgraphs
W0119 14:53:27.407582 68031 fuse_optimizer_op_pass.cc:207] Find momentum operators : 161, and 161 for dense gradients. To make the speed faster, those optimization are fused during training.
W0119 14:53:27.436177 68031 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 161. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 6.
[Epoch 0, batch 0] loss: 0.15131, acc1: 0.00000, acc5: 0.03125
[Epoch 0, batch 5] loss: 1.15416, acc1: 0.00000, acc5: 0.03125