4.2. 通信重叠¶
4.2.1. 简介¶
Paddle的通信进行重叠(overlap),可以有效提升通信效率。
4.2.2. 原理介绍¶
Paddle的整体框架目前只有一个计算流,但可以有多个通信流。在通信为瓶颈的低配网络中,通过 重叠通信流,可以有效利用通信带宽,从而达到更优的通信性能。多流相关的概念请参考: cuda-streams-best-practices。
4.2.3. 使用方法¶
Fleet已经实现通信流overlap,只需设置通信器数量 nccl_comm_num 可以加快GPU之间的通信效率,建议单机设置为1,多机设置为2。
strategy = fleet.DistributedStrategy()
strategy.nccl_comm_num = 2
strategy.sync_nccl_allreduce=False
上述例子存放在:example/resnet/train_fleet_static_overlap.py下面, 假设要运行2卡的任务,那么只需在命令行中执行:
fleetrun --gpus=0,1 train_fleet_static_overlap.py
您将看到显示如下日志信息:
----------- Configuration Arguments -----------
gpus: 0,1
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
...
------------------------------------------------
...
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:10097 |
| PADDLE_TRAINERS_NUM 2 |
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:10097,127.0.0.1:59371 |
| FLAGS_selected_gpus 0 |
| PADDLE_TRAINER_ID 0 |
+=======================================================================================+
...
W0118 21:44:34.542804 70071 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 9.2
W0118 21:44:34.547377 70071 device_context.cc:372] device: 0, cuDNN Version: 7.4.
W0118 21:44:40.178053 70071 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 161. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 5.
[Epoch 0, batch 0] loss: 0.14466, acc1: 0.00000, acc5: 0.03125
[Epoch 0, batch 5] loss: 4.00225, acc1: 0.00000, acc5: 0.03125
...