4.3. 通信拓扑优化¶
4.3.1. 原理¶
TBA
4.3.2. 操作实践¶
Fleet 实现了底层通过改变通信拓扑,实现分层 allreduce。用户只需要指定相应的DistributedStrategy() 的开关,就可以选择不同的通信拓扑。
dist_strategy = fleet.DistributedStrategy()
dist_strategy.use_hierarchical_allreduce = True
dist_strategy.hierarchical_allreduce_inter_nranks = 8
上述例子存放在:example/resnet/train_fleet_static_communication_topology.py。 假设要运行8卡的任务,那么只需在命令行中执行:
fleetrun --gpus=0,1,2,3,4,5,6,7 train_fleet_static_communication_topology.py
您将看到显示如下日志信息:
----------- Configuration Arguments -----------
gpus: None
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
...
------------------------------------------------
...
INFO 2021-01-19 14:58:43,720 launch_utils.py:472] Local start 8 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:53762 |
| PADDLE_TRAINERS_NUM 8 |
| PADDLE_TRAINER_ENDPOINTS ... 0.1:58938,127.0.0.1:54203,127.0.0.1:44221|
| FLAGS_selected_gpus 0 |
| PADDLE_TRAINER_ID 0 |
+=======================================================================================+
...
W0119 14:58:52.487838 95116 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Version: 9.2
W0119 14:58:52.493592 95116 device_context.cc:372] device: 0, cuDNN Version: 7.4.
W0119 14:59:01.665702 95116 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 161. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 5.
[Epoch 0, batch 0] loss: 0.13468, acc1: 0.00000, acc5: 0.06250
[Epoch 0, batch 5] loss: 0.18902, acc1: 0.03125, acc5: 0.03125