系统相关
首页 > 系统相关> > 机器学习分布式框架horovod安装 (Linux环境)

机器学习分布式框架horovod安装 (Linux环境)

作者:互联网

1、openmi 下载安装

  下载连接:

    https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz

  安装命令

1
2
3
4
5
shell$ gunzip -c openmpi-4.0.1.tar.gz | tar xf -
shell$ cd openmpi-4.0.1
shell$ ./configure --prefix=/usr/local
<...lots of output...>
shell$ make all install

  

sudo ldconfig

2、horovod安装

 官方文档: https://github.com/horovod/horovod#install

[sudo] pip3 install horovod
安装支持NCCL的版本的horovod
HOROVOD_GPU_ALLREDUCE=NCCL pip3 install --no-cache-dir horovod

3、horovod 使用

3.1 tensorFLow 修改

复制代码

import tensorflow as tf
import horovod.tensorflow as hvd


# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())

# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)

# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operation
train_op = opt.minimize(loss)

# Save checkpoints only on worker 0 to prevent other workers from corrupting them.
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None

# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                       config=config,
                                       hooks=hooks) as mon_sess:
  while not mon_sess.should_stop():
    # Perform synchronous training.
    mon_sess.run(train_op)

复制代码

3.2 tensorflow 运行

mpi 指定mca通讯端口

复制代码

mpirun --allow-run-as-root --oversubscribe \
 -np 8-H ubuntu1:4,ubuntu2:4 \
 -bind-to none -map-by slot \
 -mca plm_rsh_args "-p 22" \
 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
 -mca pml ob1 -mca btl ^openib \
 python3 -u train.py 

复制代码

标签:opt,horovod,hvd,checkpoint,train,Linux,tf,分布式
来源: https://blog.csdn.net/sinat_22510827/article/details/122017745