其他分享
首页 > 其他分享> > TensorFlow多GPU训练

TensorFlow多GPU训练

作者:互联网

文章目录

问题描述

单机多GPU训练,多机请自行查阅参考文献




解决方案

使用 tf.distribute.MirroredStrategy 的原理:

  1. 训练开始前,该策略在 N 个 GPU 上复制一份完整模型
  2. 每次训练传入一个批次数据时,将数据分成 N 份,分别传入 N 个 GPU
  3. N 个 GPU 使用本地变量分别计算自己那部分数据的梯度
  4. 使用分布式计算的 All-reduce 操作,在 GPU 间高效交换梯度数据并进行求和
  5. 使用梯度求和的结果更新本地变量
  6. 当所有设备均更新本地变量后,进行下一轮训练
  7. 默认情况下,TensorFlow 中的 MirroredStrategy 策略使用 NVIDIA NCCL 进行 All-reduce 操作。

安装

pip install tensorflow-datasets --upgrade

使用前

import tensorflow as tf
import tensorflow_datasets as tfds


def resize(image, label):
    """图像预处理"""
    image = tf.image.resize(image, [224, 224]) / 255.0
    return image, label


batch_size = 64
dataset = tfds.load('cats_vs_dogs', split=tfds.Split.TRAIN, as_supervised=True)
dataset = dataset.map(resize).shuffle(1024).batch(batch_size)

model = tf.keras.applications.MobileNetV2(weights=None, classes=2)
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss=tf.keras.losses.sparse_categorical_crossentropy,
    metrics=[tf.keras.metrics.sparse_categorical_accuracy]
)

model.fit(dataset, epochs=5)
# Epoch 1/5
# 364/364 [==============================] - 110s 303ms/step - loss: 0.6229 - sparse_categorical_accuracy: 0.6500
# Epoch 2/5
# 364/364 [==============================] - 111s 305ms/step - loss: 0.4781 - sparse_categorical_accuracy: 0.7690
# Epoch 3/5
# 364/364 [==============================] - 110s 301ms/step - loss: 0.3919 - sparse_categorical_accuracy: 0.8202
# Epoch 4/5
# 364/364 [==============================] - 113s 311ms/step - loss: 0.3171 - sparse_categorical_accuracy: 0.8602
# Epoch 5/5
# 364/364 [==============================] - 113s 311ms/step - loss: 0.2532 - sparse_categorical_accuracy: 0.8919

使用后

import tensorflow as tf
import tensorflow_datasets as tfds


def resize(image, label):
    """图像预处理"""
    image = tf.image.resize(image, [224, 224]) / 255.0
    return image, label


strategy = tf.distribute.MirroredStrategy()
batch_size = 64 * strategy.num_replicas_in_sync  # 批次大小×设备数量

dataset = tfds.load('cats_vs_dogs', split=tfds.Split.TRAIN, as_supervised=True)
dataset = dataset.map(resize).shuffle(1024).batch(batch_size)

with strategy.scope():
    model = tf.keras.applications.MobileNetV2(weights=None, classes=2)
    model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
        loss=tf.keras.losses.sparse_categorical_crossentropy,
        metrics=[tf.keras.metrics.sparse_categorical_accuracy]
    )

print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
model.fit(dataset, epochs=5)
# Number of devices: 4
# Epoch 1/5
# 91/91 [==============================] - 35s 390ms/step - loss: 0.6459 - sparse_categorical_accuracy: 0.6374
# Epoch 2/5
# 91/91 [==============================] - 34s 377ms/step - loss: 0.5499 - sparse_categorical_accuracy: 0.7225
# Epoch 3/5
# 91/91 [==============================] - 34s 373ms/step - loss: 0.4560 - sparse_categorical_accuracy: 0.7826
# Epoch 4/5
# 91/91 [==============================] - 35s 382ms/step - loss: 0.3811 - sparse_categorical_accuracy: 0.8285
# Epoch 5/5
# 91/91 [==============================] - 34s 379ms/step - loss: 0.3274 - sparse_categorical_accuracy: 0.8558




参考文献

  1. 使用 tf.distribute.Strategy 进行自定义训练
  2. TensorFlow分布式训练

标签:loss,训练,categorical,sparse,GPU,tf,TensorFlow,364,accuracy
来源: https://blog.csdn.net/lly1122334/article/details/118931338