用 XLA 突破 GPU 性能限制

2018 年 11 月 14 日

作者：Toby Boyd、Yanan Cao、Sanjoy Das、Thomas Joerg、Justin Lebar

XLA 是 TensorFlow 图表的编译器，您现在可以使用它来加速 TensorFlow ML 模型，而无需进行大量源代码更改。本文将介绍 XLA 以及如何将其应用到您的代码中。

TensorFlow 1.12（使用 XLA）在 NVIDIA® Tesla® V100 GPU 上对 ResNet50 v1.0 进行训练时，实现了比 TF 1.11（未使用 XLA）更高的性能：使用合成数据为每秒 10,526 张图像，使用真实数据为每秒 10,267 张图像（请参阅附录了解复制说明）。我们观察到各种内部模型的速度提升幅度从 1.13 倍到 3.04 倍不等。

图表 1：条形图显示 ResNet50v1 使用合成数据进行训练时的性能，比较了未使用 XLA 的 TensorFlow v1.11 和使用 XLA 的 TensorFlow v1.12。单个 GPU：未使用 XLA 为每秒 888 张图像，使用 XLA 为每秒 1,401 张图像。
8 个 GPU：未使用 XLA 为每秒 6,818 张图像，使用 XLA 为每秒 10,526 张图像。图表 2：条形图显示 ResNet50v1 使用真实数据进行训练时的性能，比较了未使用 XLA 的 TensorFlow v1.11 和使用 XLA 的 TensorFlow v1.12。
单个 GPU：未使用 XLA 为每秒 871 张图像，使用 XLA 为每秒 1,395 张图像。
8 个 GPU：未使用 XLA 为每秒 6,413 张图像，使用 XLA 为每秒 10,268 张图像。

xXLA：TensorFlow 编译！

通常情况下，在运行 TensorFlow 图表时，所有操作都将由 TensorFlow 图表执行程序单独执行。每个操作都有一个预编译的 GPU 内核实现（作为 TensorFlow 二进制文件的一部分提供），图表执行程序将调度到该实现。

XLA 提供了一种运行 TF 模型的替代模式：它会将您的 TensorFlow 图表编译成一系列针对您的模型生成的 GPU 内核。由于这些内核对您的程序是唯一的，因此它们可以利用模型特定的信息进行优化。

例如，让我们看一下 XLA 在简单 TensorFlow 计算中进行的优化。

def model_fn(x, y, z):
  return tf.reduce_sum(x + y * z)

在不使用 XLA 的情况下运行，图表将启动三个内核：一个用于乘法，一个用于加法，一个用于归约。

但是，XLA 可以优化图表，使其在单个内核启动中计算结果。它是通过将加法、乘法和归约“融合”成单个 GPU 内核来实现此目的的。此外，此融合操作不会将 y*z 和 x+y*z 生成的中间值写入内存；而是将这些中间计算的结果“流式传输”到其用户，同时完全保留在 GPU 寄存器中。

融合是 XLA 最重要的优化。内存带宽通常是硬件加速器上最稀缺的资源，因此移除内存操作是提高性能的最佳方法之一。

在您的模型中使用 XLA

XLA 公开了一个 API，即 xla.compile，它允许您在 TensorFlow 图表的特定部分上显式调用 XLA 编译器。xla.compile 接受一个 Python 函数，该函数将生成 TensorFlow 计算，并将生成的计算连接起来以由 XLA 编译。xla.compile 返回一个张量列表，每个张量对应于由传递的函数构建的计算中的输出，但现在已通过 XLA 优化。

因此，可以通过以下方式调用 xla.compile 来运行 model_fn 上面生成的计算。

from tensorflow.contrib.compiler import xla

def model_fn(x, y, z):
  return tf.reduce_sum(x + y * z)

def create_and_run_graph():
  with tf.Session() as sess:
    x = tf.placeholder(tf.float32, name='x')
    y = tf.placeholder(tf.float32, name='y')
    z = tf.placeholder(tf.float32, name='z')
    result = xla.compile(computation=model_fn, inputs=(x, y, z))[0]
    # `result` is a normal Tensor (albeit one that is computed by an XLA
    # compiled executable) and can be used like any other Tensor.
    result = tf.add(result, result)
    return sess.run(result, feed_dict={ ... })

您可以使用命令行标志（或其他任意逻辑）来控制是否由 XLA 编译您的计算。模型通常会像这样调用 xla.compile：

if should_use_xla():
  result = xla.compile(model_fn, (x, y, z))[0]
else:
  result = model_fn(x, y, z)

您可以使用命令行标志（或其他任意逻辑）来控制是否由 XLA 编译您的计算。模型通常会像这样调用 xla.compile：

if should_use_xla():
  result = xla.compile(model_fn, (x, y, z))[0]
else:
  result = model_fn(x, y, z)

这便于进行实验。

我们已设置了一个 colab，您可以在其中对更复杂的模型使用 xla.compile。

xla.compile 并不是在 TensorFlow 子图上调用 XLA 的唯一方法；具体而言，您可以请求 TensorFlow *自动* 查找与 XLA 兼容的子图，并使用 XLA 对其进行编译，但本文不讨论这些方法。

使用 XLA 的注意事项

首先，XLA GPU 后端目前处于实验阶段 - 尽管我们没有发现任何重大问题，但它尚未在广泛的生产环境中进行测试。

其次，xla.compile 尚未与 Keras 高级 API（如 model.fit（尽管您可以使用 Keras 操作）或在急切模式下使用）配合使用。我们正在积极开发 API 以在这些模式下启用 XLA；敬请关注。

第三，XLA 无法编译所有 TensorFlow 图表；只有具有以下属性的图表才能传递到 xla.compile。

所有操作必须具有可推断的形状

XLA 需要能够根据计算的输入推断所有操作的形状。因此，生成具有不可预测形状的张量的模型函数在运行时将失败并出现错误。（在本示例中，tf.expand_dims 输出的形状取决于 random_dim_size，而该值无法根据 x、y 和 z 推断得出。）

请注意，由于 XLA 是一个 JIT 编译器，因此形状 *可以在* 运行之间变化，只要它们可以根据集群的输入推断得出即可。因此，此示例就可以了。

所有操作都必须受 XLA 支持

并非所有 TensorFlow 操作都可由 XLA 编译，如果您的模型包含 XLA 不支持的操作，XLA 编译将失败。例如，XLA 不支持 tf.where 操作，因此，如果您的模型函数包含此操作，则在使用 xla.compile 运行时将失败。

每个受 XLA 支持的 TensorFlow 操作在 tensorflow/compiler/tf2xla/kernels/ 中都有一个 REGISTER_XLA_OP 调用，因此您可以使用 grep 命令查找 REGISTER_XLA_OP 宏的实例，以找到受支持的 TensorFlow 操作列表。

附录

Google 基准测试中的性能

以下是在 V100 GPU 上运行的所有 XLA 团队基准测试模型中，使用 XLA 的 TensorFlow 与未使用 XLA 的 TensorFlow 的相对速度提升/下降图。我们不会隐瞒任何内容；这是我们当前用于评估编译器的所有基准测试集。

图表显示了使用 XLA 的 TensorFlow 与未使用 XLA 的 TensorFlow 在 Google 内部基准测试中的速度提升/下降。数据是 fp16 和 fp32 模型的结果列表，按速度提升排序。fp32 结果：[0.86 0.94 0.94 0.97 0.98 0.99 0.99 0.99 1.00 1.01 1.01 1.01 1.01 1.02 1.04 1.05 1.06 1.06 1.07 1.07 1.08 1.08 1.08 1.09 1.09 1.10 1.10 1.11 1.11 1.11 1.12 1.12 1.12 1.13 1.15 1.15 1.18 1.18 1.20 1.27 1.30 1.30 1.32 1.37 1.40 1.41 1.43 1.44 1.52]，fp16 结果：[1.10 1.32 1.41 1.47 1.48 1.55 1.56 1.59 1.63 1.64 1.64 1.67 2.07 2.51 3.09]

每个条形图表示一个完整的模型，例如“resnet50 训练图像/秒”或“Google 内部模型的推理吞吐量”。X 轴按速度提升排序。

您的实际效果可能会有所不同，尤其是因为我们专门针对其中许多基准测试进行了 XLA 优化！尽管如此，其中许多基准测试在开箱即用时都能很好地运行，并且我们还会继续改进。

复制 ResNet50 v1.0 基准测试

以下部分将逐步介绍如何设置 Google Cloud 实例并执行 ResNet50 基准测试。

准备数据

此步骤仅在进行真实数据测试时需要，可能需要几个小时才能完成。建议您在仅使用 CPU 的实例上执行此操作，以降低计算成本。使用说明为 imagenet_to_gcs.py 创建 TFRecord 格式的 ImageNet 数据，并将其推送到 Google Cloud Storage 存储桶中。

创建 GCE 实例

以下代码段将在 Google Deep Learning VM 上创建一个实例，该实例在 Google Cloud Platform 上拥有八个 Tesla® V100 GPU。

export INSTANCE_NAME="xla-benchmark-8xV100"
export IMAGE_FAMILY="tf-1-12-cu100"
export PROJECT_NAME=""

gcloud beta compute instances create $INSTANCE_NAME \
  --project=$PROJECT_NAME \
  --machine-type=n1-standard-64 \
  --maintenance-policy=TERMINATE \
  --accelerator=type=nvidia-tesla-v100,count=8 \
  --tags=http-server,https-server \
  --image-family=$IMAGE_FAMILY \
  --image-project=deeplearning-platform-release \
  --boot-disk-size=100GB \
  --boot-disk-type=pd-ssd \
  --local-ssd interface=nvme \
  --local-ssd interface=nvme \
  --local-ssd interface=nvme \
  --local-ssd interface=nvme \
  --metadata install-nvidia-driver=True

## Combines the 4 local nvme SSD drives into a single RAID 0 drive.
# Install raid management tool.
sudo apt-get update && sudo apt-get install mdadm --no-install-recommends

# Creates RAID 0 array.
sudo mdadm --create /dev/md0 --level=0 --raid-devices=4 \
/dev/nvme0n1 /dev/nvme0n2 /dev/nvme0n3 /dev/nvme0n4 

# Formats and mounts the array.
sudo mkfs.ext4 -F /dev/md0
sudo mkdir -p /data/imagenet
sudo mount /dev/md0 /data
sudo chmod a+w /data

# Installs custom TensorFlow 1.12 binary with AVX2.  Binary included on
# the image already has XLA but the custom binary is compiled with AVX2.
sudo pip install --force-reinstall https://storage.googleapis.com/tf-performance/tf_binary/tensorflow-1.12.0.a6d8ffa.AVX2.CUDA10-cp27-cp27mu-linux_x86_64.whl

执行基准测试

gcloud compute ssh $INSTANCE_NAME

# Clone TensorFlow benchmark repository.
git clone https://github.com/tensorflow/benchmarks.git && cd benchmarks
git reset --hard 1e7d788042dfc6d5e5cd87410c57d5eccee5c664
cd scripts/tf_cnn_benchmarks

## Synthetic data test

# 8 GPUs
python tf_cnn_benchmarks.py \
    --batch_size=364 \
    --num_batches=100 \
    --model=resnet50 \
    --optimizer=momentum \
    --variable_update=replicated \
    --all_reduce_spec=nccl \
    --use_fp16=True \
    --nodistortions \
    --gradient_repacking=2 \
    --compute_lr_on_cpu=True \
    --single_l2_loss_op=True \
    --xla_compile=True \
    --num_gpus=8 \
    --loss_type_to_report=base_loss

# 1 GPU
python tf_cnn_benchmarks.py \
    --batch_size=364 \
    --num_batches=100 \
    --model=resnet50 \
    --optimizer=momentum \
    --use_fp16=True \
    --nodistortions \
    --compute_lr_on_cpu=True \
    --single_l2_loss_op=True \
    --xla_compile=True \
    --loss_type_to_report=base_loss

## Real data test
# add --data_dir=/data/imagenet to the 1 or 8 GPU command.

下一篇文章

Pushing the limits of GPU performance with XLA

TensorFlow 核心 ·

用 XLA 突破 GPU 性能限制

2018 年 11 月 14 日 — 作者：Toby Boyd、Yanan Cao、Sanjoy Das、Thomas Joerg、Justin Lebar

XLA 是 TensorFlow 图表的编译器，您现在可以使用它来加速 TensorFlow ML 模型，而无需进行大量源代码更改。本文将介绍 XLA 以及如何将其应用到您的代码中。

TensorFlow 1.12（使用 XLA）在 ResNet50 v1.0 上实现了比 TF 1.11（未使用 XLA）更高的性能…