Insights

Speeding up Transformer CPU inference in Google Cloud

By Mariko Wakabayashi
Wednesday, 10 November 2021

The Twitter Cortex team has been focused on improving content understanding to extract new topics, promote healthy conversations, and help users discover more relevant Tweets and accounts to follow.

Transformer-based models like BERT are one of the most effective natural language processing (NLP) techniques that can understand words and phrases in different contexts. BERT can distinguish between different semantic meanings of homonyms by generating more nuanced and context-dependent embeddings.

Machine Learning (ML) practitioners at Twitter have seen significant performance gains on NLP tasks such as content moderation and topic discovery by incorporating transformer-based embeddings. However, this performance gain comes with latency and throughput tradeoffs due to transformers' computationally intensive nature. For example, we have a fine-tuned BERT model that processes Tweets asynchronously for downstream tasks to consume. This can take around 500ms to process a single Tweet (of at most 128 tokens) on a CPU-based machine. The processing time can be greatly reduced to 20ms by running the model on a GPU instance, but that can get very costly as the model inference demand continues to scale.

To power Twitter features like recommendations with transformer-based embeddings, we wanted to investigate techniques that can help improve throughput and minimize computational costs without compromising the model architecture. As a result, we investigated several different inference engines, dynamic quantization techniques, and hardware on Google Cloud. We gathered our findings to help the rest of the ML community improve transformer’s speed and computational demand in production.

Background

The following companies have shared optimization techniques and findings to improve latency for BERT CPU inference:

  • Roblox sped up their fine-tuned PyTorch BERT-base model by over 30x with three techniques: model distillation, variable-length inputs, and dynamic quantization. Roblox saw the largest performance boost from dynamic quantization with a small negative F1 impact (<1%) and achieved a median latency of under ~20ms for batch size 1. It’s a little unclear what sequence length was used for benchmarking, but based on the examples provided in the article, the length seems to range between 10 and 13 tokens. (Published: 5/2020)
  • Microsoft sped up their PyTorch BERT-base model by 1.2x with ONNX runtime conversion and optimization for a sequence length of 128 tokens and a batch size of 1. (Published: 5/2020)
  • Microsoft sped up their ONNX BERT-base model by 2.9x with dynamic quantization and using Intel Deep Learning Boost: Vector Neural Network Instructions without significant loss of accuracy. They achieved a median latency of 4.5ms for a batch size of 1. Their benchmark was done on sequence lengths of 20, 32, and 64. However, it’s a little unclear what sequence length was used to achieve the 4.5ms latency. (Published: 3/2021)
  • Rasa reduced their TensorFlow BERT-base model size by 4x with TensorFlow Lite 8-bit quantization. However, the CPU inference speed slowed down by ~5x. (Published: 8/2019)

In the findings above, some benchmarking details that can affect inference speed were either omitted or uncontrolled, such as sequence length. Moreover, a year or more has passed since most of these companies have presented their benchmarks. This means that there might be new performance improvements that increase the tested inference engines’ overall speed. 

To understand the current state (7/2021) of optimization techniques and how they can be applied to transformer models at Twitter, we wanted to benchmark and find the best configuration for a transformer-based model like BERT in Google Cloud Platform. We used the following fixed parameters:

This Tweet is unavailable
This Tweet is unavailable.

We chose 128 tokens for the sequence length because this length was used for a fine-tuned BERT model at Twitter. However, in most cases, a Tweet's length is much lower, so we also performed additional single-threaded tests for sequence lengths of 8 and 16 tokens.

Since most production models at Twitter use TensorFlow, our benchmark focuses on comparison between vanilla TensorFlow and optimized models. 

Google Cloud benchmarking setup

ONNX Runtime has a benchmarking script to help measure the performance of ONNX Runtime, PyTorch, and TorchScript on pretrained transformer models. We adapted their script to test and dynamically quantize the pretrained BERT Base Uncased English model on four inference engines: ONNX Runtime, PyTorch, TorchScript, and TensorFlow engines. An option to dynamically quantize a TensorFlow model wasn’t available, so we updated the script to convert the TensorFlow models into TFLite and created the options to apply int8 or fp161 quantization. 

To test in Google Cloud, we built custom Docker images from the following Dockerfiles to execute the benchmark scripts as an AI Platform Training job:

We also ran the training job on the following two GCP instance types for comparison:

The major difference between the two instances is the n2 instance type has access to Intel’s new DL Boost instructions. These instructions can improve int8 inference workloads performance.

The instance type n2-standard-16 (16 vCPUs)2 has the following machine information:

This Tweet is unavailable
This Tweet is unavailable.

The above machine information was extracted using ONNX Runtime’s machine information extraction script.

Results

For every benchmarking test, the ONNX Runtime script saves a detailed summary of the model configurations and latency results in a CSV format. Since files do not persist after completion of an AI Platform Training job, we modified our benchmarking scripts to upload the CSV files to a Google Cloud Storage bucket. We gathered the CSV files in a single sheet to compare and share the raw results:

This Tweet is unavailable
This Tweet is unavailable.

In this section, we summarized the best and worst performing model setups, highlighted the top performing setups, and shared our model accuracy evaluation results.

Best and worst performing model setups

Below, we analyzed the raw results from the sheets document (available above) and outlined the best and worst performing model setups. We also shared caveats and issues we ran into when benchmarking with different ONNX runtime versions. 

Best performing setup for TensorFlow Models3

For a single-threaded setup, an ONNX-converted and dynamic-quantized TensorFlow model: 

  • Had the lowest latency
  • Was 2.15x faster than a vanilla TensorFlow model on n2 instances
  • Was 3.5x faster on a n2 instance than on an n1 instance 

For a multi-threaded setup, an ONNX-converted and dynamic-quantized TensorFlow model with 8 threads and 16 logical cores: 

  • Had the lowest latency
  • Was 2.34x faster than a vanilla TensorFlow model with 8 threads on n2 instances

Best performing single-threaded setup for 8, 16, and 128 tokens

An ONNX-converted and dynamic-quantized PyTorch Model on n2-standard-16 had the lowest latency for the following sequence lengths:

This Tweet is unavailable
This Tweet is unavailable.

Best performing setup for all sequence lengths on an n1-standard-16 instance

A dynamic-quantized TorchScript model consistently was the best for all sequence lengths. If n2 is not an option, this setup might be a good model for n1-standard-16. However, n2-standard-16 is only 2.23% more expensive than n1-standard-16 and is 34 - 68% faster. The performance difference increases for longer sequences.

Best performing setup for multiple threads with sequence length of 128 tokens

An ONNX-converted and dynamic-quantized PyTorch Model with 8 threads on a n2-standard-16 instance had the lowest average latency of 18.5ms.

Worst performing setup

TensorFlow tflite models with dynamic quantization consistently ranked at the bottom with the worst case average latency of around 1 second. The models had outcomes similar to Rasa's TensorFlow lite models. It was 4.9x slower than a vanilla TensorFlow model. This is most likely because TensorFlow Lite is optimized for ARM neon and runs much slower on x86_64 processors. For more details on this issue, please check out this github issue

Best performing setup by highest QPS

Below are the top 5 results for each setup ordered by highest queries-per-second (QPS) for single-threaded and multi-threaded tests on the GCP n2 instances. For single-threaded tests, 8, 16, and 128 length sequences were tested. For multi-threaded tests, 128 length sequences were only tested. In the following tables, we also included results from the vanilla TensorFlow models for comparison.

Single thread results for 128 input tokens on n2-standard-16

The following setups produced the lowest latency for sequence length of 128 tokens on an n2-standard-16 instance: 

This Tweet is unavailable
This Tweet is unavailable.

For more details on the settings for each row: Single Thread Results: 128 tokens on n2

For more details on N1 instance results: Single Thread Results: 128 tokens on n1

Single thread results for 8 input tokens on n2-standard-16

The following setups produced the lowest latency for sequence length of 8 tokens on an n2-standard-16 instance:

This Tweet is unavailable
This Tweet is unavailable.

For more details on the settings for each row: Single Thread Results: 8 tokens on n2

For more details on N1 instance results: Single Thread Results: 8 tokens on n1

Single thread results for 16 input tokens on n2-standard-16

The following setups produced the lowest latency for sequence length of 16 tokens on an n2-standard-16 instance:

This Tweet is unavailable
This Tweet is unavailable.

For more details on the settings for each row: Single Thread Results: 16 tokens on n2

For more details on N1 instance results: Single Thread Results: 16 tokens on n1

Multi-threaded results for 128 input tokens on n2-standard-16

For multi-threaded tests, thread count of 2, 4, and 8 were tested. Thread count larger than the number of physical cores, 8 in this case, was not tested due to the slow outcome we saw when tested locally. 

One caveat is that latency can increase for thread count greater than the number of physical cores due to cache thrashing. This can happen when threads are constantly getting swapped between cores. To find the best setup for your use case, we recommend tuning threads and experimenting with thread counts ranging between the number of physical cores and logical cores.

The following setups produced the lowest latency for sequence length of 128 tokens on an n2-standard-16 instance:

This Tweet is unavailable
This Tweet is unavailable.

For more details on the settings for each row: Multi-threaded Results: 128 tokens on n2

For more details on N1 instance results: Multi-threaded Results: 128 tokens on n1

Model accuracy evaluation after dynamic quantization

To check for any model degradation after dynamic quantization, we compared the prediction outputs between Twitter’s fine-tuned BERT model and its dynamic-quantized version with the relative cross entropy metric. We confirmed that the model’s prediction RCE decreased by 0.20% from 15.87 to 15.84. This essentially means there was no measurable difference in performance.

Conclusion

We executed benchmark tests on Google Cloud Platform to compare BERT CPU inference times on four different inference engines: ONNX Runtime, PyTorch, TorchScript, and TensorFlow. Compared to vanilla TensorFlow, we observed that the dynamic-quantized ONNX model performs:

  • 4x faster4 for a single thread on 128 input tokens
  • 3.6x faster5 for multiple threads on 128 input tokens
  • 4.5x faster for a single thread on 8 input tokens
  • 5.1x faster for a single thread on 16 input tokens

We also confirmed that dynamic quantization did not degrade the model’s accuracy in our testing setup. As for GCP hardware, if executed on N2 instances (compared to N1 instances), the dynamic-quantized ONNX model can get up to 3.5x faster6 and is 3.4x cheaper7 for single-threaded inference. We also learned that TensorFlow dynamic quantization slows down inference by 5x due to its optimizations for embedded devices.

There were several takeaways from this investigation. First, AI Platform Training jobs are great for benchmarking models. For example, you can run custom docker images with custom hardware set-up. It’s a quicker and simpler way to test than setting up a training job in Google Kubernetes Engine or in VMs, which might require additional DevOps knowledge.

Moreover, we learned that you must explicitly set the environment variable OMP_NUM_THREADS to 1 for PyTorch and TensorFlow frameworks to ensure inferences are executed on a single thread.

When testing multiple deep learning frameworks together, you should also ensure ONNX Runtime isn’t using OpenMP for multi-threaded execution by unsetting the KMP environment variables.

Finally, the python OS library offers useful utility functions on setting and getting CPU affinity. This can be used as an additional measure to ensure only a single thread is used for inference. However, this is only available on Linux at this time of execution.

Future work

ONNX Runtime published a new version (v1.9.1) on October 4th, 2021 with more performance optimizations. As of this writing, the most recent multi-threaded tests use v1.8.1. However, an older package was used for the single-threaded tests (v1.7.0). There might be further performance gains with the latest package, so it may be worthwhile to run the single-threaded tests with the latest package.

Thank you Yury Malkov for his comments and guidance through the inference optimization space research. Thank you Ying Xiao for his comments and review of this blog post.

1 Fp16 quantization was only tested locally because of its slow outcome.

2 A vCPU is implemented as a single hardware Hyper-thread on one of the available CPU platforms.

3 The ONNX-converted TensorFlow models were ~330x slower than vanilla TensorFlow when using the onnxruntime-openmp==1.7.0 package. However, this was not the case for ONNX-converted PyTorch models, so we created a github issue with the ONNX Runtime library.

The ONNX Runtime maintainers found several potential causes for the slow down and shared two suggestions:

  1. Upgrade to the latest version because the latest package (v.1.8) had switched from using OMP to thread pools for optimal performance.
  2. Unsetting KMP related env variables because the variables were set by default in the base docker image.

After following the suggestions, we confirmed faster inference time for ONNX-converted TensorFlow models compared to vanilla TensorFlow. One more important note is vanilla TensorFlow uses OMP internally. For more details on how TensorFlow uses OMP, please check out Maximize TensorFlow* Performance on CPU: Considerations and Recommendations for Inference Workloads. While debugging, we also surfaced a 1.17x performance slowdown with ONNX-converted PyTorch models in version 1.8 relative to onnxruntime-openmp==1.7.0. The maintainers created a separate issue to look into this.

4 N1 Tensorflow (382.49ms) vs N2 ONNX-converted and dynamic-quantized TensorFlow model (94.02ms) for 128 input tokens.

5 N1 TensorFlow (146.49ms) vs N2 ONNX-converted and dynamic-quantized TensorFlow (40.93ms) 8 threads for 128 input tokens.

6 N1 ONNX-converted and dynamic-quantized TensorFlow (327.63ms) vs N2 ONNX-converted and dynamic-quantized TensorFlow (94.02ms) for 128 input tokens.

7 Single inference price of ONNX-converted and dynamic-quantized TensorFlow model on n2-standard-2 vs n1-standard-2. We referred to Google’s VM price charts for the instances’ pricing.

This Tweet is unavailable
This Tweet is unavailable.
@1142863755862257666

Mariko Wakabayashi

‎@mwkby‎

Sr. ML Engineer