The Twitter Cortex team has been focused on improving content understanding to extract new topics, promote healthy conversations, and help users discover more relevant Tweets and accounts to follow.
Transformer-based models like BERT are one of the most effective natural language processing (NLP) techniques that can understand words and phrases in different contexts. BERT can distinguish between different semantic meanings of homonyms by generating more nuanced and context-dependent embeddings.
Machine Learning (ML) practitioners at Twitter have seen significant performance gains on NLP tasks such as content moderation and topic discovery by incorporating transformer-based embeddings. However, this performance gain comes with latency and throughput tradeoffs due to transformers' computationally intensive nature. For example, we have a fine-tuned BERT model that processes Tweets asynchronously for downstream tasks to consume. This can take around 500ms to process a single Tweet (of at most 128 tokens) on a CPU-based machine. The processing time can be greatly reduced to 20ms by running the model on a GPU instance, but that can get very costly as the model inference demand continues to scale.
To power Twitter features like recommendations with transformer-based embeddings, we wanted to investigate techniques that can help improve throughput and minimize computational costs without compromising the model architecture. As a result, we investigated several different inference engines, dynamic quantization techniques, and hardware on Google Cloud. We gathered our findings to help the rest of the ML community improve transformer’s speed and computational demand in production.
The following companies have shared optimization techniques and findings to improve latency for BERT CPU inference:
In the findings above, some benchmarking details that can affect inference speed were either omitted or uncontrolled, such as sequence length. Moreover, a year or more has passed since most of these companies have presented their benchmarks. This means that there might be new performance improvements that increase the tested inference engines’ overall speed.
To understand the current state (7/2021) of optimization techniques and how they can be applied to transformer models at Twitter, we wanted to benchmark and find the best configuration for a transformer-based model like BERT in Google Cloud Platform. We used the following fixed parameters:
We chose 128 tokens for the sequence length because this length was used for a fine-tuned BERT model at Twitter. However, in most cases, a Tweet's length is much lower, so we also performed additional single-threaded tests for sequence lengths of 8 and 16 tokens.
Since most production models at Twitter use TensorFlow, our benchmark focuses on comparison between vanilla TensorFlow and optimized models.
ONNX Runtime has a benchmarking script to help measure the performance of ONNX Runtime, PyTorch, and TorchScript on pretrained transformer models. We adapted their script to test and dynamically quantize the pretrained BERT Base Uncased English model on four inference engines: ONNX Runtime, PyTorch, TorchScript, and TensorFlow engines. An option to dynamically quantize a TensorFlow model wasn’t available, so we updated the script to convert the TensorFlow models into TFLite and created the options to apply int8 or fp161 quantization.
To test in Google Cloud, we built custom Docker images from the following Dockerfiles to execute the benchmark scripts as an AI Platform Training job:
We also ran the training job on the following two GCP instance types for comparison:
The major difference between the two instances is the n2 instance type has access to Intel’s new DL Boost instructions. These instructions can improve int8 inference workloads performance.
The instance type n2-standard-16 (16 vCPUs)2 has the following machine information:
The above machine information was extracted using ONNX Runtime’s machine information extraction script.
For every benchmarking test, the ONNX Runtime script saves a detailed summary of the model configurations and latency results in a CSV format. Since files do not persist after completion of an AI Platform Training job, we modified our benchmarking scripts to upload the CSV files to a Google Cloud Storage bucket. We gathered the CSV files in a single sheet to compare and share the raw results:
In this section, we summarized the best and worst performing model setups, highlighted the top performing setups, and shared our model accuracy evaluation results.
Below, we analyzed the raw results from the sheets document (available above) and outlined the best and worst performing model setups. We also shared caveats and issues we ran into when benchmarking with different ONNX runtime versions.
For a single-threaded setup, an ONNX-converted and dynamic-quantized TensorFlow model:
For a multi-threaded setup, an ONNX-converted and dynamic-quantized TensorFlow model with 8 threads and 16 logical cores:
An ONNX-converted and dynamic-quantized PyTorch Model on n2-standard-16 had the lowest latency for the following sequence lengths:
A dynamic-quantized TorchScript model consistently was the best for all sequence lengths. If n2 is not an option, this setup might be a good model for n1-standard-16. However, n2-standard-16 is only 2.23% more expensive than n1-standard-16 and is 34 - 68% faster. The performance difference increases for longer sequences.
An ONNX-converted and dynamic-quantized PyTorch Model with 8 threads on a n2-standard-16 instance had the lowest average latency of 18.5ms.
TensorFlow tflite models with dynamic quantization consistently ranked at the bottom with the worst case average latency of around 1 second. The models had outcomes similar to Rasa's TensorFlow lite models. It was 4.9x slower than a vanilla TensorFlow model. This is most likely because TensorFlow Lite is optimized for ARM neon and runs much slower on x86_64 processors. For more details on this issue, please check out this github issue.
Below are the top 5 results for each setup ordered by highest queries-per-second (QPS) for single-threaded and multi-threaded tests on the GCP n2 instances. For single-threaded tests, 8, 16, and 128 length sequences were tested. For multi-threaded tests, 128 length sequences were only tested. In the following tables, we also included results from the vanilla TensorFlow models for comparison.
The following setups produced the lowest latency for sequence length of 128 tokens on an n2-standard-16 instance:
For more details on the settings for each row: Single Thread Results: 128 tokens on n2
For more details on N1 instance results: Single Thread Results: 128 tokens on n1
The following setups produced the lowest latency for sequence length of 8 tokens on an n2-standard-16 instance:
For more details on the settings for each row: Single Thread Results: 8 tokens on n2
For more details on N1 instance results: Single Thread Results: 8 tokens on n1
The following setups produced the lowest latency for sequence length of 16 tokens on an n2-standard-16 instance:
For more details on the settings for each row: Single Thread Results: 16 tokens on n2
For more details on N1 instance results: Single Thread Results: 16 tokens on n1
For multi-threaded tests, thread count of 2, 4, and 8 were tested. Thread count larger than the number of physical cores, 8 in this case, was not tested due to the slow outcome we saw when tested locally.
One caveat is that latency can increase for thread count greater than the number of physical cores due to cache thrashing. This can happen when threads are constantly getting swapped between cores. To find the best setup for your use case, we recommend tuning threads and experimenting with thread counts ranging between the number of physical cores and logical cores.
The following setups produced the lowest latency for sequence length of 128 tokens on an n2-standard-16 instance:
For more details on the settings for each row: Multi-threaded Results: 128 tokens on n2
For more details on N1 instance results: Multi-threaded Results: 128 tokens on n1
To check for any model degradation after dynamic quantization, we compared the prediction outputs between Twitter’s fine-tuned BERT model and its dynamic-quantized version with the relative cross entropy metric. We confirmed that the model’s prediction RCE decreased by 0.20% from 15.87 to 15.84. This essentially means there was no measurable difference in performance.
We executed benchmark tests on Google Cloud Platform to compare BERT CPU inference times on four different inference engines: ONNX Runtime, PyTorch, TorchScript, and TensorFlow. Compared to vanilla TensorFlow, we observed that the dynamic-quantized ONNX model performs:
We also confirmed that dynamic quantization did not degrade the model’s accuracy in our testing setup. As for GCP hardware, if executed on N2 instances (compared to N1 instances), the dynamic-quantized ONNX model can get up to 3.5x faster6 and is 3.4x cheaper7 for single-threaded inference. We also learned that TensorFlow dynamic quantization slows down inference by 5x due to its optimizations for embedded devices.
There were several takeaways from this investigation. First, AI Platform Training jobs are great for benchmarking models. For example, you can run custom docker images with custom hardware set-up. It’s a quicker and simpler way to test than setting up a training job in Google Kubernetes Engine or in VMs, which might require additional DevOps knowledge.
Moreover, we learned that you must explicitly set the environment variable OMP_NUM_THREADS to 1 for PyTorch and TensorFlow frameworks to ensure inferences are executed on a single thread.
When testing multiple deep learning frameworks together, you should also ensure ONNX Runtime isn’t using OpenMP for multi-threaded execution by unsetting the KMP environment variables.
Finally, the python OS library offers useful utility functions on setting and getting CPU affinity. This can be used as an additional measure to ensure only a single thread is used for inference. However, this is only available on Linux at this time of execution.
ONNX Runtime published a new version (v1.9.1) on October 4th, 2021 with more performance optimizations. As of this writing, the most recent multi-threaded tests use v1.8.1. However, an older package was used for the single-threaded tests (v1.7.0). There might be further performance gains with the latest package, so it may be worthwhile to run the single-threaded tests with the latest package.
Thank you Yury Malkov for his comments and guidance through the inference optimization space research. Thank you Ying Xiao for his comments and review of this blog post.
1 Fp16 quantization was only tested locally because of its slow outcome.
2 A vCPU is implemented as a single hardware Hyper-thread on one of the available CPU platforms.
3 The ONNX-converted TensorFlow models were ~330x slower than vanilla TensorFlow when using the onnxruntime-openmp==1.7.0 package. However, this was not the case for ONNX-converted PyTorch models, so we created a github issue with the ONNX Runtime library.
The ONNX Runtime maintainers found several potential causes for the slow down and shared two suggestions:
After following the suggestions, we confirmed faster inference time for ONNX-converted TensorFlow models compared to vanilla TensorFlow. One more important note is vanilla TensorFlow uses OMP internally. For more details on how TensorFlow uses OMP, please check out Maximize TensorFlow* Performance on CPU: Considerations and Recommendations for Inference Workloads. While debugging, we also surfaced a 1.17x performance slowdown with ONNX-converted PyTorch models in version 1.8 relative to onnxruntime-openmp==1.7.0. The maintainers created a separate issue to look into this.
4 N1 Tensorflow (382.49ms) vs N2 ONNX-converted and dynamic-quantized TensorFlow model (94.02ms) for 128 input tokens.
5 N1 TensorFlow (146.49ms) vs N2 ONNX-converted and dynamic-quantized TensorFlow (40.93ms) 8 threads for 128 input tokens.
6 N1 ONNX-converted and dynamic-quantized TensorFlow (327.63ms) vs N2 ONNX-converted and dynamic-quantized TensorFlow (94.02ms) for 128 input tokens.
7 Single inference price of ONNX-converted and dynamic-quantized TensorFlow model on n2-standard-2 vs n1-standard-2. We referred to Google’s VM price charts for the instances’ pricing.
Did someone say … cookies?