Optimization Engine to Enable Edge Deployment of Deep Learning Models

William Cai; Manuel Vindiola, Ph.D.

JUNE 2024 I Volume 45, Issue 2

June 2024
Volume 45 I Issue 2

IN THIS JOURNAL:

Optimization Engine to Enable Edge Deployment of Deep Learning Models

William Cai

HPC Internship Program, DoD High-Performance Modernization Program
Stanford University

Abstract

This study aimed to assess the performance impact of Apache TVM optimizations on MobilenetV2 across various hardware platforms using Google Colab, specifically Intel Xeon CPU, NVIDIA T4 GPU, NVIDIA A100 GPU, and TPU. The analysis entailed benchmarking MobilenetV2 using distinct optimization configurations: TVM-AutoScheduler and TVM-AutoTune. The models, powered by the ONNX inference engine and the TVM framework, were evaluated on the Xeon CPU, NVIDIA T4 GPU, and NVIDIA A100 GPU in ONNX data format and TVM-compiled format, while the model running on the TPU utilized the Pytorch format. The study found that the MobilenetV2 model with TVM-AutoScheduler optimization, running on the NVIDIA A100 GPU, demonstrated the highest and most reliable performance across all configurations given by its lowest mean execution time and the smallest standard deviation of execution time. However, MobilenetV2 with TVM-AutoTune and TVM-AutoScheduler optimizations underperformed on the NVIDIA T4 and Intel Xeon CPU compared to the non-optimized MobilenetV2. Finally, while TVM-AutoScheduler optimization outperformed TVM’s default Autotune in terms of reliability and performance, it did so at the expense of additional optimization time.

Keywords: Machine learning, oprimization, auto-tuning, GPU, TPU

Introduction

Machine learning (ML) models have become fundamental tools in various fields such as healthcare, financial services, and autonomous driving, offering solutions with an unprecedented degree of accuracy and efficiency. As these models grow increasingly complex, there is a rising demand for advanced hardware optimizations to ensure superior performance and cost-effectiveness1. However, optimizing ML models for various hardware comes with distinct challenges. Many existing frameworks focus on specialized libraries, optimizing mainly for a restricted range of server-class GPUs2. Moreover, deploying AI/ML models on mobile devices often demands substantial manual effort due to resource constraints3.

In response to these challenges, Apache TVM[ https://tvm.apache.org/], an open-source machine learning compiler stack, has emerged as a potent solution. TVM is specifically designed to tackle optimization challenges inherent to deep learning models and aims to expedite the inference speed of trained ML models4. TVM is able to accept deep learning models from different frameworks like Tensor Flow, Pytorch, Keras, and ONNX etc. TVM then employs advanced compiler techniques that leverage specific hardware features and abstractions to maximize a model’s performance. These include graph-level optimization for computational efficiency and operator-level optimization to maximize hardware utilization5. Furthermore, TVM provides a large search space for low-level optimization through AutoTuning and AutoScheduler, which allows it to generate highly efficient code across various hardware platforms. The main difference between AutoTuning and AutoScheduler is that AutoScheduler does not require manual creation of codes for scheduling the parameters optimization6.

Figure 1 Illustration of the procedures Figure 1: Illustration of the procedures and various levels of optimization mechanisms of TVM from the original TVM paper4.

In this project, we investigate the performance implications of TVM optimizations on MobilenetV2 (MV2), a convolutional neural network renowned for its balance between accuracy and computational resource demand7. The performance will be benchmarked across multiple hardware platforms hosted on Google Colab, including Intel Xeon CPU, NVIDIA T4 GPU, NVIDIA A100 GPU, and TPU. Given the capabilities of TVM, we hypothesize that TVM-optimized MobilenetV2 will outperform the regular MobilenetV2 across all hardware platforms, delivering lower and more consistent execution times. By testing this hypothesis, we hope to provide valuable insights into effectively deploying deep learning models on mobile devices, thus contributing to the practical implementation of these models in various fields.

Methods

We evaluated the performance of MobilenetV2 in various framework formats across various hardware platforms on Google Colab. We correlated better performance of an optimization-hardware-configuration to a lower mean execution time and a smaller standard deviation of execution time.

Figure 2 Illustration of the workflow of the project

Figure 2: Illustration of the workflow of the project.

We first evaluated the performance of MobilenetV2 in ONNX data exchange format executed using the ONNX Runtime Inference Engine on Intel Xeon CPU, NVIDIA T4 GPU, and NVIDIA A100 GPU. We ran 320 trials on each of the hardware.

We then converted the MobilenetV2 in ONNX data exchange format to the format that can be compiled by a TVM framework. We then evaluated the performance of TVM-compiled MobilenetV2, MobilenetV2 with TVM-AutoTune optimization, and MobilenetV2 with TVM-AutoScheduler optimization on Intel Xeon CPU, NVIDIA T4 GPU, and NVIDIA A100 GPU. For each of the optimization-hardware pairs, we ran 320 trials. Finally, we evaluated the performance of MobilenetV2 in PyTorch format on Google Colab TPU with 320 trials.

Results and Discussion

Integer Test

The performance of the model varied dramatically depending on its optimization type and hardware configuration.

Figure 3: Cross-comparisons of performances among all the optimization-hardware-configurations. MobilenetV2 model with TVM-AutoScheduler optimization, running on the NVIDIA A100 GPU, has the lowest mean execution time and smallest standard deviation of execution time.

All the configurations that ran on the NVIDIA GPU (A100 and T4) had performances of mean execution time less than 10 ms, which are considerably lower than those of CPU (Intel Xeon) and TPU as shown in Figure 3. It demonstrated that the type of hardware a model runs on still plays a major role in the performance of the model as reflected in the mean execution time.

Figure 4 A comparison of performances Figure 4: A comparison of performances among different optimization configurations running on NVIDIA A100 GPU.

As shown in Figure 4, MV2 models that run on NVIDIA A100 GPU with TVM-AutoScheduler optimization and TVM-compilation, on average, have better performance than the ONNX-formatted MV2 and the one with TVM-AutoTune optimization. Interestingly, MV2 running on A100 with TVM-AutoTune optimization had the worst performance. A potential reason that the MV2 with TVM-AutoTuned optimization yielded a worse performance than the ONNX-formatted MV2 is that the TVM-AutoTune optimization was limited to only search for a portion of the parameter space that yielded worse performance than the other configurations of the model.

Figure 5: A comparison of performances among different optimization configurations running on NVIDIA T4 GPU.

On the other hand, when running on the NVIDIA T4 GPU, the ONNX-formatted MV2 and the TVM-compiled MV2 had better performance than those with TVM-AutoScheduler optimization or TVM-AutoTune optimization as shown in Figure 5. The TVM-auto-tune optimization yielded the worst performance. We noticed that MV2 running on T4 with TVM-AutoTune optimization had the worst performance, likely caused by the same reason as the MV2 with TVM-AutoTune running on A100. Interestingly, the MV2 that ran on T4 with TVM-AutoScheduler optimization demonstrated almost the same if not worse performance than the ONNX-formatted counterpart. This was likely due to the low number of trials that limit the potential of the TVM-AutoScheduler to explore more of the parameter space for the MoibilenetV2-T4 configuration.

Figure 6: A comparison of performances among different optimization configurations running on Intel Xeon CPU.

Interestingly, when running on the Intel Xeon CPU, the ONNX-formatted MV2 achieved the best performance and the TVM-compiled MV2 had the worst performance. One potential explanation for the underwhelming performance of the TVM-optimized and TVM-compiled MV2 might be that obtaining optimal model parameters for MV2 on the Xeon CPU necessitated a number of trials significantly exceeding 320.

Figure 7 Cross-comparisons of performances

Figure 7: Cross-comparisons of performances between TVM-AutoScheduler and TVM-AutoTunes with different hardware configurations.

We also conducted a comparative analysis of the TVM-AutoScheduler optimization against the TVM-AutoTune optimization, specifically assessing their impact on the MV2’s performance across varied hardware platforms. Our results indicated that, for both NVIDIA A100 and T4, the TVM-AutoScheduler optimization yielded a reduced mean execution time and a more consistent, narrower standard deviation of execution times. Conversely, these benefits were not observed when MV2 operated on the Xeon CPU.

Figure 8: A comparison between the mean time required to perform the TVM-AutoScheduler optimization and the TVM-AutoTune optimization.

Given the same number of trials, the TVM-AutoScheduler optimization required a longer average time to optimize the MV2 on a given hardware platform. This extended duration can be attributed to the TVM-AutoScheduler’s comprehensive search through a larger parameter space for optimization, delving deeper to identify the most suitable configurations for optimal performance.

Conclusion

This study thoroughly evaluated the performance implications of TVM optimization on the MobilenetV2 model across various hardware platforms. Our findings confirm that TVM optimization significantly enhances the performance of MobilenetV2 on the NVIDIA A100 GPU compared to the unoptimized model. This demonstrates the potential of TVM’s advanced compiler techniques and hardware-specific optimizations in harnessing the power of high-performance GPUs, leading to faster and more efficient model execution. Conversely, TVM Autotune and Autoschedule optimizations exhibited a negative impact on the performance of MobilenetV2 on NVIDIA T4 and Intel Xeon CPU compared to the standard model. This suggests that while TVM has immense potential for improving deep learning performance, care must be taken when applying optimizations across diverse hardware platforms. The nature of the optimizations and the specific features of the hardware both significantly influence the overall performance.

A comparison between TVM’s Autoschedule and Autotune methods revealed that while Autoschedule generally yields superior results, it comes at the cost of increased optimization time. This trade-off presents a significant consideration for the practical deployment of ML models, where both execution efficiency and time-to-market are essential factors. In conclusion, while TVM represents a promising solution to the challenges of hardware optimization for machine learning models, its practical application requires careful selection and tuning of optimization methods suited to the specific hardware in use. This study serves as a valuable guide for those looking to leverage the power of TVM to improve machine learning model performance, highlighting the need for targeted strategies and a nuanced understanding of both the software and hardware aspects of model deployment.

Further research is required to expand the range of tested models and hardware, enabling a more comprehensive understanding of TVM’s potential and limitations.

The Impact of Summer Research

By participating in the HIP internship and working on this project, I further developed my skills in working with high-performance computing and enriched my experience working with deep learning models.

Acknowledgements

This research was sponsored by the High-Performance Computing Modernization Program (HPCMP), the HPC Internship Program (HIP) sponsorship, and the Oak Ridge Institute for Science & Education (ORISE). I would also like to express my gratitude towards Dr. Manuel Vindiola for his guidance on this research.

References

1.LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep Learning. Nature, 521, 436-444.

2.Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., & Shelhamer, E. (2014). cuDNN: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759.

3.Lane, N. D., Bhattacharya, S., Georgiev, P., Forlivesi, C., & Jiao, L. (2016, October). Deepx: A software accelerator for low-power deep learning inference on mobile devices. In 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN) (pp. 1-12). IEEE.

4.Chen, T., Moreau, T., Jiang, Z., Zheng, L., Yan, E., Shen, H., … & Chen, Z. (2018). TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18) (pp. 578-594).

5.Ragan-Kelley, J., Barnes, C., Adams, A., Paris, S., Durand, F., & Amarasinghe, S. (2013). Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM SIGPLAN Notices (Vol. 48, No. 6, pp. 519-530). ACM.

6.Zheng, L., Jia, C., Sun, M., Wu, Z., & Hao Yu, C. (2021, March 3). Introducing TVM Auto-scheduler (a.k.a. Ansor). Apache TVM. Retrieved August 18, 2023, from https://tvm.apache.org/2021/03/03/intro-auto-scheduler

7.Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018). Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4510-4520).

Author Biographies

William Cai obtained Engineering Physics B.S.H, Mathematics B.S., and Computer Science M.S. degrees from Stanford University. Prior to this internship, he conducted studies to use applied physics-based machine-learning models to improve problem-solving efficiency in the fields of materials science and neuroscience. He began study for a Materials Science and Engineering Ph.D. at Stanford in September 2023.

PERMISSIONS

ABOUT US CHAPTERS EDUCATION EVENTS MEMBERSHIP

Join us on LinkedIn to stay updated with the latest industry insights, valuable content, and professional networking!

IN THIS JOURNAL:

Conversations with Experts

Values in T&E

Technical Articles

Workforce of the Future

News

Optimization Engine to Enable Edge Deployment of Deep Learning Models

William Cai

Abstract

Introduction

Methods

Results and Discussion

Integer Test

Conclusion

The Impact of Summer Research

Acknowledgements

References

Author Biographies