Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster
Document Type
Conference Proceeding
Publication Date
12-19-2022
Department
Department of Computer Science
Abstract
With the proliferation of deep learning, there exists a strong need to efficiently operate GPU clusters for deep learning production in giant AI companies, as well as for research and development (R&D) in small-sized research institutes and universities. Existing works have performed thorough trace analysis on large-scale production-level clusters in giant companies, which discloses the characteristics of deep learning production jobs and motivates the design of scheduling frameworks. However, R&D clusters significantly differ from production-level clusters in both job properties and user behaviors, calling for a different scheduling mechanism. In this paper, we present a detailed workload characterization of an R&D cluster, CloudBrain-I, in a research institute, Peng Cheng Laboratory. After analyzing the fine-grained resource utilization, we discover a severe problem for R&D clusters, resource underutilization, which is especially important in R&D clusters while not characterised by existing works. We further investigate two specific underutilization phenomena and conclude several implications and lessons on R&D cluster scheduling. The traces will be open-sourced to motivate further studies in the community.
Publication Title
2022 IEEE 40th International Conference on Computer Design (ICCD)
Recommended Citation
Yang, Z.,
Ye, Z.,
Fu, T.,
Luo, J.,
Wei, X.,
Luo, Y.,
Wang, Z.,
&
et al.
(2022).
Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster.
2022 IEEE 40th International Conference on Computer Design (ICCD).
http://doi.org/10.1109/ICCD56317.2022.00103
Retrieved from: https://digitalcommons.mtu.edu/michigantech-p/17306