
Responsibilities
About The Team The mission of our AML team is to push the next-generation AI infrastructure and recommendation platform for the ads ranking, search ranking, live & e-Commerce ranking in our company. We also drive substantial impact on core businesses of the company. Responsibilities - Responsible for the iteration of the underlying architecture of the large model inference engine and end-to-end GPU performance optimization, through means such as operator fusion and compilation optimization, deeply optimizing GPU memory access, computing pipeline, and Stream asynchronous scheduling, eliminating inference computing bottlenecks, improving single-card inference throughput, and reducing inference latency. - Adapt to all series of GPU/NPU hardware architectures, refine the universality of the inference engine and hardware adaptability, and build a high-performance, low-loss underlying base for large model inference. - Lead the design, development, and optimization of distributed parallel solutions for large model inference scenarios, with a focus on implementing multi-dimensional parallel strategies such as tensor parallelism (TP), pipeline parallelism (PP), sequence parallelism, and MoE expert parallelism, to address core issues such as multi-card splitting and deployment of ultra-large models, high cross-card communication overhead, load imbalance, and low parallel efficiency. - Follow up on cutting-edge technologies such as global large model inference, GPU high-performance computing, distributed parallelism, and cache optimization, benchmark against mainstream inference frameworks such as vLLM and TensorRT-LLM, complete the implementation of solutions and technological innovation, continuously iterate and optimize the performance and cost advantages of the inference system, and build the core technological barriers of the team.
Qualifications
Minimum Qualification(s) - Bachelor’s degree in Computer Science or equivalent with 3+ years of relevant experience - Solid foundation in computer low-level knowledge, proficient in C/C++ and Python programming, skilled in CUDA programming and familiar with GPU hardware architecture principles, and well-versed in GPU memory models, computing scheduling, and communication mechanisms; - Proficiently master the underlying development and implementation of various basic operators in Deep learning, be well-versed in GPU adaptation and optimization of core operators such as matrix operations, normalization, and activation functions, and be able to independently complete operator handwritten reconstruction, memory access optimization, vectorization acceleration, and precision alignment to ensure high performance and high stability of operator inference. - Familiar with the end-to-end process of deep learning inference compilation, understand core compilation technologies such as computational graph optimization, operator fusion, constant folding, memory reuse, scheduling optimization, and quantization compilation, and be able to simplify the inference process, reduce GPU memory usage, and decrease inference latency through compilation-level improvements, thereby significantly enhancing the throughput efficiency of model inference. - Proficient in using GPU performance analysis tools such as Nsight and Profiler, able to accurately identify performance bottlenecks such as computing power waste, memory access blockage, and scheduling redundancy during the inference process, possess the thinking of software-hardware collaborative optimization, capable of outputting systematic optimization solutions and completing implementation iterations, and adaptable to the requirements of industrial-level high-concurrency, low-latency inference business. Preferred Qualification(s) - Thoroughly understand the core principles of large model inference, proficiently master the core technologies of model parallelism, have experience in implementing distributed inference solutions such as tensor parallelism, pipeline parallelism, and sequence parallelism, and be familiar with multi-card communication, load balance, and parallel efficiency optimization methods. - Those with experience in secondary development and Performance optimization of mainstream large model inference frameworks such as vLLM, SGLang, TensorRT-LLM, etc. are preferred. - Familiarity with model computation efficiency optimization solutions for mainstream deep learning frameworks.
Job Information
About Us
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut and Pico as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
Why Join ByteDance
Inspiring creativity is at the core of ByteDance's mission. Our innovative products are built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and enrich life - a mission we work towards every day.
As ByteDancers, we strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our Company, and our users. When we create and grow together, the possibilities are limitless. Join us.
Diversity & Inclusion
ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.