Production Engineer - Applied Machine Learning Engine (Singapore)

Location:

Singapore

Team:

Technology

Employment Type:

Regular

Job Code:

A104354

Share this listing:

Responsibilities

About The Team Backed by ByteDance’s world-leading core algorithm businesses in recommendation, advertising, and search, the Data-AML team is dedicated to building high-performance, highly available machine learning storage systems that support trillion-parameter models. We tackle the extreme challenges of globalized, ultra-large-scale clusters, while playing a key role in the development and evolution of machine learning infrastructure. In this team, you'll have the opportunity to sharpen your expertise in multiple subdirections, being model serving, model training, scheduling and orchestration. You are working in the team serving very centric machine learning services at ByteDance with the highest level of availability, as well as creating highly automated systems and pipelines. Responsibilities - Responsible for production operations management and stability assurance of AML training, inference, and storage systems, covering core pipelines such as scheduling and orchestration, Kubernetes (K8s)/GPU clusters, distributed training, online inference serving, and Parameter Server/NoSQL storage. - Build and maintain SLO/SLA frameworks, observability, alerting, on-call processes, incident diagnosis, self-healing mechanisms, disaster recovery, and post-incident review (postmortem) practices. - Drive engineering capabilities including CI/CD, canary/gradual deployments, automated rollback, system health inspections, pre-flight checks, capacity forecasting, and elastic scaling. - Lead resource governance and optimization across GPU, CPU, storage, and network infrastructure, including quota management, cost attribution, and performance tuning, to improve system availability, resource utilization, and engineering productivity.

Qualifications

Minimum Qualification(s) - Bachelor's degree or above in Computer Science, Software Engineering, Artificial Intelligence, or a related field. - Proficient in Linux and skilled in at least one of the following programming languages: Shell, Python, Go, or C++. - Familiar with machine learning training and inference architectures, Kubernetes, GPU clusters, or distributed storage systems. - Experienced in production issue troubleshooting, performance analysis, and automation platform development. - Strong sense of ownership, solid analytical and problem-solving skills, and the ability to drive cross-functional collaboration to resolve complex technical challenges. Preferred Qualification(s) - Prior experience with large-scale training, inference, or storage platforms, SLO governance, FinOps practices, NoSQL systems, or open-source infrastructure projects. - Familiar with the Kubernetes (K8s) ecosystem, with hands-on experience in operating and governing large-scale containerized clusters, including areas such as Operators, declarative operations, and release protection mechanisms. - Familiar with recommendation and advertising system architectures, with experience in AI infrastructure components such as Parameter Servers or KV Caches (e.g., Mooncake).

Job Information

About Us

Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut and Pico as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.​

Why Join ByteDance

Inspiring creativity is at the core of ByteDance's mission. Our innovative products are built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and enrich life - a mission we work towards every day.​
As ByteDancers, we strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our Company, and our users. When we create and grow together, the possibilities are limitless. Join us.​
Diversity & Inclusion​
ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.​