Responsibilities
About The Team Backed by ByteDance’s world-leading core algorithm businesses in recommendation, advertising, and search, the Data-AML team is dedicated to building high-performance, highly available machine learning storage systems that support trillion-parameter models. We tackle the extreme challenges of globalized, ultra-large-scale clusters, while playing a key role in the development and evolution of machine learning infrastructure. In this team, you'll have the opportunity to sharpen your expertise in multiple subdirections, being model serving, model training, scheduling and orchestration. You are working in the team serving very centric machine learning services at ByteDance with the highest level of availability, as well as creating highly automated systems and pipelines. Responsibilities -Global High Availability & Reliability Governance (SLA/MTTR): As the primary owner of system reliability, ensure the stable operation of large-scale machine learning storage systems (offline and online ultra-large-scale Parameter Servers). Lead end-to-end observability initiatives (metrics, logs, tracing), alerting and incident response, postmortem analysis, and risk governance, while driving disaster recovery drills and the evolution of globally distributed architectures. -Massive Resource Management & FinOps Cost Governance: Build capacity planning models and cost attribution systems for globally distributed, multi-datacenter, heterogeneous hardware resources across multiple hardware generations. Through intelligent scheduling, capacity governance, and resource profiling, break through performance bottlenecks under extreme model scales and highly concurrent traffic, achieving both cost reduction and efficiency improvement. -Platform Engineering (Self-Healing / Change Management / Capacity Scheduling): Develop automated operations and management platforms, including closed-loop fault self-healing, change management and release protection, auto-scaling and capacity forecasting, inspections, and health scoring capabilities, enabling systems to be operable, scalable, and sustainable. -Reliability & Performance Engineering: Deeply engage with business scenarios to conduct specialized performance optimization across the training layer (ultra-large models), inference layer (ultra-low latency and large capacity), and synchronization layer (high consistency and timeliness). Drive tail-latency and jitter optimization from the perspectives of storage engines and replication pipelines, while continuously evaluating and implementing cutting-edge industry hardware and software solutions.
Qualifications
Minimum Qualifications - Bachelor’s degree or above, with 3+ years of experience in SRE, infrastructure engineering, or operations development. - Strong interest in reliability and operability engineering for large-scale distributed storage systems and machine learning infrastructure. - Proficient in at least one of Go, Python, or C/C++, with the ability to translate reliability requirements into engineering and platformized solutions, such as automation toolchains, governance platforms, and system components. - Solid foundational knowledge of NoSQL storage systems, with systematic understanding of at least one of Parameter Server (sparse embedding tables), Redis, RocksDB, or MongoDB, including mechanisms such as LSM-Tree, Compaction, WAL, MVCC, and replication pipelines. - Able to quickly identify system-level anomalies (e.g., P99/P999 tail latency spikes, compaction abnormalities, memory fragmentation) through monitoring metrics and logs, and ensure SLA compliance and reduce MTTR through parameter tuning, capacity planning, scaling strategies, and mitigation mechanisms. - Capable of collaborating efficiently with R&D teams to identify root causes, drive fault self-healing, and continuously improve system robustness. - Deep understanding of Linux operating system internals and production troubleshooting. Experienced in kernel-level profiling, tracing, and flame graph analysis, with the ability to thoroughly resolve complex end-to-end system issues. Preferred Qualifications - Familiar with the Kubernetes (K8s) ecosystem, with hands-on experience in operating and governing large-scale containerized clusters, including areas such as Operators, declarative operations, and release protection mechanisms. - Familiar with recommendation and advertising system architectures, with experience in AI infrastructure components such as Parameter Servers or KV Caches (e.g., Mooncake). - Deep understanding of the Linux kernel and the internals of classic NoSQL databases. - Experience reading source code and/or contributing PRs or bug fixes to open-source projects such as Redis, RocksDB, Pika, Ceph, or HBase.
Job Information
About Us
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut and Pico as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
Why Join ByteDance
Inspiring creativity is at the core of ByteDance's mission. Our innovative products are built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and enrich life - a mission we work towards every day.
As ByteDancers, we strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our Company, and our users. When we create and grow together, the possibilities are limitless. Join us.
Diversity & Inclusion
ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.