
Responsibilities
The Data Center Service team supports the company's fast growth by building and operating hyperscale data centers. The team manages the end to end lifecycle of server fleet, providing cloud solutions and various infrastructure services ensuring that they are scalable and are reliable. Responsibilities: 1. Large-Scale Server OS Deployment - Responsible for operating system deployment and delivery across large-scale IDC environments. - Perform OS image installation, system initialization, and customized OS provisioning for servers. 2. Provisioning Platform Architecture Evolution - Design, develop, and continuously enhance the core architecture of hyperscale automated server provisioning platforms. - Drive platform scalability, reliability, and operational efficiency improvements. 3. Low-Level Services & Hardware Enablement - Develop and maintain core backend components of the provisioning system, including PXE services, OS image management, and related infrastructure. - Support hardware enablement and compatibility for new server platforms and components. 4. Complex Troubleshooting & AIOps Innovation - Investigate and resolve complex issues across the end-to-end server delivery lifecycle. - Explore and implement Large Language Models (LLMs) and AI Agent technologies for intelligent log analysis, root cause identification, automated troubleshooting, and self-healing systems. 5. Engineering Efficiency & Security - Build and optimize CI/CD pipelines for infrastructure changes. - Strengthen lifecycle security compliance, risk mitigation, and disaster recovery capabilities. 6. Hardware Validation & Delivery Assurance - Coordinate end-to-end server hardware validation activities to ensure delivery quality and compliance requirements are met. 7. Performance Testing & Optimization - Lead validation and testing of critical server components, including CPUs, memory, storage devices, and GPUs. - Conduct single-node and cluster-level GPU performance benchmarking, stress testing, and performance tuning. 8. Test Automation - Develop automated benchmarking and stress-testing frameworks using scripting languages to improve testing efficiency and coverage. 9. Quality Analytics & Continuous Improvement - Perform quality analysis on large-scale server shipments. - Drive quality control initiatives and manage closed-loop resolution of hardware and delivery issues.
Qualifications
Minimum Qualifications 1. Bachelor's degree in Computer Science, Engineering, or a related field, with 3+ years of experience in IT infrastructure, server operations, system engineering, or hardware validation. 2. Strong understanding of data center infrastructure and operational models. 3. Deep knowledge of Linux operating systems and server hardware architecture, including CPU, memory, storage, RAID, and network interface controllers. 4. Solid understanding of PXE-based automated provisioning workflows and related network protocols such as DHCP, TFTP, and HTTP. 5. Hands-on experience with out-of-band management technologies such as IPMI and Redfish, as well as boot architectures including BIOS and UEFI. 6. Strong programming and automation skills in at least one of the following: Golang, Python, or Shell scripting. 7. Familiarity with Git-based software development workflows and collaborative engineering practices, and proficient in Linux system administration and operational troubleshooting. Preferred Qualifications 1. Experience in server hardware validation, benchmarking, and quality assurance programs. 2. Deep understanding of performance characteristics and benchmarking methodologies for CPUs, memory, storage systems, and GPUs. 3. Experience designing and implementing automated performance and stress-testing frameworks. 4. Proven ability to conduct large-scale quality analytics and operational excellence initiatives. 5. Strong documentation skills, including technical specifications, test procedures, project reports, and operational runbooks. 6. Strong analytical and problem-solving skills, with the ability to independently drive complex technical investigations and solutions. 7. Experience leveraging AI-assisted development and troubleshooting tools to improve engineering productivity.
Job Information
About Us
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut and Pico as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
Why Join ByteDance
Inspiring creativity is at the core of ByteDance's mission. Our innovative products are built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and enrich life - a mission we work towards every day.
As ByteDancers, we strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our Company, and our users. When we create and grow together, the possibilities are limitless. Join us.
Diversity & Inclusion
ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.