
Responsibilities
The Data Systems Infrastructure (DSI) team sits within the ByteDance global technology structure and supports the company's fast growth by building and operating hyper-scale datacenters, managing the life cycle of server fleet, providing cloud solutions, and developing various infrastructure services, making sure they are scalable and are reliable. We are seeking a technically skilled and detail-oriented professional to serve as a front-line responder for incident detection, triage, and response across infrastructure, facilities, and security operations. The ideal candidate will have a strong foundation in facility operations, broad knowledge across IT, infrastructure, or engineering disciplines, experience in critical environments, and the ability to analyze incidents, managing them properly, identify trends, and drive sustained improvements. This role requires performance under pressure, data-driven thinking, and a proactive approach to continuous improvement and operational resilience. - Act as the primary first responder for the IRC Operations Center by continuously monitoring infrastructure, facilities, and external services using approved tools, and immediately responding to all alerts and anomalies. - Promptly address environmental, facility, IT infrastructure, and external service events (e.g., power, temperature, flooding, outages, partner notifications) to minimize operational and customer impact. - Conduct detailed root cause analysis for all incidents, assess scope and impact, determine corrective actions, and ensure issues are fully understood before closure. - Accurately assess incident severity and customer risk, communicate clearly and proactively with stakeholders, and coordinate timely escalations and collaboration with resolver teams to drive rapid resolution. - Manage incidents properly and efficiently, track response performance against SLAs, and ensure alerts, notifications, and resolutions occur within agreed timelines. - Produce comprehensive incident reports, post-mortems, and operational metrics; analyze trends and recurring issues to generate insights and drive continuous improvement. - Own Incident, Problem, and Change Management processes; maintain SOPs/runbooks; provide technical leadership; and champion continuous improvements to reliability, security, and operational effectiveness across teams.
Qualifications
Minimum Qualifications: - Hold a Bachelor’s degree in Computer Science, Information Technology, Engineering, or a related technical discipline, with solid fundamentals in infrastructure and systems operations. - Demonstrate hands-on experience in Data Center Facility Operations Centers, IT infrastructure, network operations, or systems monitoring environments. - Be proficient with monitoring and alerting platforms (e.g., Grafana, Nagios, or similar) to detect, analyze, and respond to operational events effectively. - Exhibit strong analytical and troubleshooting skills, with the proven ability to investigate incidents, determine root causes, and implement corrective actions. - Operate properly and decisively during critical situations while coordinating incident and problem management processes across cross-functional teams. - Communicate clearly with both technical and non-technical stakeholders through reports and reviews, while maintaining a proactive mindset focused on continuous improvement and operational excellence. Preferred Qualifications: - 5+ years of hands-on experience in IT or data center environments, with strong exposure to incident and problem management in enterprise-scale systems. - Demonstrate working knowledge of data center facility operations, including mechanical, electrical, and plumbing -(MEP) systems, along with server and infrastructure technologies. - Have practical experience with ticketing systems, monitoring platforms (e.g., Grafana), and data center or server management tools to support reliable operations. - Consistently perform in fast changing, time-sensitive situations, balancing multiple priorities while meeting deadlines and resolving critical issues efficiently. - Contribute to or lead initiatives that enhance operational efficiency, security, resilience, and overall infrastructure performance through continuous improvement efforts. - Maintain relevant certifications or technical knowledge (e.g., ITIL, Server+, DCCA, CCNA, PMP, analytics tools), adapt quickly to changing environments, and support operational needs including on-call coverage. - This role requires on-call coverage to support through a scheduled on-call rotation.
Job Information
About Us
Founded in 2012, ByteDance's mission is to inspire creativity and enrich life. With a suite of more than a dozen products, including TikTok, Lemon8, CapCut and Pico as well as platforms specific to the China market, including Toutiao, Douyin, and Xigua, ByteDance has made it easier and more fun for people to connect with, consume, and create content.
Why Join ByteDance
Inspiring creativity is at the core of ByteDance's mission. Our innovative products are built to help people authentically express themselves, discover and connect – and our global, diverse teams make that possible. Together, we create value for our communities, inspire creativity and enrich life - a mission we work towards every day.
As ByteDancers, we strive to do great things with great people. We lead with curiosity, humility, and a desire to make impact in a rapidly growing tech company. By constantly iterating and fostering an "Always Day 1" mindset, we achieve meaningful breakthroughs for ourselves, our Company, and our users. When we create and grow together, the possibilities are limitless. Join us.
Diversity & Inclusion
ByteDance is committed to creating an inclusive space where employees are valued for their skills, experiences, and unique perspectives. Our platform connects people from across the globe and so does our workplace. At ByteDance, our mission is to inspire creativity and enrich life. To achieve that goal, we are committed to celebrating our diverse voices and to creating an environment that reflects the many communities we reach. We are passionate about this and hope you are too.