Updated on 07 Jan 2026
Updated on 07 Jan 2026
Updated on 07 Jan 2026
Updated on 07 Jan 2026
Updated on 07 Jan 2026
- SRE Strategy & Roadmap Development:
Define and drive the execution of the SRE strategy and technical roadmap to enhance service reliability, performance, and scalability. -
Observability Platform Leadership:
Lead the management and improvement of monitoring, alerting, logging, and tracing tools, driving the establishment of optimal observability environments for each product.
- Service Quality Definition & Achievement: Define Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and plan/execute improvement activities to achieve them. Drive the adoption and operation of Error Budgets.
- Performance & Latency Improvement: Identify bottlenecks in service performance and latency, and direct/oversee the team in proposing and implementing solutions.
- Incident Management & Troubleshooting: Act as an incident commander during production outages, leading rapid restoration efforts. Conduct Root Cause Analysis (RCA) and drive the implementation of preventative measures.
- Operational Efficiency & Automation: Promote automation of operational processes to reduce toil, building an efficient and scalable operational framework.
- Team Management & Development: Provide technical guidance, mentorship, and performance evaluations for SRE team members, contributing to the overall skill enhancement and performance of the team.
- Cross-functional Collaboration: Strengthen collaboration with product development teams, infrastructure teams, security teams, and other relevant departments, fostering a DevOps culture and strong cooperative relationships.
Mandatory Qualifications:
- 5+ years of hands-on experience in SRE, infrastructure engineering, or a related field, with at least 2 years in a team lead or technical lead capacity. - Experience in building and operating production systems in public cloud (AWS, GCP, Azure, etc.) or private cloud environments. - Extensive experience in designing, building, operating, and scaling Kubernetes environments. - Deep knowledge and hands-on experience in building and operating modern monitoring, alerting, and logging tools (e.g., Prometheus, Grafana, ELK Stack, Datadog).
- In-depth knowledge of UNIX-like operating system internals and/or networking. - Deep knowledge of IP network systems and protocols (TCP/IP, HTTP, etc.) and troubleshooting experience.
- Experience in building automated workflows using CI/CD tools (e.g., Jenkins, CircleCI, GitLab CI/CD).
- Experience in developing operational automation tools and scripts using scripting languages such as Shell, Python, etc. - Strong communication, negotiation, and collaboration skills to effectively articulate complex technical issues and align with internal and external stakeholders.
Desired Qualifications:
- Web application development experience.
- Experience as a Software Engineer for Test (SET) or knowledge of test automation.
- Deep knowledge and practical experience in Observability, and a strong drive to improve services leveraging SLIs/SLOs.
- Experience in implementing and operating Error Budgets, or a proven track record in toil reduction initiatives.
- Experience working with cross-cultural global teams in different locations. - Japanese skills: Business-level reading and writing proficiency (while internal communication is primarily in English, some documentation or communication may occur in Japanese).