As a Site Reliability Engineering Lead/Manager at Zuora you will play a critical and visible role in delivering and supporting our platform. We are responsible for scaling and optimizing the reliability, availability, and performance of our infrastructure and platform services, and partnering with developers to build highly available and performant services. We work with amazing developer teams in the design, provisioning, integration, configuration, monitoring, and incident response of large scale distributed applications and platform services. We deliver kickass SaaS.
As an SRE Lead/Manager, you will lead a team that understands the configuration, technical dependencies, and overall behavioral characteristics of production services. In partnership with developers, you and your team have the responsibility to ensure services are designed and delivered with focus on security, resiliency, scale, and performance. SREs are the ultimate authority and are accountable for end-to-end performance and operability of the services they own.
What you'll achieve
Champion service reliability and prevention: You will be part of the team whose mission is the shared ownership of a collection of services and technology areas, in partnership with developer teams.
Service restoration: You will lead a key escalation point for complex or critical issues that have not yet been documented as SOPs for L1 staff. You will often be called in during major incidents as an SME, when the source of a problem is unclear. You will have the deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations. You will help maintain up-to-date documentation on deployments, processes and SOP run-books.
Prevention: Once you have expertly resolved an issue, you will immediately work on how to more quickly resolve the issue next time, with the goal to prevent the problem ever reoccurring. You will drive the discovery and implementation of automated and self-healing solutions.
Service design and implementation: You will partner with development SCRUM teams in defining and implementing improvements to service architecture, both current and future. You will be an expert at articulating technical characteristics of services and the dependencies, and guide development teams to engineer highly reliable and performant services. Then will execute on tasks required to meet milestones and deliverables set by the team throughout a release cycle.
Operations Engineering: You will own reliability and performance of one or more services. You will understand and be able to communicate the capacity, scale, security, performance attributes and requirements of services you own. You will take part in a shared on-call rotation that won’t cripple your life or kill your soul. You are a SME, able to understand and communicate the characteristics of your service stack, such as:
Degradation and behavior under load of the services and their dependencies
End-to-end tuning needs, optimizing resource utilization, as load patterns fluctuate
Instrumentation and metrics that clearly describe the service behaviors
Scaling requirements and patterns
Resiliency and recoverability, ensuring that backup / restore and disaster recovery capabilities are implemented, tested and maintained
What you will need to be successful:
SREs are a rare mix of sysadmins and development engineers, and as such you have the ability to understand and explain the effect of product architecture decisions on the ability to run as distributed systems. You are driven by professional curiosity and a desire to develop a deep understanding of the services and the technologies they depend upon. You are passionate about automation, and can demonstrate practical knowledge of various aspects of distributed service design, including messaging protocols, caching strategies, persistence technologies, and queuing.
Leadership: You will lead a team of highly technical SREs. Keep the team focused on priorities, and mentor those under you to grow and expand their career.
Technical: Even though you are a manager/lead, you still need to be able to get in and help the team to get projects completed. You will also be able to provide technical knowledge and know
Proactive: self-motivated, customer-focused, organized, and a good communicator.
Scripting Languages: You demonstrate competence in shell scripting and high-level programming languages such as Bash, Python, Ruby. We use Python extensively.
Experience: You have over 4 years experience running large scale customer facing web services with a solid understanding of:
Load balancing technologies, including L7 routing, DNS, and CDN
Networking and TCP/IP
Server hardware configuration
Monitoring and instrumentation – we own ensuring critical instrumentation and alerting is in place
Standard Internet services, such as DNS, HTTP, etc.
Cloud computing patterns
Configuration management using Puppet, Chef, Ansible, or similar
Infrastructure Security and compliance
Experience with AWS services like EC2, ELB, ElastiCache, DynamoDB, SQS, SNS, RDS, S3.
Container and Container Management technologies, such as Docker and Kubernetes
Databases and big data stores
Defining and documenting technical architecture of complex and highly scalable products
Familiarity with ITIL-based incident, problem, and change management