As a Principal Site Reliability Engineering at Zuora you will play a critical and visible role in delivering and supporting our platform. We are responsible for scaling and optimizing the reliability, availability, and performance of our infrastructure and platform services, and partnering with developers to build highly available and performant services. We work with amazing developer teams in the design, provisioning, integration, configuration, monitoring, and incident response of large scale distributed applications and platform services. We deliver kickass SaaS.
As a Principal SRE you will lead a team that understands the configuration, technical dependencies, and overall behavioral characteristics of production services. In partnership with developers, you and your team have the responsibility to ensure services are designed and delivered with focus on security, resiliency, scale, and performance. SREs are the ultimate authority and are accountable for end-to-end performance and operability of the services they own.
What you’ll achieve
- Champion service reliability and prevention: You will be part of the team whose mission is the shared ownership of a collection of services and technology areas, in partnership with developer teams.
- Service restoration: You will lead a key escalation point for complex or critical issues that have not yet been documented as SOPs for L1 staff. You will often be called in during major incidents as an SME, when the source of a problem is unclear. You will have the deep understanding of service topology and their dependencies required to troubleshoot issues and define mitigations. You will help maintain up-to-date documentation on deployments, processes and SOP run-books.
- Prevention: Once you have expertly resolved an issue, you will immediately work on how to more quickly resolve the issue next time, with the goal to prevent the problem ever reoccurring. You will drive the discovery and implementation of automated and self-healing solutions.
- Service design and implementation: You will partner with development SCRUM teams in defining and implementing improvements to service architecture, both current and future. You will be an expert at articulating technical characteristics of services and the dependencies, and guide development teams to engineer highly reliable and performant services. Then will execute on tasks required to meet milestones and deliverables set by the team throughout a release cycle.
- Operations Engineering: You will own reliability and performance of one or more services. You will understand and be able to communicate the capacity, scale, security, performance attributes and requirements of services you own. You will take part in a shared on-call rotation that won’t cripple your life or kill your soul. You are a SME, able to understand and communicate the characteristics of your service stack, such as:
- Degradation and behavior under load of the services and their dependencies
- End-to-end tuning needs, optimizing resource utilization, as load patterns fluctuate
- Instrumentation and metrics that clearly describe the service behaviors
- Scaling requirements and patterns
- Resiliency and recoverability, ensuring that backup / restore and disaster recovery capabilities are implemented, tested and maintained
What you will need to be successful:
SREs are a rare mix of sysadmins and development engineers, and as such you have the ability to understand and explain the effect of product architecture decisions on the ability to run as distributed systems. You are driven by professional curiosity and a desire to develop a deep understanding of the services and the technologies they depend upon. You are passionate about automation, and can demonstrate practical knowledge of various aspects of distributed service design, including messaging protocols, caching strategies, persistence technologies, and queuing:
- Technical: Even though you are a manager/lead, you still need to be able to get in and help the team to get projects completed. You will also be able to provide technical knowledge and know
- Proactive: self-motivated, customer-focused, organized, and a good communicator.
- Scripting Languages: You demonstrate competence in shell scripting and high-level programming languages such as Bash, Python, Ruby. We use Python extensively.
- Experience: You have over 4 years experience running large scale customer facing web services with a solid understanding of:
- REST APIs
- Load balancing technologies, including L7 routing, DNS, and CDN
- Networking and TCP/IP
- Server hardware configuration
- Monitoring and instrumentation – we own ensuring critical instrumentation and alerting is in place
- Standard Internet services, such as DNS, HTTP, etc.
- Cloud computing patterns
- Configuration management using Puppet, Chef, Ansible, or similar
- Infrastructure Security and compliance
- Experience with AWS services like EC2, ELB, ElastiCache, DynamoDB, SQS, SNS, RDS, S3.
- Container and Container Management technologies, such as Docker and Kubernetes
- Databases and big data stores
- Defining and documenting technical architecture of complex and highly scalable products
- Familiarity with ITIL-based incident, problem, and change management