Site Reliability Engineer
SKILLS
FULL DESCRIPTION
As a Site Reliability Engineer, you will enhance system reliability, observability and performance through a strong engineering approach and assist with incident resolution and best practices.
Preferred Skills and Experience
- Excellent knowledge of Site Reliability Engineering principles, including the creation and management of effective Service Level Indicators (SLI) and Service Level Objectives (SLO) for reliability and customer satisfaction. - Knowledge of contemporary observability tools, techniques and best practice including Splunk, New Relic, Grafana and Pager Duty. - Excellent knowledge of programming languages including Python, Golang and JavaScript. - Knowledge and experience of modern software development techniques and lifecycles. - Experience with Infrastructure as Code (IaC) automation and orchestration tools such as Ansible and Terraform. - Prior experience working in a large scale, 24/7 enterprise where system uptime and stability is of paramount importance to the Business. - Keen interest of industry trends, particularly Platform Engineering. - Proficiency in shell scripting for automation and system management tasks.
What you will be doing
- Writing and contributing to code that enhances the reliability and observability of services, including telemetry, operational APIs and tooling. - Developing and maintaining tools that facilitate effective management of our systems, ensuring they are operationally efficient and resilient. - Working with automation and orchestration platforms to automate manual activity and reduce toil. - Building sophisticated dashboards using a range of telemetry data and dash boarding technologies like Grafana, Splunk and New Relic. - Maintaining and administering existing monitoring and analytic toolsets. - Mentoring colleagues in use of new technologies or practices. - Actively participating in live incident resolution and post-mortem analysis, providing effective remediation strategies to improve overall system health and prevent future issues. - Driving initiatives to enhance system reliability and observability, contributing to a culture of continuous improvement. - Collaborating with the central Site Reliability Engineering and Observability teams to establish and uphold standards for reliability and observability, assisting teams in adhering to these practices. - Working with IT Operations, providing and supporting the use of critical tooling to enable increasing levels of value to the Business.
Employee Perks
- Bonus - Eye care and Flu Vaccinations - Life Assurance - Matched pension contribution
Apply now