Site Reliability Engineer

We are recruiting for an experienced Site Reliability Engineer. As a key member of the operations team, the Site Reliability Engineer (SRE) is passionate about Products and Platforms running at peak performance. You will be excited about understanding how the software of a Global Tech business runs on top of the infrastructure (both physical and Cloud) and how the individual components interact to provide a brilliant solution that is expected by customers over the world. This role takes pride in the performance and availability of the solutions you look after and you will work hard to ensure the manual overhead of running a 24x7 internet solution is reduced to the minimum amount possible. As a SRE, you will work collaboratively with Engineering teams to ensure issues with existing software is understood and responded to appropriately and that new solutions are built with reliability at their heart. What you'll be doing As a Site Reliability Engineer you will be responsible for: Ensuring high levels of system performance through monitoring, analysis and performance tuning. Troubleshooting system hardware, software, networks, operating and system management systems. Working with the security team to identify and protect against threats to the cloud solutions. Liaising with developers, product owners and other engineering teams to deliver engineering roadmaps showing key items such as upgrades, technical refreshes and new versions. Contributing to reviews and audits of projects from an engineering perspective, including identifying risks and mitigation options. Providing on-call support including out-of-hours incident support on a rota basis to help deliver a high quality of service around the clock. Building automation scripts, tools and run-books to reduce the need for manual responses to alerts. Building knowledge and skills within engineering teams to ensure successful running of their software in a high throughput production environment. What we're looking for We're interested in hearing from candidates with experience in the following areas: Passion around reliability Previous experience of working in an Operation role (ideally a site reliability role) Ability to collaboratively work across multiple teams, to take ownership of, prioritise and be accountable for your work. Excellent communication skills and a desire to continue to learn Centralised monitoring solutions (New Relic, Application Insights, Log Analytics, ELK or similar) Configuration management tools (Ansible, Chef or similar) Scripting/programming languages to assist in automating solutions e.g. PowerShell (preferred), Bash, C#, Ruby or Python. Experience supporting web-based applications - with understanding of firewall configuration, load balancing and availability checks. Experience of working with Linux and Microsoft server operating systems. It would be great if you also had: Experience of defining service level objectives/operational requirements for a cloud-based solution. Understanding and working knowledge of MS Azure Cloud offerings, especially in the PaaS category (Web Apps, Storage, Functions). A good understanding or working knowledge of the following tools: Terraform, Ansible, VSTS, ARM, Puppet, Chef, Jenkins, ELK, Grafana. A good understanding or working knowledge of DNS, Load Balancer configuration, Active Directory and cloud-based network infrastructure. Experience of working in an agile environment and experience with agile methodologies such as TDD, Scrum, Kanban. Understanding and experience of implementing a monitoring and alerting system for a micro-service architecture. Applied understanding of cloud security best practice. Does this role fit well in what you are looking for and do you have the experience we are looking for? If so, apply today. This job was originally posted as www.totaljobs.com/job/92860413

Similar searches: Permanent, Full Time, Technology, Nottingham