Site Reliability Engineer

Offered By: Udacity

Course Description

Overview

The goal of the Site Reliability Engineer (SRE) Nanodegree program is to equip software developers with the engineering and operational skills required to build automation tools and responses that ensure designed solutions respond to non-functional requirements such as availability, performance, security, and maintainability. The content will focus on both designing systems to automate response to issues with software sites as well as how to respond to common on-call situations.

Syllabus

Welcome!

Welcome! We're so glad you're here. Join us in learning a bit more about what to expect in this program and ways to succeed.

Establishing a foundation in observability

In this course, we will learn about the founding concepts of Observability in terms of people and tools.

Planning for High Availability and Incident Response

In this course, we will look at how SREs view availability and reliability for their infrastructure. We'll learn how to create effective monitoring using SLOs and SLIs. We will create dashboards in Grafana. Next, we'll identify all our IT assets, ensure they are configured for high availability. And then we will craft a disaster recovery plan to make sure failover is seamless and automated. After that, we'll deploy the infrastructure to AWS using Terraform. We'll learn the benefits of infrastructure as code. We'll see how easy it is to deploy to multiple regions. Finally, we'll learn how to make databases highly available and disaster recovery ready. We'll look at recovery strategies and implement them in AWS via Terraform.

Self Healing Architectures

Self-healing architecture is resilient enough to withstand failure and resolve issues without human intervention through automation. In this course, you'll gain skills in self-healing architecture design strategies, deployment strategies, and cloud automation

Establishing a Culture of Reliability

This course is all about how to foster a culture that is based on reliability. We will learn how to utilize best practices for several key areas of being a Site Reliability Engineer (SRE) and how they contribute to a culture of reliability. We will cover how to have balanced and effective on-call rotations as well as how to handle incidents. Next, we will discuss how to review your system throughout its lifecycle to find and mitigate any potential risk factors. Managing system capacity at all phases of a system's lifecycle is another major component to ensuring that everything is operating at maximum reliability. We will round out this course by discussing a thorn in every SRE's side: toil. We will discuss how to identify and reduce toil to maximize time spent performing operational work.

Congratulations!

Congratulations on finishing your program!

Career Services

The Careers team at Udacity is here to help you move forward in your career - whether it's finding a new job, exploring a new career path, or applying new skills to your current job.

Taught by

nd087 Nathan Anderson, nd087 Travis Scotto, nd087 Emmanuel Apau and nd087 Sonny Sevin

Site Reliability Engineer

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Site Reliability Engineer

Tags

Course Description

Overview

Syllabus

Taught by

Related Courses

Login to Continue