YoVDO

Site Reliability Engineer

Offered By: Udacity

Tags

Site Reliability Engineering (SRE) Courses Microservices Courses Terraform Courses Incident Response Courses Observability Courses

Course Description

Overview

The goal of the Site Reliability Engineer (SRE) Nanodegree program is to equip software developers with the engineering and operational skills required to build automation tools and responses that ensure designed solutions respond to non-functional requirements such as availability, performance, security, and maintainability. The content will focus on both designing systems to automate response to issues with software sites as well as how to respond to common on-call situations.

Syllabus

  • Welcome!
    • Welcome! We're so glad you're here. Join us in learning a bit more about what to expect in this program and ways to succeed.
  • Establishing a foundation in observability
    • In this course, we will learn about the founding concepts of Observability in terms of people and tools.
  • Planning for High Availability and Incident Response
    • In this course, we will look at how SREs view availability and reliability for their infrastructure. We'll learn how to create effective monitoring using SLOs and SLIs. We will create dashboards in Grafana. Next, we'll identify all our IT assets, ensure they are configured for high availability. And then we will craft a disaster recovery plan to make sure failover is seamless and automated. After that, we'll deploy the infrastructure to AWS using Terraform. We'll learn the benefits of infrastructure as code. We'll see how easy it is to deploy to multiple regions. Finally, we'll learn how to make databases highly available and disaster recovery ready. We'll look at recovery strategies and implement them in AWS via Terraform.
  • Self Healing Architectures
    • Self-healing architecture is resilient enough to withstand failure and resolve issues without human intervention through automation. In this course, you'll gain skills in self-healing architecture design strategies, deployment strategies, and cloud automation
  • Establishing a Culture of Reliability
    • This course is all about how to foster a culture that is based on reliability. We will learn how to utilize best practices for several key areas of being a Site Reliability Engineer (SRE) and how they contribute to a culture of reliability. We will cover how to have balanced and effective on-call rotations as well as how to handle incidents. Next, we will discuss how to review your system throughout its lifecycle to find and mitigate any potential risk factors. Managing system capacity at all phases of a system's lifecycle is another major component to ensuring that everything is operating at maximum reliability. We will round out this course by discussing a thorn in every SRE's side: toil. We will discuss how to identify and reduce toil to maximize time spent performing operational work.
  • Congratulations!
    • Congratulations on finishing your program!
  • Career Services
    • The Careers team at Udacity is here to help you move forward in your career - whether it's finding a new job, exploring a new career path, or applying new skills to your current job.

Taught by

nd087 Nathan Anderson, nd087 Travis Scotto, nd087 Emmanuel Apau and nd087 Sonny Sevin

Related Courses

Google Professional Cloud DevOps Engineer Certification Course (GCP DevOps Engineer Track Part 5)
A Cloud Guru
SRE Capstone
IBM via edX
SRE Fundamentals and Security
IBM via edX
Developing a Google SRE Culture - 日本語版
Google Cloud via Coursera
Developing a Google SRE Culture
Google via Google Cloud Skills Boost