Improve your reliability with modern operations practices
Offered By: Microsoft via Microsoft Learn
Course Description
Overview
- Module 1: Discover a map for navigating reliability challenges and sustainably achieving the appropriate level of reliability in your systems, services, and products.
- Express why reliability is crucial to your success
- Describe modern operations practices that offer tools you can use to work on your reliability challenges
- Explain the Dickerson hierarchy of reliability and the map it provides for approaching reliability challenges
- Module 2: Learn how to use monitoring to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn how to increase your operational awareness as a precursor to reliability work
- Expand your understanding of reliability itself
- Change the way you frame your thinking about monitoring to make it more impactful
- Gain a basic understanding of the applicable monitoring platform and tools available on Azure
- Learn a practice from site reliability engineering that can immediately start to create an impact on reliability
- Learn to craft actionable alerts to make your operational practices sustainable
- Module 3: Learn the incident response fundamentals necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn the importance of effective incident response
- Gain an understanding of the lifecycle of an incident so we know just how to apply our efforts
- Learn the building blocks for constructing an incident response process that allows us to respond with urgency.
- Begin to track your incidents effectively using Azure DevOps tools.
- Explore ways to automate your incident tracking for a speedy and consistent response
- Understand the guidelines around communication that allow incident response to be more efficient
- Visit some Azure tools that can significantly speed up your remediation times during an incident
- Module 4: Learn about post-incident reviews, a practice necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Discover the importance of learning from incidents
- Understand the aspects of complex systems that make learning from failure important
- Learn when and how to conduct a post-incident review
- Understand the purpose and goals of a post-incident review
- Learn the components that go into a good post-incident review
- Explore the Azure tools that can assist with getting started with post-incident reviews
- Become aware of common traps to avoid
- Identify helpful practices to conduct a better review
- Module 5: Learn about deployment practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn about what software deployment is and different kinds of deployments we might employ
- Discover the significant benefits of switching from an "epic deployment" model to a "continuous deployment" model
- Explore the components of continuous deployment
- Look deep into pipelines and how they are implemented in Azure Pipelines
- Learn a number of different strategies for deployment to production that can help us avoid incidents
- Examine some important best practices that can minimize the risk when rolling out new software or a new version of existing software
- Module 6: Learn about capacity planning and scaling practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn about scalability and the scalability/reliability relationship
- Understand the role of capacity planning in preparing for growth
- Learn basic concepts and fundamental terms related to scaling
- Eliminate single points of failure
- Understand the different kinds of growth and how to respond to them
- Be able to measure capacity in the cloud
- Catch issues with service limits and quotas before they emerge using Azure tools
- Understand important steps to take before beginning work on scaling
- List techniques for making an application more scalable includingdecoupling, queues, in-memory caching and database sharding
- Learn about the Azure tools that make it possible to take yourapplication or service global
By the end of this module, you will be able to:
In this module you will:
In this module you will:
In this module you will:
In this module you will:
In this module you will:
Syllabus
- Module 1: Improve your reliability with modern operations practices: An introduction
- Introduction
- Why reliability matters
- Modern operations
- The Dickerson hierarchy of reliability
- Summary
- Module 2: Improve your reliability with modern operations practices: Monitoring
- Introduction
- Operational awareness
- Expanding our understanding of reliability
- Changing the frame
- Azure monitoring tools
- Log analytics and KQL queries
- Service level indicators (SLIs) and service level objectives (SLOs)
- Actionable alerts
- Summary
- Module 3: Improve your reliability with modern operations practices: Incident response
- Introduction
- Importance of incident response
- Characteristics and lifecycle of an incident
- Foundations of incident response
- Incident tracking
- Communication and collaboration
- Remediation
- Summary
- Module 4: Improve your reliability with modern operations practices: Learning from failure
- Introduction
- Why learn from incidents?
- What is a post-incident review?
- Characteristics and components of a good post-incident review
- The post-incident review process
- Common traps to avoid
- Helpful practices for learning from failure
- Summary
- Module 5: Improve your reliability with modern operations practices: Deployment
- Introduction
- What is software deployment?
- The continuous delivery deployment model
- Test automation and the delivery pipeline
- Deployment strategies
- Summary
- Module 6: Improve your reliability with modern operations practices: Capacity planning and scaling
- Introduction
- What is scalability?
- Prepare for growth
- Capacity planning considerations
- Make applications scalable
- Go global
- Summary
Tags
Related Courses
Web DevelopmentUdacity Fractals and Scaling
Santa Fe Institute via Complexity Explorer Adobe Experience Manager and MongoDB
MongoDB University Google Cloud Platform for AWS Professionals
Google via Coursera Inove na gestão de equipes e negócios: O crescimento da empresa
Universidade de São Paulo via Coursera