Improve your reliability with modern operations practices
Offered By: Microsoft via Microsoft Learn
Course Description
Overview
- Module 1: Discover a map for navigating reliability challenges and sustainably achieving the appropriate level of reliability in your systems, services, and products.
- Express why reliability is crucial to your success
- Describe modern operations practices that offer tools you can use to work on your reliability challenges
- Explain the Dickerson hierarchy of reliability and the map it provides for approaching reliability challenges
- Module 2: Learn how to use monitoring to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn how to increase your operational awareness as a precursor to reliability work
- Expand your understanding of reliability itself
- Change the way you frame your thinking about monitoring to make it more impactful
- Gain a basic understanding of the applicable monitoring platform and tools available on Azure
- Learn a practice from site reliability engineering that can immediately start to create an impact on reliability
- Learn to craft actionable alerts to make your operational practices sustainable
- Module 3: Learn the incident response fundamentals necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn the importance of effective incident response
- Gain an understanding of the lifecycle of an incident so we know just how to apply our efforts
- Learn the building blocks for constructing an incident response process that allows us to respond with urgency.
- Begin to track your incidents effectively using Azure DevOps tools.
- Explore ways to automate your incident tracking for a speedy and consistent response
- Understand the guidelines around communication that allow incident response to be more efficient
- Visit some Azure tools that can significantly speed up your remediation times during an incident
- Module 4: Learn about post-incident reviews, a practice necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Discover the importance of learning from incidents
- Understand the aspects of complex systems that make learning from failure important
- Learn when and how to conduct a post-incident review
- Understand the purpose and goals of a post-incident review
- Learn the components that go into a good post-incident review
- Explore the Azure tools that can assist with getting started with post-incident reviews
- Become aware of common traps to avoid
- Identify helpful practices to conduct a better review
- Module 5: Learn about deployment practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn about what software deployment is and different kinds of deployments we might employ
- Discover the significant benefits of switching from an "epic deployment" model to a "continuous deployment" model
- Explore the components of continuous deployment
- Look deep into pipelines and how they are implemented in Azure Pipelines
- Learn a number of different strategies for deployment to production that can help us avoid incidents
- Examine some important best practices that can minimize the risk when rolling out new software or a new version of existing software
- Module 6: Learn about capacity planning and scaling practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
- Learn about scalability and the scalability/reliability relationship
- Understand the role of capacity planning in preparing for growth
- Learn basic concepts and fundamental terms related to scaling
- Eliminate single points of failure
- Understand the different kinds of growth and how to respond to them
- Be able to measure capacity in the cloud
- Catch issues with service limits and quotas before they emerge using Azure tools
- Understand important steps to take before beginning work on scaling
- List techniques for making an application more scalable includingdecoupling, queues, in-memory caching and database sharding
- Learn about the Azure tools that make it possible to take yourapplication or service global
By the end of this module, you will be able to:
In this module you will:
In this module you will:
In this module you will:
In this module you will:
In this module you will:
Syllabus
- Module 1: Improve your reliability with modern operations practices: An introduction
- Introduction
- Why reliability matters
- Modern operations
- The Dickerson hierarchy of reliability
- Summary
- Module 2: Improve your reliability with modern operations practices: Monitoring
- Introduction
- Operational awareness
- Expanding our understanding of reliability
- Changing the frame
- Azure monitoring tools
- Log analytics and KQL queries
- Service level indicators (SLIs) and service level objectives (SLOs)
- Actionable alerts
- Summary
- Module 3: Improve your reliability with modern operations practices: Incident response
- Introduction
- Importance of incident response
- Characteristics and lifecycle of an incident
- Foundations of incident response
- Incident tracking
- Communication and collaboration
- Remediation
- Summary
- Module 4: Improve your reliability with modern operations practices: Learning from failure
- Introduction
- Why learn from incidents?
- What is a post-incident review?
- Characteristics and components of a good post-incident review
- The post-incident review process
- Common traps to avoid
- Helpful practices for learning from failure
- Summary
- Module 5: Improve your reliability with modern operations practices: Deployment
- Introduction
- What is software deployment?
- The continuous delivery deployment model
- Test automation and the delivery pipeline
- Deployment strategies
- Summary
- Module 6: Improve your reliability with modern operations practices: Capacity planning and scaling
- Introduction
- What is scalability?
- Prepare for growth
- Capacity planning considerations
- Make applications scalable
- Go global
- Summary
Tags
Related Courses
Advanced Ansible for Devops: Create the MEAN StackCoursera Project Network via Coursera Advanced CloudFormation: Macros (French)
Amazon Web Services via AWS Skill Builder Advanced CloudFormation: Macros (German)
Amazon Web Services via AWS Skill Builder Advanced CloudFormation: Macros (Indonesian)
Amazon Web Services via AWS Skill Builder Advanced CloudFormation: Macros (Italian)
Amazon Web Services via AWS Skill Builder