YoVDO

Improve your reliability with modern operations practices

Offered By: Microsoft via Microsoft Learn

Tags

DevOps Courses Operations Management Courses Incident Response Courses Reliability Engineering Courses Scaling Courses Capacity Planning Courses Test Automation Courses

Course Description

Overview

  • Module 1: Discover a map for navigating reliability challenges and sustainably achieving the appropriate level of reliability in your systems, services, and products.
  • By the end of this module, you will be able to:

    • Express why reliability is crucial to your success
    • Describe modern operations practices that offer tools you can use to work on your reliability challenges
    • Explain the Dickerson hierarchy of reliability and the map it provides for approaching reliability challenges
  • Module 2: Learn how to use monitoring to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
  • In this module you will:

    • Learn how to increase your operational awareness as a precursor to reliability work
    • Expand your understanding of reliability itself
    • Change the way you frame your thinking about monitoring to make it more impactful
    • Gain a basic understanding of the applicable monitoring platform and tools available on Azure
    • Learn a practice from site reliability engineering that can immediately start to create an impact on reliability
    • Learn to craft actionable alerts to make your operational practices sustainable
  • Module 3: Learn the incident response fundamentals necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
  • In this module you will:

    • Learn the importance of effective incident response
    • Gain an understanding of the lifecycle of an incident so we know just how to apply our efforts
    • Learn the building blocks for constructing an incident response process that allows us to respond with urgency.
    • Begin to track your incidents effectively using Azure DevOps tools.
    • Explore ways to automate your incident tracking for a speedy and consistent response
    • Understand the guidelines around communication that allow incident response to be more efficient
    • Visit some Azure tools that can significantly speed up your remediation times during an incident
  • Module 4: Learn about post-incident reviews, a practice necessary to help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
  • In this module you will:

    • Discover the importance of learning from incidents
    • Understand the aspects of complex systems that make learning from failure important
    • Learn when and how to conduct a post-incident review
    • Understand the purpose and goals of a post-incident review
    • Learn the components that go into a good post-incident review
    • Explore the Azure tools that can assist with getting started with post-incident reviews
    • Become aware of common traps to avoid
    • Identify helpful practices to conduct a better review
  • Module 5: Learn about deployment practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
  • In this module you will:

    • Learn about what software deployment is and different kinds of deployments we might employ
    • Discover the significant benefits of switching from an "epic deployment" model to a "continuous deployment" model
    • Explore the components of continuous deployment
    • Look deep into pipelines and how they are implemented in Azure Pipelines
    • Learn a number of different strategies for deployment to production that can help us avoid incidents
    • Examine some important best practices that can minimize the risk when rolling out new software or a new version of existing software
  • Module 6: Learn about capacity planning and scaling practices that can help you sustainably achieve the appropriate level of reliability in your systems, services, and products.
  • In this module you will:

    • Learn about scalability and the scalability/reliability relationship
    • Understand the role of capacity planning in preparing for growth
    • Learn basic concepts and fundamental terms related to scaling
    • Eliminate single points of failure
    • Understand the different kinds of growth and how to respond to them
    • Be able to measure capacity in the cloud
    • Catch issues with service limits and quotas before they emerge using Azure tools
    • Understand important steps to take before beginning work on scaling
    • List techniques for making an application more scalable includingdecoupling, queues, in-memory caching and database sharding
    • Learn about the Azure tools that make it possible to take yourapplication or service global

Syllabus

  • Module 1: Improve your reliability with modern operations practices: An introduction
    • Introduction
    • Why reliability matters
    • Modern operations
    • The Dickerson hierarchy of reliability
    • Summary
  • Module 2: Improve your reliability with modern operations practices: Monitoring
    • Introduction
    • Operational awareness
    • Expanding our understanding of reliability
    • Changing the frame
    • Azure monitoring tools
    • Log analytics and KQL queries
    • Service level indicators (SLIs) and service level objectives (SLOs)
    • Actionable alerts
    • Summary
  • Module 3: Improve your reliability with modern operations practices: Incident response
    • Introduction
    • Importance of incident response
    • Characteristics and lifecycle of an incident
    • Foundations of incident response
    • Incident tracking
    • Communication and collaboration
    • Remediation
    • Summary
  • Module 4: Improve your reliability with modern operations practices: Learning from failure
    • Introduction
    • Why learn from incidents?
    • What is a post-incident review?
    • Characteristics and components of a good post-incident review
    • The post-incident review process
    • Common traps to avoid
    • Helpful practices for learning from failure
    • Summary
  • Module 5: Improve your reliability with modern operations practices: Deployment
    • Introduction
    • What is software deployment?
    • The continuous delivery deployment model
    • Test automation and the delivery pipeline
    • Deployment strategies
    • Summary
  • Module 6: Improve your reliability with modern operations practices: Capacity planning and scaling
    • Introduction
    • What is scalability?
    • Prepare for growth
    • Capacity planning considerations
    • Make applications scalable
    • Go global
    • Summary

Tags

Related Courses

Advanced Ansible for Devops: Create the MEAN Stack
Coursera Project Network via Coursera
Advanced CloudFormation: Macros (French)
Amazon Web Services via AWS Skill Builder
Advanced CloudFormation: Macros (German)
Amazon Web Services via AWS Skill Builder
Advanced CloudFormation: Macros (Indonesian)
Amazon Web Services via AWS Skill Builder
Advanced CloudFormation: Macros (Italian)
Amazon Web Services via AWS Skill Builder