Reliability

• Ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues

• Design Principles

  • Test recovery procedures - Use automation to simulate different failures or to recreate

  • Automatically recover from failure - Anticipate and remediate failures before they occur

  • Scale horizontally to increase aggregate system availability - Distribute requests across multiple, smaller resources to ensure that they don't share a common point of failure

  • Stop guessing capacity - Maintain the optimal level to satisfy demand without over or under provisioning - Use Auto Scaling

  • Manage change in automation - Use automation to make changes to infrastructure

Last updated