Reliability

Hardware faults

  • Hard disk crash

  • RAM becomes faulty

  • Network device is not accessible

  • ETC.

Potential solutions

  • Add redundant hardware components. e.g. RAID for disk

  • Add standby machine, fail over to standby if primary fails (HA)

Software errors

  • Software bug

  • Run out of shared resource

  • ETC.

Potential solutions

  • Comprehensive testing

  • Process isolation

  • Process life cycle management (auto restart)

  • Monitoring

Human errors

  • Miss configured the system

  • Operation failure

Potential solutions

  • Provide the sandbox environment

  • comprehensive tests which cover the negative cases

  • Provide rollback mechanism

  • Provide telemetry metrics

  • Training

Last updated