Reliability

Hardware faults

Hard disk crash
RAM becomes faulty
Network device is not accessible
ETC.

Potential solutions

Add redundant hardware components. e.g. RAID for disk
Add standby machine, fail over to standby if primary fails (HA)

Software errors

Software bug
Run out of shared resource
ETC.

Potential solutions

Comprehensive testing
Process isolation
Process life cycle management (auto restart)
Monitoring

Human errors

Miss configured the system
Operation failure

Potential solutions

Provide the sandbox environment
comprehensive tests which cover the negative cases
Provide rollback mechanism
Provide telemetry metrics
Training

PreviousRSM NextScalability

Last updated 5 years ago