Reliability
Hardware faults
Hard disk crash
RAM becomes faulty
Network device is not accessible
ETC.
Potential solutions
Add redundant hardware components. e.g. RAID for disk
Add standby machine, fail over to standby if primary fails (HA)
Software errors
Software bug
Run out of shared resource
ETC.
Potential solutions
Comprehensive testing
Process isolation
Process life cycle management (auto restart)
Monitoring
Human errors
Miss configured the system
Operation failure
Potential solutions
Provide the sandbox environment
comprehensive tests which cover the negative cases
Provide rollback mechanism
Provide telemetry metrics
Training
Last updated