Hard disk crash
RAM becomes faulty
Network device is not accessible
ETC.
Add redundant hardware components. e.g. RAID for disk
Add standby machine, fail over to standby if primary fails (HA)
Software bug
Run out of shared resource
Comprehensive testing
Process isolation
Process life cycle management (auto restart)
Monitoring
Miss configured the system
Operation failure
Provide the sandbox environment
comprehensive tests which cover the negative cases
Provide rollback mechanism
Provide telemetry metrics
Training
Last updated 5 years ago