Tuesday, December 16, 2014

Building Fault-tolerant Software

Creating a software system or an app that can fail-safe or even fault-tolerant is not an easy task, especially for cloud apps or services since they are often platform agnostic and may require network availability.

I often ask myself how to enhance the user experience even in the unexpected user scenarios. Though the answer varies case by case for different systems and scenarios, here are some simple questions I often deliberate on when designing for error handling and fault-tolerance:
 
1. What are the corner use cases the app or system might encounter? Can system handle it?
2. In case of internal or external error occurs, can system automatically recover from the error? Can it prevent the error happens again?
3. If not, can system continue its intentional operations by providing walk-around options for users?
4. If not, can system provide users with possible manual solutions to get rid of the error?
5. If not, can system fail gracefully with proper error message for users?
6. Then, does the error need to be logged or sent for further analysis?

A simple example would be how different JavaScript websites react when the hosting browser has JavaScript support disabled (in Dec, 2014):

1. My Facebook Personal Homepage: Neither shows any content nor error message, but only an app bar:
 

2. My Live.com Homepage: Shows the proper error messages with solution:


3. My Gmail Homepage after logon: Provide user with both proper error messages with solution, and excitingly, an html version mail walk-around:

 
 
It requires extensive software validations to uncover some of software failures, and then careful design for the error handling remedy. But this process will eventually benefit the users, ITs, and developers!