High availability is the term used to describe the level of system availability normally expected by the users. And the level of availability is normally demanded in critical systems that provide essential services to the general public, such as communication systems, e-commerce systems, and banking services such as Automated Teller Machines (ATM) machines. Any High availability system must be able to perform under stated conditions for a stated period of time (reliable); be able to easily bypass and recover from a component failure; be able to perform effective problem determination, diagnosis, and repair; and so forth. This paper discusses the various high availability technologies and the role(s) that they play in maintaining system up-time. The examples of these technologies are grouped in terms of function namely: redundancy, fault tolerance, clustering, partitioning, automation, security mechanism, and caching, etc.

Redundancy: is a technique derived to identically duplicate system components so that if one component fails, the system can still continue to function. The purpose it to enhance reliability by masking users from system failures, and help in instant recovery since the redundant system is automatically used in place of the failed one. Thus redundancy is a powerful tool to improve system availability. Examples are mirrored databases, RAID technology, TCP/IP-based Internet communication, dual network interface cards (NICs) found in high availability computer servers designed to function as file or application servers, and so forth.

In Fault Tolerance: systems or their components are designed to continue functioning even during faulty (internal or external errors) conditions. That is, they buy time to identify and resolve the cause of the problem. For example, Error Checking and Correcting  (ECC) Memory stores check bits that are generated by a special algorithm within the memory system itself. The algorithm checks the any retrieved piece of data against the code stored to validate it. The detected bad data or incorrect bit is then fixed automatically. This allows the system to remain reliable since it remains operation despite the problems. Hence, allowing the IT team to analyze the root cause of the problem. However, since ECC memory for example depends on the specific algorithm implemented, it can only correct a limited number of erroneous bits. Other examplesof fault tolerance include RAID-5 with storage systems that provides fault tolerance, and the Hot fixing feature of Windows NT and Windows 2000 that handle write errors.

Partitioning: which is a technique more effective in limiting the scope of outages that preventing them, calls for dividing or splitting the system. So that when fault occurs in one system, it does not affect the other. Thus when systems are isolated adequately, there is less risk of a system-wide outage; subsystems are easier to manage since they are less complex; risk of changes are minimized; resource contention is reduced; and recovery is simpler and faster. For example, you can isolate critical applications from no-critical applications. For instance, data can be isolated from applications by storing data in a different storage location from that of the applications. Likewise, separate servers for files can be implemented for each department, instead of having a centralized server for the entire company.

System backup and recovery: backup is the process of making copies of system data or files and storing it in standby or in a secure location or multiple locations to be used when the system fails. And recovery is the process of restoring and bringing the system to current using the backed up file copies. For example, Standby Recovery Server can be implemented in a way that two servers share the same set of storage systems. If the primary server fails, then the storage system is automatically switched over to the backup server, which brings the system up in minutes. This improves system availability since it eliminates the need to schedule maintenance during off hours, thus it continues to provide users access.

Clustering: A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers, or servers working together as a single integrated computing resource. Cluster technology provides load distribution and high availability. And it permits organizations to boost their processing power by expanding without incurring extra costs, while using standard technology that can be acquired at low cost. In addition, the performance of the applications improves with the support of scalable software and automatic load sharing. Further, the failover technology capability allows a backup computer or server to take over the tasks of the failed serve or computer in the cluster system. An example of failover clustering is Dell PowerEdge server configured to share external data storage devices. If one server fails, the other assumes the storage handle by the failed server.

Caching: delivers the object to the user from the closest site that has the content the user is requesting. Busy Server CPUs can be cached to prevent bottlenecks due to high CPU utilization. Examples of Caches include Web caches, server caches, iCaches and so forth. Cache keep the most frequently requested objects in memory and send them to clients without having to go through the usual processing. For example, with some web caches, you save on bandwidth since request occurs only once thus allowing web sites performance optimization; improves application scalability due to reduced number of requests to servers; storage of large files to allow more rapid transfer across the net; and distribution of load between several web servers. The intangibles benefits may include customer retention and employee satisfaction due to better online experience and greater work experience respectively, and improved company-wide communications through messaging and online training and development.

Automation: automated operations are designed in order to reduce or replace manual procedures for running the computer system with tools or programs that can simulate or bypass the decision and function normally performed by humans. System tasks that are highly repetitive, prone to human errors, difficult to monitor, or difficult to enforce like security procedures, then those tasks should be automated. For example the “Availability monitoring tools that transmit and receive up/down polling information, many times emulating mission-critical traffic to gauge availability and latency; Security management tools that transmit and receive authentication, and authorization information and may perform vulnerability tracking services; Fault management tools that receive SNMP traps and syslog event messages and so on” (Cisco). These tools are generally prime candidates for high availability because automation makes it easier to detect events and respond immediately as needed, eliminate problems that result from human errors, and its accurate in performing it task within the defined parameters.

Security mechanism: Most organizations use some form of authentication requirement and procedure as form of first defense mechanism to thwart unauthorized users from illegally accessing data stored in the company database. Typical systems security includes creating profiles, roles, user accounts, and assigning them privileges and passwords, thus restricting them to the specified activities that they can perform in that very database. Security rules and procedures are set to determine which users can access the system, and which data each user is allowed to access (Holden, 2003).  There are definitely various security measures that can be implemented within any computer system or network. However, the above mentioned are the most common features and once used or reinforced properly, unauthorized users can be prevented from accessing and performing any illegal activities that my crush the system, thus making it unavailable. Furthermore, test and audit access control mechanisms and update security tools as often as needed.

“Enact basic administrative practices: This includes such things as training for IT staff members, defining and enforcing security measures, and enforcing configuration change control.”

References:
Arregoces, M. (2006). Data Center Fundamentals. Cisco Press

Benson, A. (1996). Client/Servert Architecture.  2nd Ed. McGrow-Hill

Buyya, R. (1999). High Performance Cluter Computing, Volume 1.
    Prentice Hall-PTR

Brown, K. et al. (2003). Enterprise Java Programming with IBM WebSphere.
Addison-Wesley

Cisco. Network Management: Implementing and Operating High Availability Solutions

 http://www.cisco.com/en/US/technologies/tk869/tk769/white_paper_c11-449655.html

Holden, G. (2003). Guide to Network Defense and Countermeasure. Thomson Course
technology

Marcus, E. & Stern, H. (2003). Blueprints for High Availability. 2nd Ed.
    Wiley

Mohan, C. Caching Technologies for Web Applications, 12 September 2001.

http://www.almaden.ibm.com/u/mohan/Caching_VLDB2001.pdf