Network Self-Fault Management Based on Multi-Intelligent Agents and Windows Management Instrumentation (WMI)

This paper proposed a new method for network self-fault management (NSFM) based on two technologies: intelligent agent to automate fault management tasks, and Windows Management Instrumentations (WMI) to identify the fault faster when resources are independent (different type of devices). The proposed network self-fault management reduced the load of network traffic by reducing the request and response between the server and client, which achieves less downtime for each node in state of fault occurring in the client. The performance of the proposed system is measured by three measures: efficiency, availability, and reliability. A high efficiency average is obtained depending on the faults occurred in the system which reaches to 92.19%, availability 92.375%, and reliability 100%. The proposed system managed five devices. The NSFM implemented using Java and C# languages.


Introduction:
A network management system (NMS) is a set of tools of hardware and/or software that allows an Information Technology (IT) professional to supervise the individual components of a network within a larger network management framework [1]. The Autonomic management systems promises to provide guaranteed, smooth, and autonomous services of network and operations. There are four basics of self-X functions for autonomic computing [2]:  Self-configuringsystems dynamically changing environments.  Self-healingsystems diagnose, discover, and react to disruptions.  Self-optimizingsystems tune resources and monitor automatically.  Self-protectingsystems anticipate, detect, identify, and protect themselves from attacks from anywhere. Network management is divided into five functional areas by the International Organization for Standardization (IOS) network management forum. interaction beyond previous implementations by producing a list in decreasing order of likelihood of potential root causes which brings the state of the art one step closer toward fully self-healing systems. M. Toy [6] (2014), introduced the selfmanaged network that performs by identifying network failures and repair them, also the selfconfigurations network that is performed by configuration network resources and services. The architectures of the Self-managed Network Element (sNE) and Network Management System (sNMS) for centrally managed networks are described in this work. A hierarchy among repairing entities is defined. An in-band message format for Metro Ethernet networks is proposed for the fault management communication. In self-managed, when a single point of hardware is a failure, a network, isolating and identifying faults performed by itself and fixing them, and having technicians at the failure site only. Therefore, the operational cost reduced. D. Mitrovic,et;al.[7] (2010), present faulttolerance to existing agent frameworks that an easy and a flexible way. The approach is using new two types of mobile agents; First, connection, Agent and Second Remnant Agent. The mobile agents manage efficient construction and maintenance of faulttolerant multi-agent system networks, and implement a robust agent tracking technique.

Network Fault Management:
The detecting, diagnosing, repairing and reporting network equipment are the purpose of the fault management and efficiency of running network to keep from services failure. The functions of fault management are alarm surveillance, localizing the fault, management, testing, correcting the fault and trouble administration [8]. Fault management involves several steps: Data collection and modeling, Detection, Isolation, and Recovery, as explained in the following [9].  Data collection and modeling: Errors can be reported by monitoring devices.  Detection: Analysis the errors and define the type of errors.  Isolation: Among the procedures/tools that aid in isolating a fault when an operational device suddenly fails.  Recovery: Recovery actions are within the scope of external signaling for automated or manual correction of the problem.

Intelligent Agent:
Intelligent agent can be defined as software that acts on their behalf and assists people by allowing them to delegate work that they could have done. Agents can perform repetitive tasks, intelligently summarize complex data, remember things you forgot, learn from you and even make recommendations to you [10]. There are many characteristics of intelligent agent [11]:  Autonomous, means that all actions of the agent have control of the agent.  Goal-driven, means that an agent has a purpose, and represent in accordance with that purpose.  Social, means that they interact, or communicate with other agents.  Reactive, means that an agent senses for the dynamic environment and responds in a timely fashion to these changes.  Customized or adaptive, means that an agent learns, or changes their behavior based on previous experience.  Mobile, means that an agent move from machine to machine.

Windows Management Instrumentation (WMI) For Network Management:
Microsoft Windows operating systems run on local and remote computers for management data and functionality by the WMI. WMI management, data obtained directly through enterprise management tools such as Microsoft Operations Manager (MOM) and Microsoft Systems Management Server (SMS), or through scripts and applications. Scripts written in any scripting language can be used that can work with Windows Script Host. The WS-Management protocol can obtain WMI data through Windows Remote Management. WMI is the Microsoft implementation of Web-Based Enterprise Management (WEBM) , an industry initiative to establish standards for sharing and accessing management information over an enterprise network. WBEM provides the ability for the industry to deliver a well-integrated set of standard-based management tools, facilitating the exchange of data across otherwise disparate technologies and platforms [12]. WMI includes a CIM-compliant object repository and the CIM Object Manager. The object repository contains object definitions that supply data for managing hardware and software. Examples of WMI classes are the Win32 classes, such as Win32_Printer or Win32_ComputerSystem, and StdRegProv, which supplies registry data. The CIM Object Manager handles the collection and manipulation of objects in the repository and gathers information from WMI providers. WMI providers act as intermediaries between WMI and components of the operating system, drivers, applications, and other systems. [13].

The Performance Criteria:
The performance evaluation based on three evaluation performance metrics; these are efficiency, availability, and reliability [14]. 1-Efficiency Criteria: Generally, The efficiency can be defined as the ratio between the output such as the amount of time, number of processes and the input such as the total amount of time, total number of processes. When efficiency improves, the output to input ratio improves. In order to measure the efficiency of the proposed system, the Up (the client work correctly) and the Down (the client work faulted) measures will be calculated by the equation (1): Efficiency=output/input *100% … (1) Where output represents the Up and Input represents all states (Up and Down).

2-Availability Criteria:
Availability is the ratio of the uptime and the sum of the uptime and downtime of the system. This measure calculates the availability of the proposed system for the other clients. To measure the availability of the proposed system, the Uptime (amount of time when the system work correctly) and Downtime (amount of time when the system work faulted) measures will be calculated by the equation (2): Availability=Uptime / (Uptime +Downtime)…(2) 3-Reliability Criteria: Reliability is the probability of the system to work accurately as a function of time 't', or is the ability of an item to perform a required function under stated conditions for a stated time period. In order to measure the service reliability of the proposed system, the Total Requests (total number of requests of the proposed system) and Successful Responses (the number of requests responded by the proposed system) measures will be calculated by the equation (3):

Service Reliability = (Successful Responses / Total Requests) * 100 …(3)
Proposed Method For Network Self-Fault Management: Our proposed method supported devices related to two network services which are chatting system and video chatting. These devices are camera, monitor, soundcard, network card, and keyboard. When the fault management consists of three stages then the proposed method consists of three agents: the first agent used for performing the monitoring, intelligent to specific devices and trying to detect the fault in each monitored device, the second agent used for performing identify the type of fault based on the WMI. WMI is a part of windows that support 49 types of fault for each device, this agent returns a type of the fault; and the third agent used for performing the solution founding based on two phases: the first phase, search occurred fault in a database called the common faults that contain all the faults occurred previously. The second phase, used another database called recommendation solved for having a solution for the occurred fault.  The following algorithm (1) explains the detection stage, where the monitoring process is continued until the fault occurred to perform the isolation process.

Algorithms of Identification Stage:
Identification stage consists of five secondary agents, each agent is responsible for identifying the fault of a specific device by using WMI. The steps of the created WMI class and obtained information explained in the algorithm (2). Identification stage is explained by the following steps:- Step a: Obtained Information from WMI Class (PNPEntity) Step b: Obtained Information from WMI Class(NetworkAdapter) Step c: Obtained Information from WMI Class(SoundDevice) Step d: Obtained Information from WMI Class(Desktop_Monitor) Step e: Obtained Information from WMI Class(Keyboard) End.

End.
Step4: Return Fault Information for each managed device. End

Algorithm of Recovery Stage:
Algorithm (3) explains the recovery stage of the proposed NSFM system, where the Device Name, Device Type and Type of Fault are three input variables obtained from identification stage and the Fault Solved represents the solution to the fault occurred. Algorithm (4) explains the proposed Network self-fault management approach.

Discussion and Test Results:
This section displays an evaluation of the proposed NSFM that obtained faster method than other methods because client tried to solve his/her faults alone without server action, therefore, the proposed method avoids the bottleneck problem that all clients ask the server and wait for receiving responded and reduced load of the network traffic. The NSFM consists of three intelligent agents that running automatically and has a full control of it. The first agent, is used for the detection stage of the NSFM system. In this stage the agent worked continuously monitoring for occurring faults and made isolation for the node (the computer of the Client) from NSFM system when the fault occurred. The second agent, is used for the identification stage of the NSFM system. In this stage the agent is responsible for five sub agents that worked in parallel for identifying the fault occurred. The third agent, is used for the recovery stage of the NSFM system. In this stage the agent is responsible for finding a solution depends on two phases to solve the fault occurred.

Evaluation of the Efficiency Criteria
The proposed system applied on WLAN which consists of a set of nodes that denoted as Nn= {N1, N2, N3….. Nn}, each node has one of these status at one time: Up state (Working correctly) denoted as 1 or Down state (Working Failure) denoted as 0. Four video conferences used for testing the NSFM system. In each video conference testing the node state (UP state or Down state) for 5 seconds (for each period of time) and recording the values ("1" or "0"). When finishing the video conference, calculated the total number of upstate, total uptime, total downtime, and recording the request information such as total number of requests and successful request. Table (1) shows the efficiency of  the results through video conference 1, video  conference 2, video conference 3, and video conference 4, where N1 denoted Node1, N2 denoted Node2, and No. of up denoted number of up state. The efficiency criteria for each node in the video conference is calculated by using the equation (1). Also, the efficiency calculated at each node at a specific moment of time. From the results in Table (1), high efficiency is obtained when the number of the up state is increased. Noticeable decrease in the efficiency occurs when the up state is decreased and the down state is increased. The VC1 shows the smallest value of No. of Upstate (7) and have a low efficiency (%87.5), and the largest value of No. of Upstate (8) achieves high efficiency (%100). The average efficiency for the VC1 is equal to the (%93.75). The average efficiency of VC2 is approximately equal to (%87.5) because the efficiency of the node 1 is equal to (%75) and the efficiency of the node 2 is equal to (%100). The smallest value of No. of upstate achieves of the node1; therefore, node1 has a low efficiency and a large value of No. of upstate achieves of the node2; therefore, node2 has a high efficiency. The average efficiency of VC3 is equal to (%87.5) because the efficiency of the node 1 is equal to (%87.5) and the efficiency of the node 2 is equal to (%87.5). The two nodes have the same values of the No. of upstate; therefore, the two nodes have the same efficiency too. Also, the video conference 4, the optimal state for the proposed system when the two nodes: node 1 (N1) and node 2 (N2) have the upstate for long time of video conference then each of the nodes has a high efficiency (%100) and the average of efficiency in the video conference 4 is equal (%100). Fig. (2) plotting the average efficiency for the four video conferences of the proposed system.

Evaluation of the Availability Criteria
When using the Availability measure for the proposed system by using equation (2), the uptime and downtime are calculated for each node in the video conference. Table (2) shows the results of availability of the proposed system, where the total time represents the period of time, No. Node represents the number of nodes, Uptime represents the time for corrected working, and Downtime represents the time for solving the fault of the node, where the availability depends on the amount of uptime and downtime for each node in a specific video conference. From the above result in Table (2), In the first row, the node 1 has availability equal to (1) because the node 1 has the uptime (40 Sec) and the downtime (0 Sec). But the node 2 has the availability equal to (0.88) because the node 2 has the uptime (35 Sec) and the downtime (5 Sec) of the total time of the video conference is (40 Sec). Better results could be seen of the availability of the two nodes when the first node (N1) and the second node (N2) have value equal to (1), while the results show a noticeable decrease in the availability when the downtime is increased. Also, the large value of downtime (10 Sec) and have a low availability (0.75), and the smaller value of downtime (0 Sec) and have the high availability (1). The average availability of the system is approximately equal to (%92.375). Fig.(3) explains the availability of NSFM system. From the above fig. (3), where the x-axis represents the number of video conference and y-axis represents the average of the availability, in the first video conference, the average of the availability is (0.94), the second video conference, the average of the availability is (0.875), the third video conference, the average of the availability is (0.88), and the forth video conference, the average of the availability is (1).

Evaluation of the Reliability Criteria
When using a Reliability measure for the proposed NSFM system, the total request for the node and number of successful responses are calculated. Table 3 shows the Reliability of the proposed NSFM System for the node, where the total request denoted to the total number of requested to solve the occurred fault, and the successful responses represents number of solved requested successfully. From the result in Table (3) displayed high reliability of the proposed system for the nodes (100) because the client responded to each request by himself/herself without any server action. The average reliability for each video conference is equal to (100). Fig. (4) explains the Reliability of the proposed NSFM system for the node. The traditional system required a collection of main faults for building a database or it is using the distribution management with a middle level of the administrator. The test results of the proposed work, show a fast and efficient to be applied into the network services because it is using intelligent agents and parallel technique for reducing detection time, identification time and recovery time. Also, the proposed work using the WMI and it does not need a built-in a special database.

Conclusions:
The proposed NSFM system used three intelligent agents with full self-auto control; the first agent is used for the detection stage, the second agent is used for the identification stage, and the third agent is used for the recovery stage. The intelligent agents give the system more flexiblity and powerful properties to ensure finding the solution to the fault when it occurs.
selfmanagement reduces the managing traffic in the network (no bandwidth-intensive client/server message exchange). Most operations of the proposed NSFM system management are done without server intervention. The property enables the proposed ssystem to reduce the congestion in the network which supports offering the services without noticeable delay. Video chatting is represented as a video conference application. The proposed NSFM system successes in accomplishing this application with high performance criteria in term of efficiency, availability and reliability. The proposed NSFM system optimizes the efficiency criterion which reaches 92.19%, availability criterion 92.375%, and reliability criterion 100%.

Number of Video Conferences
Average Reliablity