Reliability Analysis of a Fault-Tolerant Full-Duplex Optical Wireless Communication Transceiver

Optical wireless communication (OWC) has emerged as a promising solution to the radio spectrum crunch. OWC technology requires more access points (APs) compared to other wireless technologies, making it crucial to have reliable OWC APs to ensure low repair rates and optimal network availability. To address this, our paper provides methods to enhance the reliability of duplex OWC AP transceivers and calculates their reliability parameters. The transceiver design consists of several independent modules, and we calculate the failure in time (FIT) for each module using MTBF calculator software and component datasheets. Our reliability analysis includes two methods: mean time between failure (MTBF) calculations using reliability block diagrams and Markov chain modeling. The former identifies modules with a high failure rate and determines the system’s cumulative downtime or repair time, while the latter accounts for the repair rates of individual modules for availability calculations. Both methods complement each other and provide insight into fault-tolerant system designs. Furthermore, we perform a life cycle cost analysis of the proposed fault-tolerant transceiver designs to facilitate appropriate design choices for different applications.


I. INTRODUCTION
The traffic share of Wi-Fi networks is 5 times that of mobile networks [1]. Wi-Fi networks face spectrum crunch and cannot meet the ever-increasing data demands, driving research towards alternative frequency bands. As a result, optical wireless communication (OWC) and millimeter wave (mmWave) wireless technology has emerged as the primary competitors for indoor wireless communication. OWC has the potential to deliver significantly higher bandwidth [2] than any other wireless method. It is safe, secure, green, and uses light-emitting diodes (LEDs) simultaneously for communication and illumination [3]. In OWC, we switch the LEDs at high speed to transmit the optical signals and use photodetectors in the receivers to convert the light intensity variations back to the electrical signals. The infrastructure LEDs act as the optical antennas [4] for the OWC access points (AP) The associate editor coordinating the review of this manuscript and approving it for publication was Lorenzo Ciani . to transmit downstream data over the visible light, while the OWC users may send upstream data over the infrared (IR) light. Since OWC utilizes separate frequency bands for upstream and downstream communication, it can transmit and receive data simultaneously in full-duplex mode. However, legacy reasons [5] restrict OWC networks to half-duplex media access control (MAC), which hinders the network's ability to utilize the full spectrum for communication.
OWC networks are often used in environments that require high security or are sensitive to radio-frequency (RF) interference [6], [7], [8]. These environments are also fault-sensitive, as minor device failures can have disastrous consequences. For instance, a malfunctioning OWC device could lead to the loss of life in a hospital where patients are being monitored, disrupt communication in a defense unit that relies on OWC networks or cause production failure in an industrial setting where machines are connected and controlled remotely via an OWC network. To avoid these potential damages, ensuring the reliability of OWC networks becomes important. The reliability of an OWC network is evaluated in terms of connection and hardware reliability. The connection reliability can be enhanced by replicating paths using multiple OWC transmitters or alternate technologies like Wi-Fi. In contrast, hardware reliability can be improved by replicating the components responsible for most failures. The authors of [9] and [10] investigated multiple-input-multiple-output (MIMO) techniques, including transmit-diversity, which uses multiple transmitters to send the same data, thereby providing multiple parallel downstream paths to a single user, improving reliability. IEEE P802.11bb light communication task group is working on a hybrid network of Wi-Fi and light communication [4]. This network can switch seamlessly between Wi-Fi and OWC networks depending on the connection quality. Our group was the first [11] to propose the connection reliability metric in a hybrid OWC-RF system. Additionally, our group provides the hardware reliability of the receiver of a passive optical positioning system in [12]. To achieve full-duplex communication in a standalone OWC network, we introduced various changes at the MAC layer [13], [14] and presented the corresponding hardware in [15]. However, the hardware reliability of an OWC AP transceiver has not been studied so far.
This article presents the reliability analysis of the OWC front-ends we designed for communication, which we presented at COMSNETS 2022 [15] as an example. We achieve this by proposing several module-level improvements that enhance the reliability and availability of the transceiver. The methodology is general in nature, illustrated here with application to OWC, and can be utilized for other networks too. We assume a 5-year life expectancy for the analysis until the technology becomes obsolete. The reliability analysis of a fault-tolerant OWC transceiver is carried out in two parts: a) MTBF analysis using reliability block diagram, b) Markov model to calculate the system availability.
The MTBF analysis method is used to identify the modules within a system that have a higher likelihood of failure. By pinpointing these modules, it becomes possible to determine which ones should be replicated in order to achieve optimal system reliability. Using this method, we propose various fault-tolerant transceiver designs. A fault-tolerant system has the ability to operate with reduced performance in the event of certain module failures, as the modules causing higher failures are replicated inside it. This analysis calculates the system's average downtime or the mean time to repair (MTTR) to maintain the desired system availability. On the other hand, the Markov chain model analyzes the fault-tolerant system designs by considering all the system states, namely fully working, degraded, and failed. A system in a fully working state has no module failures. In the event of replicated module failures, it can still function in a degraded state. However, in a failed state, the system cannot operate due to the failure of a primary module. The Markov chain model can be used to determine the repair rate needed for circuit modules to achieve a desired level of availability. This model allows flexibility in selecting different repair rates for each module and also considers the repair of degraded states, which is not possible in MTBF analysis. The significant contributions of our work are briefed as follows: 1) Our MTBF analysis of the OWC transceiver identifies the key modules contributing to most failures. This provides a targeted approach to improve the overall reliability of the system. 2) We present three different designs of transceiver modules, each with distinct types of module replication, and provide reliability metrics for each design. We propose a highly reliable and fault-tolerant transceiver circuit design by combining these different replications. This approach ensures the system remains operational even after a few module failures. 3) We further provide the detailed methodology to perform the Markov analysis for system availability calculations. 4) Using the Markov model, we analyze the transceiver designs proposed in the first half of this article and compare them based on their system availability.
By considering the repair rates of each module separately, we provide a more detailed analysis of the system availability. 5) We compare the various transceiver designs based on their life cycle cost. This analysis helps to decide on the most appropriate design for different applications.
The structure of the rest of the paper is as follows: Section II provides the reliability metrics, followed by Section III, which discusses the modular view of our hardware and the corresponding MTBF calculations. Section IV details a variety of hardware upgrades as well as a highly reliable fault-tolerant design by using reliability block diagrams. The Markov model analysis for the various redundancies proposed in Section IV is provided in Section V. Section VI compiles the analysis results for the reliability block diagram and the Markov model. Section VII provides the life cycle cost analysis of the proposed transceiver designs. Finally, Section VIII concludes our work, followed by the scope of future work.

II. RELIABILITY THEORY
Reliability is the probability that a system performs satisfactorily under stated conditions for a given time. In reliability theory, we assume that the number of failures per unit of time for a particular type of device is constant. This theory adheres to the principle of memorylessness, which states that the probability of device failure is not influenced by the amount of time it has been in operation. Consequently, the exponential distribution f T (t) is used to represent the random time of failure (T ) of a device.

A. RELIABILITY FUNCTION
The survivor or reliability function represents the probability that a system will continue to function for a time greater than VOLUME 11, 2023 t. This function is calculated as follows [16]: where λ 0 (unit in FITs) is the failure rate of the system. The failure rate λ 0 or failure in time (FIT) defines the average number of system failures in one billion operating hours. We use the internationally accepted MIL-HDBK-217 [17] standard for predicting the FIT of electronic equipment. MIL-HDBK-217 part stress method is used for designing highly reliable systems and requires detailed information on the system components.
The reliability of a system having N components in series, where the i th component has an individual reliability of R i (t), can be given as [16] or, R 0 (t) = R(t) N , for N similar series components with individual reliability of R(t). Using (1) in (2), Therefore, the failure rate of a series combination is simply the sum of the failure rate of each component. Similarly, the reliability of a system having N components in parallel is [16]: or, R 0 (t) = 1 − (1 − R(t)) N , for N similar parallel components.

1) MTBF
Mean time between failures (MTBF) is another essential reliability metric, which defines the average time a system takes to run into a repairable failure. MTBF is the reciprocal of the failure rate [16]: The commonly used reliability metric mean time to failure (MTTF) is applicable to non-repairable devices and defines the average time until the device fails and cannot be repaired. In this work, the transceiver is a repairable device and the appropriate measure of its reliability is the MTBF.

2) MTTR
The mean time to repair (MTTR) refers to the average time it takes to repair a device after a repairable failure occurs and is expressed in standard time units. The repair rate (µ), which is the reciprocal of MTTR, represents the number of repairs per unit of time for a system, and can be expressed as follows:

3) AVAILABILITY
The availability of a repairable device is the fraction of time its intended operational service is available to a user. The availability [16] of a device is calculated as:

III. MODULAR RELIABILITY ANALYSIS
An access point (AP) is a network control device that provides Internet access to the end users. Fig. 1 displays standard OWC AP transceiver circuit blocks. The first block in the diagram is the power supply of the transceiver. Typically, the AP is mains-powered or through power over Ethernet (PoE) cable. Thus the power supply consists of either an AC-DC converter or a buck-boost converter circuit. The second block is a processor responsible for modulation, demodulation, digital signal processing techniques, and networking-related control functions. The processor connects to the OWC front-ends LEDs and photodetector via LED driver and amplifier circuits. We designed our own OWC transceiver circuit on similar lines for the AP [15] and presented its modular block diagram in Fig. 2. Each module further consists of multiple electronic components. We calculate the failure rate (λ) of each component at the ambient temperature of 25 • C and stress temperature of 70 • C using the ALD's MTBF Calculator software [18] and summarize the values in Table 1. The method for calculations of MTBF and the reliability of each module is discussed in Section III-A and the values are tabularized in columns six and seven, respectively. We have discussed the working principle of the transceiver in our previous work [15]. In this section, we will focus on the internal structure of the OWC transceiver, which is crucial for MTBF calculations. All the module components form a series combination for reliability calculations, as a single component failure results in module failure.

A. MODULE 1: BUCK CONVERTER
We use a DC voltage supply of 12 V and a combination of buck converters (or voltage step-down regulators) to reduce  the 12 V DC to ±5 V for the rest of the circuit. Fig. 3 displays the circuit diagram of the buck converter. It uses two voltage step-down regulator integrated circuits (ICs) and supporting components, a rectifier diode, an inductor, electrolytic capacitors, 0.25 W carbon film resistors, an NPN-transistor, and a Zener diode.
Since all the module components are in series for reliability analysis, we find the failure rate of module 1 (λ 1 ) using (4). For example, to calculate module 1 FIT at 25 • C, we multiply columns three (quantity) and four (FIT at 25 • C) of Table 1 and then add all the values together. After calculating the FIT value of module 1, MTBF and reliability values are determined using (7) and (1), respectively. We use the same reliability calculations for the rest of the modules in this section and tabularize the results in Table 1.

B. MODULE 2: PROCESSOR
We test our transceiver circuit on ARM Cortex-A53 64-bit quad-core processor. The processor provides packetized input to the transmitter front-end (modules 3 to 6) through its universal asynchronous receiver-transmitter (UART) pin and receives packetized data from the receiver front-end (modules 7 to 10) through the second UART pin.

C. MODULE 3 TO 6: TRANSMITTER FRONT-END
The transmitter front-end of our OWC transceiver comprises an inverter, buffer, LED driver, and optical antenna, which is an LED, as depicted in Fig. 4. To achieve high switching speeds, we utilize the LM7171 operational amplifier (opamp) in the circuit. The inverter circuit has a low current flow (∼a few mA), which is why we opted for 0.25W carbon film resistors. A buffer circuit following module 3 is employed to prevent loading effects on the processor. It consists of a single op-amp, followed by modules 5-6, an LED driver, and an LED. We utilize a bipolar junction transistor (BJT) as a switch in the LED driver circuit to provide adequate currents. When the BJT is in the cut-off region, the LED current of ∼200 mA flows through the collector resistance requiring a 5 W collector resistance. As the optical antenna, we incorporate OSRAM visible light LEDs LCW W5SM, with a stated estimated lifetime of L70B50 hours. L70B50 represents the duration after which the LED lumen output declines to 70% of its initial level, with 50% of the LED area failing to attain the 70% lumen level. We obtain the L70B50 values of our LEDs from the OSRAM (dragon product family) application notes [19], which we use as the MTBF of the LED.

D. MODULE 7 TO 10: RECEIVER FRONT-END
The receiver front-end consists of a photodetector (PD), followed by a trans-impedance amplifier (TIA), a high pass filter (HPF), and a two-stage amplifier as displayed in Fig. 5. We use OSRAM PD SFH2701 to receive the optical signal at the receiver. The PD output current is amplified and converted to a voltage signal by using a TIA. An HPF with a low cut-off frequency (∼300 Hz) is used after TIA to remove the low-frequency ambient light effects. The voltage signal from the HPF is amplified by using a two-stage amplifier. The signal of the two-stage amplifier is suitable for the processor UART input pin.
As shown in Fig. 6, we connect all of the transceiver circuit modules in series and include the failure rate of each module (in FITs) alongside the module in the transceiver reliability diagram. It can be stated that a failure of the transmitter frontend (modules 3-6) or the receiver front-end (modules 7-10) does not imply full transceiver failure and could be connected in parallel. However, in duplex communication, both the transmitter and receiver data is required. Consequently, any module failure leads to communication failure, leading to a series connection in the reliability diagram. We list the transceiver circuit reliability values in the last row of Table 1.

IV. FAULT-TOLERANT DESIGN AND MTBF ANALYSIS
The transceiver circuit has low 5-year reliability of ∼0.25 at temperature stress conditions, which signifies that a transceiver fails at an average probability of 75% in the five years. We improve the reliability of the circuit by adding redundancy to the modules having high failure rates. The redundancy introduces parallel routes in a circuit. We consider three modules with the highest failure rates, module 6, module 2, and module 1, for incorporating fault tolerance.

A. PARALLEL REDUNDANCY IN THE LED MODULE
An OWC system works at a variable data rate depending on the light dimming levels [20]. However, there is no mechanism to identify LED lumen degradation. A receiver front-end has a fixed gain, and the circuit fails to receive the signal if the light intensity is lower than a specific limit. We consider the sensitivity limit of the receiver circuit such that it receives a signal between 100% to 70% lumen of the initial lumen values of the transmitter. We add 12 LEDs in parallel in the circuit to improve illumination and the fault tolerance of module 6. Since a single BJT cannot provide high currents of the 12 LEDs (∼2.4 A), we replicate the LED driver circuits as shown in Fig. 6. The circuit is said to fail if all the 12 LED reach their L70B50 values (lumen output declines to 70% of its initial level, with 50% of the LED area failing to reach the 70% lumen level). All the LEDs and the driver circuits are in parallel with a combined failure rate of λ ′ 56 . We calculate λ ′ 56 by first calculating the reliability R ′ 56 (t) (for t = T = 5 years or 43800 hours) of the series-parallel combination (module 5A-5L, module 6A-6L) as follows: where R 5 (t), R 6 (t) are the reliabilities of module 5 and module 6, listed in Table 1. The corresponding failure rate λ ′ 56 is: The failure rate of the transceiver circuit is simply the sum of the failure rates of all the modules provided in Fig. 7. We calculate the MTBF of the transceiver from its failure rate using (7). We assume the availability of four nines (0.9999) to further calculate the MTTR value for the circuit from (9) which comes out to be 8 hours in temperature stress conditions. An MTTR of 8 hours means that the circuit or the faulty module must be replaced within 8 hours to provide service availability of 99.99%. The calculated values for the new system are summarized in Table 3 in Section VI. The failure rate of the combined module is ∼940 times lower than the initial failure rate.

B. VOTING REDUNDANCY IN THE PROCESSOR MODULE
The processor has a second higher failure rate in the transceiver circuit. We use three parallel processors in series with a voter logic, as shown in Fig. 8. The voter logic decides the failure of a processor based on the received signals. The voter logic XORs the signals of every two out of three ( 3 C 2 ) processors to check a processor's failure and take appropriate action accordingly. A minimal processor (ARM-Cortex-M0 32-bit) can act as a voter logic unit having a low failure rate of 158.57 and 812.12 FITs at a temperature of 25 • C and 70 • C, respectively. The reliability R ′ 2 (t) of the parallel-series combination is as follows [16]: where R 2 (t) and R V (t) are the reliabilities of the module 2 and the voting unit. Logically, all three possible series combinations of two processors out of three processors are interconnected in parallel. These combinations are further connected in series with the voter logic. The probability of two processors failing simultaneously or the common cause of failure is assumed to be negligibly small. We find the rest of the reliability values of the fault-tolerant system (Fig. 8) following the procedure discussed in Section IV-A and tabularize them in Table 3. The failure rate of the combined module is ∼1.7 times lower than the initial failure rate of module 2. Note that introducing an additional processor to perform the voting logic introduces a slight latency in the transmitter signal, typically in the order of a few ns. For simple calculations, let us assume that the clock frequency of the processor used for voting logic is 200 MHz, and it takes one clock cycle to compare two input signals. To compare three different signals, the processor requires 3 clock cycles, which corresponds to a time of 15 ns. This delay is small and can be easily accommodated in most OWC systems. In general, latency is an important network performance metric and interested readers may refer to our previous work [13].

C. DIODE REDUNDANCY IN THE BUCK CONVERTER MODULE
We connect two parallel buck converters using the diode redundancy in which two diodes are connected in series with the buck converters, as shown in Fig. 9. Buck converters are equipped with thermal shutdown and current limit protection features. Therefore, in case of a buck converter failure, the output voltage drops below the required value. However, the working buck converter provides supply to the load through the diode connection, ensuring that the output  voltage is maintained. The series diodes must be rectifier diodes (1N5824 diode in Table 1) to operate at high currents and voltages of the buck converters. The reliability R ′ 1 (t) of the parallel combination is as follows: where R 1 (t) and R D (t) are the reliabilities of the module 1 and the rectifier diode. We can perform the rest of the reliability calculations using the method mentioned in Section IV-A, and the values are summarized in Table 3. The diode redundancy lowers the failure rate of module 1 by ∼13 times the initial failure rate.

D. A HIGHLY RELIABLE FAULT-TOLERANT OWC TRANSCEIVER DESIGN
We can design a highly reliable OWC transceiver by introducing the three redundancies discussed in Section IV-A-IV-C as provided in Fig. 10. The reliability values of the highly reliable transceiver design are provided in Table 3. The reliability of the resulting transceiver unit is 0.68 at 70 • C, which was 0.24 for the circuit with no fault tolerance. Using (9) and considering the availability of four-nines (0.9999), the MTTR of the circuit is predicted to be 12.66 hours at a stress temperature of 70 • C. This is easily accomplished by keeping spares at the site.

V. MARKOV MODEL ANALYSIS
Till now, we have sufficient insight into the quantitative analysis of the transceiver failure. We use the reliability calculations mentioned in Section II to determine the MTBF values This section presents the Markov model for the transceiver circuit with and without redundancy, as proposed in Section IV. We initially present the generic method of solving a Markov model by taking our transceiver circuit without redundancy as an example. This method can be used to solve any Markov model for reliability analysis [16], [21], [22]. The subsections of this section provide the Markov model for the circuits with redundancy, and Section VI compiles their results. The assumptions for designing the Markov model are as follows: • We assume that the test points are provided on the transceiver circuit board for every module, and their recorded waveforms for a correct working transceiver card are available for fault diagnosis of the circuit to identify the faulty module.
• The buck converter, processor, and LED with LED driver are mounted on separate daughterboards connected to the transceiver PCB using sockets. The sockets are assumed to be highly reliable and are not included in the reliability calculations. These modules are designed in plug-and-play modular formats because they have high FIT values. The buck-converter daughterboard takes 12 V at the input, gives out ±5 V at the output, and is connected to the transceiver PCB using a 4-pin connector. Similarly, the processor and LED with LED driver is connected using 5-pin and 3-pin connector, respectively. The repair for these modules takes the same MTTR values as they are in a similar plug-and-play format.
• The rest of the transceiver modules, like the buffer and inverter, are embedded in a chip. Any failure in these modules requires the transceiver card replacement.
• The transceiver includes a notification light that turns on when any transceiver redundancy fails, indicating that the device is in a degraded state. The failed redundant module can be replaced at the notification, and the transceiver downtime is considered insignificant and ignored for the replacement. Fig. 11 presents the Markov model for the transceiver circuit without redundancy. The Markov model consists of thirteen states: one working, ten failure (each corresponding to a different module failure), and two fault-detection and repair states corresponding to two types of failures: plugand-play module failures or chip failure due to the embedded modules. The probability of state 1 gives the availability of the device as it is the only working state. Similarly, the probability of states 2 to 11 presents the fraction of time the device spends in a failure state owing to module 1 to module 10 failure, respectively. Each failure state transits to fault detection states 12 and 13 with repair rates µ fd1 for chip-embedded modules (non-replicated modules in the following subsections) and µ fd2 for plug-and-play modules (replicated modules in the following subsections). States 12 and 13 return to fully working state 1 with a repair rate of µ 1 and µ 2 respectively. The Markov model is analyzed by using the Chapman-Kolmogorov equation. The procedure for solving the Chapman-Kolmogorov equation for the Markov model in terms of failure and repair rates is briefly described below. The failure rates (λ i or FIT) are the same as indicated above/below for respective modules, as shown in Fig. 6.
The transition probabilities of the Markov chain shown in Fig. 11 satisfy the Chapman-Kolmogorov forward equation given by [16]: where Q is the state transition matrix, and P(t) is the transition probability matrix. The state transition matrix is a square matrix in which each row i indicates the transition rate out of a state i to all other states or the departure rates from state i. In contrast, each column j presents the transition rate of each state into state j or the arrival rates to a state j. The diagonal elements of the matrix, or the rate of transition of a state i to itself, is the negative sum of the departure rates from state i. P(t) is a one-dimensional array that provides the probability of the system being in each state at time t. The solution to this matrix equation is given by: where P(0) is the probability of all states at time t = 0, one for the fully functional state and zero for the rest. The square matrix power in (15) is more computationally complex than the diagonal matrix power. The calculations can be simplified using linear algebra if Q is diagonalizable, which is true when Q has distinct eigenvalues: where M is a non-singular matrix formed with the eigenvectors of Q, and D is the diagonal matrix with the distinct eigenvalues of Q as its elements. Then we can obtain from (15) and (16), From the state probabilities, we can find the availability of a device. As an example, we provide a step-by-step solution of the Markov model of the OWC transceiver circuit without redundancy in the following manner. We first develop the state transition matrix Q from the state diagram, which is provided in Fig. 11, given as shown at the bottom of the page, where S 1 = −(λ 1 +λ 2 +λ 3 +λ 4 +λ 5 +λ 6 +λ 7 +λ 8 +λ 9 +λ 10 ), r 1 = µ fd1 , and r 2 = µ fd2 . We assume that initially, the device is in fully working state. Therefore, the initial conditions are  We find the probabilities for all the states at a time t by solving (15). The probabilities of the system being in different states for a lifetime of t = 5 years are given by: , P 4 (t), P 5 (t), P 6 (t), P 7 (t),P 8 (t),P 9 (t), P 10 (t), P 11 (t), P 12 (t), P 13  State 1 is the only state where the system is in working condition. So the availability of the system is the probability of state 1, which is two nines as follows: We present the Markov model with redundancy in the following subsections. For easy understanding, we provide a compact form of the Markov model, where multiple states are grouped together and represented in a single state. Multiple transitions corresponding to that state are then grouped together and represented as an array, as discussed later.

A. PARALLEL REDUNDANCY IN THE LED MODULE
As stated in Section IV-A, we replicate twelve LEDs (with their respective LED drivers) in parallel in the LED module to introduce parallel redundancy. We use four light-dependent resistors (LDR) in the LED array and average their current outputs to track the transmitter lumen levels. The LDR is a very reliable component and is assumed to last till the lifetime of the transceiver. We presume that the loss of either an LED or the connecting driver causes an LED branch (out of twelve) failure, ignoring the simultaneous failure of both the LED and its driver. An LED failure drops the LED branch lumen levels to 70%, while a driver failure reduces the lumen levels to 0%. The OWC transceiver fails to provide sufficient illumination to the user if the overall lumen levels drop to 70% or less from the original lumen levels. Fig. 12 shows the Markov model for the OWC transceiver with parallel redundancy. One leaf of the Markov chain is described here for a better understanding. State 1 has twelve working LEDs, and it reaches state 19 at a failure rate of 12λ 6 (see [16]). State 19 is the transceiver state after one LED failure, and the subsequent LED failure happens at a rate of 11λ 6 . Similarly, in the vertically downward direction, the transition from state 19 to state 20 is with a failure rate of 11λ 5 , where λ 5 is the failure rate of the driver. State 20 denotes the transceiver state with an LED and a driver failure, while the ten LED branches (out of twelve) are working. In this fashion, one can move in the horizontal (LED failure) or vertical (driver failure) direction for every LED branch failure. The Markov model explores all permutation routes leading to the thirteen failure states of the transceiver due to the LED array, tabularized in Table 2. Note that the permutation paths (3,4), (8,9), and (12,13) in Table 2 represent the same failure state. At the LED array failure (when intensity falls to 70% or lower), the notification light turns on, and the input current to the LED array is increased at a repair rate of µ 4 . The processor controls the current boost to increase the output lumen level, and the transceiver enters into the degraded working state 14. The processor performs the repair at a repair rate of µ 4 at a fast speed, and we consider it to be one repair per sec in our calculations. We employ on-off keying (OOK) modulation scheme in our transmitter, as a current boost does not impact the OOK-modulated data. The degraded working state 14 can reach states 6 and 7 due to the failure of LED and LED drivers. Every working state (including the degraded working state) of the transceiver can enter into a failure state (2)(3)(4)(5)(8)(9)(10)(11) due to the failure of the respective non-replicated modules (1)(2)(3)(4)(7)(8)(9)(10). The compact state diagram of Fig. 12 uses arrays to represent these transitions. Each working state can transit to a failed state with rates as given by arrays λ M and λ L . The grouping of the failure states for the two arrays is done based on the respective fault detection rates µ fd1 and µ fd2 . The degraded state 14 can be directly repaired by replacing the LED array at a repair rate of µ 3 or the array can completely fail at a failure rate λ f after exhausting the increased current lifetime to reach state 6 or 7. There is no fault-detection stage between a fully working and a degraded state, as the notification light already indicates the fault. States 2 to 13 hold the same meaning as discussed in the Markov model for the transceiver without redundancy in Fig. 11. The Markov model is solved using (15) for 5 years, where the state transition matrix is a 55×55 matrix. The availability of the transceiver is simply the sum of probabilities of all the working states (including the degraded state), which is three nines as provided in Table 4.

B. VOTING REDUNDANCY IN THE PROCESSOR MODULE
As mentioned in Section IV-B, we duplicate three processors and connect them to the inputs of voting logic to introduce voting redundancy in the processor module. At its output, the voting logic produces a majority vote from its inputs. As a result, our voting logic produces correct results for a single processor failure but not in the event of several processor failures. Fig. 13 depicts the Markov model of the transceiver with processor redundancy. State 1 is the full working state, and it transitions to a single processor failure state 14 at the transition probability of λ 2 . State 14 is the degraded working state for which the indicator light turns on, and the failed processor is replaced at µ 3 repair rate to return to state 1. The transceiver can enter failed state 3 (processor module failure state) either due to the voting logic failure at a failure rate of λ 11 or a second processor failure at a failure rate of 2λ 2 . Both states 1 and 14 are working states that can fail due to non-replicated modules (1, 3-10) and reach their respective failure states (2,(4)(5)(6)(7)(8)(9)(10)(11), as shown in Fig. 13 in a compact form. The states 2 to 13 correspond to the failure and fault detection states of the transceiver without redundancy. We calculate the state probabilities using (15) for t = 5 years, where the state transition matrix Q is a 14 × 14 matrix. The availability of the transceiver is simply the sum of the state probabilities of states 1 and 14, which is three nines, as provided in Table 4. Availability = P 1 (t) + P 14 (t).

C. DIODE REDUNDANCY IN THE BUCK CONVERTER MODULE
As investigated in Section IV-C, we connect two buck converters in parallel and use diode 'OR' redundancy to get a single output. Fig. 14 presents the Markov model for the transceiver with diode redundancy in the buck converter, where state 1 is the full working state. A buck converter failure (one failure out of the two buck converters at a rate of 2λ 1 ) causes the system to transition to the degraded state 14, and the notification light turns on. The failed buck converter is replaced at a µ 3 repair rate to return to state 1. The degraded state can enter into failure state 2 due to the failure of the second buck converter. The diodes are highly reliable components; therefore, their failure is neglected. The working states 1 and 14 can enter into transceiver failure states (3)(4)(5)(6)(7)(8)(9)(10)(11) due to the respective failure of non-replicated modules (2-10), as displayed in the compact form in Fig. 14. States 2 to 13 are similar to the states of the Markov model for the transceiver without redundancy. We solve the Markov model using (15) for a time of 5 years by using a 14 × 14 transition matrix. The transceiver availability is the probability that the transceiver stays in states 1 and 14, provided in Table 4.

D. A HIGHLY RELIABLE FAULT-TOLERANT OWC TRANSCEIVER DESIGN
We present a highly reliable OWC transceiver circuit by replicating the above-discussed redundancies in a circuit. The reliability block diagram for the circuit is presented previously in Fig. 10, and the Markov model is provided in Fig. 15. The Markov model of the reliable transceiver circuit has 184 distinct states, which can be sub-divided into four quadrants A, B, C, and D. Quadrants B, C, and D display different levels of degradation for each state of quadrant A. The states are numbered accordingly for easy understanding. Quadrant A represents the states due to LED array degradation and failure. Each working state of quadrant A transitions to its corresponding working state in quadrant B for a redundant buck converter failure or quadrant C for a redundant processor failure. The working state of quadrants B and C move to quadrant D in case of a redundant processor failure and a buck converter failure, respectively. As a result, the states in quadrant D present a redundant buck converter and a redundant processor failure along with LED failure and degradation states. For example, when a redundant buck converter fails, state 19A moves to state 19B, after which a redundant processor failure causes state 19B to transition to state 19D. State 19A could take another route through quadrant C (19C) to reach quadrant D (19D), where a redundant processor failure occurs after a redundant buck converter failure. Every working state of quadrants B, C, and D (excluding state 14) repairs at a rate of µ 3 to transition to respective intensity states in quadrant A, which means replacing all the defective redundant components during the repair of the degraded transceiver. State 14 is a degraded state in every quadrant, which transitions to a fully working state 1A at the repair rate of µ 3 . Due to the failure of non-redundant modules, each working state (including the current increase state 14) of a quadrant moves to a complete failure state, and we report the failure rates through an array. For example, due to the failure of modules (1)(2)(3)(4)(5)(8)(9)(10), each state of quadrant A can enter a complete failure, which is displayed in compact form using the array λ A = [0, λ 11 , λ 3 , λ 4 , λ 7 , λ 8 , λ 9 , λ 10 ]. Note that the first element of the array λ A is zero, indicating that the working states of quadrant A do not enter into failure due to failure of a single buck converter, as redundancy is present for module 1 in quadrant A. The second element of the array indicates that the transceiver fails due to the processor module, where we consider the failure of voting logic. The rest of the elements of the array λ A correspond to non-replicated module (3)(4)(5)(8)(9)(10) failures. Similarly, we represent the failures for working states in quadrants B, C, and D by arrays λ B , λ C , and λ D , respectively. We solve the Markov model using (15) for five years till the expected device life using a 184 × 184 transition matrix. The sum of the state probability of all the operating states defines the transceiver availability, and the value is provided in Table 4. Table 3 compares FIT, MTBF, reliability, and the MTTR values for various transceiver designs using the reliability block diagram discussed earlier in Section IV. We conclude that introducing redundancies improves the transceiver reliability; however, each redundancy has a distinct effect. This section discusses the results at a stress temperature of 70 • C. The parallel redundancy in the LED module improves the MTBF value from 3.9 years to 9 years. The MTTR value of the circuit for the availability of four nines is 7.9 hours compared to 3.4 hours for the circuit with no redundancy. The MTTR value of 7.9 hours is still low for the transceiver, which we further improve by introducing voting and diode redundancy. The diode redundancy improves the MTBF to 4 years, and the voting redundancy increases it to 4.4 years. The MTTR values are 3.5 hours and 3.9 hours for the respective redundancies. We merge all three redundancies in a single circuit to obtain a highly reliable transceiver design and obtain an MTBF value of 14 years, and an MTTR value of 12.66 hours.

VI. RESULTS
We further assess the availability of the following circuits by using the Markov model technique: • Circuit with no redundancy • Parallel redundancy in LED module • Voting redundancy in the processor module • Diode redundancy in the buck-converter module • Highly reliable fault-tolerant circuit with the above three redundancies. Table 4 summarizes the availability values for the above circuits. The table shows that the circuit availability without redundancy is two nines and increases to four nines with all three redundancies. We determine these availability values by summing up the probability of the working states of the Markov chains discussed in Section V. The repair rates are chosen such that the availability of four nines is achieved  for the highly reliable transceiver design at the temperature stress conditions. To do so, we assume that the time per repair varies between 10 hours to 100 hours (including fault detection time), i.e., for our calculations, the time per repair is the sum of the reciprocal of the fault-detection and repair rates (1/µ fd2 +1/µ 2 ) for the replicated components. Similarly, the fault-detection and repair time per repair for non-replicated modules (1/µ fd1 + 1/µ 1 ) varies from 8 hours to 36 hours. We set a lengthy repair time of ten days or 240 hours for notified component failures (or redundant component failures 1/µ 3 ) and plot the availability in Fig. 16. We can use this graph to find the availability given the fault-detection and repair rate of the replicated and non-replicated components in the transceiver. Alternatively, given the availability requirement, one can read the required repair rate for both types of modules from the graph. The figure shows the readings for the availability of four nines. Table 5 provides the typical time per repair for replicated and non-replicated components. To achieve the desired availability, the user can install repair facilities for the two categories of component failures mentioned above. For instance, achieving four nines availability necessitates a time of 56 hours per repair for replicated components and 18 hours for non-replicated components. A similar graph can be plotted for the lower availability values, which as a result,  increases the values of the time per repair (column 2 and column 3 of Table 5).

VII. LIFE CYCLE COST ANALYSIS
The total life cycle cost of a device comprises the initial product cost and the estimated maintenance cost required during its lifetime. The life cycle cost function (C f ) of a repairable device is designed by considering the initial cost of the device (C i ), the cost of spares required for its lifetime (C S ), and the technician cost (C T ) for repairing the system as follows: This section calculates the total life cycle cost per transceiver for an indoor application with 500 transceivers. Table 6 lists the cost of various components used in our transceiver. The cost of the components is presented in terms of cost units (1 cost unit is approximately equal to 1 £), calculated based on an online survey from well-known component distributor websites (Digi-key, Mouser). Using Table 6, the number of components in each module from Table 1, and  Table 7. Let us assume that we sell the transceiver with a 50% margin which makes the initial cost per transceiver for the consumer as listed in the last row of Table 7. The 50% margin includes our manufacturing cost and the profits. The initial cost of the transceiver design with redundancy is higher than that of the circuit without redundancy as the number of components in the initial design increases. Specifically, the cost increases by 1.21, 1.59, 1.03, and 1.83 times for LED, processor, buck-converter, and all three types of module replications, respectively. Table 9 provides the total initial cost of all the 500 transceivers of our considered application.
The cost of spare modules for a device is typically greater than their initial cost due to various factors [24]. For instance, spare parts transportation, demand, warehouse storage, and production costs make it more expensive than the product produced in large quantities. In our case, we establish a uniform cost (including profit margin) for spares by selecting a cost slightly higher than the highest cost among the ten modules, which is 21 cost units (> 20.3 of Module 10). To compute the total cost of spares (C S ) required for the lifespan of the transceiver, we first determine the total number of repairs (N R ) necessary for 500 transceivers for 5 years. We compute N R by dividing the device lifespan (5 years) by the MTBF values from Table 3 and multiplying the result by the total number of transceivers, as shown below: After obtaining the value of N R , we calculate the C S for the lifespan of 500 transceivers by multiplying N R by the cost per spare, as determined earlier. The computed value of C S is presented in Table 9.
One of the critical components of the life cycle cost analysis is the technician cost C T . The C T is directly influenced by the technician placement with respect to the repair site, as a longer travel time for the technician requires higher MTTR values [25]. In other words, faster service can be provided when a technician is located closer to the repair site. However,  it should be noted that placing a technician close to the repair site can be expensive. For example, in-house salaried technicians can provide the quickest service, but full-time salaried technician costs more than occasional third-party technician support. Therefore, we notice an inverse relationship between the C T and the required MTTR values. Our analysis assumes that one technician is needed to provide the fastest service for the scenario. We further assume that the technician works for 8 hr/day at an average base salary of 21 cost units per hour, a value calculated from an online survey of employment website (Indeed.com). We take an overhead factor [26] of 2 which needs to be multiplied by the technician's salary to consider the total C T to the employer. The overhead factor accounts for training, health coverage, tools, rented office space, and so on. Taking an average of 250 working days per year, the annual salary for a technician would be 84,000 cost units. We further consider an increase of 3% in the base salary every year, accumulating to 445,968 cost units for a considered device lifetime of 5 years. However, a part-time technician may suffice for a device with a higher MTTR value, leading to reduced salaries. We have established this relationship in Table 8, where the fastest service with MTTR values of 0-5 hours is equivalent to a full-time salaried technician. For MTTR values between 5 and 12 hours, we assume that the technician is a part-timer and the C T is halved. Similarly, for MTTR values between 12 and 24 hours, the C T is quartered. For MTTR values of more than 24 hours, the technician support is provided by a third-party service provider, resulting in a reduced C T of an hour cost per repair. We assume that the third-party service provider keeps an approximate profit margin of 25%, and the cost per hour of the third-party technician is fixed to 56 cost units per hour for the device lifetime. The C T values for all the circuits can be found by comparing MTTR values of Table 3 with Table 8. The C T for the circuit with all three redundancies at a temperature of 25 • C (MTTR > 24 hrs) is calculated by multiplying the value of N R with the C T per repair which is 56 cost units, i.e., 67.35 × 56 = 3772.1. The C T values for different transceiver designs are listed in Table 9. Table 9 lists various cost function components for our scenario where 500 transceivers are installed. The average cost per transceiver shares the total technician cost, which is higher for circuits with lower MTTR values. The average transceiver cost, listed in the second last row of the table, decreases by 47.5%, from 1070.92 cost units to 509.45 cost units, for the circuit with all three redundancies operating at a temperature of 70 • C. The last row of Table 9 provides the technician cost contribution in the life cycle cost of the transceivers. The technician contribution decreases from 83% to 43% for the transceivers operating at a temperature of 70 • C. Therefore, replication increases the initial device cost, but the life cycle cost of the device reduces due to the decrease in the technician cost contribution. The cost analysis for selecting an appropriate transceiver design varies based on different scenarios. Factors such as the number of installed devices, geographical location (with higher C T in specific locations), and device lifetime requirements (with lower requirements resulting in lower initial device costs) can all contribute to cost variations. Therefore, a cost analysis is critical for selecting an appropriate transceiver design.

VIII. CONCLUSION
OWC hardware requires fault-tolerance in environments where the loss of production, life, and high network availability are the primary concerns. In this paper, we analyze the reliability parameters of an OWC transceiver to calculate its mean time between failures (MTBF), reliability, and the required mean time to repair (MTTR) for the availability of four nines. We divide the OWC transceiver into independent modules to calculate each module's reliability and FIT values. This process aims to identify and replicate the low-reliability modules, thereby improving the system's fault tolerance against such modules. In the first analysis, we use the MTBF method and identify that the three modules in our system have a high failure rate: LED, processor, and buck converter. After which, we replicate each module and calculate various reliability parameters in Table 3. In the end, we provide a highly reliable fault-tolerant transceiver circuit with a reliability of 0.68 compared to 0.24 without module replication. The OWC transceiver design ensures the availability of four nines, with an MTTR of 12.66 hours under temperature-stress conditions. This represents a significant improvement over the transceiver design with no redundancy, with an MTTR value of 3.4 hours. The new MTTR value can easily be achieved by keeping spares at the site.
In the second analysis, we use Markov model analysis to find the availability of the various proposed fault-tolerant designs. We increase the availability of the circuit from two nines to four nines by including all three redundancies, as displayed in Table 4. A composite state diagram is created by combining three state diagrams for three different types of faults: faults in the LED module (designated as quadrant A in Fig. 15), single fault in voting redundancy in processor module along with LED module faults (quadrant C), single fault in buck converter along with LED module faults (quadrant B), and a combination of all three types of faults (quadrant D) to calculate the availability of the highly reliable transceiver design. It has been made possible by developing a compact state diagram representation, as discussed in Section V. The compact state diagram uses arrays to express the standard state diagram.
The availability values with reference to the repair rates are also presented in a graphical form in Fig. 16. Such a graph can be used to choose various repair rates of the replicated and non-replicated components in the transceiver to achieve the desired availability value. The graph shows the range of readings for the availability of four nines.
We further provide the device life cycle cost analysis of various replications. The analysis suggests that the transceiver designs with replications have a higher initial device cost, but the life cycle cost decreases by 47.5% for the highly reliable fault-tolerant design at temperature stress conditions. With this, we conclude that we can increase the reliability of any OWC transceiver module by replicating LED, power supply, and processor since these modules usually have a higher module failure rate. The method described in our study can be used to estimate the reliability of the user transceiver circuit (IR transmitter and visible light receiver) and obtain the duplex system reliability, consisting of an access point and user transceivers. There is a scope for future work in developing software for fault diagnosis and reconfiguring the replicated modules.