Data Analysis and Results of the Radiation-Tolerant Collaborative Computer On-Board OPTOS CubeSat

,


Introduction
In recent years, space missions have experimented an impressive growth in terms of the number of small satellites designed, manufactured, and launched to orbit around our planet.Many space agencies, as well as universities and aerospace companies, have undertaken building reliable, low-power, and low-cost small satellites to make affordable scientific experiments and technology demonstrations for a wide community of researchers [1].In [2], a summary of launched nanosatellites and CubeSats from 1998 is reported; a total amount of 948 nanosats have been manufactured and launched, being 587 currently operative in-orbit.On the other hand, 875 CubeSats have been launched from that date on.The past 10 years have been the period with the highest number of deliveries of these spacecrafts, because of the possibility of launching as an additional payload of a larger mission.
The miniaturization of electronic devices for sensing, processing, and actuating has helped to produce very effective and very small minisatellites.There has been an intensive activity in researching innovative solutions for building these small satellites with low cost, although maintaining the reliability required in aerospace systems.The final system must behave with the robustness expected from space systems, while consuming significantly less power and costing significantly less money.Space-qualified components are not providing the degree of flexibility required for these small systems.The inclusion of commercial components, for noncritical tasks, is a reality in many space missions, especially in CubeSats.In [3], a survey on launched CubeSats in the past 12 years highlights the improvement margin in control (On-Board Computer (OBC)) and communications, which allow accomplishing larger missions for space exploration, and not only for Low Earth Orbits (LEO).In this study, and in NASA reports [4], the use of commercial devices is not forbidden, although test is highly recommended.
Following this tendency, the National Institute of Aerospace Technique (INTA) proposed, designed, and developed in 2013 a 3U-CubeSat satellite named OPTOS [5,6] that features low size, low cost, low-power consumption, low weight, and high-performance capabilities.The satellite was partially built with commercial components, previously assessed w.r.t.their fault tolerance.OPTOS was conceived with no data wires (wireless optical data bus communication).Optical Wireless Links to intra-Spacecraft communications (OWLS) technology [7,8] is based on diffuse light emitted by LED through open space within the satellite, which is received by discrete photodiodes and processed by programmable logic devices (PLD).This wireless approach facilitates the implementations of new collaborative hardening techniques, where all messages sent by any unit are received by all units at the same time (mesh-like topologies).Furthermore, the system was designed as a network of distributed terminals, in charge of different payloads, but sharing critical and global tasks.Collaborative hardening techniques were applied to minimize the effects of transient faults in the architecture [9].
The design and development of this CubeSat have been already reported in scientific publications and technical reports.In this paper, the authors present the assessment of the OPTOS collaborative hardening computer during its 3-year operation in a LEO orbit.
The paper is organized as follows.In Section 2, related work about the design of reliable on-board computers is detailed.Section 3 states the procedure followed to analyse the radiation environment and apply this knowledge to OPTOS OBC (On-Board Computer).Section 4 describes the on-board computer architecture.In-orbit data analysis is presented in Section 5. Finally, Section 6 states the conclusions of this work.

Related Work
The hardening techniques applied to processor architectures have been addressed either through software-based or hardware-based approaches or through both of them [10].One of the most reliable and widely used approach consists in triplicating the core processor based in a COTS field-programmable gate array (FPGA), with an external radiation-hardened (Rad-Hard) component performing majority voting and reconfiguration tasks.This technique is commonly known as scrubbing [11,12].The technique consists in reading the configuration memory of the FPGA and comparing it with the initial value, which is stored in a Rad-Hard nonvolatile memory.Whenever a bit-flip is detected, the external component (so called, scrubber) reconfigures the FPGA with correct values.All this process is made on-the-fly, without interrupting operation.This technique allows a reliable use of SRAM-based FPGAs, increasing the performance over traditional space computers.The main drawbacks of these solutions appear in terms of a significant increase in power consumption and larger development time introduced by the complexity of the redundancy and the scrubbing process.
An alternative mechanism is applying redundancy at the device level or even making redundant the whole system board.This approach is called hot redundancy, where all the redundant units synchronously process data, and an external hardware is in-charge of identifying faulty units.Example of this architecture are ESA's Data Management System (DMS-R) or EADS's SPAICE computer.Obviously, these approaches are reserved only for high-demanding and rich-resourced platforms, such as the International Space Station or telecom satellites.Some common software hardening techniques are referenced as Software-Implemented Hardware Fault Tolerance (SIHFT).These techniques are usually focused on correcting either computational, control flow, or memory errors [13].SIHFT techniques are low-cost and much easier to implement than their hardware counterparts.A well-known disadvantage of software hardening is the overhead that it is being added both in terms of execution time and in terms of memory usage.However, some new methodologies, as presented in [14], have shown to improve execution time overhead while, at the same time, increasing the reliability of the system against Single Event Functional Interrupt (SEFI) errors.Nevertheless, software techniques are only a partial solution, since Single Event Latch-up (SEL) and Total Ionizing Dose (TID) effects on microelectronics may only be dealt by hardware approaches.

Computer Radiation Assurance
As part of the radiation engineering activities of OPTOS mission, a full radiation tolerance analysis was carried out.This analysis included the characterization of the expected high-energy particle fluxes during the mission and a detailed simulation study on the propagation of such fluxes through a full 3D model of the platform (with the FASTRAD [15] tool), which considered the main shielding structures and subsystems.The final aim was to provide a realistic estimation of the radiation levels intraspacecraft in the form of particle spectral fluxes and cumulated ionizing and nonionizing doses at selected locations and components.
These estimated radiation levels were used for the selection of technologies according to their radiation tolerances and provided the input environment for the evaluation of the End-of-Life (EOL) degradation of radiation-sensitive elements.Moreover, the outputs of the study were used for the estimation of the Single Event Effects (SEE) average rates for specific components.In particular, the rates for Single Event Upsets (SEU) in different memory regions and for device functional interruptions (SEFI) were estimated for Xilinx CoolRunner-II and Virtex-II programmable logic devices.
In the following subsections, we detail the calculation techniques and models used for the radiation environment propagation, as well as the experimental irradiation test data used for the SEE rates prediction.For the estimation of the Van Allen's belt particle fluxes, the environment model used was the AP8/AE8 [16][17][18], which was recommended by the European Space Agency (ESA) in their ECSS guidelines [19,20] as the standard model for near-Earth missions.In the case of solar events and GCRs, following ESA recommendations, the environmental models applied were, respectively, the ESP/PHYSCIC [21][22][23], with a 95% confidence level, and the ISO-15390 [24], including the Earth magnetosphere cut-off.

OPTOS Radiation 3D Model.
To propagate the mission environment from the previous section down to specific locations inside the OPTOS platform, a 3D mass model of the main spacecraft structural elements was built by means of FASTRAD tool and the actual OPTOS mechanical CAD designs.In addition, simplified models of the CoolRunner-II and the Virtex-II ICs were implemented, simulating both devices as plastic-packaged components, with the actual IC dimensions, and encapsulating them inside a thin silicon layer (100 μm thick).Figure 1 shows the OPTOS 3D model as built with the FASTRAD tool.
The final OPTOS 3D model achieved a total simulated mass of 2.5 kg.Once the 3D model was geometrically verified, it was exported as a file in GDML format [25] to be used as an input for the particle propagation Code, GEANT4 [26,27], a Monte-Carlo toolkit developed at the European Organization for Nuclear Research (CERN), capable of simulating the passage and interaction of high-energy radiation in matter using realistic electromagnetic and hadronic models [28].
The physics packages included for the simulation of OPTOS environment were the standard model, option-4, for the electromagnetic interactions, and the QGSP-BIC package for the hadronic processes.The average annual proton spectral flux obtained for locations inside the OPTOS platform is shown in Figure 2.

Experimental Irradiation Test.
In 2007, during the feasibility phases of the OPTOS program, a series of irradiation test campaigns were conducted [29] with the aim of characterizing the SEE sensibility of the CoolRunner-II devices to high-energy proton particles.Six CPLDs were tested (DUT 1 to DUT 6), four of them (DUT 1, DUT 7, DUT 2, and DUT 3) were in static mode to study SEU susceptibility, DUT 1 and DUT 7 irradiated while powered on to test SRAM memory, and finally DUT 2 and DUT 3 also powered on and off, respectively, to test flash configuration memory in different conditions.All devices were tested under a monoenergetic single proton energy, from 10.75 to 62.91 MeV, and up to maximum fluence of 10 10 proton/cm 2 or 100 SEE events, whichever condition achieved first.
Testing results, in the form of SEE cross-section versus proton incident energy, were obtained for each DUT and fitted to a Weibull distribution function.Table 1 shows the Weibull fitting parameters for the different DUTs and SEE.
For the Virtex-II devices [30], Tables 2 and 3 show the irradiation testing results for SEU and SEFI from the manufacturer in the form of Weibull fitting parameters for heavy ions and protons.

Estimated SEE Error
Rates.Finally, combining the particle models from Section 3.1, the attenuated proton spectrum from Section 3.2 and the experimental SEE data from 3.3, an estimation of the SEE occurrence rates for the OPTOS mission was performed.Table 4 shows the SEU (DUT 1 and 7) and SEFI (DUT 4 and 5) predicted 3 International Journal of Aerospace Engineering rates for the CoolRunner-II CPLD used in OPTOS OBC, Low Earth Orbit.The same analysis was performed for Virtex-II devices (see Table 5).
After performing all these calculi, the MTBF of the CoolRunner-II CPLD are around 20.83 days while in continuous operation without any reconfiguration cycle.Virtex-II (XC2V256) is the device used for DOT 7 (Distributed OBC Terminal).This device was selected instead of CR-II because DOT 7 controls a complex payload called APIS [31] and needed extra resources to deal with it.Virtex-II XC2V256 FPGA has 1,593,632 configuration bits, and therefore the SEU rate shall be of 0.07 SEU/Device/Day.Therefore, the MTBF of DOT 7 will be very similar to the other DOTs (1 through 6).As shown in Table 6, SEFI are negligible for the OPTOS mission.

On-Board Computer
The proposed architecture is based on a distributed processor where all terminals are connected by an Optical Wireless CAN Bus [7].Each unit can achieve redundantly all critical duties that belong to the OBC, and separately they will give specific services to Sub-Systems (S/S) or Pay-Loads (P/L) connected to them.Typical critical duties of an OBC are real-time maintenance, self-check supervision, and P/L latch-up control.Additionally, parallel processing may also be achieved if necessary.The purpose of this architecture is to maximize the processing capabilities and developing collaborative hardening techniques to increase the reliability of the OBC.
This design considers two kinds of units.
(i) Enhanced Processing Hardware (EPH).This unit is based on Xilinx Virtex-II (XC2V1000) FPGA commanded by MicroBlaze soft processor.The aim of this unit is to support the On-Board Software (OBSW) that will process communications through TTC S/S and support ADCS software.These two S/S aim for a complex processing capability that may not be achieved by DOT units (ii) Distributed OBC Terminals (DOT).These units are based on ultralow power Xilinx CoolRunner-II (XC2C512) CPLD.They are oriented to control all the other satellite S/S (PDU, ADCS, etc.) and P/L.They will be all interconnected (including EPH) through an optical wireless CAN Bus, and they will give support in terms of intelligent control logic through the following interfaces to the S/S and P/L connected to them.
(a) Four ten-bit ADC inputs with a dynamic range of 0 to 3.0 V (b) Sixteen digital inputs/outputs Figure 3 shows the top-level architecture of OPTOS collaborative computer.
On-Board Communications Subsystem (Ob-Com) is based on a set of miniaturized transceivers capable of implementing an optical wireless network inside the satellite [7].This network is made available through a CAN BUS implementation, conferring the OBC instant and complete communication between all its DOTs.Ob-Com is based on OWLS technology [8], which transmit data with diffuse light (emitted by LED) through the open spaces in the satellite.Data is received by discrete photodiodes with real-time processing made by a built-in PLD.
4.1.Collaborative Hardening.This section describes the proposed hardened computer architecture of OPTOS satellite, thanks to collaborative hardening, with special remarks in the implementation of the critical tasks.This innovative hardening technique must be considered as a design paradigm.It has been developed to create a complete solution for small mass, low power, low price, resource-rich, and reliable picosatellites processing architectures.
On-Board Real Time maintenance is a critical task in every satellite system [32].This task is usually undertaken either by a Rad-Hard component which keeps track of Real Time or by a Global Navigation Satellite System (GNSS) receiver [33][34][35].Both options are expensive in terms of power consumption and cost.Yet another benefit of the proposed collaborative system is to be able to maintain On-Board Real Time without the use of the above-mentioned options.
In the development of OPTOS OBC, the challenge was twofold.First, a reliable and dependable system should be built using devices that are not inherently robust: commercial CPLDs with SEU sensitivity in their SRAM memory elements.Secondly, critical tasks to be run on the satellite can 5 International Journal of Aerospace Engineering be replicated along the nodes in the network to perform an intrinsically redundant task execution.One of these critical tasks is the real-time maintenance, which has been designed in a distributed way within the network previously described, avoiding a single point of failure thanks to collaborative hardening.
Collaborative hardening is based on the parallel capabilities of the PLD.Each DOT uses its free resources to execute, in parallel with its own dedicated duties, specific algorithms to maintain OBC critical tasks in collaboration with all the other DOTs in the system.As an example of this, the satellite Real Time is maintained through the Real-Time Maintenance process (RTM).Real Time is updated from Earth through the TTC subsystem at every Earth contact to avoid time shifting due to each DOT oscillator errors.Once the Real Time has been introduced in the OBC, every DOT maintains its Own Real Time (ORT).Every second, the RTM process begins for all DOT maintaining its Own Real Time (ORT).These DOT can broadcast two types of messages.First, their ORT is broadcast to the network's elements.Secondly, a voting message is broadcast by a DOT when its ORT is not equal to received Real Time from other DOT.Once the complete voting process has finished, those DOT that detect a failure in its ORT execute a reset process to clean its configuration memory from possible errors caused by SEUs.Meanwhile, the collaborative hardening achieves in protecting Real-Time maintenance, as the nonfailure DOT, success through the RTM process to keep the satellite Real Time updated.As explained hereafter, in Section 5.3, no Real Time loss has been detected across the whole mission.

In-Orbit Data Analysis
This section is composed of three subsections.First subsection describes how the collaborative computer generates data, which type of data, and under what circumstances.The second subsection describes the tools implemented to analyse the gathered data.Finally, in the third subsection the results of the analysis are presented.

Telemetry.
To analyse the behaviour of the fault-tolerant collaborative computer on board a satellite, there are several considerations and difficulties to have into account.Probably the most important one is the observability of the system.The system we pretend to analyse is the one that will be generating the data to be studied, and therefore performing any processing of the data to be studied should avoided.This means that RAW data must be sent to Earth, limiting the amount of data to be gathered.This limitation is also increased by the following factors: With these considerations in mind, we decided to produce telemetries (i.e., pieces of data suitable to be stored in the On-Board Memory for further download to Earth) storing only the CAN messages generated by the OBC, concerning the collaborative hardening.This approach has allowed to maximize the amount of useful data retrieved to validate the OBC collaborative activities.

Analysis
Tool.The analysis tool designed and developed to study the behaviour of OPTOS OBC has been divided into two well-differentiated modules.The first module, called TM Decoder, is meant to extract the useful information from the Telemetry Data Packages that are transferred from OPTOS to Earth.This tool extracts the CAN messages, propagates the timestamp of each message, and stores them into a structure that is easy to manage by the second module.
The second module is called OBC Simulator.This module is a model-driven software where each DOT has been modelized with a set of parametric values.Each of these models can produce a deterministic response not only in form but also in time.The models are fed with only two inputs as follows: (i) Clock.This the tick of each model that simulates the real oscillator mounted on each DOT.Clock allows minimal differences for each DOT, as each oscillator has small differences (QT25L9M part from Q-TECH was used.This oscillator presents a frequency stability of 50 ppm.),allowing simulation of more realistic scenarios and more random at the same time (ii) CAN Messages.Messages generated by the other units that constitute the OBC Finally, the analysis tool connects the TM Decoder module with the OBC Simulator module, feeding it with the extracted messages of the telemetry.The analysis tool compares the output of each DOT model with the messages generated on-board at the OPTOS satellite.Whenever a message is out of the simulated behaviour of the model, it is marked to be considered erroneous.An erroneous message is defined by a misbehaviour of the DOT due to a SEU or the accumulation of multiple SEU.
To validate the analysis tool (among other functionalities of the satellite), a functional Mission Simulation Test (MST) was carried out during the Assembly, Integration and Verification phase of OPTOS CubeSat.During this test campaign, (i) Sun and eclipse periods were simulated as if a real orbit was developed around Earth (ii) Earth contacts were limited to a real scenario (iii) Communications were only achieved through RF Telemetry and Telecommand subsystem MST was carried out for 21 days.The OBC telemetries gathered were then introduced in the analysis tool.Only 8 CAN messages among more than 30,000 messages were marked as erroneous by the OBC Simulator.After analysing those messages, all of them resulted to be because of a misunderstanding of the programmed behaviour of the EPH unit.The error in the model was corrected.This test was helpful not only for validating the tool but also to gain more confidence in the collaborative hardening techniques implemented in the OBC.

In-Orbit Results
. At End-of-Life, OPTOS ground segment had received a total of 102,439,125 CAN messages coming from OBC collaborative hardening tasks.From the    7 International Journal of Aerospace Engineering whole set of messages, only 280 messages have been categorized as erroneous messages.Table 7 shows the numbers categorized by DOT.
The relationship between the number of errors and the number of messages is very consistent in DOT 3,4,6,and 7.For DOT 1 and 2, the number of errors is increased compared with the other DOT.The reason why this has happened is because DOT 1 and 2 also have the highest CAN ID priority among all units.This means that during the arbitrary negotiation of the CAN protocol, when two or more units try to send a message at the same bit time, access to the bus will be obtained by those high priority units (i.e., DOT 1 and DOT 2).As explained before, the Real-Time Maintenance algorithm starts with every unit trying to send a Broadcast Message with its own real time.As most of the Broadcast Messages are sent by these units, they send more messages than the others and cause more erroneous messages.
Figure 4 illustrates very well this situation and the relationship between the number of messages sent and the number of errors detected.
Nevertheless, the Real-Time Maintenance algorithm does not explain why DOT 5 has also such a great number of errors compared with the others DOT.To find a possible explanation to this issue, we have studied the satellite housekeeping information, searching for any significant change in the measured parameters of DOT 5.In fact, only three months after OPTOS was injected in its orbit, an unexplained issue began to happen with DOT 5's current housekeeping.Each DOT has a characteristic current draw depending on the connected PL/SS and functionality.Actually, DOT 5 nominal current was 23 mA.But on February 2014, current started to diminish until 14 mA.During satellite commissioning, we were not able to see any unusual behaviour of DOT 5; however, the thorough analysis carried out during the last months and explained in this paper shows that deterioration in the unit provoked extra errors along the whole mission in this DOT (see Figure 5).
CAN messages marked as erroneous have been categorized according to the nature of the problem that caused them.Five different groups have been defined as follows: (a) DOT Timing Error.Across the whole analysis performed with all the gathered telemetry during almost 3-year mission of OPTOS OBC, no System Error (i.e., loss of the real time due to an error in the collaborative hardening techniques) has been found.This analysis confirms the reliability of the system is higher than the sum of the reliability of its components.OPTOS OBC has successfully applied collaborative hardening techniques to maximize the reliability while reducing cost, mass, and power with an OBC made with non-radiation-tolerant COTS.Of course, this could never be achieved if the used parts were very susceptible to Single Event Latch-ups or Total Ionizing Dose.
Regarding the unitary errors found through the analysis, the real distribution is shown in Figure 6.
As clearly seen in the graph, the TIME Errors are the most predominant of all, not only in DOT 1 and 2 as we have already stated but also less predominantly in the rest of the units.This is explained by two complementary reasons as follows: To conclude with the analysis, we decided to include the latitude and longitude position to each of the messages resulted as errors.Due to the previous analysis of the cross section of both devices, the passes of the satellite through the South Atlantic Anomaly and by the poles (especially if coincides with a Solar Flare) should be the most favourable to see errors.We created a tool to propagate the satellite position from a TLE file provided by NORAD.Then, we introduced the result in a Google Fusion Table to be able to see the data into a Map file.As expected, SAA was the main driver of the SEUs affecting OPTOS OBC.
To analyse if we could find any further information with the map images, we decided to include two distributions.In the first image (Figure 7), each dot colour represents a different unit (DOT 1, 2, etc.). Figure 8 has been categorized by error type.
No significant differences have been found neither by DOT nor by type of error.The distribution of errors along the SAA and the poles is exactly the predicted behaviour for a satellite on a LEO orbit.

Conclusions
The use of commercial electronic devices in aerospace applications is still considered as a risky decision.However, space agencies, academia, and aerospace companies are using them on-board spacecrafts for noncritical tasks.This work presents a real case of a satellite, OPTOS Cube-Sat, designed, manufactured, launched, and flight during a 3-year mission that merge the new winds of flexibility, smart fault tolerance techniques, and cost consciousness to prove it is possible to produce reliable and effective architectures for small satellites.
In this paper, the hardening by-design techniques applied to the collaborative OBC of the OPTOS satellite are presented.Due to the use of COTS, a comprehensive radiation assurance analysis has been carried out based on the test performed by INTA to CR-II during the early phases of the project, as well as to Virtex-II FPGA based on the analysis carried out by the Xilinx Radiation Test Consortium.
Moreover, the authors have produced a set of models and simulators to thoroughly analyse the in-orbit behaviour of the On-Board Computer.These tools have been previously validated with real data and have returned a valuable set of processed data.
Collaborative hardening techniques have proven to be reliable to support the critical task of a small satellite and yet allow the use of much more efficient components in terms of power consumption, processing capabilities, and cost.Data analysis presented in "In-Orbit Results" section proved that collaborative architecture succeeded in preventing single unit errors to propagate to the system, allowing the maintenance of the critical tasks unfaulty during the whole 3-year mission of OPTOS.
Future small satellites may take advantage of the presented architecture to reduce cost, size, and power consumption while keeping safe the critical tasks of the satellite.
3.1.OPTOS Radiation Environment.Launched in November 23 rd , 2013, to a sun-synchronous orbit with an average 2 International Journal of Aerospace Engineering altitude of 670 km, the OPTOS mission was exposed to a severe radiation environment, consisting essentially of a dominating flux of high-energy protons and electrons, coming from the Van Allen's inner radiation belt, occasionally perturbed in case of solar particle events and the constant background flux of high-Z species due to the galactic cosmic ray (GCR) contribution.

Figure 3 :
Figure 3: OPTOS On-Board Computer Top-Level Architecture.(OPTOS S/S: EPS (Electronic Power Subsystem); ADCS (Attitude Determination and Control Subsystem) which comprises a set of magnetometers (MGM), a reaction wheel (RW), and several sun sensors; On-Board Communications based on OWLS; TTC (Telemetry/Telecommand and Control); OBSW (On-Board SoftWare) comprising the Application and Boot Software; and the OBC itself.OPTOS P/L: radiation monitor based on RadFETs (ODM), a novel magnetometer based on giant magnetoresistance (GMR), Fiber Bragg Grating for optical sensing (FIBOS), and an Athermalized Panchromatic Image Sensor (APIS)).
(i) OBC must share bandwidth with other 7 subsystems and 4 payloads (ii) Downlink speed is 5 Kbps with only 3 to 4 contacts with the Earth Ground Station, with a typical total contact time of 20 minutes per day (including the uplink of Telecommands to the satellite).

Figure 4 :
Figure 4: Total messages sent by unit and its committed errors.

Figure 5 :
Figure 5: Housekeeping analysis of DOT 5 current deterioration values.

Figure 6 :
Figure 6: Distribution of type of errors per DOT.

Figure 7 :
Figure 7: World map error distribution by DOT type.

Figure 8 :
Figure 8: World map error distribution by error type.

Table 4 :
SEU and SEFI rates for CoolRunner-II.

Table 6 :
SEFI rates for Virtex-II for OPTOS radiation environment.

Table 7 :
Total number of messages vs. erroneous ones.