Alphabet Size Matching Techniques Based on Non-Binary Gilbert-Varshamov Bounded Limits for Synchronization Finite State Markov Channel

The Gilbert-Varshamov (GV) lower bound is used to provide indications and prescriptions for the outer code coding parameters for a memory synchronisation model that focuses solely on the internal resynchronisation process. The binary and <inline-formula> <tex-math notation="LaTeX">$q$ </tex-math></inline-formula>-ary GV bounds are utilised in this analysis to indicate parameters to remove the remaining substitution errors and provide a complete framework. Procedures and examples are provided to determine optimal outer code parameters for given inner-entropies and residual substitution errors produced during resynchronisation. In particular, using the non-binary GV bounds allows us to match the best alphabet size for given parameters. For the cases explored, a 16-ary GV bound provides the best results, with an (<inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$d$ </tex-math></inline-formula>) code of (120, 57, 37) being a possible outer code when the inner entropy is 0.1. Using GV bounds for outer code parameter considerations frees the system from using stringent codes and instead allows any outer code to be utilised to meet the required error correction needs.


I. INTRODUCTION
Reliable communication is a necessity in today's world with its ever-increasing data transmission demands. Communication channels should be able to recover from errors produced during the transmission process easily. The fundamental principle in allowing communication systems to recover from errors is redundancy [1]. Adding more bits or symbols to a data stream based on predefined rules provides the power to detect and correct errors in the received sequence. It is easy to see that the more additional data is truncated onto the information, the more chance there is to recover from the erroneous transmission. These error correction capabilities, however, come with the cost of wasted transmission bandwidth and, consequently, inefficient use of the channel. This naturally leads to a fundamental problem in information theory where trade-offs between redundancy (and consequently The associate editor coordinating the review of this manuscript and approving it for publication was Abdullah Iliyasu . error-correcting capabilities) are contrasted against the efficient use of the channel [1].
In Achari and Cheng [2], consider a channel model and decoding scheme to correct insertion and deletion errors in a system based on a Synchronisation Finite-State Markov Channel (S-FSMC) where Insertion-Deletion-Substitution (IDS) errors are present. The paper presented restricts its focus to the inner decoding of the communication system, where the purpose of the inner decoder is to regain synchronisation by removing the insertion and deletion errors detected. However, this resynchronisation process does not account for substitution errors caused by the channel or the incorrect decoding of the inner decoder. This paper aims to provide considerations and suggestions for the outer code coding parameters of the S-FSMC presented in [2] to ensure reliable communication and provide a framework for a complete system.
There are three main contributions to this paper. Firstly the binary GV lower bound is used to give indications on the parameters of the outer code to ensure reliable communication for the S-FSMC. Secondly, these bounds are extended to q-ary symbols, which seek to reduce bursty errors' effects and provide a more efficient coding scheme. Lastly, using GV bounds to indicate parameters for the outer code frees the synchronisation decoder from stringent outer substitution codes such as LDPC. This allows any code construction and substitution Error Correction Codes (ECC) to be utilised as the outer code, so long as it satisfies the necessary communication requirements.
The rest of the paper is structured as follows. The black box inner process and basics of the GV bounds are outlined in Section II. The methodology and general case studies for both the binary and q-ary cases are described and discussed in Section III. Section IV then uses simulations of the inner decoder to give practical results and analysis of the described procedure. Conclusions are finally drawn in Section V.

II. BACKGROUND AND LITERATURE REVIEW A. SYNCHRONISATION-FSMC (S-FSMC)
In general, research has either looked at synchronisation errors or memory effects of a channel in isolation. The modelling and analysis of synchronisation errors can be found in references such as [3], [4], [5], and [6] and more recently in [7] and [8]. Likewise, the modelling of memory within channels can be found in literature such as [9], [10], and [11] with an extensive review of these memory channels with error control techniques presented in [12].
In [2], Achari and Cheng have incorporated both of these channel effects and properties and describe a synchronisation channel based on a Finite-State Markov Channel (FSMC). FSMC -also referred to as discrete finite-state channels, are primarily based on Markov processes and was presented by Claude Shannon in his landmark paper in 1948 [13].
In addition to the channel model in [2], the paper provides an algorithm to regain synchronisation by removing the effects of insertions and deletions encountered during data transmission. However, this leaves substitution errors from the decoding process and inherent substitution errors from the channel itself. The work in [2] is based on the memoryless synchronisation channel, and watermark decoding scheme presented by Davey and MacKay in [5], which has been used in various fields such as DNA barcoding in medicine [14], [15] to data hiding in watermarking applications [16].
This paper treats the synchronisation channel and inner decoder workings as a black box illustrated in Figure 1. The inputs to this system will include a block-coded frame of length n. The outputs of this black box will be the Bit Error Rate (BER) or Symbol Error Rate (SER) for the qary case, which are the substitution errors that remain after the synchronisation process. Additionally, the corresponding entropy of the synchronisation channel, or inner channel entropy, denoted by H I , is also an output of this black box system. The sole focus of this paper remains on the outer decoding process to ensure reliable communication for the given system.

B. GILBERT-VARSHAMOV (GV) BOUNDS
Numerous bounds are used to analyse channel models to determine the system performance, where the main trade-off of code rate and error-correction capabilities are explored. The best known and most notable lower bound is the GV bound described independently by Edgar Gilbert [17] and later by Rom Varshamov [18] where the bound was slightly improved [19]. As this is a lower bound, it is a positive result and, as such, provides a definite achievable bound [1], [20]. The GV bound states that for a q-ary code of length n and minimum Hamming distance d, the maximum number of codewords, A q (n, d) must satisfy Equation (1) [1], [21], [22].
Here V q (n, l) is the volume or number of strings in a Hamming ball of radius l and is equal to l j=0 n j (q − 1) j which gives rise to Equation (2) [1], [20], [21], [22].
Using Stirling's approximation and simplifying, the GV bound can be written in its asymptotic form shown in Equation (3) where R(δ) is the corresponding code rate as a function of δ, H q is the q-ary entropy function defined in Equation (4) [22] and, δ is the relative distance, or fractional minimum distance, which is equal to d n and is restricted to [20], [21], [22]. It is worth noting that Equation (4) is derived by making use of the change of base formula, and as such, any logarithm base may be used so long as it is consistent throughout.
Theorem 1 (GV Bound): Let q ≥ 2. Then for any 0 ≤ δ ≤ 1 − 1 q , and any 0 < ϵ ≤ 1 − H q (δ), there exists a q-ary family VOLUME 11, 2023 of codes C where: As this paper employs a black box approach for the inner decoder, the only information needed to determine the outer decoder parameters is the residual SER and the corresponding inner entropy (entropy of the S-FSMC). Figure 2 recreates the results from the inner synchronisation process in [2], and in particular, the corresponding BER for various inner entropy values are plotted. From Figure 2, the residual BER and, consequently, the number of substitution errors remaining after resynchronisation can easily be identified. Thus a corresponding outer code can be prescribed.
It is well known in coding theory that the maximum number of errors a block code can correct, denoted by t, is dependent on d, and this inequality is described in Equation (5). From this, it is easily shown that the required minimum distance of our prescribed code should be at least 2t + 1.
Consequently, Equation (6) shows the corresponding relative distance that should be chosen given the number of errors we wish to correct.

B. BINARY CASE STUDY
We use an example that utilises the data reproduced in Fig  residual BER is at worst 0.1. Using a frame length or coded data size of 480 bits (as in the case of the parameters used in [2]) implies there will be an average of roughly 48 substitution errors that the outer code should try to correct. Using Equation 6 implies that our outer code should have a relative distance of at least 0.2021. Assuming a normal binary data sequence allows the use of the binary entropy function and produces Figure 3, which shows the relationship between code rate, R(δ), and relative distance for the binary GV bound. As is indicated by the highlighted data point and red line in Figure 3, a maximum outer code rate, R o , of 0.2741 is achievable for the given relative distance. Additionally, any point below the curve and to the right of the red dotted line would have the required error correction capabilities for the given parameters. However, to ensure the best code rate for the given scenario, it is recommended to stay as close to the curve and the required δ line as possible. Using the maximum R o of 0.2741, the number of message bits can be calculated by k = ⌊R o ×n⌋. This gives a value of 131 message bits. As this number is rounded down to ensure whole bits are used, the actual R o is recalculated to be 0.2729. Thus providing possible parameters in terms of an (n, k, d) code, we have (480, 131, 97). Lastly, the overall code rate, R T , is determined by multiplying the inner code rate and outer code rate together. In this case, the R i is solely dependent on the sparsifier used during the resynchronisation process, and in the simulations in [2], a 4 to 5 sparsifier is used, making R i = 0.8. Therefore the corresponding R T in this example is approximately 0.2183. Again, this is not the only possible set of parameters that can be used to correct the given errors for this example.

C. EXTENSION TO q-ARY
In the previous example, the scenario was limited to that of the binary case. However, the remaining errors produced during the inner resynchronisation process may likely occur in bursts. This is expected as the channel model used during the inner portion is a FSMC which has correlated errors. Since the errors may be clumped together in bursts, extending the outer code to a q-ary GV function would be more effective. Figure 4 shows the GV bounds for different q values. Not all q values will be directly beneficial to our application, and only values of q = 2 b , where b is a positive integer, will be utilised; however, q values that fall outside of this constraint are shown for completeness. It is easily seen from Figure 4 that higher code rates are achievable for corresponding δ values as larger q values are utilised. In other words, we can obtain similar or better error-correction capabilities by utilising symbols compromised of more bits while having a more efficient code rate. This is common knowledge in the field, as now the focus is on symbol-level data communication, and as such, we can correct up to t ′ symbols and consequently t ′ × log 2 (q) bits.

D. Q-ARY CASE STUDY
Using the values from the binary example, the idea of utilising q-ary codes is further illustrated. We use k ′ to denote the symbol message length and n ′ to denote the symbol block length, which is further quantified in Equation (7). This gives rise to n ′ = 120 symbols for a value of q = 16 as we still require 480 coded bits at the input of the inner resynchronisation process for the scenario described. Suppose 48 substitution errors remain after resynchronisation; as a worst-case, this will affect 48 separate symbols once the bits are converted to their corresponding symbols. This, in turn, using the symbol equivalent values in Equation (6), suggests that the outer code's relative distance should now be approximately 0.808 as we have t ′ = 48 symbols to correct. This relates to approximately R o = 0.0335 as shown by the red dotted line in Figure 5. At first glance, this may seem like an inferior result when compared with the binary case, as a lower code rate is achieved. However, it is crucial to keep in mind that this is the worst-case scenario, and when looking at the number of bits that can be corrected, the given code can correct up to 192 bits. On the opposite end of the spectrum, a best-case scenario is when all the bit errors occur in one large burst. In this scenario, that would mean 12 symbols overall would be affected, and as such, we would only require a relative distance of approximately 0.208. This corresponds to R o = 0.61 and is indicated by the blue dotted line in Figure 5. The procedure follows as in the binary case study to find the overall code parameters. This gives us a worst-case code of (120, 4, 97) with a corresponding R o of 0.033 and R T of 0.027, respectively. The best-case scenario gives a (120, 73, 25) code where the R o is 0.608, and R T is 0.487. n ′ = n log 2 (q) (7)

IV. RESULTS AND ANALYSIS A. RESULTS FROM SIMULATED CHANNEL
This Section runs simulations using the entire process, including inner synchronisation. Consequently, the practicality of the method discussed is illustrated by using different q-ary divisions. Figure 6 shows the SER obtained after the inner resynchronisation process when various q-ary values are employed. As seen in this figure, while all the plots follow a similar trend, at higher H I values, there is generally a higher chance of errors. Additionally, as the q value increases, the higher the SER becomes. This is especially true at inner entropies exceeding 0.1, where the higher the entropy of the S-FSMC, the more evident the difference in SER obtained for corresponding q-ary groupings. This intuitively makes sense as more errors are expected at higher entropies due to more uncertainty within the channel. The various drops in the plot occurring at approximately H I = 0.1 and H I = 0.2 are due to the method used to generate the different inner channels and are again further detailed in [2]. It is evident from these plots that the inner system does cause more sporadic errors as the SER increases with an increase in q size. If the results obtained saw an equal SER with an increasing size of q, this would suggest the errors affect consecutive bits and thus be more situated towards the best-case scenario described in the method.  It is also worth noting that the plots for q = 8 and q = 16 in Figure 6 are reasonably similar in SER performance which further suggests that it is unlikely to have groupings of more than three or four consecutive error bits. This is explored further by looking at the error-run plots shown in Figure 7. Here Pr(1 m |0) is the probability of having at least m errors after an error-free communication state. It is noticed that in the case of H I = 0.014, the probability of at least four consecutive errors is 0.08, which drops to approximately 0.03 for at least five consecutive errors, reiterating the notion of having a low chance of getting more than four consecutive error bits. This is confirmed at higher entropy values too. In the case of H I = 0.292, the probability of at least four consecutive errors is 0.096, the probability of at least five consecutive errors is approximately 0.04, and the probability of six consecutive errors drops to 0.02, again reemphasising this idea. It is worth remarking that the error-run plots are all based on a binary case (q = 2), as this provides the consecutive raw errors before any symbol segmentation.  Table 1 shows the coding gain achieved at various inner entropy values for corresponding q-ary values. The values of R(δ) are chosen by looking at the corresponding minimum δ value rounded to the nearest hundredth. The values obtained show that using a q value of 16 provides the best results in terms of code rate (outer and total) for the given channel simulations. This appears optimal for the given channel as increasing to q = 32 starts to incur a decreasing performance in code gain. The potential reason a 16−ary code performs the best in these tests could be attributed to using a 4-5 sparsifier during the resynchronisation process. Intuitively, the number of bits used at the input of the sparsifier can be seen as the bits that constitute a single symbol at the outer q-ary code. As such, matching these values inherently makes sense as errors are contained and thus decoded within each individual symbol. Choosing an outer code such that the makeup of each symbol exceeds the number of bits at the sparsifier input potentially allows errors to affect more symbols during resynchronization, thus producing a poorer error rate performance. Definite indications of the effect of the sparsifier on the outer code require further research. Figure 8 illustrates a graphical representation of Table 1 where the red dotted lines and arrows show the code gain between q = 2 and q = 16 for an H I of 0.043. In this figure, the corresponding symbols show the code gain for a given H I using the various q-ary GV bounds. Again, it is fairly evident for respective values of H I that using q = 16 provides the best code rate for the given error correction requirements.

B. SIMULATED CHANNEL CASE STUDY
To further illustrate the method described, the use of the simulation results is explored in order to give a more real-world case study. Again, using H I = 0.1 in this example, the SER, no matter the number of bits composing a symbol, will always be less than 0.2 according to Figure 6. As previously mentioned, using a q value of 16 would be the best option for the described channel, and thus symbols consisting of 4 bits are opted for in this example. Not forgetting the multitude of 16-ary code constructions available in practice makes this option the easy choice. Again, from Figure 6, using a 16-ary value/code at an entropy of 0.1 for the scenario gives a SER of approximately 0.1437. Overcompensating, an SER of 0.15 can be coded for, which means 18 (four-bit) symbol errors can be corrected, giving rise to δ = 0.308 and consequently gives R o = 0.4739 and k ′ = 57. This gives an overall rate of approximately 0.38 and a code of (120, 57, 37). While not the absolute best case for the 16-ary instance, these parameters tend more towards the blue dotted line in Figure 5 and again shows that the simulated channel benefits from using the q-ary case over standard binary.

C. SIMULATED CHANNEL WITH REED-SOLOMON OUTER CODING
Finally, the complete framework, including a choice of outer code, is simulated to show the application of the methods discussed. In this case, the Reed-Solomon (RS) code is chosen due to its multiple use cases in practical systems, especially in applications where burst errors are prevalent. Again various q-ary segmentations are tested, corresponding to the various RS parameters outlined in Table 2. As RS codes are nonbinary, here N corresponds to the symbol codeword length or block length, and K is the length of the symbol-wise message. A RS coding scheme can correct up to t ′ = N −K 2 symbols.  To keep a fair comparison, we resume from the previous section and choose the RS parameters in such a way as to keep R o as close to 0.4739. The number of blocks sent, n b , is chosen to ensure that the inner resynchronisation process receives 480 bits (padding may be required to get exactly 480 bits). For the given case, n b = ⌊ 480 q×N ⌋. Figure 9 shows the SER and corresponding BER after the RS decoding. The BER is calculated by converting its symbol data stream into its binary counterpart based on the number of bits for that specific q-ary code. As can be seen from the figure, the q = 16 and q = 32 have relatively similar performances in lower H I values. The q = 16 coding scheme shows substantial coding gains after an inner entropy of around 0.1. Figure 10 highlights these effects further by plotting both the BER at the output of the inner decoder (after the resynchronisation process) and finally the BER after the RS decoding. It is clearly seen that the BER after resynchronisation is relatively the same no matter the segmentation of bits that compromise a symbol, again reiterating the usefulness of the black box approach. Again at lower inner entropy values, the capabili- ties of the q = 16 and q = 32 cases have very similar results, with the q = 32 case having slightly better performance. After an inner entropy of around 0.1, which corresponds to BER values of above 0.1 after resynchronisation, we see substantial performance improvements for the 16-ary coding scheme. Again, this agrees with intuition as the parameters in this example application were chosen for these corresponding regions of interest.

V. CONCLUSION
The S-FSMC is presented as a black box implementation along with the theory of GV bounds. From this, a method is presented to deduce the relevant substitution outer code parameters that will, in essence, correct the remaining substitution errors produced from the synchronisation channel and the resynchronisation process. Firstly, indications are given based on a binary setting, but it is shown that depending on the distribution of errors, the system may benefit from using the q-ary GV bounds. This, however, depends on the structure of the errors and in fact, the system may indeed suffer from utilising the q-ary counterpart if the errors produced are spread out in the data frame. From simulations of the synchronisation channel, it is shown that using a q-ary value of 16 will see the best results in terms of coding gain. Increasing higher than q = 16 starts to provide diminishing results for the given channel. The simulated channel case study shows that the system benefits from using the q-ary GV bounds, and a good indication of the outer code parameters for an inner entropy of 0.1 would be (120, 57, 37) for a 16-ary partition. The proposed use of GV bounds for outer code parameter considerations liberates the coding construction and decoder from specialised outer error correction codes. It thus allows any code to be implemented as long as the coding requirements and parameters are met.