Parallel Self-Avoiding Walks for a Low-Autocorrelation Binary Sequences Problem

A low-autocorrelation binary sequences problem with a high figure of merit factor represents a formidable computational challenge. An efficient parallel computing algorithm is required to reach the new best-known solutions for this problem. Therefore, we developed the sokol skew solver for the skew-symmetric search space. The developed solver takes the advantage of parallel computing on graphics processing units. The solver organized the search process as a sequence of parallel and contiguous self-avoiding walks and achieved a speedup factor of 387 compared with lssOrel , its predecessor. The sokol skew solver belongs to stochastic solvers and can not guarantee the optimality of solutions. To mitigate this problem, we established the predictive model of stopping conditions according to the small instances for which the optimal skew-symmetric solutions are known. With its help and 99% probability, the sokol skew solver found all the known and seven new best-known skew-symmetric sequences for odd instances from L = 121 to L = 223 . For larger instances, the solver can not reach 99% probability within our limitations, but it still found several new best-known binary sequences. We also analyzed the trend of the best merit factor values, and it shows that as sequence size increases, the value of the merit factor also increases, and this trend is flatter for larger instances.


Introduction
General-Purpose computing on Graphics Processing Units (GPGPU) is used to reach better solutions in many fields, such as in computer vision, graph processing, biomedical, financial analysis, and physical simulation [1,2,3,4,5].The reason for that is the computational power of Graphics Processing Units (GPU).Therefore, these devices are used in High-Performance Computing (HPC) for solving real-world problems.However, current GPU devices have some drawbacks and limitations that must be considered when designing a new algorithm.The architecture, memory bandwidth, amount of memory, and memory speed are some of them.Despite these limitations, we can implement a solver for reaching impressive results and new solutions.
In this paper, we proposed a parallel self-avoiding walk algorithm for solving a Low-Autocorrelation Binary Sequences (LABS) problem that represents a formidable computational challenge.Here we have to locate binary sequences S * L of length L with the highest value of merit factor F , as shown in Eq. (1).These sequences are important in many areas, such as communication engineering [6,7,8], statistical mechanics [9,10,11], chemistry [12], cryptography [13,14], and mathematics [15,16,17].
S * L = arg max Two solvers, lssOrel and xLastovka, were proposed in our previous works [18,19].Both are stochastic solvers that take the advantage of the mechanism for fast evaluation of neighbor solutions and are limited to skew-symmetric sequences.The main difference between both solvers is in the memory usage that guides the search process.The lssOrel uses a self-avoiding walk and a small hash table to prevent visiting the same pivot twice.In contrast, xLastovka stores the large number of promising sequences into a priority queue and uses them for managing the search process.Both solvers explore the skew-symmetric search space only.This means the dimension of the search space is reduced by almost half (D = L+1 2 ), and, consequently, the solver can find (sub)-optimal sequences significantly faster.The skew-symmetric search space is defined only for odd instances, as shown in Eq. (2), and optimal skew-symmetric sequences may not be optimal for the entire search space.
The authors in [20] proposed a hybrid architecture that utilizes CPU and GPGPU processing units for the agent-based memetic computational system.This system was used for solving the LABS problem on the entire search space with the steepest descent and tabu search.The obtained maximum speedup was 46 with the tabu search for L = 128.The usefulness of GPGPU processing units for solving the LABS problem is evident from this result.Motivated by this work, we proposed the sokol skew solver for skew-symmetric sequences.It is implemented for the NVIDIA GPU devices with the help of the CUDA (Compute Unified Device Architecture) toolkit.The sokol skew solver uses parallel self-avoiding walks and a recently published mechanism for efficient neighborhood evaluation [21].Both are not only computationally efficient but also require a small amount of memory.They represent a suitable algorithm that prevents cycling, and a mechanism which can take the advantage of GPU devices to speedup the energy calculation.Using them, sokol skew obtained a speedup of 387 compared with the sequential CPU solver and consequently found new best-known sequences.
The predictive model of stopping conditions for sokol skew is established in this paper according to the instances from L = 71 to L = 119.Using this model, we determined the stopping conditions needed by the solver to reach the optimal skew-symmetric sequences with a probability of 99%.With these stopping conditions, sokol skew found all the known and seven new best-known sequences for all odd instances from L = 121 to L = 223.The sokol skew solver was also used for solving larger instances up to L = 247.In this case, it found all known and several new best-known sequences.We also calculated the probability that the achieved sequences are optimal according to the predictive model.
Using all the best merit factor values, we established their trend model.It shows that the merit factor values increase slightly as the sequence length increases, and this trend is flatter for larger sequence sizes.
According to the above mentioned, the main contributions of this paper are: • A parallel search with self-avoiding walks, designed for skew-symmetric sequences, which takes the advantage of GPGPU devices.
• Skew-symmetric sequences that are optimal with a probability of 99% according to the predictive model for instances from L = 121 to L = 223.
• The seventeen new best-known sequences for L ≤ 247.
• The trend of merit factors according to new best-known sequences.
The remainder of the paper is organized as follows.Related work is described in Section 2. The proposed sokol skew solver is described in Section 3. The description of the experiments, analysis of the proposed solver, and the obtained results are presented in Section 4. Finally, the paper ends with a conclusion in Section 5.

Related work
This section is divided into two parts since we address two fields in this paper.The first part is finding the skew-symmetric binary sequences with the highest value of merit factor, and the second is the development of stochastic solvers that take the advantage of GPGPUs.

Binary sequences with the highest value of merit factor
In [22,23] it was shown that the upper bound of merit factor for the LABS problem could be 12.3248 or 10.23 as L → ∞.With the help of construction methods, it is possible to form a sequence with the merit factor value of 6.342061 or 6.4382 [16,24].The difference between the established upper bound and the obtained merit factors by construction methods remains large.However, these methods are favorable for large sequence sizes.In the survey [25], we can find the following challenge: "Find a binary sequence S of length L > 13 for which F ≥ 10."In [26], the authors have suspected that for L > 250 under the skew-symmetric search space, we can expect F ≥ 10.Currently, the optimal sequences are known for L ≤ 66, and the skew-symmetric optimal sequences are known for L ≤ 119 [14].
Two well-known search approaches exist to find (sub)-optimal sequences.The exact search allows finding the optimal sequence that is time-consuming and inappropriate for long sequences.On the other hand, the stochastic search can return a reasonable good sequence but it is not necessarily optimal.Therefore, the stochastic search is preferable for larger instances.In the literature, we can find the following stochastic approaches used for solving the LABS problem: local search algorithm [27,21], tabu search [28], memetic algorithm combined with tabu search [29], self-avoiding walk technique [18], guiding the search process with a priority queue [19,30], etc. Currently best-known longer skew-symmetric sequences are published in [30].

Stochastic solvers and graphics processing units
A general metaheuristic framework for solving combinatorial optimization problems based on GPGPU is presented in [31].It enables efficient utilization of the GPU for the parallelization of an objective function and takes the advantage of the computational power provided by modern GPUs.The usefulness of this framework was demonstrated in the probabilistic traveling salesman problem with deadlines.The efficient implementation of the local search for GPU was presented in [32].The authors have used different profiling information and best-known practice to identify bottlenecks and improve performance incrementally.They took into account efficient solution evaluation on kernels, design of CPU-GPU interactions, and how efficiently they can evaluate large neighborhoods whose fitness structures do not fit into the GPU's memory.A GPU implementation of Ant Colony Optimization is presented in [33], and [34].In both works, the algorithm was applied to the traveling salesman problem.The parallel differential evolution algorithm for GPUs was designed in [35], and, in such a way, the computational time was reduced for high-dimensional problems.The advanced parallel cellular genetic algorithm for GPU was developed in [36] for solving large instances of the Scheduling of Independent Tasks problem.The research in [37] uses difficult black-box problems and popular metaheuristic algorithms implemented on up-to-date parallel, multi-core, and many-core platforms, to show that the population-based techniques may have benefited from employing dedicated hardware like a GPGPU or an FPGA.In the paper [38], the authors have proposed a scalable implementation of the adaptive bulk search for solving quadratic unconstrained binary optimization problems.This search algorithm combines a genetic algorithm in a CPU and local searches in multiple GPUs.
The paper [20] proposes a hybrid GPGPU architecture for the memetic evolutionary multiagent systems to improve the efficiency of computations.It is analyzed on the entire search space of the LABS problem, and the speed-up factor 46 was obtained for L = 128 compared to the CPU solver.In [39], the two new heuristics for the whole search space of the LABS problem were developed and implemented on the GPGPU architectures.The authors have shown that exploring the larger neighborhood improves the quality of the found sequences.
In contrast to the mentioned works on the LABS problem with GPGPU approaches, our approach is based on skew-symmetric sequences.It uses self-avoiding walks with a recently published efficient neighborhood evaluation mechanism [21].Using these components, we designed an architecture where the amount of memory used by a contiguous self-avoiding walk is minimized, the overhead related to synchronization between the CPU and GPGPU's memories is minimized, and the possibility of cycling in the searching process is reduced.This way the implemented solver can obtain a high speed or the number of sequence evaluations per second and find new best-known sequences.

The sokol skew solver
The way the solver explores the search space is crucial for its efficiency.In [18] it was demonstrated that cycling can happen within the search process of the skew-symmetric sequences.A sequence of contiguous self-avoiding walks {SAW t1 , SAW t2 , ..., SAW t N } is used in our solver to mitigate this problem (see Eq. ( 3)).The reason for that is the memory limitation of devices.It is impossible to perform an unlimited self-avoiding walk and prevent cycling completely because we cannot store all visited solutions in the memory for larger instances.Each contiguous self-avoiding walk SAW ti of length n is a walk at the time t i on a lattice that does not visit the same node or pivot more than once, and contains n steps or pivots {P 1 , P 2 , ..., P n }.In our case, each lattice node represents a sequence S j of length L and has D neighbors.Each neighbor (S k ) differs from its pivot in only one sequence component.The first pivot (P 1 ) is selected randomly, while the following pivot is the best neighbor that has not been visited before.With the hash table, we can store the previous pivots and, in an efficient way, prevent cycling within the search process of one contiguous self-avoiding walk.
Algorithm 1 The sokol skew algorithm  4).All contiguous self-avoiding walks are independent.That means there is no need for synchronization, and efficient parallelization is possible with GPGPU devices.However, the best-located sequence and its energy value (S best and E best ) should be updated during the optimization process, as shown in Algorithm 1 and Fig. 1.Parallel self-avoiding walks SAW p t are performed in each iteration.In the GPU device, each contiguous self-avoiding walk with the walk length n is performed in one block.Each block has several threads with shared memory.With these threads, each contiguous self-avoiding walk is performed in parallel, as shown in Algorithm 2. Note that all functions in this algorithm are executed in parallel and are responsible for the following: • generating the first pivot randomly, • calculating the values of sidelobe array C = { Ĉ0 (S L ), Ĉ1 (S L ), ..., ĈL−1 (S L )} where • calculating the differences ∆ step between the pivot E step and neighbors energy values, • searching for the best-unvisited neighbor, • updating the values of the sidelobe array and energy value for the next pivot P step .
As already mentioned, each contiguous self-avoiding walk is performed in parallel with the specific number of threads.This number depends on the instance size.It is set with Eq. ( 5) and this equation allows an efficient parallel reduction mechanism in the following functions: CalculateEnergy and BestUnvisitedNeighbor.
The first function calculates the sum we have used two arrays ∆ step and ∆ index .The first array contains differences between the pivot and neighbors' energy values.Exceptions are only neighbors that were already visited.Their differences are set to the maximum values, which prevent them from being selected as the next pivot.The second array contains the corresponding neighbor's indexes.Both are used in the reduction process, which returns the index of the best-unvisited neighbor or next pivot.
When the parallel contiguous self-avoiding walks are finished, the best energy values from each block E b are transferred from the GPU device to the host and compared with the best energy value (E best ).If a new best energy value is found, the corresponding sequence S block is also transferred to the host, and the best sequence (S best ) is updated.This update is frequent only at the beginning of the optimization process, and only energy values are often transferred to the host in each iteration.The overhead between the CPU and GPU memory is minimized this way.
Each contiguous self-avoiding walk requires a specific data structures, the hash table for preventing cycles, and a sidelobe array for efficient neighborhood evaluation, which contains only L integers.From this, we can conclude that a small amount of memory is required for one contiguous self-avoiding walk, and a large number of them can be performed efficiently in parallel on GPU devices.

Experiments
The purpose of the experiments was to analyze the implemented sokol skew solver.The solver was implemented with the C++ programming language using GNU C++ compiler 9.4 and the CUDA Toolkit 11.2.152.The personal computer with an AMD Ryzen 5 5600X processor was used for testing the lssOrel solver, and the NVIDIA A100-SXM4-40GB devices were used in the grid environment VEGA 1 for testing the sokol skew solver.q q q q q q q q 0 1000 3000 5000 7000 1e+09 2e+09 3e+09 4e+09 5e+09 b l ock s speed

Analysis of the sokol skew solver
In this section, we will analyze the efficiency of the sokol skew solver.It has three control parameters n, threads, and blocks.The first parameter defines the walk length of the contiguous self-avoiding walk.It is set to n = 8 • D according to [18].The second parameter, threads, determines the number of threads used for one contiguous self-avoiding walk executed in one block.As was mentioned in the previous section, this number is determined according to the sequence length (see Eq. ( 5)).The third parameter, blocks, specifies the number of blocks executed on the GPU device.This parameter minimizes the overhead related to synchronization between the CPU and GPGPU memory.It affects the speed of the solver, and it is hardware dependent.Our GPU device has 6912 CUDA cores.To use all GPU cores, the product between the number of blocks and threads must be a multiple of the number of CUDA cores.In our case these values could be 54 and 128, respectively.Therefore, the solver was analyzed on sequence length L = 245 with the following number of blocks 54, 108, 216, 432, 864, 1728, 3456, 6912 and with 128 threads, as shown in Fig. 2. From the displayed results, we can see that, as the number of blocks increased, the speed of the solver also increased.We can not perform an infinite number of blocks, because a GPU device has a limited amount of memory.Therefore, in all the following experiments, we will use the following equation to calculate the number of blocks: blocks = 6912 threads • 128.The main goal of our solver is to find optimal or sub-optimal solutions.Unfortunately, it belongs to stochastic solvers, and cannot guarantee the optimality of solutions.However, we can establish the predictive model of stopping conditions or the maximum number of solution evaluations (NSEs lmt ).According to this model, we can also assume the optimality of reached solutions.For this purpose, the sokol skew was analyzed on odd instances from L=71 to L=119.These sequences were selected because the optimal solutions are known for them [14].For determining NSEs lmt , the target approach was used with the optimal solutions and 100 independent runs for each instance size.With the Anderson-Darling test [40] and visual identification, we found that the  variable NSEs lmt was distributed exponentially for all instances.For example, the histogram and corresponding exponential distribution for L = 117 are shown in Fig. 3a.The λ values of exponential distributions for all the selected instances and their trend model (λ = −4.998• (−0.1489L )) are displayed in Fig. 3b.This model will be used in the continuation of the paper for two purposes.The first purpose was to determine the stopping condition NSEs lmt needed by the solver to reach the optimal solution with a 99% probability.For larger instances, the solver cannot obtain the determined NSEs lmt due to runtime lmt = 4 days.Therefore, the second purpose was to calculate the probability of solutions' optimality according to the predicted exponential distribution and obtained NSEs values.

The best-known sequences
To demonstrate the efficiency of the proposed solver and locate new best-known sequences, we performed N = 100 independent runs for L ∈ {121, 123, ..., 247}.The stopping conditions were NSEs lmt and runtime lmt .NSEs lmt was defined for one run, and was set according to the probability of 99% with Eq. (6).
The λ model described in the previous section is used in this equation.With the variable N we also consider 100 independent runs.We can do this, because contiguous self-avoiding walks within the search process are independent, and each of the 100 runs can be considered as a single run.Each run was also limited with the runtime lmt = 4 days.This limitation was determined by the VEGA environment.These values of the runtime and number of runs were also used in previous works [18,30].For instances where the solver cannot reach NSEs lmt within runtime lmt = 4 days, we calculated the probability of reaching the optimal solution with Eq. ( 7).The sum of the NSEs for all independent runs was used in this equation because we can consider 100 runs as a single run.
With the described limitations, our solver reached the 17 new best-known sequences shown in Table 1.These sequences are presented by using a hexadecimal coding.Only first half of each sequence was coded, because the remaining values are defined by skew-symmetry.Each hexadecimal digit is presented with a binary string: 0 = 0000, 1 = 0001, 2 = 0010, ..., F = 1111.Therefore, it is necessary to remove the leading 0 values to obtain the length of the binary string D. To get the sequence values, each 0 of the binary string should be converted to +1 and each 1 to −1 or vice versa.In Table 1, we can observe that the sequences obtained up to L = 223 are optimal, with the probability of 99% according to the λ model.For larger instances, the runs were limited to 4 days, which means the corresponding probability is less than 99%.For the largest instance size L = 247, this probability was reduced to only 13%.
The best-known merit factor values of our and related works [41,18,30], are depicted in Fig. 4a.As we can see, sokol skew has found all the best-known sequences and 17 new best-known sequences.Vertical red lines show improvements in these sequences.We can also observe that the significant improvements were obtained the for the last seven instances.Additionally, merit factor values greater than 9 were obtained for almost all new best-known sequences.In Fig. 4a, we can also observe the significant improvement reached in the last two decades by comparing Knauer's and  L speed q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q NVIDIA A100−SXM4−40GB AMD Ryzen 5 5600X − 6 threads AMD Ryzen 5 5600X − 1 thread (a) The speed or the number of solution evaluations per second for L ∈ {121, 123, ..., 247}.q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 120 140 160  the shown lines, we can conclude that, as the sequence size increases, the value of the merit factor also increases.We can also observe that the second fitted line is flatter, which means this increase of merit factor value for larger instances is lower.However, the merit factors of the proposed sokol skew solver are much better than the values of constructive methods (6.4382), but still below value 10, which is a challenge value for L > 13.The largest merit factor value found by sokol skew was 9.3601 for L = 247.
According to the reached new best-known solutions, it is evident that the proposed sokol skew solver took the advantage of parallel computing on GPU devices.It is also shown in Fig. 4a where we compare the results of the sokol skew solver and related works.Note that the same number of runs N = 100 and runtime lmt = 4 days were used by lssOrel and xLastovka.To demonstrate this advantage, Fig. 5a shows the speed or number of sequence evaluations per second of the sequential lssOrel solver, parallel lssOrel solver with 6 CPU threads, and sokol skew solver with 6912 GPU threads.The obtained speedup is shown in Fig. 5b.The smallest speedup was achieved between the parallel and sequential lssOrel.The larger speedup was obtained between the sokol skew and parallel lssOrel.In this case, the largest speedup of 74 was obtained for L = 247.The largest speedup was obtained between the sokol skew and sequential lssOrel.In this case, the speedup of 387 was obtained for L = 247.From all the reported results, it is evident that the sokol skew solver exploited the advantage of parallel computing on GPU devices successfully and achieved outstanding results.

Conclusions
This paper introduces a new stochastic sokol skew solver for the low-autocorrelation binary sequences problem.It takes the advantage of parallel computing on the graphics processing units.For this purpose, sokol skew organizes the search process as a series of parallel and contiguous self-avoiding walks on the skew-symmetric sequences.With these sequences, the dimension of the search space is reduced by almost half, and, consequently, the solver can find (sub)-optimal sequences significantly faster.The skew-symmetric search space is defined for odd sequence lengths, and its optimal solutions may not be optimal for the entire search space.The contiguous selfavoiding walks reduce the possibility of cycling in the searching process and are independent of one another.The neighborhood evaluation mechanism allows the solver to achieve a high speed in the search process and requires a small amount of memory.With this mechanism, we designed an architecture that minimizes the overhead related to synchronization between the CPU and GPU's memories.All the described components maximize the solver's speed or the number of solution evaluations per second.This way, a bigger search space is examined, and a better sequence can be achieved at the same amount of time.
The solver was analyzed to determine a good value of control parameters and stopping condition (the number of solution evaluations) needed to reach an optimal solution.An exponential distribution was identified for this stopping condition.Using the parameter λ of the exponential distribution, we established the predictive model for the stopping condition with a 99% probability.With this model and the VEGA supercomputer, we applied the proposed solver to all odd sequence lengths from 121 to 223.For these instances, the seven new best-know sequences were found.For larger sequence sizes up to 247, the corresponding probability of optimality is calculated according to the predictive model.In this case, the solver found ten new best-known sequences.The advantage of the proposed solver is demonstrated by comparing the speed or number of solution evaluations per second between sokol skew and its predecessor, lssOrel.The sokol skew solver with 6912 GPU threads obtained a speedup of 387 and 74 compared to sequential and parallel lssOrel with 6 CPU threads, respectively.
The quality of sequences defines the merit factor.In this work, the maximum value of the merit factor was 9.3601 for the sequence with length of 247.Unfortunately, the challenge of finding a binary sequence of length greater than 13 for which the merit factor is greater or equal to 10, remains open.We also analyzed the trend of merit factor values.This analysis shows that as the sequence size increases, the value of the merit factor also increases, and a flatter trend is observed for larger instances.
In future work, we will try to improve the search algorithm or find a better search space.With the better search space, we will try to increase the probability of locating better solutions.We will also implement the sokol full solver for the GPGPU devices that will examine the entire search space of the problem.

Figure 1 :
Figure 1: Architecture of the sokol skew solver.

Figure 3 :
Figure 3: The distribution of NSEs that are needed to reach the optimal solution and the model (R 2 = 0.9429) of the exponential distribution parameter λ according to the results for L ∈ {71, 73, ..., 119}.
The best-known merit factors and their trend lines.

Figure 5 :
Figure 5: The speed ratio between the sokol skew with 6912 GPU threads, parallel lssOrel with 6 CPU threads, and sequential lssOrel solver.

Table 1 :
The new best-known skew-symmetric sequences obtained by the sokol skew solver.