New selection algorithm based on generating sequences

The paper proposes a new deterministic selection algorithm with computational complexity O(n), called cs-select, which is a modification of the quickselect algorithm. The changes concern the selection of reference elements. Instead of selecting some elements of the sequence itself as References, the cs-select algorithm proposes to use finite segments of complete sequences, that allow to represent any natural number in a given range as the sum of elements from a selected finite segment of the complete sequence. It is theoretically justified that in the cs-select algorithm, the number of comparisons when searching for the k-th statistic in a sequence of length n can converge to the value 2n with unlimited growth of n for some complete sequences. It was also shown that different variations of the complete sequences make it possible to reduce the latent constant in O(n) less than 2.


Introduction
The selection algorithm is a fundamental task of theoretical computer science and is widely used in processing numerical data: finding elements of the specified rank in the selection, sorting algorithms, finding nearest neighbors in numerical sequence, the problem of finding convex hulls of a set.
The formulation of this problem goes back to C.A.R. Hoare and is as follows [1]: is a set and that is a relation  on X and integer n k   1 . Required to find k -th smallest element, i.e. element X x  , for which there are at least (2) and no less than k elements Let's introduce the following notation: In special cases, we obtain: Usually, for convenience of analysis X is considered as a set (without repeating elements). The first implementation of the algorithm specified in paper [1] (FIND), was proposed by C. E. R. Hoar himself, as a modification of his quicksort sorting algorithm. The FIND algorithm is also often called quickselect. The following estimation of the number of comparisons is proved for this algorithm [2]: For boundary cases, we have: As noted in the paper [3], in practice, the FIND algorithm is popular, because many other algorithms are much slower on average.
The SELECT algorithm is considered in a series of articles [3,4,5], which currently provides the best average result: Moreover, the article [3] provides a comparative analysis of SELECT and FIND. Compared to FIND, SELECT requires only a small additional stack space for recursion. It was shown that, SELECT surpasses rather complex implementations of FIND.
It should be noted that all the considered algorithms assume the use of random selection. Therefore, the effectiveness of their application is estimated for the average case. For specially selected input data, the considered algorithms do not guarantee linear time ) (n O of finding the k -th statistics. Therefore, there remains a theoretical interest in finding algorithms that guarantee ) (n O for the worst case with acceptable hidden constants in ) (n O . So the article [6] proposes the algorithm PICK -a modification of the algorithm quickselect, PICK guarantees ) (n O for the worst case. However, as shown in [6], the number of comparisons of the PICK algorithm does not exceed n  4305 . 5 . Therefore, applying the PICK algorithm in practice is not appropriate if there are such large hidden constants in ) (n O . In this article, based on the results presented in [7], the authors propose a deterministic algorithm called cs-select, it guarantees ) (n O in the worst case, with a hidden constant close to 2 for many practically important cases. The experiments conducted by the authors confirm that the proposed algorithm is on average comparable to quickselect in terms of computational complexity.

The description of cs-select algorithm
By analogy with quickselect and quicksort, the cs-select algorithm is a modification of the cs-sort sorting algorithm described in the paper [7].
Regular algorithms for generating sequences are considered in [7], which are defined as follows: for a given natural number n let's title a finite sequence of natural numbers  as generating sequence , if any natural number n k  can be represented as the sum of the elements of this sequence: To obtain generating sequences in [7], the properties of the following CAlgorithm(n,c) algorithm are analyzed under the condition 2 Step 1.
In [7], it was proved, in particular, that with the 2 = c the CAlgorithm(n,c) algorithm guarantees the generation of a generating sequence B with its minimum length . The basic idea of cs-sort and cs-select is to use a generator sequence B to separate the elements of the sorted sequence. This on the one hand makes similar cs-sort and quicksort, and on the other shows the differences: in the cs-sort algorithm the elements of the sorted sequence are not used as reference elements, third-party elements of the generating sequence are used as reference elements. At the same time, obtaining the result at the end of the elements of the generating sequence is guaranteed. Below are all the steps of the proposed algorithm cs-select, assuming that at the input of the cs-select algorithm are a set n X X = , , generating sequence B = CAlgorithm(n,c), as well as the number of kth ordinal statistics of the set X .

Analysis of the computational complexity of the cs-select algorithm
The speed of the cs-select algorithm is determined by two factors: 1. the length of the generating sequence B , 2. the cardinality of set X , as well as the cardinality of subsets X Less X More   , .
As shown in the article [7], the length of the generating sequence B obtained by the CAlgorithm(n,c) for given ) max( , 2 X n c =  , is bounded from above: For the simplicity of estimating changes in the cardinality of sets  As can be seen from table 1, with c=2, the values of the hidden constant have the smallest spread with an average value of 1.988; for c=2.5, we obtain a much larger spread in values; in some cases, this gives a hidden constant less , than for c = 2. However, on average, with c=2.5, the value of the hidden constant is 2.146. The last column of table 1 demonstrates the ability to "modify" the generating sequence in such a way that the best result with the average value of the hidden constant 1.95 is achieved. As can be seen from the columns for c = 2, c = 2.5 of table 1, the number of comparisons increases with increasing k. For the initial segment of k values, the hidden constant is less than 2. In the generative sequence of the last column of table 1, this initial segment is stretched due to the fact that the sequence S = (51,26,13,7,4,4,2,1) is generative for a slightly larger number n = 104 while maintaining the same length. It is interesting to note that the smallest value of the hidden constant is when finding the median value. All this demonstrates the potential for further lowering the theoretical limit for the hidden constant.

Experimental comparison
In addition to theoretical estimates, it is interesting to compare the cs-select algorithm with existing algorithms for finding the k -th statistics in practice. Since the main idea of the cs-select algorithm coincides with the quickselect algorithm, and the difference lies in the selection of elements relative to which the sequence is divided, quickselect was chosen as the comparison base. It should be noted that the author's version of the quickselect algorithm aimed to finding the k -th statistics in sets, i.e. does not allow repetition of elements. The quickselect algorithm used on multisets should include verification that the multiset obtained at the next step of the recursion does not consist of the same elements, and therefore cannot be divided. Such verification leads to additional time costs. In the csselect algorithm, on the contrary, separation occurs in any case at the end of the elements of the generating sequence without the need for accounting for the repetition of elements. This fact is an important advantage of the cs-select algorithm.
It is also important to note the limitations to the practical use of the cs-select algorithm. Since the number of steps of the algorithm depends on the length of the generating sequence, which in turn is determined by the maximum number in the set X , or indirectly by its digit , the cs-select algorithm turns out to be a priori ineffective in the case of sufficiently sparse sequences of high-digit numbers.
Implementation of cs-select, quickselect algorithms made by authors in the Common Lisp programming language (source code and results are presented in the github repository [8]) became the Platform for conducting experiments.
The experiment used random sequences X of different lengths from numbers ranging from 1 to 1 10 17 − with possible repetitions of elements. For each such sequence, were searched, averaged values of the number of comparisons on a sample of 100 sequences of a given length n were calculated for them The cs-select algorithm always used generating sequences with c = 2. The quickselect algorithm used a random selection of a reference element.
The results of the experiment are shown in tables 2, 3. The results obtained in table 2 demonstrate the stabilization of the number of comparisons at the 2n level. Indeed, for n = 10 elements, when finding the median, the number of comparisons was 20% higher than the 2n level. For the last considered n = 1000000 elements, when finding the median, the excess was only 0.03% of the 2n level. As can be seen from table 2, finding the median value is the most laborious. Table 3 shows the average number of comparisons obtained by the quickselect algorithm on the same sequences as in the cs-select algorithm. As can be seen from table 3, in general, the quickselect algorithm in the form of the implementation proposed in [8] is inferior to cs-select. For this algorithm, finding the median is also the most laborious. However, here the excess of the number of comparisons over the 2n level was 68%. At the same time, it should be noted that when finding the maximum value of Q(X,n), the quickselect algorithm is always more efficient than cs-select.

Conclusions
In general, we can say that the cs-select algorithm should be attributed to deterministic algorithms for selecting the k-th statistic in a sequence of integers, allowing their repetition, with a computational complexity O(n) and a hidden constant in O(n) close to the value of 2 for sufficiently large sequences. It should also be noted that there are potential opportunities to reduce the hidden constant by selecting a suitable generating sequence.