Mathematical model for recovering discrete parts of a text message

. The paper studies the procedure for restoring discreet segments of an unknown source message based on information about possible variants of each sign. An algorithm is proposed based on compiling dictionaries of appropriate lengths, searching for text sections with a total number of character variants that do not exceed a given boundary, and then iterating through and eliminating false variants of dictionary values. Statistical properties of short-length text dictionaries are investigated, and extrapolation estimates are made for long-length texts. The main mathematical properties of this algorithm are described. Theoretical studies of the eﬀectiveness of the procedure under consideration are carried out within the framework of a certain probability-theoretical model.


Introduction
In a number of tasks for the analysis of information security algorithms, a situation arises when information about possible variants of its characters appears in relation to the characters of an unknown source text message.This situation occurs, for example, when using stream encryption, if: 1) information about possible variants of the message characters was obtained through the side channels of the leak, 2) the power of the key set is less than the power of the plaintext alphabet, 3) the characters of the key sequence are not equally distributed, 4) the same encryption key is used repeatedly.
If the number of variants for each character is relatively small, then the original message is simply restored completely [1].Otherwise, if the number of variants of signs can be considered random, it becomes possible to build a procedure for searching and then recovering discrete sections of the message on which signs are concentrated with a relatively small number of possible variants.This paper considers the recovery of messages in English with a fixed plaintext alphabet -29 characters (26 letters of the Latin alphabet, a space, a period and a comma).

Description of the algorithm
Let's assume that for each character of the encrypted message, it was possible to construct a certain set of variants of its sign, among which is the true character.The number of such possible options may have a different probability distribution.In this paper, we consider the case of an equally probable distribution.
The algorithm consists of the following main steps: 1) Compilation of s-gram dictionaries.Dictionaries are compiled on the basis of a large text sample (text corpus).
2) Selection of "limited" segments of a message.Only those segments of a message that contain a relatively small number of possible variants are selected for recovery.
Selection criterion: the average geometric value of the number of possible options for the corresponding segment l i does not exceed the specified critical boundary: where s is the length of the segment, l ij is the number of character variants for the j-th character in the i-th s-gram.
3) Building options for recovering the segment.Iterate through all possible combinations that can be constructed using known character variants for each character of the selected segment.4) Selection of all possible legal recovery options based on the dictionary.If the constructed version is a dictionary value, then it is considered that it represents a part of the legal text, and is considered as a potential segment of the original text of a message.If the compiled s-gram is not present in the dictionary, then it is assumed that it is a text of a random structure, and is rejected as a false recovery option.
5) Restore the message segment.Depending on the number of possible legal variants found, the message segment is considered successfully restored or not restored.The maximum allowed number of recovery options is estimated as: k = 2 β•s , where s is the length of the message section, β is some constant < 1.In practice, β = 0, 1 [6].The following algorithm parameters are considered: -Length of the text segment s: 10-25 characters -Critical boundary of L: 8-16 characters β = 0,1 Text length 10 15 16 20 25 Number of possible options 1-2 1-2 1-3 1-4 1-5 Table 1.Acceptable degree of segment reconstruction ambiguity

Statistical properties of s-gram dictionaries
As part of the experimental study, dictionaries are compiled for segments of 10, 15, 20 and 25 characters long based on a text of 100 million characters in English.
Based on the data on the volume of the dictionaries obtained, an experimental assessment of the coverage of dictionaries and the entropy of the corresponding s-grams of the English language is carried out.
The compiled dictionaries were obtained on the basis of a limited volume of material and do not cover the set of all possible legal s-grams of the English language.Therefore, the coverage of the resulting dictionaries is evaluated.To estimate the coverage of the resulting dictionary, the number of s-grams that occur only once in the source material is used.
where N s is the initial volume of the s-gram dictionary, n s is the number of s-grams occurring once, τ is s-gram dictionary coverage.
Based on the amount of coverage, the volume of dictionaries is recalculated: It is assumed that the value obtained in this way is the size of the real dictionary of s-grams of the language.Based on the theoretically found volume of the dictionary, the value of the entropy of s-grams is estimated as: Extrapolating these results to large values of s is difficult, since the shape of this sequence of values is unknown, except that it is positively decreasing.
The marginal entropy is defined as: To estimate the marginal entropy from this set of measurements, a model of sequential estimates is constructed.
It is assumed that the sequence of entropy values obeys a linear recurrence relation: The coefficient k for the model is determined numerically in accordance with the experimentally obtained entropy values for segments of small length.
In this case, the value of k, which gives the best approximation, is k ≈ 0.62.By increasing the value of s, a sequence of heuristic estimates of H s is constructed, the experimental evaluation of which becomes difficult for a large length of a text segment.Starting from s = 50, the values of H s are stabilized and no longer change with the length of the segment.Thus, the maximum entropy is at H = 0.8 bits per character.The limit value can be used in theoretical calculations for large values of the message segment.
4 Mathematical properties of the algorithm

Probability of occurrence of bounded segments
Let the number of l ij sign variants for all segments be distributed independently and randomly equally probable from 1 to 29, that is, it takes values with the same probabilities p k = 1 29 for any k = 1, 2, . . ., 29.The value of the critical boundary L is fixed.
Let's define the characteristic of the i-th segment of the message: In addition, for any segment, the expected value of Then the following theorems about the probability distribution of the appearance of a bounded segment in the message are valid.
Theorem 1.Let the length of the message segment be s → ∞.Then the probability that the geometric mean of the i-th segment does not exceed a set limit L: ) for any i, where Φ is a function of the standard normal distribution.Theorem 2. Expected number of bounded segments of length s in a message of N characters: b) The expected geometric mean value of the next segment, provided that the previous one is limited: where ϕ is the density of the standard normal distribution, Φ is a function of the standard normal distribution.
Theorem 5.The probability that all segments of the message are simultaneously bounded tends to a multidimensional normal distribution.
 is a covariance matrix, at the intersection of the k-th line and the z-th column are the values of the covariance of the segments: Remark.The sum of independent uniform quantities quickly converges to the normal distribution.Traditional estimates of the error generated by the CLT in the final case, such as the Berry-Essen inequality, at an average value of s (tens) show a very rough estimate and an overestimated upper bound.This is due to the fact that, first, different classes of distributions converge to the normal at different rates, and the error estimate is given in general for any distribution structure.Secondly, the Berry-Essen inequality uses an error estimate in the form of the ratio of the third moment to the root of the number of terms.This form of error will give a fairly accurate estimate only on large s.It is known from numerical calculations that the uniform distribution converges quickly to the normal one, much faster than the general theoretical estimates of [5].Thus, even for small values of s, the approximation by the normal distribution allows us to obtain a fairly accurate estimate of the theoretical probability.

Mathematical model of the distribution of the number of possible legal texts
Let N = m s be the total number of s-grams that can be constructed from characters of the alphabet with power m (in this paper m = 29 is fixed), D be the number of possible legal texts of length s in the same alphabet (this value is estimated as 2 Hs , where H is the entropy of s-grams), n i = l s i is the number of possible recovery options for the i-th segment of the message, k i is the number of possible legal recovery options for the i-th recoverable message segment among n i , where l i = s l i1 • . . .• l is -the average number of character variants for the unknown character of the i-th section of the message.In this case, it is assumed that the true recovery option is always present in the sample, that is, n i − 1 are false options.
Then the probability that in a sample of n i different options for the recovery obtained from the set of all possible variants of the N containing D legal variants, exactly k i variants are legal, is described by the hypergeometric distribution [6], that is: Theorem 6.If a section of a message in English (alphabet power is 29 characters) with a length of s characters with an average number of variants of char- The most likely number of possible legal texts (taking into account the true one) that will be found when restoring the message segment (hypergeometric distribution mode): li/s 10 15 20 25 8 3 1 1 1 10 24 1 1 1 12 143 2 1 1 14 667 11 1 1 16 2534 80 1 1 Table 3.Most likely number of possible legal texts Therefore, restoring 10-grams with an average number of character variants greater than 8-9 characters is practically meaningless.For 15-grams, the maximum recovery efficiency for L is no more than 12-13 characters.
As the length of the message segment increases, the degree of ambiguity in restoring the plaintext decreases.However, for large sections of text, the process of compiling search dictionaries becomes very difficult.The simulation allows us to choose the most optimal algorithm parameter is the length of the recovered message segment-based on the most effective ratio between the degree of ambiguity of the plaintext recovery and the coverage of the corresponding dictionary.Obviously, such a parameter is the minimum length of the restored s-gram, at which the probability of restoring the segment tends to one for any average number of true sign variants.Based on the probabilities of the hypergeometric distribution, the length of such a minimal s-gram is determined 16 characters.
The limit distribution of the number of possible legal texts occurs when s → ∞.Then the parameters of the hypergeometric distribution are The type of marginal distribution depends on the number of possible legal texts in the sample.
Theorem 7. The probability of finding exactly 1 legal text (true) at s → ∞: Theorem 8.If the number of possible legal texts is k = 2 s•β , then the probability of getting k possible legal texts when restoring a segment of length s → ∞ is:

Algorithm efficiency
The probability of restoring a bounded segment of a given length is the joint probability of such a segment appearing in the message and the success of its restoration, taking into account the acceptable polysemy: where P occurrence (s, l i ) is the probability of occurrence of a segment of length s and with the average geometric sign variants l i , P uniqueness (s, l i ) is the probability that the segment recovery does not exceed the allowed polysemicity, P recovery (s, l i ) is the probability of successful recovery of the segment.

Conclusion
In this paper, we propose an algorithm for restoring individual sections of encrypted messages, which can be used in a number of cases, such as the use of an incomplete or uneven key sequence, leakage through side communication channels, and repeated use of the same encryption key.The recovery procedure is based on compiling dictionaries of the appropriate lengths, selecting message segments that contain characters with a relatively small number of possible variants, sorting through all possible combinations, and eliminating false variants that are not dictionary units.The theoretical evaluation of the proposed method is carried out in the framework of an equally probable model and the optimal parameters of the algorithm are found.As part of the experimental study, sgram dictionaries were created and their statistical properties were studied.A method of theoretical evaluation of dictionary coverage is developed.The values of the entropy of s-grams are found and the limit value of entropy is estimated.Various probability distributions arising in the framework of this algorithm are considered.It is found that as the length of the segment to be restored increases, the probability of finding a bounded segment approaches the normal distribution.Limit distributions are given for the probability of occurrence of a single segment, for the conditional probability of occurrence of a bounded segment, and for the joint occurrence of bounded segments.It is found that the normal distribution allows us to obtain a good approximation for small values of the segment length (tens).In addition, it is shown that the degree of ambiguity arising in the considered problem of restoring text segments is described by a hypergeometric distribution, and some numerical calculations of the probability of the appearance of an acceptable number of possible legal texts during the restoration are given.The asymptotics of the hypergeometric distribution for large values of the segment length are found.Using the found probability distributions, the general theoretical efficiency of the algorithm under consideration is estimated -the average fraction of the recovered segments.In the future, it is planned to consider other probabilistic distributions of the number of variants of the message symbol (polynomial).Investigate the probability distribution that occurs when the key is reused, and evaluate the effectiveness of the algorithm in this case.

Theorem 3 .
a) The correlation coefficient of two adjacent segments S i and S i+1 with the length of s characters is equal to: ρ = s s−1 .b) The correlation coefficient of two arbitrary segments S k and S z with the length of s characters is equal to: ρ = s−z+k s , если z − k < s ρ = 0, если z − k ≥ s Theorem 4. a) Conditional probability of occurrence of a bounded segment, provided that the previous segment is bounded:

Fig. 3 .
Fig. 3. Number of possible legal texts per segment

Table 2 .
Dictionaries and entropy