Encrypted audio dataset based on the Collatz conjecture

In information security, one way to keep a secret content is through encryption. The objective is to alter the content so that it is not intelligible, and therefore only the intended user can reveal the secret content. With the aim to provide examples of encrypted audio data, we applied a novel method of encryption based on the Collatz conjecture in five hundred speech recordings (50 speakers, 10 different messages), and then five hundred encrypted audio files were obtained. The main characteristics of our encrypted recordings are as follows: the spectrogram is quasi-uniform, histograms have a repetitive pattern, average of samples is around −0.4, standard deviation is around 0.55; Shannon entropy is around 7.5 (for 8-bits per sample). The novelty of the results consists in obtaining a completely different behavior than natural speech recordings, i.e.: spectrogram with higher energy in low frequencies, histogram with Gaussian behavior, average of samples around 0, standard deviation around 0.11, entropy around 5.5. A more comprehensive analysis of our encrypted signals may be obtained from the article “High-uncertainty audio signal encryption based on the Collatz conjecture” in the Journal of Information Security and Applications.


a b s t r a c t
In information security, one way to keep a secret content is through encryption. The objective is to alter the content so that it is not intelligible, and therefore only the intended user can reveal the secret content. With the aim to provide examples of encrypted audio data, we applied a novel method of encryption based on the Collatz conjecture in five hundred speech recordings (50 speakers, 10 different messages), and then five hundred encrypted audio files were obtained. The main characteristics of our encrypted recordings are as follows: the spectrogram is quasi-uniform, histograms have a repetitive pattern, average of samples is around À0.4, standard deviation is around 0.55; Shannon entropy is around 7.5 (for 8-bits per sample). The novelty of the results consists in obtaining a completely different behavior than natural speech recordings, i.e.: spectrogram with higher energy in low frequencies, histogram with Gaussian behavior, average of samples around 0, standard deviation around 0.11, entropy around 5.5. A more comprehensive analysis of our encrypted signals may be obtained from the article "High-uncertainty audio signal encryption based on the Collatz conjecture" in the Journal of Information Security and Applications.

Data
The shared data contain 500 audio files that have been encrypted using the algorithm proposed in Ref. [2]. The original audio recordings are 500 audio files corresponding to 50 speakers and 10 different messages per speaker. The encrypted files are new audio signals with unintelligible content that can be used to test cryptanalysis techniques; they have a length in the range [85 295] s, a sampling frequency of 8 kHz and 32 bits/sample (128 kbps).
Statistical analysis of the audio encrypted data are provided in Fig. 1 to Fig. 4, using radial plots.

Experimental design, materials, and methods
The encrypted audio data were obtained from 500 speech recordings using the method presented in Ref. [2]. The statistical analysis of the 500 encrypted audio files is provided in the file titled Encrypted audios.rar.
The average, m, is calculated with the equation m ¼ 1 n P n i¼1 x i , where x i are the audio samples (original or encrypted) and n is its number of samples. Fig. 1 shows the results of average for every group of audio file (i.e. 500 original audios, 500 encrypted audios). According to Fig. 1, the average of the original audios is around 0, but for the encrypted audio files it is around À0.4. The difference in this statistical metric between the original and its encrypted audio using the Collatz Conjecture is remarkable.
Standard deviation (s) is obtained as s ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P n i¼1 jx i À mj 2 =ðn À 1Þ q . Fig. 2 shows the results for this parameter. It is remarkable that s values of the original audio signals are not nearly constant, but they are for the encrypted audios.
Kurtosis is obtained through the equation, k ¼ Eðx À mÞ 4 =s 4 , where E(.) represents the expected value of data. Fig. 3 shows the results of original recordings and their encrypted files. Again, the behavior between these two groups is completely different. Value of the Data -This data set can be used for cryptanalysis purposes in order to try to break the encryption method proposed in Ref. [2].
-It is useful for comparing the quality of encrypted audios in terms of their statistics like average, standard deviation, kurtosis, and entropy. Our encrypted audio files have the following values: average around À0.4, standard deviation around 0.55, kurtosis around 2.03, and entropy of 7.5 (for 8 bits). -In addition, it can be used to compare the behavior of the encrypted audio signal in terms of its spectrogram (quasiuniform) and histogram (repetitive pattern).   In terms of the Shannon entropy, H(x), is obtained as HðxÞ ¼ À P n i Pðx i Þlog 2 ðPðx i ÞÞ, where P (.) is the probability of data. For data with uniform distribution, i.e. where all values are equally likely, the expected entropy value is equal to the number of bits per sample [3]. Otherwise, entropy decreases. Fig. 4 shows the entropy comparison.
Given that, the encrypted audio files have 8 bits per sample, the theoretical highest value of entropy is 8. According to Fig. 4., the entropy of the encrypted data is around 7.5, whereas for natural audio signals it is around 5.5, in most of the cases. The entropy obtained in the encrypted audio files suggests that the level of uncertainty is very close to the highest possible.
With the encrypted recordings, the histogram and spectrogram can be obtained. If you use Matlab, the following code can help you to plot the figures:

Funding sources
This work was funded by the "Universidad Militar Nueva Granada -Vicerrectoría de Investigaciones" under the grant IMP-ING-2936.