Lossless Steganography for Speech Communications

Transmitting supplementary data by steganography, new functions can be added to communications systems without changing conventional data format. Based on this concept, several applications have been proposed for enhancing the speech quality of telephony communications. These applications secretly transmit side information along with speech data itself for enhancing the performance of signal processing such as packet loss concealment and band extension (Aoki, 2003; Aoki, 2006; Aoki, 2007a; Aoki, 2012).


Lossless steganography technique based on the folded binary code
For representing signed integers as binary data, many speech codecs employ the folded binary code instead of the 2's complement, the most common format of binary data.Table 1 shows how an 8 bit speech data is represented by the folded binary code as well as the 2's complement.Table 1.2's complement and folded binary code for representing 8 bit speech data.
As shown in this table, an 8 bit speech data encoded in the 2's complement ranges from -128 to +127.On the other hand, an 8 bit speech data encoded in the folded binary code ranges from -127 to +127.Although the folded binary code cannot represent -128, it may represent both +0 and -0 instead.This redundancy can be a container for embedding secret message without any degradation.
The embedding procedure of the proposed technique is programmed in C language as shown in Fig. 2. In this procedure, b represents a 1 bit secret message and c represents an 8bit cover data encoded in the folded binary code.The proposed technique is categorized as a lossless steganography technique, since it does not degrade cover data by embedding secret message.

G.711
G.711 is the most common codec for telephony speech standardized by ITU-T (International Telecommunication Union Telecommunication Standardization Sector) (ITU-T, 1988).It consists of -law and A-law schemes designated as PCMU and PCMA, respectively.PCMU is mainly employed in North America and Japan.It encodes 14 bit speech data into 8 bit compression data at an 8 kHz sampling rate.PCMA is mainly employed in Europe.It encodes 13 bit speech data into 8 bit compression data at an 8 kHz sampling rate.
Figure 3 and 4 show the encoding and decoding procedure of PCMU.The compression data of PCMU consists of 1 bit sign, 3 bit exponent, and 4 bit mantissa (ITU-T, 2005).The compression data is encoded in the folded binary code.Table 2 shows some of the compression data and their corresponding speech data decoded with PCMU.As shown in this table, the speech data decoded with PCMU ranges from -0 to -8031, and +0 to +8031.compression data and their corresponding speech data decoded with PCMA.As shown in this table, the speech data decoded with PCMA ranges from -1 to -4032, and +1 to +4032.
Note that there is an overlap in the speech data decoded with PCMU.On the other hand, there is no such an overlap in the speech data decoded with PCMA.This indicates that a lossless steganography technique is available for PCMU, although it is not for PCMA.Table 2. Speech data decoded with PCMU and PCMA.

Lossless steganography technique for G.711
Taking account of the characteristic of PCMU, secret message can be embedded into both +0 and -0 in the compression data without any degradation.When 0 is required to be embedded, the sign bit of the compression data is changed to be 0.This means that the compression data is changed to be +0.When 1 is required to be embedded, the sign bit of the compression data is changed to be 1.This means that the compression data is changed to be -0.The embedding procedure of the proposed technique is defined as follows.
where b represents a 1 bit secret message and c represents an 8 bit compression data.This procedure is programmed in C language as shown in Fig. 7.
Figure 8 shows an example of the proposed technique.The compression data is represented as white and black circles according to the sign bit.The sign bit of the compression data represented by white circle is 0. On the other hand, the sign bit of the compression data represented by black circle is 1.
This example shows 4 candidates that can contain 4 bit secret message in total.According to their sign bits, these data originally contain 4 bit secret message represented as (0, 0, 1, 1).In order to embed secret message represented as (0, 1, 0, 1), the proposed technique changes these data as shown in Fig. 8. Since all of these are decoded to be 0 even if their sign bits are changed, the proposed technique does not degrade the speech quality at all.

DVI-ADPCM
The concept of the proposed technique may potentially be applicable to other codecs that also employ the folded binary code.Another example is DVI-ADPCM.Not only G.711 but also DVI-ADPCM is employed in telephony communications as a standard VoIP (Voice over IP) codec (RFC, 1996).Figure 10 and 11 show the decoding procedure of DVI3 and DVI4 programmed in C language.In these procedures, c is a compression data, x is a speech data, d is a difference between the previous and the current speech data, and s is a step size (Microsoft, 1994).
Note that there is no such a condition that allows a lossless steganography technique for DVI3.This means that a lossless steganography technique is not available for DVI3 in the same manner of the proposed technique for DVI4.

Capacity of the proposed technique
The capacity of the proposed technique was evaluated by using speech data obtained from actual telephony environment, such as a private room, an office room, a cafeteria, and a railroad station.In these conditions, 8 male speech data (m1 -m8) and 8 female speech data (f1 -f8) were collected.As shown in Table 3, the duration of the speech data denoted as L was more than 120 s.The voice activity ratio of the speech data denoted as R was at around 50 %, since telephony speech generally shows the half duplex structure due to the alternate conversation process (Wright, 2001).RMS (Root Mean Square) of the background noise was calculated from voice inactive intervals.The capacity of the proposed technique for PCMU is shown in Fig. 14.This figure also shows a solid line that represents the average capacity obtained from a simulation using a speech dialogue database (ATR, 1997).It is indicated that the capacity of the proposed technique depends on the background noise in each telephony environment.The capacity ranges from 3.3 % to 6.4 % for the speech data obtained from a private room in which the background noise is almost imperceptible.It is interpreted that the capacity ranges from 264 bps to 512 bps in this condition.On the other hand, the capacity ranges from 0.24 % to 0.44 % for the speech data obtained from a railroad station in which the background noise is very annoying.It is interpreted that the capacity ranges from 19.2 bps to 35.2 bps in this condition.
The capacity of the proposed technique for DVI4 as well as PCMU is shown in Table 4. Compared with PCMU, the capacity for DVI4 is much smaller.Note that the capacity for DVI4 is very small even if the background noise is almost imperceptible.The capacity ranges from 0.029 % to 0.16 % for the speech data obtained from a private room.It is interpreted that the capacity ranges from 2.32 bps to 12.8 bps in this condition.On the other hand, there is no capacity for the speech data obtained from a cafeteria and a railroad station.

Semi-lossless steganography
Semi-lossless steganography technique is an idea for increasing the capacity of the proposed technique (Aoki, 2010b).This article describes how the capacity of the lossless steganography technique for PCMU can be increased by the semi-lossless steganography technique.
Figure 15 shows how the semi-lossless steganography technique embeds secret message.In the embedding procedure, this technique modifies an 8 bit compression data as follows. (0 where j (≥0) represents the amplitude modification level.
The amplitude modification may cause undesirable clipping in the 8 bit compression data, if its magnitude exceeds 127-j.Consequently, this technique can recover the original speech data only when the amplitude of the 8 bit compression data ranges from -127+j to +127-j.Most of the practical cases meet this condition when the amplitude modification level is small enough.This is based on the fact that the amplitude of speech data statistically shows the exponential distribution (Rabiner, 1978).In general, the maximum magnitude of the 8 bit compression data is less than 127-j.
In this sense, this technique can be categorized as a reversible steganography technique, if this condition is satisfied.However, if this condition is not satisfied, this technique cannot recover the original speech data any more.Therefore, this technique is named semi-lossless steganography technique in this study.
The embedding procedure of the semi-lossless steganography technique is defined as follows.
(, 2 ) The capacity of the lossless steganography technique is defined as N bit, where N represents the number of the compression data in which the secret message can be embedded.On the other hand, the capacity of the semi-lossless steganography technique is defined as 2 (log (1 ) 1 ) Nj bit.The capacity of the semi-lossless steganography technique increases according to the amplitude modification level.However, undesirable clipping may occur more frequently in such a situation.
After the extracting procedure of the secret message, the semi-lossless steganography technique recovers the 8 bit compression data as follows.
( 127) 0( ) ( 127) Of course, the recovery procedure is necessary for decoding the original speech data.However, this procedure is omitted in the conventional telephony systems that do not implement the semi-lossless steganography technique.In such a situation, there is no way to remove the degradation from the speech data.
In order to evaluate such degradation, this study investigated the quality of the modified speech data by using PESQ (Perceptual Evaluation of Speech Quality) (ITU-T, 2001).PESQ is widely employed as an objective evaluation measure of the speech quality in telephony communications.Taking account of the characteristics of human auditory perception, PESQ positively correlates with a subjective evaluation measure such as MOS (Mean Opinion Score).PESQ score ranges from 4.5 to -0.5.The higher the PESQ score, the better the speech quality.
Figure 16 shows the average PESQ scores with 95 % confidence intervals.These were calculated from the 16 speech data employed in the evaluation for the capacity of the proposed technique.
As shown in this figure, it is indicated that the amplitude modification causes some degradation.However, it is almost imperceptible when the amplitude modification level is small enough.This result may potentially assure the compatibility of the semi-lossless steganography technique with the conventional telephony systems.This means that normal playback of the speech data modified with the proposed technique is still acceptable in the conventional telephony systems that do not implement the semi-lossless steganography technique.

Conclusion
This article described an idea for the lossless steganography technique based on the characteristic of the folded binary code employed in several speech codecs, such as G.711 and DVI-ADPCM.In addition, an idea for the semi-lossless steganography technique is also described.
The proposed techniques take advantage of the redundancy of the speech codecs.It is a sort of the loophole of the speech codecs that can be employed as a container of secret message.Such a loophole plays an important role for embedding secret message without any degradation.
The concept of the proposed technique may potentially be applicable to other codecs that also employ the folded binary code.Besides G.711 and DVI-ADPCM, it is of interest to find out the codecs in which the proposed technique is available.In addition, it is also of interest to develop some practical applications that employ the proposed technique for transmitting secret message.Both of these topics are the future works of this study.

Acknowledgment
The author would like to express the gratitude to the Ministry of Education, Culture, Sports, Science and Technology of Japan for providing a grant (no.21760270) toward this study.

Fig. 2 .
Fig. 2. Embedding procedure of the proposed technique programmed in C language.

Figure 5
Figure5and 6 show the encoding and decoding procedure of PCMA.The compression data of PCMA consists of 1 bit sign, 3 bit exponent, and 4 bit mantissa(ITU-T, 2005).The compression data is encoded in the folded binary code.Table2shows some of the Fig. 4. Decoding procedure of PCMU.

Fig. 7 .
Fig. 7. Embedding procedure of the proposed technique programmed in C language.

Fig. 8 .
Fig. 8. Example of the proposed technique: (a) compression data before embedding, and (b) compression data after embedding.DVI-ADPCM is a speech codec based on the ADPCM (Adaptive Differential Pulse Code Modulation) algorithm developed by DVI (Intel's Digital Video Interactive Group) (Microsoft, 1994).The block diagram of the ADPCM algorithm is shown in Fig. 9.In this diagram, x(n) represents a speech data and c(n) represents a compression data at the time of n.DVI-ADPCM is designated as DVI3 and DVI4 according to the size of compression data.DVI3 encodes 16 bit speech data into 3 bit compression data at an 8 kHz sampling rate.DVI4 encodes 16 bit speech data into 4 bit compression data at an 8 kHz sampling rate.The compression data of DVI3 consists of 1 bit sign and 2 bit magnitude.The compression data of DVI4 consists of 1 bit sign and 3 bit magnitude.Both of these are encoded in the folded binary code.

Fig. 13 .
Fig. 13.Embedding procedure of the proposed technique programmed in C language.

Fig. 14 .
Fig.14.Capacity of the proposed technique for PCMU: Circles, triangles, diamonds, and squares represent the capacity of a private room, an office room, a cafeteria, and a railroad station, respectively.Solid line represents the average capacity obtained from a simulation using a speech dialogue database.
Capacity of the proposed technique for PCMU and DVI4.

Table 3 .
Speech data obtained from actual telephony environment.