Development of Hausa dataset a baseline for speech recognition

The Hausa language read-speech dataset was created by recording native Hausa speakers. The recording took place at Nile university of Nigeria audio studio and radio broadcasting studio. The recorded dataset was segmented into unigram and bigram. The Hausa speech dataset contain 47hr of recorded audio speech. The dataset can be used for automatic speech recognition, speech synthesis, Text-to-Speech and speech-to-text application.


Value of the Data
• The data presented is important because it is the first freely licensed Hausa speech dataset.
• The dataset is useful because it can used in developing Automatic Speech recognition, Textto-Speech system, Speech-to-Text for Hausa Language. • Researcher can use the dataset for speech research for Hausa language. While the industry can use the data in developing applications for Hausa language. • The present dataset is in raw state that can be further segmented into trigram as we have segmented the data in unigram and bigram.

Data Description
Almost every language has writing system. Hausa has two writing system Ajami (originated from Arabic scripts) and Boko (Originated from Latin scripts). Hausa is a Chadic language family that is within the family of Afro-Asiatic languages [1] . The dataset uploaded in the repository contain nine different literature recordings. The literatures were recording by different people at different locations. These locations are Nile University Audio Studio, Radio Broadcasting studio and Quiet Room at different homes.
The aim of constructing HSC is to provide a public Hausa language corpus that will serve as a baseline corpus for ASR research and application [2] . More also to promote research in Hausa speech processing applications. Although some Hausa speech corpora were collected and developed in [3] and [4] , they are publicly not available. Besides, they are insufficient to train and build a reliable end-to-end model. These collected corpora are publicly unavailable or contain an insufficient amount of data to train reliable models, which are extremely data hungry [5] .
The dataset uploaded in the repository contain nine different literature recordings. The literatures were recording by different people at different locations. These locations are Nile University Audio Studio, Radio Broadcasting studio and Quiet Room at different homes.
At Nile University, audio studio the book Iliya dan Maikarfi and Koya da kanka was read. 154 recordings took place at the Nile university studio. At broadcasting studio, the following literatures were read: Shehu Umar, A Duniya Ne, Rayuwar Hibba, Komai Nisan Dare, Gani Gare Ka, Wani Gari Yafi Gaban Kunu, Magana Jarice, Jiki Magayi. 167 recordings took place at broadcasting. While Kamus Na Turanci Da Hausa was read in, quit rooms by different people. 36 recordings was done at home. The raw dataset are presented in dot mp3 format. The place of recording, recording time and the gender of recorder for each literature was presented in Table 1 . The literatures were chosen because they contain rich Hausa vocabulary and grammar. For the Kamus Na Turanci da Hausa it is a Hausa-English dictionary which recordings was done for machine translation project. While Koya Da Kanka is a mini Hausa book the contain Hausa daily conversation and digits. Fig. 1 . Depict the speech corpus development flow.            Rayuwar Hibba 8 15 7 Komai Nisan Dare 2 5 8 Gani Gare Ka 10 13 9 Wani Gari Yafi Gaban Kunu 3 9 10 Magana Jarice 2 32 11 Jiki Magayi 1 30 Table 4 . Presented the statistical summary of the n-grams after segmentation.

Experimental Design, Materials and Methods
Gutkin et al. [5] developed an open-source Yoruba speech corpus. Yoruba is a language spoken in western African. It is part of Niger-Congo family. The authors recorded 36 male and female using two offices in Lagos Nigeria. The corpus was used for Text-to-Speech application by implementing Hidden Markov Model.
Georgescu et al [6] developed the largest Romanian speech corpus. The came up with 100hours speech corpus read by 164 people. It is one of the large public data set for Romanian speech corpus. They collected their corpus from interview, news and literature utterances. They presented the acoustic and language model of the Romanian speech corpus. For model analysis the used RNN language based.
The authors took advantage of the Catalan TV program. This program also have transcribed files attached to them. Their aim of using public TV program was to reduce the cost of collecting and transcribing the data set [7] .
Selecting the accurate length of script from an audio set is an important feature in Text-to-Speech application as presented by [8] .
Paper [9] describes the experiences and challenges while collecting speech data for Hindi via mobile phone. The goals in [10] was to enhance speech data transcription and harvesting accuracy, enhance text normalization process and pronunciation modelling. According to [11] the major challenges when it comes to speech corpus creation is segmenting audio data into sentences. [12] presented the methodology the used for designing and creating Hindi speech corpus. The methodology involved crawling text, filtering, recording and annotation phases.
The Hausa speech corpus dataset collected was recorded in three different locations viz., quiet room at homes, university audio studio and news broadcasting studio. The dataset comprised of 357 files in mp3 format. Females and Males readers participated voluntarily. The readers were informed of the dataset collection and use protocols.

Study area
The literatures were selected based on the generic and specific nature of the literature. Volunteers were contacted for recording purposes. The recording took place in three different locations.

Data acquisition
Some of the recording took places at Nile University of Nigeria Mass Communication Department Audio Studio. Other recording to place at broadcasting studio whilst some of the recording took place in quiet room at different homes. Just as paper [15] presented the process of developing speech corporal, pronunciation dictionary and transcription for Malayam language automatic speech recognition. The research started by building a text corpus. The corpus contained connected and isolated digits. These would be used for recognition task. Additionally, the corpus contained text data for continuous speech recognition task. The data was recorded in an office environment via microphone. The 15 speakers were asked to read in a normal reading manner. The data set contain Malayam unique phoneme categories and phoneme classes, which was used for analysis.

Data processing
Paper [13] described the system architecture for the development of speech corpus. The paper also described acoustic-phonetic approach of developing speech corpus. Work done in [14] presented a method that can automatically structure text files of Filipino audio files for deep learning automatic speech recognition.
The adobe Audition software was used for recording. For pre-processing spectral subtraction and adaptive noise cancellation algorithm was applied [18] . The aim is to remove background or ambient noise. This is adjust or modify speech signal and the create feature vectors. Some of the recordings were filter out due to errors in pronunciation, noise in background and lack of clarity in the readers' voices. The dataset composed 11 different folders with files in mp3 format. The dataset has 47 hr of spoken Hausa sentences, words and syllables with a file size of 3.97 GB. Each file in the folder follows a naming convention used to distinguished one file from another.
The raw dataset needs to pre-processing to make it ready for speech recognition. To manipulate this waveform signal, the data needs to be translated into spectrograms. Waveform is a visual representation of spectrum of frequencies of a signal as it varies with time. Adobe audition was used as an editing tool to do simple cut and splicing needed for this speech manipulation.
A sample sentence waveform signal was taken from one audio file and spliced using wavepad. The transcript being uttered in the Hausa sentence is "Sannu Da Zuwa" (English translation: "Welcome") [17] work, which stated that "audio segmentation for achieving accurate alignment". The audio segmentation target was to improve baseline acoustic model by improving the quality of text normalization and the accuracy of the pronunciation dictionaries. To perform the speech segmentation, Adobe audition was installed. The recorded speeches were segmented into sub-words and words. Segmentation is just the act of breaking down a continuous speech into discrete units like words, phonemes, syllables and meaningful sub-words [16] . These segmented speech utterances can be used while building automatic speech recognition system.

Ethics Statements
The readers were contacted, informed of the dataset collection process and use protocol. Permission was also seek from the head of department, Mass communication Nile university of Nigeria.
Permission was also seek for recording done in broadcasting studios.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.