research-article

Open Access

Factorized Recurrent Neural Network with Attention for Language Identification and Content Detection

Authors:
Birhanu Hailu Belay

Faculty of Computing, Bahir Dar Institute of Technology, Bahir Dar University, Ethiopia

Faculty of Computing, Bahir Dar Institute of Technology, Bahir Dar University, Ethiopia

0009-0000-6709-7773
View Profile

,
Gebeyehu Belay Gebremeskel

Faculty of Computing, Bahir Dar Institute of Technology, Bahir Dar University, Ethiopia

Faculty of Computing, Bahir Dar Institute of Technology, Bahir Dar University, Ethiopia

0000-0002-6101-6204
View Profile

,
Belete Biazen Bezabih

Faculty of Computing, Bahir Dar Institute of Technology, Bahir Dar University, Ethiopia

Faculty of Computing, Bahir Dar Institute of Technology, Bahir Dar University, Ethiopia

0009-0002-1904-6941
View Profile

,
Seffi Gebeyehu

Faculty of Computing, Bahir Dar Institute of Technology, Bahir Dar University, Ethiopia

Faculty of Computing, Bahir Dar Institute of Technology, Bahir Dar University, Ethiopia

0000-0002-9974-5015
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22 Issue 12Article No.: 255pp 1–13https://doi.org/10.1145/3630607

Published:19 December 2023Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Language identification and content detection are essential for ensuring effective digital communication, and content moderation. While extensive research has primarily focused on well-known and widely spoken languages, challenges persist when dealing with indigenous and resource-limited languages, especially between closely similar languages such as Ethiopian languages. This article aims to simultaneously identify the language of a given text and detect its content, and to achieve this, we propose a novel attention-based recurrent neural network framework. The proposed method has an attention-embedded Bidirectional-LSTM architecture with two classifiers that identify the language of a given text and content within the text. The two classifiers share a common feature space before they branched at their task-specific layers where both layers are assisted by attention mechanism. We use five different topics in Six Ethiopian Languages the dataset consists of nearly 22,624 sentences. We compared our result with the classical NLP techniques, the proposed method shortened the data prepossessing steps. We evaluated the model performance using the accuracy metric, achieving results of 98.88% for language identification and 96.5% for text content detection. The dataset, source code, and pretrained model are available at https://github.com/bdu-birhanu/LID_TCD.

1 INTRODUCTION

Digitization with advancement and widespread adoption of communication technology has enabled the use of the Internet as a means of worldwide communication. It brought a shift in society that will never be reverted back. Thus, there are an ever-increasing demand for Natural Language Processing (NLP) applications to facilitate knowledge acquisition, extraction, organization, and filtering of documents [1]. Language identification and text content detection are core applications of NLP that automatically recognizes a language from a given text in a document and detect the content of the texts in the language. Language identification is an important research area. It is often found as the first step in many NLP tasks such as machine translation [1], sentiment analysis [3, 17], information retrieval [10, 23], web search engines [19], text content classification [22], spelling and grammar checking [32], and text summarization [26]. These NLP applications are efficient in multilingual environments and social media such as X (formerly known as Twitter) and Facebook. However, they are not efficiently working as expected since tools trained for one language may not deal with unknown languages unless suitable resources are developed. The challenge is for resource-limited languages, such as Ethiopian languages, which have limited NLP applications.

In Ethiopia, there are over 86 languages, and these languages are divided into Semitic, Cushitic, Omotic, and Nilo-Saharan groups [21]. The languages are written with either Ge’ez/Amharic and/or Latin scripts. Of the 86 languages in Ethiopia, 41 are living languages at the institutional level, 14 are developing, 18 are vigorous, 8 are in danger, and 5 are near extinction [21]. None of these languages have a workable and efficient NLP tool and did not use that can be applied in industrial/commercial settings. Identification of language from a given small text piece is an interesting problem in the Ethiopian language contexts. In this article, we proposed a multi-task model, which automatically determines the language in a given written text. The content of the texts in a language identification is an emphasis on six Ethiopian spoken languages in different regional states of the country namely; Amharic, Afan-Oromo, Tigrigna, Afar, Somali, and Awi. The text content in each language consists of five different topics, including Agriculture, Sport, Health, Politics, and Religious.

Various approaches have been proposed and used for the task of language identification, including detection based on character n-grams frequency [40], dictionary-based [16], and words as lexical representation [12]. In literature, n-gram frequency considers one of the most popular techniques in language identification. It uses the letter n-grams to represent the frequency of occurrence of various n-letter combinations in a particular language. The frequency statistics of n-gram occurrence were used as features for many classical machine learning techniques like Naïve Bayes, Support vector machines (SVM), and Artificial Neural Networks (ANN) are discussed in [16, 38]. These techniques have been extensively used for tasks like language identification and text classification but have been surpassed by deep learning methods.

Despite their close relationship, conventional approaches to language identification and text detection involve the separate training of two distinct models using traditional techniques like N-gram features, Naive Bayes, and SVM classification algorithms. One significant limitation of these methods is their failure to consider long-term character relationships within text. Additionally, statistical approaches require labor-intensive preprocessing steps, adding complexity to the process. Moreover, developing two separate models for related tasks is time-consuming and doesn’t fully capitalize on their shared characteristics.

Language identification could be applied in any language modality, such as speech, sign language, and printed handwritten texts. Language identification systems in all modalities are relevant since information could be indexed, processed, distributed, and stored from various sources. In this article, we limit the scope to language identification and text content detection based on Ethiopian language text data stored in a digitally-encoded form. Sample fragment texts in English and Ethiopian languages on the topic of Coronavirus (COVID-19) and Football game news, labeled according to the language they are written in and the content of information that consists of the texts. For example, without referring to the labels in Table 1, readers of this article will certainly have recognized at least one language in Table 1, and few Ethiopian are likely to be able to identify some of the languages therein. It can also identify the contents of the texts in each language.

Table 1.

View Table

Table 1. Sample Segment Texts are Taken from the News on Different Topics and Languages

The motivation behind proposing the factored attention-based network for language identification and content detection lies in addressing existing challenges. For well-known languages, many works have reported near-perfect accuracy for language identification, yet unsolved tasks remain [33]. NLP applications, including language identification, often yield impressive results but are limited to narrow domains and specific use cases. This limitation characterizes NLP as a challenging task. Moreover, there are other indigenous and resource-limited languages, such as Ethiopian languages, for which no well-developed NLP applications exist [8]. Our objective is, therefore, to design and develop a multi-task text-based language identification and text content detection model for Ethiopian languages, streamlining text preprocessing steps. This article presents the first attempt at Ethiopian language identification and text content detection, leveraging attention mechanisms and multi-task learning approaches to address these challenges effectively. This article’s contribution is summarized as follows:

–	We propose an attention-based language identification and text content detection framework for Ethiopian languages for the first time. This enables the model to attend to different parts of a text when various aspects are concerned.
–	Designed attention-based language identification and text content detection model by leveraging the architecture of a multi-task learning strategy to share and represent related tasks for better generalization.
–	Proposed recurrent neural network model to minimize a single cross-entropy and a joint loss from the two classifiers to enhance the existing language identification and text content detection performance.
–	Proposed multi-task model to use the same input data and trained jointly with a challenge parameter sharing strategy that reduces the risk of over-fitting for both tasks.
–	Proposed a novel approach based on word embedding to shorten text preprocessing steps, which is a challenge for classical NLP model development,
–	We introduce a baseline dataset for Ethiopian languages identification and text content detection as a shared task.

The rest of the article is organized as follows: Section 2 reviews the relevant methods and related works. The proposed method, the details of the dataset, and training strategies are described in Section 3. Section 4 presents experiments, empirical analysis, and obtained results. Finally, conclusions and future works are described in the last section.

2 RELATED WORK

In this section, we present previous works related to language identification and text content detection. However, attempts at Ethiopian languages are very few and focused on single language identification that frequently occurring words are varied and small for languages. Various approaches have been used for developing a language identification model ranging from classical machine learning techniques to the recent deep learning techniques. For local languages, in a given statement recognized words are given a weight based on their ranks and language identification is the highest sum of word weights. In this research, we start with the research attempts that follow classical NLP approaches. Then, we present the recent findings using deep learning techniques.

Language identification and content detection specifically for native languages gained more attention in the past decades. As Seppo [36], language identification was proposed based on multiple discriminant analyses with the ability to distinguish texts at the word level. This model was designed to differentiate between 3 languages: English, Swedish and Finnish. The author tried to compile a list of linguistically-motivated character-based features and trained the model on 300 words for each of the three target languages. From each language, 100 words were used to evaluate the language identifier and 76% were correctly classified. However, such languages lack unification for collective evaluation, and an attention-based model is used to improve the performance on native or local language identification and text content detection tasks.

Attention-based in a combination with the techniques (deep learning) used in the statistical character-based n-gram model uses the most frequent 1 to 5 grams in a text proposed for newsgroup text classification in 8 European languages [11]. Markov models [46], kernels methods in SVMs [28] parameter-sharing of speech (POS) correlation were applied to language identification to determine the texts that were written in the same or different languages [25]. Bayesian classifiers were also employed to distinguish between European and Brazilian variants of tweets written in the Portuguese language, and achieving 95% classification accuracy [29]. The same technique was also used for short text identification on film subtitles in 22 languages and reported a promising performance [42].

A Language identification model dealing with short texts was proposed using linear regression to classify different Indian languages, including Hindi, Bengali, Marathi, Punjabi, Oriya, Telugu, Tamil, Malayalam, and Kannada [35]. The other language identification techniques were also proposed on multilingual X datasets with tweets in 9 languages for Cyrillic, Arabic, and Devanagari scripts [9].

Following the recent success of deep learning in both text and image-based applications, multiple studies were conducted to solve various problems in NLP and reported groundbreaking performances. In the NLP tasks, attention-based networks have been intensively applied and gained successive results in neural machine translation [6, 31]. However, recently, research researchers have applied attention mechanisms to different research areas and gained popularity among various NLP types of research, such as speech recognition [14], image captioning, and visual question answering [4], sentiment analysis [41], text classification [44], and language identification [34].

Another very common training strategy in machine learning is multi-task learning, which allows for the simultaneous learning of multiple related tasks in an efficient manner. It has been also applied successfully across multiple applications, from speech recognition [18] to computer vision [24], drug discovery [37], and natural language processing [15].

Many well-known languages are benefited, in one way or the other, from the above-stated techniques and technological advancements. As a result, these languages have sophisticated NPL applications in commercial and industrial settings. However, few attempts are made for many resource-limited Ethiopian languages that are still underrepresented in the field of NLP. According to Legesse [43] and Biruk [39], except for the attempts made by these two researchers, there is no published work on Ethiopian language identification. Although they reported a considerably better language identification performance rate for a limited number of Ethiopian languages, their research did not consider factors that can determine identification performance, such as the amount and variety of training data. In addition, both researchers follow traditional approaches such as N-gram features, Naïve Bayes, and SVM classification algorithms. The main drawback of these statistical methods is their low efficiency in working with short and unseen words in a text. Another disadvantage of n-grams-based statistical methods is ignoring long-term relationships between characters occurring in a text. In addition, statistical methods follow tedious preprocessing steps.

Inspired by the success of attention and multi-task learning in various NLP tasks, we continue to integrate the capabilities of these techniques into a single framework that can train from end to end. The proposed attention-based recurrent neural network framework, called FRNN, has two classifiers with shared layers at the lower stage and a task layer at their last phase. Both classifiers are trained jointly and can identify the language and text content detection in a given written document.

3 RESEARCH METHODOLOGY

In general, Language identification and text content detection system start with dataset collection, preprocessing and is followed by model training. In this section, first, we elaborate on the source of text data and dataset preparation then the basic principles of multi-task learning and attention mechanism are presented. The proposed architecture of the factored and attention-based RNN model, along with the training procedures, is detailed in Section 3.4.

3.1 Dataset

The shortage of training datasets is one of the main challenges in machine learning and pattern recognition, particularly in the development of a reliable system for indigenous and resource-limited languages in the field of NLP. Ethiopian languages, in particular, lack publicly available datasets for language identification and text content detection. Therefore, to address the issue of the training dataset, in this work, we prepare our own Ethiopian language text corpora for Ethiopian language identification and text content detection tasks. This dataset is now made freely available at https://github.com/bdu-birhanu/LID_TCD.

Text data can be characterized by various features that describe the formality of used vocabulary, grammatical structure, content, and length. Building a suitable dataset that covers many texts from various sources and steps is necessary for training and evaluating the model. The dataset we used to train and test the model consists of 22,624 text lines collected from different Television (TV) news,¹ \(^,\)² Wikipedia, and institutional web pages.³ In addition, the dataset consists of six different Ethiopian languages, which are widely used at the institutional level in the country and available electronically. Among these, Amharic is the working language in most regions of Ethiopia. And Awi language is one zone in the Amhara Region. The other four languages are now considered additional working languages. It also comprises topics such as Agriculture, Sport, Health, Religious, and Politics.

From the above sources, 22,624 text lines were prepared and annotated manually. The text lines are associated with their corresponding language and text content code in the manual annotation process. The manual annotation was conducted by 3 annotators and instructed to assign codes to text lines according to the language scripts and the text contents based on the predefined group of the TV news and/or institutional archives. We also asked the annotators to remove or ignore multilingual texts in a line and assign a label for a text-based donating process. During preprocessing, white/blank spaces within the document are removed, and all uppercase texts are converted to lower cases. Table 2 shows the distribution of annotations in our text corpora. The new corpora consist of long-ranging text lines from a single to 17 words.

Table 2.

Language-type	#agriculture	#Health	#politics	#Religious	#Sport
Amharic	964	1,643	1,512	1,976	1,406
Afar	164	127	116	1,862	158
Awi	90	615	169	425	249
Afan-oromo	112	500	633	2,418	826
Tigrigna	20	861	217	1,080	821
Somali	277	311	345	2,412	315

View Table

Table 2. Details of Text Corpora

3.2 Multi-task Learning

Multi-task learning (MTL) is one of the training paradigms in machine learning that can learn and solve multiple tasks timely while exploiting tasks’ commonalities and differences [18, 20]. MTL can be inspired by biological learning in humans, where we often apply the knowledge we have already acquired by learning related and new tasks. As compared to training tasks separately, MTL improves learning efficiency and prediction accuracy for task-specific models.

MTL provides various services, such as implicit data augmentation, attention focus, and eavesdropping. Suppose two tasks \(T_{1}\) and \(T_{2}\) have different noise patterns. Training both tasks enables the model to learn a good representation. In addition, training tasks help the model to focus on the actual matter attention by differentiating between relevant and irrelevant features. The model can also learn some feature F through hint [2] either task \(T_{1}\) or \(T_{2}\). It depends on how the feature complex to train by each task (that the model is allowed to eavesdrop). For example, if the feature F is easy for task \(T_{1}\) then the model learns F through task \(T_{1}\).

MTL could follow different training strategies to improve the performance of all the tasks [5, 45]. In this work, we use a joint training strategy which is one of the most commonly used training approaches in the MTL paradigm [7], where the model uses complex parameter sharing to optimize a single joint loss L which is computed as, (1) \(\begin{equation} L=\sum _{i=1}^{n} l_{i} , \end{equation}\) where \(l_{i}\) is the loss of task \(T_{i}\) and n is the total number of tasks involved in the training procedure.

3.3 Attention-based Models

Attention is an attempt to implement the same action of selectively concentrating on a few relevant things while ignoring others. The attention mechanism emerged as an improvement over the encoder decoder-based neural machine translation system in NLP [6, 31]. Later, this mechanism, and/or its variants, have been used in other NLP applications, including speech recognition [14], image caption [30], and visual question and answering [4]. Various forms of Attention are commonly classified as Bahadanu attention [6] and Luong attention [31], where the intuition of the attention is to produce a weighted context vector with an improved hidden layer representation.

3.4 The Proposed Model Architecture

The overall framework of the proposed approach is shown in Figure 1. In this architecture, we employ two Bi-directional LSTM networks together with an attention layer in common for both classifiers and fork-out by adding one fully connected layer with soft-max activation function for each classification branch: six neurons for language identification and five neurons for text content detection branch.

Fig. 1. The proposed factorized Bi-LSTM architecture. This framework consists of 3 main components. The encoder layer together with the embedding layer converts an input text into a sequence of constant feature vectors, the attention layer produces aggregate information, and the classification layers are the top branched layers responsible for language identification and content detection and have six and five neurons that corresponding to the number of classes for each task, respectively.

The word embedding framework is used as a dictionary that maps integer indices for specific words to a real-valued vector in a high-dimensional space, where words with similar meanings are closer together while texts with different meanings are farther apart. It achieves by learning the vector representation of a word through the context in which it appears. An embedding layer takes as input a 2D tensor of integers of shape (samples, sequence length), where each entry is a sequence of integers. In this case, we use Keras Embedding, which supports a supervised method that finds custom embedding while training the whole model. The weights are randomly-initialized, then updated during training using the back-propagation algorithm. So, the result word embedding is guided by the loss function. The embedding layer returns a 3D floating-point tensor of shape (samples, sequence length, embedding dimensionality). Such a 3D tensor can then be processed by recurrent neural network layers.

The encoder module of our Factored RNN model comprises two bidirectional LSTM networks that process an input sequence X=\(x_{1},\ldots , x_{Tx}\) with a length of Tx from the embedding layers and maps them to a higher-level feature representation \(H= h_{1},\ldots , h_{Tx}\). The attention module takes the output feature H of the encoder module and computes the aggregated information, context vector, from all these hidden states using Equation (2), (2) \(\begin{equation} C=\sum _{i=1}^{T_{x}}a_{i}h_{i} , \end{equation}\) where C is the context vector and \(a_{i}\) is the attention score which is computed as (3) \(\begin{equation} a_{i}= softmax(f(g(h_{i})), \mbox{for}\ i= 1,\ldots ,T_{x} . \end{equation}\) The function f and g in Equation (3) denotes feed-forward neural networks with relu and tanh activation functions that are stacked consecutively and trained with other parts of the model.

The computed context information is a 3D tensor resulting due to the dimension of the word embedding layer. Therefore, we could apply a Keras Flatten operation before we feed to the task-specific layers. The final probability p(\(y|X\)) in both task-specific layers is computed using a soft-max function. The network parameters details and configuration of the proposed model are depicted in Table 3.

Table 3.

Network-module	Network Layers (Type)	Hidden neuron Size
Encoder	Bof i-LSTM	128
Encoder	Bi-LSTM	128
Attention	FC+ relu	64
	FC+ tanh	32
	Softmax	–
Classifier	FC1+ softmax	6
Classifier	FC2+ softmax	5

The inputs of the Long-Short-Term Memory (LSTM) network are vectors from word embedding layers.

View Table

Table 3. The Recurrent Network and Attention Layers of the Proposed Model with their Corresponding Parameter Values

The inputs of the Long-Short-Term Memory (LSTM) network are vectors from word embedding layers.

During training, the texts are tokenized and assigned an integer for each unique word. These integer sequences of integers are then passed through the word embedding layer, in which the integer indices of each words are mapped to a vector in a high-dimensional space. Words with the same meaning are represented in similar vectors. The word output embedding layer is connected to a bidirectional LSTM network. The attention layer is aggregated to the outputs of is LSTMs. The aggregated information is reshaped and fed into the task-specific layers of a feed-forward neural network with a soft-max activation function. Each classification branch has six neurons for language identification and five neurons for text content detection. The final output of each is determined by the softmax probability using Equation (6).

We proposed jointly training the two tasks (language identifier and text content detector) by taking the advantage of multi-task learning. In such a way, when the network train, the parameters of the language identifier layer should not change no matter how wrong we get in the text content detector, and vice versa, but the parameters of the shared layer changes with both tasks since we optimize a single joint loss which is computed as (4) \(\begin{equation} loss_{joint}=l_{LID}+l_{CD} , \end{equation}\) where \(l_{LID}\) and \(l_{CD}\) denotes different losses of a language identifier and text content detector which can be also computed by Equation (5) (5) \(\begin{equation} loss=-\sum _{i}^{c} t_{i} log(s_{i}) . \end{equation}\) Where c is the class size, \(t_{i}\) is ground-truth of each class and \(s_{i}\) is a soft-max score of each class calculated by Equation (6) (6) \(\begin{equation} s_{i}=\frac{exp(s_{i}))}{\sum _ {k=1}^{c} exp(s_{k})} . \end{equation}\)

4 EXPERIMENT AND RESULT DISCUSSION

Experiments are conducted using a dataset of six Ethiopian languages covering five topics, which are collected from TV news, Wikipedia, and institutional web pages. The proposed models are implemented in Keras Application Program Interface (API) on a TensorFlow back end [13]. We consider different network parameter values and tuning during experimentation. The results reported in this article are conducted using Adam Optimizer due to its suitability for training on sparse data, which is computationally efficient and has little memory requirements [27]. We also use a dropout rate of 0.25, for LSTM network layers to avoid the training set overfitting. Early stopping techniques are employed to make the learning process more time-efficient.

Two bidirectional-LSTM networks with a hidden layer size of 128 are employed as a shared layer, followed by attention models that are two fully connected networks, one for each task, with a Softmax activation function and task-specific layers. A Keras embedding layer is employed, to map integer indices of each word to a real-valued vector in a high-dimensional space where a similar vector representation is used for words having similar meanings. A Keras Flatten operation is applied to reshape the 3D vector output of the attention module into 2D and then to feed for the task-specific fully connected layers. Since we use batch training, with a batch size of 32, all sequences in a batch must have the same length, because we need to pack them into a single sequence tensor that is shorter than others are padded with zeros. We randomly split the training and test set. We use a total of 20,362 training and 2,262 samples for testing. For validation, we use 20% of the selected training dataset. We evaluate the performance of our model for each task-specific classifier using the accuracy (A) metric, calculated as follows: (7) \(\begin{equation} A=\frac{ \# correctly\ predicted\ sample}{\# total\ number\ of\ sample } . \end{equation}\) The model’s overall accuracy is computed by averaging the classification accuracy of the two task-specific classifiers. Figure 2 presents a sample of randomly selected input text with corresponding ground-truth labels and model predictions across the two jointly learned tasks using data from our new LICD dataset. Here, we would like to emphasize that we are not accountable for any errors in the ground truth labels or interpretations of the contents in our experiment.

Fig. 2. Sample randomly selected input texts with corresponding ground truth labels, illustrating predictions for both language identification and content detection tasks.

Based on the results recorded during experimentation, 98.88% of the languages are identified correctly, 96.5% of text content is detected correctly, and 97.69% is the overall classification accuracy.

Compared to the text content detection performance, the proposed model works better on longer texts than on short texts. However, text length has no significant effect on language identification performance. The generalization of this model, specifically on short texts, is not as good as that of longer sentences. It is because the dataset consists of short texts that either has neutral or lack contextual meanings texts are composed of either one or two words.

A confusion matrix for the two tasks is presented in Figure 3. In this confusion matrix, the diagonal represents correctly predicted instances, while the off-diagonal entries correspond to wrongly predicted instances. As shown in the language identification confusion matrix, Amharic is frequently confused with Tigrigna, resulting in 8 instances of confusion. In contrast, in the content detection matrix, agriculture is often confused with health and sport.

Fig. 3. Confusion matrices for Language Identification (left) and Content Detection tasks (right).

Moreover, the confusion between similar languages occurred differently for human annotators that are given. In addition, each annotator had to deal only with languages from a specific region. It means there is rarely confusion between languages written with different alphabets because they usually appear in various language groups. Still one of the most common errors in our dataset is confusion between Amharic, Tigrigna, and Awi, which use similar alphabets for scripting and are from the same language group.

5 CONCLUSION AND FUTURE WORK

In this article, we present a novel RNN-based approach for Ethiopian language identification and text content detection. Our proposed model is a multi-task attention architecture, consisting of two classifiers. (i) classifier responsible for language identification in a given written text, and (ii) classifier responsible for text detection. The proposed framework with the new annotated text corpora enables us to study currently unresolved challenges and untouched issues of Ethiopian resource-limited languages. This model can also tackle challenges such as scripting similarity and the analysis of short texts to identify both the language and its content.

Our dataset includes six widely used Ethiopian languages (Amharic, Tigrigna, Afan-Oromo, Awi, Afar, Somali) at the institutional level, covering five diverse topics (Agriculture, Sport, Health, Politics, and Religion). We evaluated the performance of the proposed model using a test set of 2,262 samples and achieved 98.88% and 96.5% of language identification and text content detection accuracy, respectively. The proposed model has fewer preprocessing steps. The dataset, source code the pretrained model are publicly available at https://github.com/bdu-birhanu/LID_TCD. As part of future work, we will further investigate how to better leverage more training data incorporating additional Ethiopian languages with further text content from social media to improve the proposed approach. In addition, the proposed model has investigated, improved, and extended to spoken Ethiopian language identification and content detection.

Footnotes

REFERENCES

[1] Abate Solomon Teferra, Michael Melese, Martha Yifiru Tachbelie, Million Meshesha, Solomon Atinafu, Wondwossen Mulugeta, Yaregal Assabie, Hafte Abera, Binyam Ephrem Seyoum, Tewodros Abebe, et al. 2018. Parallel corpora for bi-directional statistical machine translation for seven Ethiopian language pairs. In Proceedings of the 1st Workshop on Linguistic Resources for Natural Language Processing. 83–90.Google Scholar
Reference 1Reference 2
[2] Abu-Mostafa Yaser S.. 1990. Learning from hints in neural networks. Journal of Complexity 6, 2 (1990), 192–198.Google ScholarDigital Library
Reference
[3] Alharbi Ahmed Sulaiman M. and Elise de Doncker. 2019. Twitter sentiment analysis with a deep neural network: An enhanced approach using user behavioral information. Elsevier, Cognitive Systems Research 54 (2019), 50–61.Google Scholar
Reference
[4] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google ScholarCross Ref
Reference 1Reference 2
[5] Argyriou Andreas, Evgeniou Theodoros, and Pontil Massimiliano. 2007. Multi-task feature learning. In Proceedings of the Advances in Neural Information Processing Systems. 41–48.Google ScholarCross Ref
Reference
[6] Bahdanau Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 [cs.CL]. Retrieved from https://arxiv.org/abs/1409.0473Google Scholar
Reference 1Reference 2Reference 3
[7] Belay Birhanu, Habtegebrial Tewodros, Liwicki Marcus, Belay Gebeyehu, and Stricker Didier. 2019. Factored convolutional neural network for amharic character image recognition. In Proceedings of the 2019 IEEE International Conference on Image Processing.IEEE, 2906–2910.Google ScholarCross Ref
Reference
[8] Belay Birhanu Hailu, Habtegebirial Tewodros, Liwicki Marcus, Belay Gebeyehu, and Stricker Didier. 2019. Amharic text image recognition: Database, algorithm, and analysis. In Proceedings of the 2019 International Conference on Document Analysis and Recognition.IEEE, 1268–1273.Google ScholarCross Ref
Reference
[9] Bergsma Shane, McNamee Paul, Bagdouri Mossaab, Fink Clayton, and Wilson Theresa. 2012. Language identification for creating language-specific twitter collections. In Proceedings of the 2nd Workshop on Language in Social Media. 65–74.Google ScholarDigital Library
Reference
[10] Bounhas Ibrahim, Nadia Soudani, and Yahya Slimani. 2020. Building a morpho-semantic knowledge graph for arabic information retrieval. Elsevier, Information Processing and Management 57, 6 (2020), 102124.Google Scholar
Reference
[11] Cavnar William B., John M. Trenkle, et al. 1994. N-gram-based text categorization. In Proceedings of the SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval. Citeseer, Las Vegas, 161–175.Google Scholar
Reference
[12] Chittaranjan Gokul, Vyas Yogarshi, Bali Kalika, and Choudhury Monojit. 2014. Word-level language identification using crf: Code-switching shared task report of msr india system. In Proceedings of the 1st Workshop on Computational Approaches to Code Switching. 73–79.Google ScholarCross Ref
Reference
[13] Chollet François and others, 2018. Keras: The Python deep learning library. Astrophysics Source Code Library (2018), ascl–1806.Google Scholar
Reference
[14] Chorowski Jan K., Bahdanau Dzmitry, Serdyuk Dmitriy, Cho Kyunghyun, and Bengio Yoshua. 2015. Attention-based models for speech recognition. In Proceedings of the Advances in Neural Information Processing Systems. 577–585.Google Scholar
Reference 1Reference 2
[15] Collobert Ronan and Weston Jason. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning. 160–167.Google ScholarDigital Library
Reference
[16] Davidson Douglas R. and Ali Ozer. 2013. Automatic language identification for dynamic text processing. Date of Patent: Jun. 11, 2013, US Patent 8,464,150.Google Scholar
Reference 1Reference 2
[17] Dehkharghani Rahim. 2019. Sentifars: A persian polarity lexicon for sentiment analysis. ACM Transactions on Asian and Low-Resource Language Information Processing 19, 2 (2019), 1–12.Google ScholarDigital Library
Reference
[18] Deng Li, Hinton Geoffrey, and Kingsbury Brian. 2013. New types of deep neural network learning for speech recognition and related applications: An overview. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 8599–8603.Google ScholarCross Ref
Reference 1Reference 2
[19] Dittrich Sabrina, Weiss Zarah, Schröter Hannes, and Meurers Detmar. 2019. Integrating large-scale web data and curated corpus data in a search engine supporting german literacy education. In Proceedings of the 8th Workshop on Natural Language Processing for Computer Assisted Language Learning. Linköping University Electronic Press, 41–56.Google Scholar
Reference
[20] Duong Long, Cohn Trevor, Bird Steven, and Cook Paul. 2015. Low resource dependency parsing: Cross-lingual parameter sharing in a neural network parser. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 845–850.Google ScholarCross Ref
Reference
[21] Eberhard David M., Simons Gary F., and Fennig Charles D.. 2020. Ethnologue: Languages of the world. 23rd edn. dallas: SIL International. Retrieved from https://www.ethnologue.comGoogle Scholar
Reference 1Reference 2
[22] GALAL MOHAMED, MADBOULY MAGDA M., and EL-ZOGHBY ADEL. 2019. Classifying arabic text using deep learning. Journal of Theoretical and Applied Information Technology 97, 23 (2019), 3412–3422.Google Scholar
Reference
[23] Gashaw Ibrahim and Shashirekha HL. 2019. Enhanced amharic-arabic cross-language information retrieval system using part of speech tagging. In Proceedings of the 2019 International Conference on Advances in Computing, Communication and Control.IEEE, 1–7.Google ScholarCross Ref
Reference
[24] Girshick Ross. 2015. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision. 1440–1448.Google ScholarDigital Library
Reference
[25] Grefenstette Gregory. 2001. Text summarization using part-of-speech. Date of Patent Sep. 11, 2001, US Patent 6,289,304.Google Scholar
Reference
[26] Kanapala Ambedkar, Pal Sukomal, and Pamula Rajendra. 2019. Text summarization from legal documents: A survey. Artificial Intelligence Review 51, 3 (2019), 371–402.Google ScholarDigital Library
Reference
[27] Kingma Diederik P. and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980 [cs. LG]. Retrieved from https://arxiv.org/abs/1412.6980Google Scholar
Reference
[28] Kruengkrai Canasai, Srichaivattana Prapass, Sornlertlamvanich Virach, and Isahara Hitoshi. 2005. Language identification based on string kernels. In Proceedings of the IEEE International Symposium on Communications and Information Technology. IEEE, 926–929.Google ScholarCross Ref
Reference
[29] Laboreiro Gustavo, Bošnjak Matko, Sarmento Luís, Rodrigues Eduarda Mendes, and Oliveira Eugénio. 2013. Determining language variant in microblog messages. In Proceedings of the 28th Annual ACM Symposium on Applied Computing. 902–907.Google ScholarDigital Library
Reference
[30] Li Linghui, Tang Sheng, Deng Lixi, Zhang Yongdong, and Tian Qi. 2017. Image caption with global-local attention. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
Reference
[31] Luong Minh-Thang, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv:1508.04025 [cs.CL]. Retrieved from https://arxiv.org/abs/1508.04025Google Scholar
Reference 1Reference 2Reference 3
[32] Madi Nora and Hend S. Al-Khalifa. 2018. A proposed Arabic grammatical error detection tool based on deep learning. Elsevier, Procedia Computer Science 142 (2018), 352–355.Google Scholar
Reference
[33] McNamee Paul. 2005. Language identification: A solved problem suitable for undergraduate instruction. Journal of Computing Sciences in Colleges 20, 3 (2005), 94–101.Google ScholarDigital Library
Reference
[34] Miao Xiaoxiao, McLoughlin Ian, and Yan Yonghong. 2020. A new time–frequency attention tensor network for language identification. Circuits, Systems, and Signal Processing 39, 5 (2020), 2744–2758.Google ScholarCross Ref
Reference
[35] Murthy Kavi Narayana and Kumar G. Bharadwaja. 2006. Language identification from small text samples. Journal of Quantitative Linguistics 13, 01 (2006), 57–80.Google ScholarCross Ref
Reference
[36] Mustonen Seppo. 1965. Multiple discriminant analysis in linguistic problems. Statistical Methods in Linguistics 4 (1965), 37–44.Google Scholar
Reference
[37] Ramsundar Bharath, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding, and Vijay Pande. 2015. Massively multitask networks for drug discovery. arXiv:1502.02072 [stat.ML], Retrieved from https://arxiv.org/abs/1502.02072Google Scholar
Reference
[38] Rao K. Sreenivasa and Nandi Dipanjan. 2015. Language identification—a brief review. In Proceedings of the Language Identification Using Excitation Source Features. Springer, 11–30.Google ScholarCross Ref
Reference
[39] Tadesse Biruk. 2018. AUTOMATIC IDENTIFICATION OF MAJOR ETHIOPIAN LANGUAGES (Master’s thesis). Master’s thesis. school of Computing, Bahir Dar Institute of Technology, Bahir Dar, Ethiopia.Google Scholar
Reference
[40] Tromp Erik and Pechenizkiy Mykola. 2011. Graph-based n-gram language identification on short texts. In Proceedings of the 20th Machine Learning Conference of Belgium and the Netherlands. 27–34.Google Scholar
Reference
[41] Usama Mohd, Belal Ahmad, Enmin Song, M. Shamim Hossain, Mubarak Alrashoud, and Ghulam Muhammad. 2020. Attention-based sentiment analysis using convolutional and recurrent neural networks. Elsevier, Future Generation Computer Systems 113 (2020), 571–578.Google Scholar
Reference
[42] Winkelmolen Fela and Mascardi Viviana. 2011. Statistical language identification of short texts.. In Proceedings of the ICAART 1, (2011),498–503.Google Scholar
Reference
[43] Wodajo Legesse. 2014. Modeling Text Language Identification for Ethiopian Cushitic Languages (Master’s Thesis). Master’s thesis. HiLCoE School of computer Science, Addis Ababa, Ethiopia.Google Scholar
Reference
[44] Xie Jinbao, Hou Yongjin, Wang Yujing, Wang Qingyan, Li Baiwei, Lv Shiwei, and Vorotnitsky Yury I. 2020. Chinese text classification based on attention mechanism and feature-enhanced fusion neural network. Computing 102, 3 (2020), 683–700.Google ScholarCross Ref
Reference
[45] Zhang Yu and Qiang Yang. 2017. A survey on multi-task learning. arXiv:1707.08114 [cs. LG]. Retrieved from https://arxiv.org/abs/1707.08114Google Scholar
Reference
[46] Zissman Marc A.. 1993. Automatic language identification using gaussian mixture and hidden markov models. In Proceedings of the 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 399–402.Google ScholarDigital Library
Reference

Index Terms

Factorized Recurrent Neural Network with Attention for Language Identification and Content Detection
1. Computing methodologies

Recommendations

Language Identification Using Wavelet Transform and Artificial Neural Network
CASON '10: Proceedings of the 2010 International Conference on Computational Aspects of Social Networks

In traditional language identification methods, it is not so easy for search engines to find relevant language database of a given query. Therefore, there is a need to identify the relevant user’s natural language query of unknown document database in a ...
Read More
Word-length algorithm for language identification of under-resourced languages

Language identification is widely used in machine learning, text mining, information retrieval, and speech processing. Available techniques for solving the problem of language identification do require large amount of training text that are not ...
Read More
A model of recurrent neural network with high capacity
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 12
December 2023
194 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3638035
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 December 2023
- Online AM: 2 November 2023
- Accepted: 22 October 2023
- Revised: 22 September 2023
- Received: 29 January 2022
Published in tallip Volume 22, Issue 12

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Attention mechanism
datasets
factorized-recurrent neural network
language identification
neural networks
text content detection
resource-limited languages
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 390
  Total Downloads
- Downloads (Last 12 months)390
- Downloads (Last 6 weeks)60
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Factorized Recurrent Neural Network with Attention for Language Identification and Content Detection

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

1 INTRODUCTION

2 RELATED WORK

3 RESEARCH METHODOLOGY

3.1 Dataset

3.2 Multi-task Learning

3.3 Attention-based Models

3.4 The Proposed Model Architecture

4 EXPERIMENT AND RESULT DISCUSSION

5 CONCLUSION AND FUTURE WORK

Footnotes

REFERENCES

Cited By

Index Terms

Recommendations

Language Identification Using Wavelet Transform and Artificial Neural Network

Word-length algorithm for language identification of under-resourced languages

A model of recurrent neural network with high capacity

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media