1 Introduction

In order to contribute to the development of a community, we should take into account its socio-cultural and economic context. In this paper, we propose a joint solution to the socio-cultural context of the Basque Country and the needs and characteristics of small and medium companies with limited staff and economic resources, which are one of the foundations of the economy of the area. The model is easily exportable to other similar environments that may appear in developing areas [1]. Thus, innovative technologies have to be adequately designed so that they suit the needs of the industrial fabric, and they can be integrated into real applications. For instance, those technologies should be designed including auto-changing and auto-learning abilities. These capabilities will allow systems to automatically learn from new data and to adapt its components to new conditions, thus increasing system robustness [1].

In the socio-cultural context of the Basque Country, the interest in multilingual systems [1,2,3,4,5,6] arises because there are three languages in coexistence (Basque, Spanish, and French). However, Basque is an under-resourced language and its situation is different depending on the geographical location: In the Spanish State, Basque and Spanish are both official languages, and the recovery of Basque is active with positive results; in the French State, Basque is not an official language and it shows a clear negative growth with a low number of speakers. In addition to this, there is a high cross-lingual interaction between them, even though the origins of Basque are completely different from the origins of the other two languages (Spanish and French). In this scenario, speakers tend to mix words and sentences that belong to the three languages in their discourse, and the acoustic interactions between the languages and the Basque dialects are a very interesting area of research [6]. Although some people speak the three languages like native speakers, people are more commonly fluent in only two of them, Basque–Spanish or Basque–French. This means that in their conversations there are mixed features of different languages, and for this reason cross-lingual ASR is necessary. Thus, the three languages have to be taken into account in order to develop an efficient ASR system, and it is essential to try to find new strategies for the development.

In this sense, audio information retrieval systems are increasingly used in different applications ranging from the extraction of musical information [7, 8], health-related applications [9, 10], to applications related to the acquisition of data in the mobile environment [11] or on the Internet of Things (IoT). New trends in the field of information and communications technologies are committed to humanize not only information, but also ways of access by means of new techniques such as concept-based semantic web, or access to information by more user-friendly interfaces [7]. In this research area, within the field of speech processing, multilingual automatic speech recognition (ASR) systems provide users with improved interaction by means of various languages. This increases comfort and naturalness because users can express themselves in their own language. Machine learning and neural computing-based approaches provide flexible solutions for complex systems [1].

In this context, this paper proposes an audio information management system (the adiUP system, available online: [12]) that can be directly applied to radio broadcasting and that has been tested using the trilingual (Basque, Spanish, and French) internet radio channel Info7 [13]. The system provides automatic tools to manage all the information included in the audio records by semantic knowledge. One of the objectives of the development is to allow to introduce new information resources in a more open and collaborative way, where the user is an active part of the system. In this trilingual environment with an under-resourced language, there are other constrains regarding limited resources (financial, material, human, and audio resources) which, in addition to the poor quality of the signal, increase the complexity of the development. The developed adiUP system is an ASR system for complex environments that fulfills these requirements with under-resourced languages. The proposed system has been built using universal design, so that it can provide barrier-free access for example for people with hearing loss, cognitive impairments, or deafness.

This document is organized as follows: Sect. 2 presents the environment and the requirements of the system. Section 3 describes the resources. Section 4 sets forth the design of the proposed system, its architecture and components. In Sect. 5, a new methodology for automatic selection of the system configuration is proposed. Section 6 comprises the experimentation and discussion. Finally, conclusions are drawn in Sect. 7.

2 Description of the environment and requirements of the system

In the field of speech processing, extraction and indexing of audio information from internet mass media is a research area of interest to the international scientific community [14, 15]. The complexity of this task is closely related to factors such as the type of oral communication (isolated words, continuous speech, spontaneous speech, dialogue), the quality of the signal, the environment in which recognition takes place, the amount of terminology, the ability to generate confusion, language, and speakers [16]. There is a wide variety of solutions that addresses these problems in different ways, ranging from the detection of large vocabularies [17], through the detection of spoken numbers for telephone applications [18], to the detection of segments, spoken or not spoken [19].

When working in complex environments with limited amount of data, multilingual contexts, nonlinearities, or uncontrollable noise, some possibilities are based on: enriching poor resources of a language with resources from another powerful language beside it, approaches oriented to the lack of resources, cross-lingual approaches [20], training of acoustic models for a new language using results from other languages [21], data optimization methods, collaborative systems, or open configuration systems. However, the development of a robust ASR system is very tough when there are under-resourced languages involved, even if there are powerful languages beside them, and the classic techniques perform poorly with regard to correct rates [22, 23]. In fact, the statistical nature of most of the approaches used in ASR systems requires large amounts of language resources in order to perform properly. In the case of under-resourced languages, the availability of resources is very limited, most of which are of very poor quality. Therefore, nowadays there is an increasing interest in multilingual systems and under-resourced languages. There are interesting applications focused on these fields such as Babel [24] and meetings and publications whose main goal is to search for new methodologies oriented to this kind of complex environments [25,26,27]. Current research lines include: proposals for generating language resources; acoustic and language modeling; multilingual approaches; or management of audio information [28, 29].

Within this complex environment, our proposal aims to: automatically generate an optimal corpus from an initial corpus; a corpus is said to be optimal when the recognition results measured by means of cross-validation methods are optimal for the task; reduce the dimensionality of data through the convergence of redundant information by means of principal components analysis (PCA); estimate the accuracy of the system by means of the leave-one-out cross-validation technique (LOOCV); reduce unwished effects in sublexical units (SLU) due to the lack of resources and the low number of samples by choosing properly sublexical units and possible groupings of them; and extract information by means to folksonomies.

In this context, the requirements and novelty of the system lie in the complex environment in which it has to work, that is defined by:

  • The trilingual context that comprises the three languages of the Basque Country (Basque, Spanish, and French), which requires the design of the first multilingual ASR system for these three languages.

  • The regular appearance of cross-lingual elements between the three languages.

  • The fact that within the geographical area of this research project, the variety of the Basque language in the French region is in a critical under-resourced situation because it is not an official language in the French State.

  • The limited resources (financial, material, human, and audio resources), specially the audio corpora for training is only about 0.5–1% of the usual corpora for such systems, for example, in [14] and [30] they use 100–200 h for training, while in our system we use only 1 h.

  • The extremely poor quality of the audio signal, taken from an internet radio channel, which increases the complexity of the development.

3 Materials

In this section, the resources for the development of the system are presented. Furthermore, the main linguistic features are analyzed because they have a clear impact both on the performance of the acoustic phonetic decoding (APD) system, and on the size of the vocabulary of the system.

3.1 Phonetic features of the languages

The three languages involved in our study are: Basque, Spanish, and French. Basque is a Pre-Indo-European language of unknown origin and has circa 1,000,000 speakers in the Basque Country, which spreads over the international border between France and Spain. The Basque language has a wide range of dialects, there are six main dialects and several variations, and this dialectal variety entails phonetic, phonologic, and morphologic differences. In order to develop the APD system, a sound inventory of each language is necessary. Table 1 summarizes the sound inventories for the three languages, expressed in the eXtended Speech Assessment Methods Phonetic Alphabet (X-SAMPA) notation, and the usage of phonemes in the three cases.

Table 1 Sound inventories in the X-SAMPA

3.2 Lexical structure

A further challenge for developing an ASR system is that Basque is an agglutinative language with a special intra-word morphosyntactic structure [11] that may, depending on the complexity of the task, lead to intractable vocabularies. Inside Basque words, there is not only semantic information, but also grammatical elements as it can be seen in Table 2.

Table 2 Examples of the agglutinative structure of the Basque language

A plausible approach to the problem would be to use lemmas and morphemes instead of words when defining the system vocabulary [11]. However, this could lead to a recognition problem for the shortest morphemes, especially those that are not lemmas, for instance: “ak,” “ko,” “go,” “k”, etc. In this project, a robust proposal based on lemmas and pseudo-morphemes was used [31].

3.3 Cross-lingual effects

Most speakers in the Basque Country are bilingual, and they commonly mix two of the three languages in their speech, particularly in spontaneous speech. The two languages that are mixed depend on the region of the country where the speaker lives: most Basque speakers living in the Spanish side also use Spanish, while Basque speakers in the French side also use French. Moreover, mixing all the three languages also occurs. Indeed, the acoustic interactions between these three languages with the addition of Basque dialects are very strong, because speakers naturally and spontaneously mix sounds and vocabulary, and sometimes they also add other influences, such as English. Some speakers are able to use the three languages consecutively in the same sentence with native pronunciation. All the resources contain numerous instances of cross-lingual material at words, sentences, and pronunciation levels (Tables 1 and 3). The strongest effects appear in Spanish and French recordings, which unfortunately also have the highest background noise level. The inventory of allophones for the languages of the system and the matches between each of them are shown in Table 1 [32].

Table 3 Examples of cross-lingual appearance in the language resources

3.4 Resources and materials

The basic audio resources used in this project have been mainly provided by a small local news radio channel, Info7 [13], that is trilingual (Basque-BS-, Spanish-SP-, and French-FR-). It has provided the audio and text data from their news bulletins for each language (semi-parallel corpus). The texts have been processed to create XML files which include information of different speakers, noises, and parts of the speech files and transcriptions. The transcriptions for the Basque language also include morphological information such as the lemma of each word and Part-Of-Speech tag.

In order to correctly implement an ASR system, it is crucial to design and obtain appropriate linguistic resources. A speech corpus is a collection of audio recordings tagged at different levels, which contains phrases, words, and common expressions of a certain language. This type of corpus is a database that stores implicitly various properties of the language, and this information lays the foundation for building voice recognition systems.

When only limited resources are available, the appropriate choice of the training corpus is a fundamental part of the design of the application. This is because on the one hand, depending on the circumstances a larger corpus does not imply better performance results [33], and on the other hand, the errors within the training corpus cause deterioration in the performance of the recognition system, or in other words, recordings with errors at various levels can cause the results not to be appropriate. The methodology for choosing optimal corpus is called HobeCor. Thanks to the HobeCor system, files that contain extreme anomalies can be automatically removed, thus decreasing the number of errors in the recognition process, for example: if an extreme noise interrupts communication, which causes an extreme distortion of the signal that prevents readability; if background music covers the speech; if the audio file is empty although there is a phonetic transcription, etc. The HobeCor system optimizes the recognition results, even after removing part of the corpus. Therefore, it makes possible to reduce even more the training corpus and to balance the corpuses of the different languages with regard to time duration. The HobeCor system has only been applied to speech segments.

The resources inventory is summarized in Table 4, which shows the duration of the overall corpus provided by Info7 radio channel (second column, audio), and the duration of the speech segments (third column, SSG). There is a difference between them because in the broadcasted signal there are also segments that only contain music. The third column of Table 4 (SSG used for training) shows the time features of the training corpus used in this project after using the HobeCor system.

Table 4 Summary of the resources inventory, audio, and speech segments (SSG)

In the audio for French and Spanish, there is a high background noise due to signature tunes, which can be noted in Fig. 1 that shows NIST signal-to-noise ratio (SNR) and the waveform amplitude distribution analysis (WADA) SNR of speech signals with regard to the signal length [34, 35].

Fig. 1
figure 1

NIST SNR and WADA SNR of speech signals with regard to the signal length

4 Development of the proposed system

The adiUP multilingual tool [30] developed for the Info7 internet radio channel provides users with information such as identification of language, subject of speech, terms of interest, or duration of audio files in news broadcasted in Basque, Spanish, and French. Thus, by reading the information added to the audio, the user can select topics of interest, or search among the audio files within the radio repository by subject, keywords, or language. In Sect. 4.1, the architecture of the system is analyzed, and in the following subsections the components of the system are described: folksonomies, sublexical units and acoustic phonetic models, and lexical units (LU).

4.1 Architecture of the system

The architecture of the adiUP system is based on three layers: the interface layer or user interaction layer, the domain layer, and the database layer, as it is outlined in Fig. 2.

Fig. 2
figure 2

Architecture of the adiUP system

4.1.1 Interface layer

This layer, labeled as 1 in Fig. 2, comprises all aspects of software and design related to interfaces for different kinds of users. In addition to the super-user, there are two types of users that can access the application: external users and system-management collaborators:

  • Super-user This user is responsible for the supervision and control of the whole system, manages external users and system-management collaborators, and grants permissions.

  • External users The so-called external users only use the application for information retrieval, but they cannot insert or modify any feature of the resources of the adiUP system.

  • System-management collaborators These special external users are not necessarily experts. They are part of a collaboration network of the radio channel intended to improve and evolve the initial prototype, by adding new information and knowledge. These members are actively involved in the adiUP system and have the potential to enrich the resources of the system by inserting new audio files with related information (such as audio data, the name of the speaker, or the language), and by correcting errors at different stages. These active users can influence and improve both the results of the ASR module and the categorization of concepts of the audio files.

4.1.2 Domain layer

The domain layer consists of different components which provide all the information to fill the database of the system. This layer includes the Language Identification (LID) tool [32] that uses a hybrid system based on triphonemes and cross-lingual approaches which achieves an optimal and stable LID recognition rate despite the complexity of the problem. The domain layer is divided into some modules related to: the semantic management of audio information, the morpheme extraction task, the characteristics of folksonomies (categorization and extra information provided by the super-user), and the decision process. The modules of this layer are labeled as 2 in Fig. 2:

  • Multilingual ASR Audio files inserted through the interface are processed by a multilingual ASR system. The engine is a keyword-spotting system based on concepts (one word, several words, or slogans) and fillers of different types. The aim of the fillers is to absorb the words out of vocabulary (OOV) according to folksonomies.

  • Morphological analysis The output provided by the recognition system is analyzed morphologically in order to extract lemmas, which are used for concept specification in the folksonomy. The morphological analyzer has been developed by a partner company, Insima Teknologia [36], and it is multilingual: English, French, Spanish, and Basque.

  • Categorization Three folksonomies have been inserted in the adiUP system, one for each language, and concepts and terms have been grouped into superclasses, classes, and subclasses. The results of the morphological analyzer are the input to the categorization module, which calculates a value that defines their membership to a class.

  • Decision system In this module, the final assignment of concepts to the audio file is done by using a decision equation.

  • Additional information Extra information supplied by the super-user is inserted through this module.

4.1.3 Database layer

The database contains all the information of the application, which is related to users, audio, concepts, and conceptual information. This information is divided into tables and developed in the management system MySQL. This layer is labeled as 3 in Fig. 2. In order to index the results of the domain layer in the database, there is also a step called Indexing that transmits the information with the results between the domain layer and the database layer. Note that it is important to supply the system with a high flexibility level, so that the application can provide a powerful way to generate a network of collaborators, who will assist in improving the initial system. The parts that may be changed are the application dictionary, the expansion and creation of folksonomies, the categorization method, and the thresholds of thematic decision.

4.2 Folksonomies

For each language, a folksonomy has been defined. Figure 3 shows an example of the structure of these folksonomies.

Fig. 3
figure 3

Example of the structure of the folksonomy. Showing only some of the concepts (culture, politics, international, …), superclasses (sports, arts, …), classes (local sports, general…), etc

4.2.1 Components of the folksonomy

The folksonomy of each language consist of:

  • Concepts The concepts defined in the folksonomy of each language are politics, international, culture, general, sports, and headlines.

  • Superclasses, classes, and subclasses One concept, for example politics, is subdivided into superclasses which are more specific, for example Basque Country and State. Superclasses are divided into classes as well, and subsequently, classes are divided into subclasses. Each level provides more accurate information.

  • Instances Instances are the last elements of the folksonomy, and they are words, lemmas, or terms made of groups of words.

  • Attributes The attribute is a numeric value that indicates the relevance or weight of a subclass within a certain class. In general terms, there is a single attribute for each subclass. However, some subclasses can be shared by two or more classes. The relevance or weight (attribute) given to the subclass within each class makes the difference.

  • Axioms An axiom is a parameter that takes into account the relevance and the number of instances of each class that appears in a certain text. Thus, the class where a certain text belongs to can be automatically chosen by comparing the different axioms of the text computed for each class.

4.2.2 Construction of the folksonomy

In order to determine the class where a certain text belongs to, the following procedure is followed, where several variables have been defined:


WTc It is the total weight for a class c. For each class, this value is calculated by adding the weights of the n instances in the text that belong to that class:

$$WT_{C} = \mathop \sum \limits_{J = 1}^{n} w_{{jsch_{v} }}$$
(1)

where \(w_{{jsch_{v} }}\) is the weight of the instance j within a subclass s, that belongs to that particular class c, that can also belong to a superclass h, which can belong to a particular concept ν as well. The value of the weight ranges from 1 (the lowest relevance) to 5 (the highest relevance).


WRc The so-called ratified weight for a certain class c is a numerical value which takes into account the relevance of the subclasses that belong to that class. Within each class, there are defined five weight values Wcr, r = 1,…,5, where Wc5 represents the weight of the most relevant subclasses, and \(W_{c1}\) represents the weight of the less relevant subclasses. These parameters (Wcr,) depend on the subjectivity and experience of the creator of the folksonomy. For each class, the value of the ratified weight WRc is calculated as the sum of these five weights, this is:

$$WR_{C} = \mathop \sum \limits_{r = 1}^{5} W_{cr}$$
(2)

Wc It is the final weight value or axiom of the text that is computed for each class c by multiplying the total weight by the ratified weight:

$$W_{C} = WT_{C} \cdot WR_{C}$$
(3)

Finally, the class where the text belongs to is the one that fulfills the following optimization criterion:

$$C_{t} = \mathop {\hbox{max} }\limits_{c} \{ W_{C} \}$$
(4)

4.3 Modeling of sublexical units

Although usually an expert defines the set of SLUs, this method becomes very complex when the application is multilingual, resources are limited, or in shortage or under-resourced conditions Fig. 4. The lack of resources produces unwished effects in SLUs with very few samples. In consequence, it is necessary to define new hidden Markov model (HMM) topologies that can optimize the internal structure of the different sounds of the language. Two different HMM structure configurations have been tested:

Fig. 4
figure 4

General diagram of adiUP system

  1. 1.

    In the first configuration, the HMMs have the same number of states (NS) for all the SLUs (NS-X, where X is the number of states used).

  2. 2.

    In the second configuration, the effect of assigning different number of states to each SLU is analyzed, where the selection of the number of states depends on the nature of the allophones (a similar approach can be found in [37]) as it is shown in Table 5. There are also groups of SLUs defined according to the sound class for the three languages. In some cases, certain allophones are grouped (joining acoustic models) by considering that different allophones are the same sound unit. Table 6 shows for each language: Basque-BS, Spanish-SP, and French-FR, the different groups taken into account for each of the languages under study (language is stated in the first column). The second column shows the name of the group type, and the third column describes the groups of allophones and their nomenclature. For example, /i/ = /i/+/j/ represents that a new /i/ SLU is created by joining the /i/ vowel model and the /j/ semi-vowel model. Thus, a new acoustic model /i/ will be trained with speech samples of the previous /i/ and /j/.

    Table 5 Examples of HMM topologies for the SLUs of the three languages with different topologies and number of states (NS)
    Table 6 Description of examples of groups of SLUs for the three languages

Finally, sets of triphonemes are created for all the selected allophonic options, and discrete and semicontinuous HMM are generated [38]. These triphonemes will be used in the acoustic phonetic decoding system as SLUs.

4.4 Modeling of lexical units

The adiUP system uses LUs both for creating the vocabulary for the ASR engine and for other elements of the system, such as the components of the folksonomy. These LUs are selected taking into account the nature of each language. In the case of the Basque language, the fundamental LU will be the lemma or a combination of lemmas, based on the pseudo-morphemes proposed in [31]. In the case of Spanish and French, the fundamental LU will be the word (simple or compound).

4.5 Modeling of fillers

In a keyword-spotting system, the presence of words out of vocabulary (OOV) in the speech signal during the recognition process can produce unwanted performance results. One of the classic methods for solving this problem consists in using filler models to absorb this part of unwanted signal. Furthermore, these fillers can provide valuable information for the construction of hypothetical new vocabulary words or in vocabulary. This approach is really interesting in the case of under-resourced languages because there is a lack of words in vocabulary. On the one hand, the adiUP system tests fillers of type: Allophonic, allophone, triphoneme, etc.; Syllabic, syllables or combinations; SLUs, most frequent sublexicals or subwords based on words or lemmas, and morphemes of the language. On the other hand, in the recognition process a parameter is used in order to adjust word insertion penalty (WIP) and filler insertion penalty (FIP). The folksonomy completes the recognition process by extracting relevant semantic information in the form of concepts or conceptual phrases built from lexical unit. Moreover, the strong cross-lingual effect requires the creation of folksonomy elements which integrate concepts of the three languages.

5 Evolutionary and adaptive methodology for automatic selection of the system configuration

5.1 Analysis

The characteristics of a system determine its performance and its ability to evolve and improve, and to become more general in the long term. Nevertheless, the definition of these characteristics becomes tedious when multiple possibilities must be tested in order to reach optimal configurations Fig. 4. This task is crucial in the case of complex environments where, due to minimal resources, experimental tests cannot objectively measure the performance of the system in the future. New data will be available during the auto-learning process in the collaborative network, and the system components could change, for instance, the SLUs, folksonomies, and LUs, which could lead to an increase in the robustness and performance of the system. Optimization methods for automatic selection of the system configuration provide a powerful tool that can forecast long-term changes. The adiUP system allows multiple system configuration options that could be generated by changing the components of the systems described in Table 7.

Table 7 Inventory of components involved in the system configuration

In our case, for the three languages there are about 6000 possible system configurations. Therefore, our goal is to automatically find optimal combinations of parameters (optimal configurations), while ensuring the quality of the system. In the literature, phone error rate (PER) and word error rate (WER) and word correct rate (WCR) have usually been used in order to assess the quality of ASR systems.

$${\text{PER/WER}} = \frac{S + I + D}{N} \cdot 100$$
(5)
$${\text{WCR}} = \frac{C}{N} \cdot 100$$
(6)

where S is the number of incorrect words substituted, I is the number of extra words inserted, D is the number of words deleted, C the number of correct words, and N is the total number of words in the correct transcription.

However, these measures strongly depend on the vocabulary and they only analyze some features of recognition, without measuring any features related to the quality of unit partitions, the mutual interaction between units, combined semantic quality features, or a combination of these factors. Therefore, there is not a comprehensive quality measurement of the system that would be very useful in the future, when there could be significant changes in the original components of the system. In this paper, we propose a new method for the selection and ranking of the features of the system in order to adjust the performance of the system, which is already in use in other fields [1, 39, 40], where we take into account not only the components of the system, but also their integration.

5.2 Methodology

The proposed methodology based on optimization algorithms, fuzzy indices, PCA, ANN and SVM is described below:

  1. 1.

    In a first stage, the quality of the sets of SLUs is ensured by an automatic selection based on several indices which analyze the quality of unit partitions, the mutual interaction between units, false positive and true negative rates, entropy, and similarities. More than 80 external and fuzzy indices have been analyzed, and the best eight are chosen for the selection process described in [41]. In this reference, the optimum ones are selected: IR, ISTL, IMAC, IIMN, IMN, IKAPPA, IHP, and IPE.

  2. 2.

    Then, for each of the combinations of configuration parameters, twelve system performance rates are calculated (described in Table 8). These rates are relative to the APD, the adiUP system, the lexicons, and the folksonomies.

    Table 8 System performance rates
  3. 3.

    During the second stage, an expert classifies all configurations in five levels of quality with regard to an Objective Function which consists in a linear combination of the aforementioned rates (Table 8). Several rates of the classes of the system are used in order to automatically classify the combinations of configuration parameters, because they play a key role according to the experts and the staff of the Info7 radio channel. Finally, an automatic classification by using the so-called decision tree (DT) machine learning algorithm [42] is carried out. This classification is used as a reference.

In ANN case, multi-layer perceptron (MLP) with neuron number in hidden layer (NNHL) = (attribute number + classes/2) and training step (TS) NNHL*10. In SVM case, PolyKernel option has been used. For the training and validation steps, we used k-fold cross-validation with k = 10. Cross-validation is a robust validation for variable selection [1].

  1. 4.

    The third stage consists in the automatic selection of rates in order to get the optimal set of parameters. The most relevant features are selected by means of attribute selection algorithms, which are available in the WEKA Program [43]. In general, these algorithms can be classified by several criteria: filters or wrappers, and analysis of each attribute or group of attributes [44]. Then, in order to determine the most influential factors in the process (ASR or performance of the adiUP application), the rates are sorted according to the number of times that they are chosen by these selection methods, and a ranking of the system performance rates is obtained taking into account their repetitiveness in the appearance in the WEKA selection methods.

  2. 5.

    Once the ranking is obtained, the system performance rates outside the ranking are discarded, and the classification is repeated by using the DT machine learning algorithm in order to analyze significant changes in the results, and overall quality of the system.

  3. 6.

    Then the number of features (rates) is reduced by applying the principal component analysis (PCA) algorithm to the ten selected rates, and the n-best solutions for the configuration are selected by employing an Objective Function which is a linear combination of the PCA components.

  4. 7.

    In this step, artificial neural networks (ANN) and support vector machines (SVM) are used for the final automatic model selection.

  5. 8.

    Finally, a group of experts (developers, linguists, and staff of the Info7 radio channel) supervises and selects the best configuration for each language from the n-best solutions provided by the previous stage.

6 Experimentation and discussion

The main aim of the experiments is to improve the performance of the designed system using the current resources, but keeping in mind the possibility of incorporating new resources in the future. The experimentation is divided into two tasks: the automatic selection of the system configuration by using the methodology described in Sect. 5; and the analysis of usability with the Info7 internet radio channel in order to assess the performance of the system under the design requirements.

6.1 Automatic configuration and performance of the system

Prior to evaluating the system by users, it is necessary to select the optimal configuration of the system, that is, to choose the best set of parameters in order to optimize the system performance. These parameters include the optimal unit set, the folksonomy, the topology, the LUs, the SLUs, the number of Gaussians, the word insertion penalty, the fillers, and the acoustic models, among others. The selection process is carried out by using the methodology described in Sect. 5. The parameters of the configuration of the system have been automatically selected and tuned by using 1000 sentences from recordings of the Info7 radio channel [13], and the validation method was a 10-fold cross-validation. In some cases, for the selection of SLUs the leave-one-out cross-validation technique has been used (LOOCV).

6.1.1 Automatic selection of sublexical units

In the first stage, the optimal sets of SLUs are selected from the proposals described in Sect. 4. All parameters are also selected pursuing optimal performance of the system under noise conditions. Semicontinuous hidden Markov models with different topologies were used. The input signal is transformed and characterized with a set of 13 Mel Frequency Cepstral Coefficients (MFCC), energy and their dynamic components, by taking into account the high noise level of the signal (42 features). The frame period is 10 ms, the FFT uses a Hamming window, and the signal has first-order pre-emphasis applied using a coefficient of 0.97. The filter-bank has 26 channels. By using the methodology described in Sect. 5.2 and also voice activity detection (VAD) methodologies [45], the most robust sets of SLUs are selected for the three languages among 10,000 options. These best sets are based on SC-HMM and triphonemes.

6.1.2 Automatic selection of the system configuration

Then, the combinations of system configuration parameters are generated. There are about 6000 different combinations for the three languages of the adiUP system. The optimal set of parameters is obtained by using the methodology described in Sect. 5. That is, for each one of the combinations of configuration parameters, the twelve system performance rates described in Sect. 5 are calculated (Table 8). Afterward, an expert classifies all the configurations in five levels of quality with regard to the radio objectives, and the classification reference is calculated. The ranking of rates is carried out by using the DT machine learning algorithm, and then, the attribute selection algorithms yield Table 9 that shows the ten rates that have the greatest effect on the classification.

Table 9 Rates that have the greatest effect on the classification, according to the attribute-selection algorithms

The other system performance rates are discarded, and the classification is repeated by using the DT machine learning algorithm without significant changes in the results, so we decided to keep using only the best rates.

In the next stage, new combinations of rates are generated by PCA, and the number of criteria is reduced. The result of the reduction by PCA is the creation of five groups that integrate different performance features. These groups are shown in Table 10. The group PC1 comprises information related to the WER of LUs; PC2 describes the general performance of the adiUP system; PC3 is related to the performance of SLUs; PC4 defines the performance of words; PC5 measures the performance of key concepts.

Table 10 Results of the PCA algorithm

For each language, an Objective Function is created by a linear combination of PCA components, and the six best solutions are selected. Afterward, an automatic selection by ANNs and SVM is carried out and the best selected solutions are supervised by a group of experts according to their performance on: traditional classification paradigms; experiments with semicontinuous HMM (SC-HMM) ASR systems; experiments with the adiUP system; and industrial interests. Finally, the optimal solution is implemented in the adiUP system. Table 11 shows the results of the success rate in percentage for three of the best options for each language (Basque-BS, Spanish-SP, and French-FR) using the chosen system configuration parameters selected from the 6000 possible options available at the beginning. Several proposals for the HMM structure were tested, not only by using different model topologies, but also with different numbers of Gaussians, and with different word insertion penalties. The features shown in Table 11 are the topology (as described in Table 5), or NS-X if the same number of states—X—has been used for all the SLUs); the type of SLU (as described in Tables 6 and 7); the number of Gaussians (NG); the value of the WIP parameter, and the type of filler in use (as described in Table 7); the WCR and the WER of the APD system, of the key concepts, and of the words (as described in Table 8); and the correct rates for classes (Cl) and concepts (Co) of the adiUP system (as described in Table 8). The best rates are marked in bold in Table 11.

Table 11 Three of the best configuration options selected for each language and their system performance % rates

It can be seen that the best results were obtained for Basque. Although Spanish has weaker acoustic models, the success rates in the adiUP system for words and concepts are good. The worst results are obtained for French, as it was expected due to the cross-lingual influence from the other two languages (French is not the native language of the radio-speaker). Note that the models that have a lot of states need fewer Gaussians, and in general they need fewer unit insertions because their configuration is appropriately adjusted. In some cases, the WER value of the APD system is very high, but this weakness is compensated by the folksonomy, achieving reasonable rates for the adiUP system and acceptable WER values for key concepts.

6.1.3 Evaluation oriented to users

An expert supervises the proposed solutions, and selects the final combination of configuration parameters for each of the three languages. Table 12 shows for each language the final configuration selected for the adiUP system.

Table 12 Configurations implemented in the adiUP system

The assessment of the final performance of the system was carried out both automatically by using the tags created during the analysis stage and manually by experts, mainly journalists. Figure 5 shows the results obtained from tests in the laboratory. These results of the performance of the adiUP system in terms of concepts (Co) and classes (Cl) fulfill the requirements defined by the clients.

Fig. 5
figure 5

Performance of the final system in terms of concept correct rate (adiUP-Co) and class correct rate (adiUP-Cl)

6.2 Usability

Finally, in order to evaluate the usability of the system, it was also tested by people not familiarized with this kind of systems. Tests have been carried out with four types of users: One person that uses the system as a super-user; three expert workers of the internet radio channel; three collaborators of the project; and three external real users of the system. These people have used the system and have subjectively graded it with regard to: the interface of the application; the answers provided by the search engine; the updating of the system; the flexibility of the system; and the general information provided by the system. Real users have only been inquired about this last general issue. The subjective parameters represent the degree of satisfaction with regard to the outcome of the system and the interface. This satisfaction was measured in a range of 0–10, where 0 is the lowest satisfaction level, and 10 is the highest satisfaction level with regard to the application performance: interface, perplexity, level of confusion, information, search usefulness, updating-adaptation, and information level. The tests and usability heuristics were designed by following the methodology described in [46,47,48]. Figure 6 summarizes the results of the tests that fulfill the design requirements.

Fig. 6
figure 6

Summary of the results of the usability tests for each user profile

7 Concluding remarks

Neural computing-based approaches provide flexible solutions for complex systems. This paper presents the design and development of the adiUP system, a multilingual audio information management system in complex environments based on a scalable architecture of automatic methodologies for semantic knowledge analysis. The neural computing-based approaches integrate: artificial neural networks (ANN), support vector machines (SVM), and hidden Markov models (HMM). The complex environment of internet mass media is given by: the trilingual context (Basque, Spanish, and French); the regular appearance of cross-lingual elements between the three languages; the under-resourced situation of Basque because the Basque language in the French State is not an official language and has a low number of speakers; the limited resources (financial, material, human, and audio resources), specially the audio corpora for training is only about 0.5–1% of the usual corpora for such systems; and the extremely poor quality of the audio signal, taken from an internet radio channel, that increases the complexity of the development.

The adiUP system has been tested with a local radio channel, the trilingual Info7 radio channel (Basque, Spanish, and French) [13], and the design specifications have been successfully fulfilled. As a result, a multilingual ASR system has been created which is able to: handle under-resourced languages with a very small audio corpus; select automatically attributes by reducing the redundant data; reduce unwished effects in sublexical units (SLU) due to the low number of samples; and extract information by means to folksonomies. Additionally, this is an evolutionary system where new information resources can be introduced in a more open and collaborative way by users, that has been developed using universal design criteria. In future research lines, other under-resourced languages and media can be integrated.