Morphosyntactic Linguistic Wavelets for Knowledge Management

Thus, morphosyntactics establishes a commons framework for oral and written language that guides the process of externally encoding ideas produced in the mind. Speech is an important vehicle for exchanging thoughts, and phonetics also has a significant influence on oral communication. Hearing deficiency causes a leveling and distortion of phonetic processes and hinders morphosyntactic development, particularly when present during the second and third years of life (Kampen, 2005).


Introduction
Morphosyntactics studies grammatical categories and linguistic units that have both morphological and syntactic properties.In its proscriptive form, morphosyntactics describes the set of rules that govern linguistic units whose properties are definable by both morphological and syntactic paradigms.
Thus, morphosyntactics establishes a commons framework for oral and written language that guides the process of externally encoding ideas produced in the mind.Speech is an important vehicle for exchanging thoughts, and phonetics also has a significant influence on oral communication.Hearing deficiency causes a leveling and distortion of phonetic processes and hinders morphosyntactic development, particularly when present during the second and third years of life (Kampen, 2005).
Fundamental semantic and ontologic elements of speech become apparent though word usage.For example, the distance between successive occurrences of a word has a distinctive Poisson distribution that is well characterized by a stretched exponential scaling (Altmann, 2004).The variance in this analysis depends strongly on semantic type, a measure of the abstractness of each word, and only weakly on frequency.Distribution characteristics are related to the semantics and functions of words.The use of words provides a uniquely precise and powerful lens into human thought and activity (Altmann, 2004).As a consequence, word usage is likely to affect other manifestations of collective human dynamics.

Words may follow Zipf's empirical law
Zipf's empirical law was formulated using mathematical statistics.It refers to the fact that many types of data studied in the physical and social sciences can be approximated with a Zipfian distribution, one of a family of related discrete power law probability distributions (Figure 1) 1

(Wolfram, 2011).
There is no theoretical proof that Zipf's law applies to most languages (Brillouin, 2004), but Wentian Li (Li, 1992) demonstrated empirical evidence supporting the validity of Zipf's law in the domain of language.Li generated a document by choosing each character at random from a uniform distribution including letters and the space character.Its words follow the general trend of Zipf's law.Some experts explain this linguistic phenomenon as a natural conservation of effort in which speakers and hearers minimize the work needed to reach understanding, resulting in an approximately equal distribution of effort consistent with the observed Zipf distribution (Ferrer, 2003).Whatever the underlying cause of this behavior, word distribution has established correspondences between social activities and natural and biological phenomena.As language is a natural instrument for representation and communication (Altmann, 2004), it becomes a particularly interesting and promising domain for exploration and indirect analysis of social activity, and it offers a way to understand how humans perform conceptualization.Word meaning is directly related to its distribution and location in context.A word's position is also related to its thematic importance and its usefulness as a keyword (López De Luise, 2008b, 2008c).This kind of information (recurrence, distribution and position) is strongly correlated with morphosyntactic analysis and strongly supports "views of human conceptual structure" in which all concepts, no matter how abstract, directly or indirectly engage contextually specific experience tracing language in the ever larger digital databases of human communications can be a most promising tool for tracing human and social dynamics".Thus, morphosyntactic analysis offers a new and promising tool for the study of dynamic social interaction.(Altmann, 2004).

Why morphosyntactic wavelets?
The evidence that wavelets offer the best description of such morphosyntactic decomposition is revealed by comparing the details of both traditional and morphosyntactical analyses.Figure 2 shows a graphical comparison between a signal and its FFT. Figure 3 is a linguistic version: E ci and ER.The graphics in Figure 2 represent the original signal (time-domain) and the resulting FFT decomposition (Lahm, 2002).The images in Figure 3 represent a translated original Spanish text (content from wikipedia.org,topic Topacio) transformed into an E ci (López De Luise, 2007) that models dialog knowledge.(Hisgen, 2010) Statistical modeling of knowledge is beyond the scope of this chapter, but additional information is available in (López De Luise, 2005, 2008, 2008b, 2008c, 2007b, 2007c).Figure 4 shows a sample wavelet decomposition.It is a signature decomposition using a Daubechies wavelet, a wavelet specially suited for this type of image.Figure 5 shows a MLW decomposition of a generic text.There, C i , and C j,k stand for abstract knowledge and F m represents filters.This Figure will be described further in the final section.

Technical overview 2.1 Wavelets
Wavelets are mathematical tools that are used to decompose/transform data into different components (coefficients) that describe different levels of detail (Lahm, 2002).Thus, they can extract the main features of a signal while simultaneously and independently analyzing details.
These tools have been applied to several problems, including the challenges of linguistic information retrieval.For example, wavelets have been used to build a Fuzzy Wavelet Neural Network (FWNN) for decision making over multiple criteria (Chen, 2008).In that analysis, custom built linguistic labels were used to represent information about events and situations and were processed with the FWNN.
Wavelets are sometimes used to replace linguistic analysis.For example, Tolba (Tolba, 2005) used consonant and vowel segmentation to develop automatic speech recognition for Arabic speech without linguistic information.Segmentation was performed with a combination of wavelet transformation and spectral analysis.
Hui and Wanglu combined the Linguistic Cloud Model (LCM) with wavelets to produce Advanced Synthetic Aperture Radar (ASAR) image target detection (Hui, 2008).This approach first solves image segmentation, avoids noise and recovers errors.Then, it uses LCM to solve the uncertainty of pixels.Representation using LCM bridges the gap between qualitative knowledge and quantitative knowledge, and it is thus used to map linguistic terms with contextually specific meaning to numeric processing.

Comparison between MLW and traditional wavelets
To demonstrate the concept of MLW and its relationship to its traditional counterpart, this

Linguistic cloud model and MLW
LCM models linguistic knowledge (Li, 2000) using a set of predefined, customized fuzzy linguistic variables.These variables are generated in accordance with two rules: 1.The atom generation rule specifies the manner in which a linguistic "atom" may be generated.An atom is a variable that cannot be sliced into smaller parts.2. The semantic rule specifies the procedure by which composite linguistic terms are computed from linguistic atoms.In addition, there are connecting operators ("and" "or", etc.), modifiers ("very" "quite", etc.) and negatives that are treated as soft operators that modify an operand's (atom's) meaning to produce linguistic "terms".
The MSW and the LCM share a common goal.However, the MSW replaces the manual procedure used to obtain linguistic atoms with automated processing that determines an atom's linguistic category (e.g., noun or verb) (López De Luise, 2007d, 2008c).The result is not an atom or a term but is a structure named E ci (an acronym from the Spanish, Estructura de Composición Interna).The E ci is used to model the morphosyntactic configuration within sentences (López De Luise, 2007;Hisgen, 2010).Thus, the core processing is based on E ci structures instead of linguistic variables.An E ci is a plastic representation that can evolve to reflect more detailed information regarding the represented portion of text.While atoms cannot be sliced, any E ci can be partitioned as required during the learning process.Further differences between the LCM and the MSW are shown in Table 3.

Morphosyntactics as a goal
Most morphological and syntactical processing is intended for information retrieval, while alignment supports automatic translation.Those approaches are mainly descriptive and are defined by cross-classifying different varieties of features (Harley, 1994)

Detecting language tendencies
Language tendencies denote cultural characteristics, which are represented as dialects and regional practices.Noyer (Noyer, 1992) described a hierarchical tree organization defined by applying manually predefined morphological feature filters to manage morphological contrasts.They used this organization as an indicator of linguistic tendencies in language usage.Extensions of this approach attempt to derive the geometry of morphological features2 (Harley, 1994(Harley, , 1998)), with the goal of classifying features into subgroups based on an universal geometry while accounting for universals in feature distribution and realization.In MLW, the structure of the information is organized in a general oriented graph (E ci structure) for only the smallest unit of processing (a sentence), and a hierarchy is defined by a chained sequence of clustering filters (Hisgen, 2010).Language tendencies are therefore visible in the configuration of a current graph.

Sentence generation
Morphosyntax has also been used to implement a language sentence generator.In an earlier study (Martínez López, 2007), Spanish adverbial phrases were analyzed to extract the reusable structures and discard the remainder, with the goal of using the reusable subset to generate new phrases.Interestingly, the shortest, simplest structures presented the most productive patterns and represented 45% of the corpus.
Another study (López De Luise, 2007) suggested translating Spanish text, represented by sets of E ci , into a graphic representing the main structure of the content.This structure was tested with 44 subjects (López De Luise, 2005).The results showed that this treatment, even without directly managing semantics, could communicate the original content.Volunteers were able to reconstruct the original text content successfully in 100% of the cases.As MLW is based on E ci structure, it follows that it: represents keywording well.-performs well independent of an individual's knowledge on a specific subject.-performs well independent of an individual's knowledge of informatics.

Language comprehension detection
As language is an expression of mind and its processes, it becomes also the expression of meaning (or lack of meaning) in general.This fact is also true when the subject is the language itself.A recent study focused on the most frequently recurrent morphosyntactic uses in a group of students who study Spanish as a foreign language (González Negrón, 2011) revealed a peculiar distribution of nouns and personal pronouns.These parts of speech were present at a higher frequency than in the speech of native speakers, probably to guarantee the reader comprehension of the text.Other findings included preposition repetition and a significant number of misplaced prepositions.Thus, morphosyntactic statistics detect deficient language understanding.A similar study was performed in (Konopka, 2008) with Mexican subjects living in Chicago (USA).In the case of MLW, the E ci and E ce structures will shape irregular language usage and make detection of incorrect language practices easy.

Semantics detection
Morphosyntactics can be used to detect certain types of semantics in a text.An analysis of vowel formant structure and vowel space dispersion revealed overall spectral reduction for certain talkers.These findings suggest an interaction between semantic and indexing factors in vowel reduction processes (Cloppera, 2008).
Two morphosyntactic experimental studies of numeral quantifiers in English (more than k, at least k, at most k, and fewer than k) (Koster-Moeller, 2008) showed that Generalized Quantifier Theory (GQT) 3 must be extant to manage morphosyntactic differences between denotationally equivalent quantifiers.The formal semantic is focused on the correct set of entailment patterns of expressions but is not concerned with deep comprehension or realtime verification.However, certain systematic distinctions occur during real-time comprehension.The degree of compromise implicit in a semantic theory depends on the types of semantic primitives it assumes, and this also influences its ability to treat these phenomena.In (López De Luise, 2008b), sentences were processed to automatically obtain specific semantic interpretations.The shape of the statistics performed over the E ci 's internal weighting value (named p o ) is strongly biased by the semantics behind sentence content.
3 Generalized Quantifier Theory is a logical semantic theory that studies the interpretation of noun phrases and determinants.The formal theory of generalized quantifiers already existed as a part of mathematical logic (Mostowski, 1957), and it was implicit in Montague Grammar (Montague, 1974).It has been fully developed by Barwise & Cooper (1981) and Keenan & Stavi (Barwise, 1981) as a framework for investigating universal constraints on quantification and inferential patterns concerning quantifiers.

Improvement of translation quality/performance
Automatic translation has an important evolution.Translation quality depends on proper pairing or alignment of sources and on appropriate targeting of languages.This sensible processing be improved using morphosyntactic tools.
Hwang used morphosyntactics intensively for three kinds of language (Hwang, 2005).The pairs were matched on the basis of morphosyntactical similarities or differences.They investigated the effects of morphosyntactical information such as base form, part-of-speech, and the relative positional information of a word in a statistical machine translation framework.They built word and class-based language models by manipulating morphological and relative positional information.
They used the language pairs Japanese-Korean (languages with same word order and high inflection/agglutination4 ), English-Korean (a highly inflecting and agglutinating language with partial free word order and an inflecting language with rigid word order), and Chinese-Korean, (a highly inflecting and agglutinating language with partially free word order and a non-inflectional language with rigid word order).
According to the language pairing and the direction of translation, different combinations of morphosyntactic information most strongly improve translation quality.In all cases, however, using morphosyntactic information in the target language optimized translation efficacy.Language models based on morphosyntactic information effectively improved performance.E ci is an important part of the MLW, and it has inbuilt morphophonemic descriptors that contribute significantly to this task.

Speech recognition
Speech recognition requires real-time speech detection.This is problematic when modeling languages that are highly inflectional but can be achieved by decomposing words into stems and endings and storing these word subunits (morphemes) separately in the vocabulary.An enhanced morpheme-based language model has been designed for the inflectional Dravidian language Tamil (Saraswathi, 2007).This enhanced, morphemebased language model was trained on the decomposed corpus.The results were compared with word-based bi-gram and trigram language models, a distance-based language model, a dependency-based language model and a class-based language model.The proposed model improves the performance of the Tamil speech recognition system relative to the word-based language models.The MLW approach is based on a similar decomposition into stems and endings, but it includes additional morphosyntactical features that are processed with the same importance as full words (for more information, see the last sections).Thus, we expect that this approach will be suitable for processing highly inflectional languages.

Morphosyntactic linguistic wavelet approach 3.1 A sequential approach to wavelets
Because language is complex, soft decomposition into a set of base functions (as in traditional wavelets) is a multi-step process with several components.
Developing numeric wavelets usually includes the following steps: 1. Take the original signal sample 2. Apply filtering (decomposition using the mother wavelet) 3. Analyze coefficients defined by the basis function 4. If the granularity and details are inadequate for the current problem, repeat from step 2 5. Take the resulting coefficients as a current representation of the signal Language requires additional steps, which are described in more detail in the following section.In brief, these steps are: 1. Take the original text sample 2. Compress and translate text into an oriented graph (E ci ) preserving most morphosyntactic properties 3. Apply filtering using the most suitable approach 4. If abstraction granularity and details are insufficient for the current problem 4.1 Insert a new filter, E ce , in the knowledge organization 4.2 Repeat from step 3 5.Take the resulting sequence of filtering as a current representation of the knowledge about and ontology of the text 6.Take the resulting E ci as the internal representation of the new text event A short description of the MLW steps is presented below, with an example in the Use case.

Details of the MLW process
Further details of the MLW process are provided in this section, with the considerations relevant to each step included.

Take the original text sample
Text can be extracted from Spanish dialogs, Web pages, documents, speech transcriptions, and other documents.The case study in the section 4 uses dialogs, transcriptions, and other documents.Several references mentioned in this chapter were based on Web pages.

Compress and translate text into an oriented graph (Called E ci ) preserving most morphosyntactic properties
Original text is processed using predefined and static tables.The main components of this step are as follows: -Filter useless morphemes5 using reference tables. -

Apply filtering using the most suitable approach
Since knowledge management depends on previous language experiences, filtering is dynamic process that adapts itself to current cognitive capabilities.Furthermore, as shown in the Case Study section, filtering is a very sensitive step in the MLW transformation.
Filtering is a process composed of several filters.The current paper includes the following three clustering algorithms: Simple K-means, Farthest First and Expectation Maximization (Witten, 2005).They are applied sequentially for each new E ce .When an E ce is "mature", the filter no longer changes.
The distance used to evaluate clustering is based on the similarity between the descriptor values and the internal morphosyntactic metric, p o , that weights EBH (representing morphemes).It has been shown that clusters generated with p o represent consistent word agglomerations (López De Luise, 2008, 2008b).Although this chapter does not use fuzzy clustering algorithms, it is important to note that such filters require a specific adaptation for distance using the categorical metrics defined in (López De Luise, 2007e).

If "Abstraction" granularity and details are inadequate for the current problem
Granularity is determined by the ability to discriminate the topic and by the degree of detail required to represent the E ci .In the MLW context it is the logic distance between the current E ci and the E ce partitions7 (see Figure 5).This distance depends on the desired learning approach.In the example included herein (Section 4), it is the number of elements in the E ci that fall within each E ce partition.The distribution of EBHs determines whether a new E ce is a necessary.When the EBHs are too irregular, a new E ce is built per step 3.2.4.1.Otherwise the new E ci is added to the partition that is the best match.

Insert a new filter, E ce , in the knowledge organization
The current E ce is cleaned so that it keeps all the E ci s that best match its partitions, and a new E ce that includes all the E ci s that are not well represented is created and linked.

Take the resulting sequence of filtering as a current representation of the knowledge about and ontology of the text
The learned E ci 's ontology is distributed along the chain of E ce s.

Take the resulting E ci as the internal representation of the new text event
The specific acquired, concrete knowledge is now condensed in the E ci .This provides a good representation of the original text and its keywording (López De Luise, 2005).
Real texts include contradictions and ambiguities.As previously shown (López De Luise, 2007b), they are processed and handled despite potentially inadequate contextual information.The algorithm does not include detailed clause analysis of or encode linguistic knowledge about the context because these components complicate the process and make it less automatic.
Furthermore, using the p o metric can distinguish the following Writing Profiles: general document, Web forum, Web index and blogs.This metric is therefore independent of document size and mentioned text styles (López De Luise, 2007c).Consequently, it is useful to define the quality of the text that is being learned and to decide whether to accept it as a source of knowledge.

Gelernter's perspective on reasoning
Section 3.2.3.defines that the clustering algorithms must be used first hard clusterings and afterwards fuzzy.It is not a trivial restriction.Its goal is to organize learning across a range from specific concrete data to abstract and fuzzy information.The filters are therefore organized as a sequence from simple k-means clustering to fuzzy clustering.This approach is compatible with Gelernter's belief that thinking is not a static algorithm that applies to every situation.Thinking requires a set of diverse algorithms that are not limited to reasoning.Some of these algorithms are sharp and deep, allowing clear manipulation of concrete objects, but there are other algorithms with different properties.
David Gelernter Theory (Gelernter, 2010) states that thinking is not the same as reasoning.
When your mind wanders, you are still thinking.Your mind is still at work.This free association is an important part of human thought.No computer will be able to think like a man unless it can perform free association.
People have three common misconceptions:

The belief that "thinking" is the same as "reasoning"
There are several activities in the mind that are not reasoning.The brain keeps working even when the mind is wandering.

The belief that reality and thoughts are different and separated things
Reality is conceptualized as external while the mental landscape created by thoughts is seen as internal and mental.According to Gelernter, both are essentially the same although the attentional focus varies.

The separation of the thinker and the thought
Thinking is not a PowerPoint presentation in which the thinker watches the stream of his thoughts.When a person is dreaming or hallucinating, the thinker and his thought-stream are not separate.They are blended together.The thinker inhabits his thoughts.
Gelernter describes thinking as a spectrum of many methods that alternate depending on the current attentional focus.When the focus is high, the method is analytic and sharp.When the brain is not sharply focused, emotions are more involved and objects become fuzzy.That description is analogous to the filtering restriction: define sharp clustering first and leave fuzzy clustering approaches for the final steps.As Gelernter writes, "No computer will be creative unless it can simulate all the nuances of human emotion."

Case study
This section presents a sample case to illustrate the MLW procedure.The database is a set of ten Web pages with the topic "Orchids".From more than 4200 original symbols and morphemes in the original pages, 3292 words were extracted; 67 of them were automatically selected for the example.This section shows the sequential MLW decomposition.Table 4 shows the filtering results for the first six E ci s.

Build E ci 1
Because the algorithm has no initial information about the text, we start with a transition state and set the d parameter to 20%.This parameter assesses the difference in the number of elements between the most and least populated partitions.

Apply filters to E ci 1
The K-means clustering, in the following KM, is used as the first filter with settings N= 5 clusters, seed 10.Diff=16%<d.Keep KM as the filter.

Apply filters to E ci 2
Filter using KM with the same settings, and the current Diff=11%<d.Keep KM as the filter.

Apply filters to E ci 3
Filter using KM with the same settings, and the current Diff=10%<d.Keep KM as the filter and exit the transition state.

Apply filters to E ci 4
Filter using KM with d=10% for steady state.This process will indicate whether to change the filter or build a new E ce .Clustering settings are the same, and the current Diff=20%>d.Change to Farthest First (FF) as the filter.

Apply filters to E ci 5
Filter using FF with clustering settings N=5 clusters, seed 1.The current Diff=45%>d.Change to Expectation Maximization (EM) as the next filter.

Apply filters to E ci 6
Filter using EM.Keep all the individuals as E ci 1, and put in E ce 2 the individuals in cluster 1 (one of the three less cohesive, with lower p o ).This procedure is shown in Figure 6.Detect the E ce 1 partition that best suits E ci 7 using cohesiveness criteria.The result shows that the partition that holds E ci 6 is the best.E ci 7 now hangs from this partition as indicated in Figure 7. Detect the E ce 1 partition that best suits E ci 8 using the same cohesiveness criteria.The partition that holds E ci 5 is the best.E ce 1,1 now contains E ci 4, E ci 5 and E ci 6. Filter E ce 1,1 using KM with clustering settings of N= 5 clusters, seed 10.The value of Diff=20%>d.Change to Farthest First (FF) as the next filter.
Now the E ce sequence is as indicated in Figure 8.

The representation in MLW
We do not expect E ce content to be understood from the human point of view, but it should be considered a tool to condense and potentially regenerate knowledge from textual sources.This is a first step in the study of this type of tool that uses mathematical and statistical extraction of knowledge to automatically decompose text and represent it in a selforganizational approach.
For instance, the following sentence from the dataset, "Dactylorhiza incarnata es orquídea de especies Europeas" (Dactylorhiza incarnata is an European orchid species) corresponds to the EBH number 04, and can be found (after MLW) as the sequence E ce 1-E ce 1,2-E ci 4.
If there is an interest in understanding the topic, the main entry of the set of E ci s in the cluster can be used as a brief description.To regenerate the concepts saved in the structure for human understanding, it is only necessary to use the symbolic representation of the E ci (López De Luise, 2007).

Conclusion
MLW is a new approach that attempts to model natural language automatically, without the use of dictionaries, special languages, tagging, external information, adaptation for new changes in the languages, or other supports.It differs from traditional wavelets in that it depends on previous usage, but it does not require human activities to produce definitions or provide specific adaptations to regional settings.In addition, it compresses the original text into the final E ci .However, the long-term results require further testing, both to further evaluate MLW and to evaluate the correspondence between human ontology and conceptualization and the E ce s sequence .
This approach can be completed with the use of a p o weighting to filter the results of any query or browsing activity according to quality and to detect additional source types automatically.
It will also be important to test the use of categorical metrics for fuzzy filters and to evaluate MLW with alternate distances, filter sequences and cohesiveness parameters.

Fig. 6 .
Fig. 6.E ce 1 after cleaning up the less cohesive E ci s

Table 2 .
table summarizes the main characteristics that unite or distinguish them: Characteristics of traditional wavelets and MLW

Table 3 .
such as number and person.When morphological operations are an autonomous subpart of the derivation, they acquire a status beyond descriptive convenience.They become linguistic primitives, manipulated by the rules of word formation.Comparison Between the LCM and the MSW In many approaches, they are manipulated as an undifferentiated bundle divided only into nominal and verbal atoms.The following section describes elements of the morphosyntactic approach.
Extract morphosyntactic descriptors (numeric values automatically extracted, such as the number of vowels) for each word processed.Words were previously represented by Porter's Stemming, but this tool does not have enough classification power for use as a sole instrument.Morphosyntactic descriptors are required to process text with sufficient confidence levels (López De Luise, 2007d).-Collapse syntagmas into a condensed internal representation (usually, selected morphemes 6 ).The resulting representation is called EBH (Estructura Básica Homogénea, uniform basic structure).EBHs are linked with specific connectors.-Calculate and set the morphosyntactic weighting p o for E ci .More details of each of these steps are outside of the scope of this chapter (but see (López De Luise, 2008c) and (López De Luise, 2008)).
*This is the result of EM to define the splitting of Ece1.

Table 4 .
Filtering results for each E ci