EXTRACTION OF COMPOUND NOUNS IN MALAY NOUN PHRASES USING A NOUN PHRASE FRAME STRUCTURE

This paper addresses the process of extracting compound nouns in Malay noun phrases using a noun phrase frame structure. Studying in a compound noun area is very important to see the dependency of the word that can produce a correct meaning of the sentence. Each complete sentence in Malay must have their respective compound nouns. The compound nouns can be extracted from a subject and predicate of the sentences. The subject must consist of a noun phrase, while the predicate may have a verb phrase, an adjective phrase or a prepositional phrase. In a compound noun structure, we may know the head and modifier of the sentences. The compound noun has been discussed in detail by our native language experts. Many issues were highlighted to further strengthen the concept of compound nouns. Based on the issues discussed, we attempted to use a noun frame structure technique using a computer to extract compound nouns.


INTRODUCTION
As sighted in Rencana Pilihan (2009), Malay language is a language spoken and written by 40 million people across the Malacca Strait, Peninsula of Malaysia, Southern Thailand, Singapore, Eastern Coast of Sumatra, Riau Islands in Indonesia, Western Coastal Sarawak, Brunei and West Kalimantan in Borneo.In Malaysia, a public agency known as Dewan Bahasa dan Pustaka (DBP 1 ) is responsible for coordinating the use of Malay language in Malaysia and Brunei.Various studies were conducted to strengthen usage of Malay language in various domains (reference).The grammar issues in Malay language have been discussed by (Addullah Hassan 2004;Addullah Hassan 1992;Fazal Mohamed 2009;Fazal Mohamed 2008;Nik Sapiah et al. 2010;Ong Ching Guan 2009).
One of the important issues addressed in studying Natural Language Processing for Malay language is a compound noun.Researchers such as (Ken Barker 1998;L. Barrett 2001;Takaaki Tanaka and Yoshihiro Matsuo 2001;Vivi Nastase and Stan Szpakowicz 2003) discussed the fundamental concept of compound nouns in English.Most compound nouns in English are constructed by nouns that are modified by other nouns or adjectives.
As explained by (Addullah Hassan 1992;Nik Sapiahet al.2010), the compound noun in Malay is made up of two categories such as i) noun and noun modifier; and ii) noun and non-noun modifier.In the first category-noun and noun modifier, all the compound nouns are grouped according to their respective class names.The words are combined with nouns followed by other nouns.It has 13 different types of class names.In the second category-noun and non-noun modifier, a compound noun is formed based on 6 different class names.The words are combined with nouns followed by other non-nouns.Andrew, H et al. (2008) explained the usage of semantic relations such as hypernym and meronym used in their research work.Both of these semantic relations are used to show the linkages between nouns.They give a few words, for example boat, houseboat, and speedboat.The word boat is a super class to the sub-class houseboat and speedboat.The compound houseboat and speedboat is therefore, a hyponym of boat, i.e. "..is a kind of boat".A semantic relation hypernym is used to describe the relation "..is a kind of" between the words.They also used meronym semantic relation to elicit "attribute" or "property" of the sub-class words.The speedboat is a super-class to its length, size and engine.It means that the meronym of word speedboat is length, size and engine.Andrew, H et al. (2008) referred to the WordNet to identify the nouns using both hypernym and meronym relations.
(H. Sundblad 2002;Vivi Nastase and Stan Szpakowicz 2003) discussed the bracketing technique process to find the possibility of the existence of a compound noun in a sentence.They use a left and right branching format to place a group of words.They gave the example for left-branching and right-branching techniques used in a sentence.For example, phrase 1 :laser printer manual, after the left-branching applied, the following bracketing ((laser printer) manual) will be produced, while for the right-branching, the following bracketing (desktop (laser printer)) for the phrase desktop laser printer will be created.However, there is still a lack of empirical research in solving ambiguity in pairing the word.
The rest of the paper is organized as follows.In section 2 we will discuss the compound nouns in a Malay sentence and section 3 gives a brief conclusion.

COMPOUND NOUNS IN MALAY SENTENCE
The main activity in our research work is to observe and find an acceptable technique to extract a pair of compound nouns in Malay noun phrases.Referring to the compound noun construction process in Figure 1, we compiled nearly 107 pairs of words that fit with 13 categories of noun and noun modifier defined in our research requirement (Suhaimi Ab Rahman et al. 2011;Suhaimi Ab Rahman et al. 2012).
In order to label Malay word's Part-of-Speech (POS), Table 1 shows several Malay POS taken from Arbak Othman (2006) with their corresponding English POS.We will use the Malay POS labels shown in Table 1 for the examples and discussion outlined in our studies.Penbil Gelaran (Title) In general, compound nouns in Malay sentences can be constructed using the following three steps: Step1: Collect and analyse sentences in Malay noun phrase sentences In our data preparation process, we currently focus on noun phrase sentences to identify the compound nouns that exist in the sentence.In order to get the noun phrase sentences, we gathered 600 examples of Malay sentences.Out of 600 examples, we may know a number of compound nouns placed in the sentences.This is done using Malay compound nouns analyser tool.All the 600 examples are gained from a dictionary, children story book, internet, and magazine.The collection of examples is done manually.The prototype Malay compound nouns analyser tool is developed to detect the possibility of the existence of a pair of compound noun in a Malay noun phrase sentence.This tool uses a technique named a noun phrase frame structure to extract compound noun from a noun phrase sentence but it does not detect a compound noun in verb phrases, adjective phrases and prepositional phrases.
Step 1: Collect and analyse Malay noun phrase sentences Step 2: Create a noun phrase frame structure form Step 3: Extract pair of compound nouns based on the step 2.
At present, the method of splitting the sentence into two sub phrases, a subject and a predicate is done manually to become as the input to our prototype system.The words in each subject and predicate will be assigned with a respective POS.The POS for the words are done automatically using our prototype Malay language noun modifier tagger tool.
Table 2 shows the examples of Malay noun phrases collected from our data preparation process.Based on the examples shown in Table 2, we divided ach sentence into a subject and a predicate.Identifying a correct group of words as subjects and predicates is very important to find the existence of a compound noun in the sentence.The correctness of group of words as subjects and predicates was verified by the linguist.
Step 2: Analyse and create a noun phrase frame structure.To find a compound noun in a Malay sentence, we use a noun phrase frame structure depicted in Table 3.This frame structure contains a list of sequence word categories starting from numeric, classifier until determiner.Not all sentences must fulfil all these categories.The rules are necessary in determining the appropriate position for each word from input sentence to be placed in a noun phrase frame structure.Other than the words that are assigned with their respective POS, an additional label known as the noun modifier is needed for the rule to be more precise and recognised as a compound noun.(Addullah Hassan 2004;Addullah Hassan 1992;Nik Sapiah et al. 2010) also discussed these issues by showing the importance of a noun phrase frame structure in constructing a noun phrase or Malay sentence.
The examples of noun phrase rules are discussed below: Sentence 1: Malay: "Gunung itu amat tinggi." English: The mountain is very high.Sentence 2: Malay: "Mr Ahmad seorang guru matematik sekolah saya." English: Mr Ahmad is a mathematics teacher at my school.Sentence3: Malay: "Dua orang kanak-kanak perempuan sedang bermain di taman permainan." English: Two girls are playing at the playground.
Based on the examples in Sentence 1, 2 and 3, we can use a noun phrase frame structure to arrange the position of words.Referring to the noun phrase frame structure in Table 3, we can summarize the discussion as follows: Sentence 1: Malay: "Gunung itu amat tinggi."English: The mountain is very high.
The noun modifier labels for the words in Sentence 1 are: In syntax 1, only the first two words in Sentence 1 comply with the rule in Row 1 : English: Mr Ahmad is a mathematics teacher at my school.
The noun modifier labels for the words in Sentence 2 are: " In syntax2, the first two words in Sentence 1 comply with the rule in Row 2 : In this rule, a compound noun for a noun phrase is nominated as follows: Sentence 3: Malay: "Dua orang kanak-kanak perempuan sedang bermain di taman permainan."English: Two girls are playing on the playground.

][Bil 0 ]+[KN 1 ][PenBil 1 ]+[KN 2 ]+[KN 3 ] while remaining of words are in verb phrases, such as [KB 4 ]+ [KK 5 ]+[ KS 6 ]+ [KN 7 ]+ [KN 8 ].
In this rule, a compound noun for a noun phrase is nominated as follows: All formulated rules will be kept in a Malay noun phrase structure rule database for reference.The more examples of Malay sentences collected, the more noun phrase rules can be constructed.The rules are also discussed in [Michael, N 2002].
subject(noun phrase) predicate(noun phrase) compound noun 2 noun modifier tagged output noun modifier tagged output Step 3: Extract compound nouns based on step 2. We developed a prototype of a Malay language compound noun analyser to identify and recognize which part of the words in Malay noun phrase sentences are in compound nouns.The Malay language compound noun analyser will use a technique described in Step 2.
Below are examples of the output produced using a Malay language compound noun analyser.
We tested the system using 100 simple Malay sentences.The sentences are from general domain.A few examples of the sentences that were tested are as follows: Based on 100 sentences tested, we found that a number of compound nouns exist in the sentences as shown in Table 5: Referring to Table 5, the total number of compound nouns in a noun phrase was detected using rules in Step 2. The calculation of the total of compound nouns that exist as subjects and predicates is based on the following summation formula: (3) Where n=100 represents the total number of Malay sentences that need to be tested while the variables Compound Noun (CN), sum subject, and sum predicate represent the total number of compound nouns obtained by using a Step 2.The summation is based on the number of compound nouns discovered from subjects and predicates of the sentences, represented as Total CN.Referring to Table 5, the Total CN will be assigned to value 140.It means that there are 140 compound nouns detected from 100 examples tested in Malay sentences.
Table 6 shows a few examples of compound nouns obtained from the testing of 100 sentences in Malay sentences using the summation formula discussed earlier.The list of compound nouns produced from our prototype Malay language compound noun analyser is important to be used as a data sample in evaluating the accuracy of the method used in our prototype system.The accuracy of compound noun results was evaluated by the linguist or the language expert to verify the correctness of the compound noun produced from the prototype system.The language expert will analyse the sentences to identify compound nouns in the sentences.The results from the language expert will be compared with the result generated from our system to find the percentage of similarity results.Therefore, we can measure the percentage of accuracy method used in our prototype system.
The result of our study will be used in our next research work to detect head modifier in a noun phrase sentence based on the compound nouns method.

CONCLUSION
In this paper, we described fundamental process on extracting compound noun from Malay sentences.As discussed, the Malay language compound noun analyser tool only detected a noun phrase, and not detecting a verb phrase, an adjective phrase and a prepositional phrase.The rules to detect compound noun can be constructed using a noun phrase frame structure rule.This frame structure rule also has been explained in Section 2. In creating a noun phrase frame structure rule, we also plan to create another two different frames to handle problems such as solving issues on appearing word verb and adjective in a noun phrase.The accuracy of compound noun detected in a noun phrase is based on the number of formulated rules.The more rules created means more possibility of getting compound nouns in a noun phrase to be improved.The final result such as a list of compound nouns will be verified from a language expert, so that we know the correctness of the process, particularly in using the technique to extract compound noun in a noun phrase noun phrase for Malay sentences.

TABLE 3 .
Noun phrase frame structure for Malay sentence 1, 2, and 3 The remaining words are grouped as adjective phrases.In this rule, a compound noun for a noun phrase is nominated as follows:

TABLE 4 .
Sentence examples to be tested

TABLE 5 .
Number of compound noun

TABLE 6 .
Example of compound nouns