Contex Free Grammer For Turkish

Formal Grammar which is introduced by Chomsky is one of the most important development in Natural Language Processing, a branch of Artificial Intelligence. The mathematical reresentation of languages can be possible using Formal Grammars. Almost all natural languages have word classes such as noun, adjective, verb. In addition to this one sentence consist of noun phrase and verb phrase. Noun phrase may consist of location, destination and source elements. Despite many similarities between the languages, there exist important dissimilarities in grammar rules of the languages belonging to different language families.  In our study the most appropriate formal grammar representing Turkish language is investigated. Accuracy of the suggested grammars’ rules is evaluated in two different corpus. This study is the enhanced version of “Turkish Context Free Grammar Rules with Case Suffix and Phrase Relation” that was presented on UBMK 2016 International Conference on Computer Science \& Engineering \cite{ilk}. Different from the first study, this study includes all word and sentence types of Turkish. Adjectives and prepositions are considered. The quoted sentences, incomplete sentences and question sentences are included. The genitive phrase structures including verbal word are included. In this study, the noun phrases are also defined in detail.


Introduction
The scientific studies about languages start at 1900's.
English is represented by Context free grammar (CFG) which is a type of formal grammar.Different CFG grammar rules are determined in different studies for English [14], [15].The main source for "CFG for English" is "An Introduction to Natural Language Processing" book of Daniel Jurafsky and James H. Mart.In this book "CFG for English" is a chapter of the book [16].There are also studies related with CFG for English [17], [18].English CFG does not need so much rules for suffixes relative to İ. Dönmez, E. Adalı / Contex Free Grammer For Turkish Turkish which is agglutinative language.Turkish can be represented by CFG.In this study, the most appropriate Context Free Grammar and rules are searched for Turkish.Z. Güngördü and C. Demir are studied Turkish syntactic structure (Parsing Turkish using the lexical functional grammar(LFG) formalism) in 1993 [19].In this study LFG grammar does not consider some of the verbal phrases (VP).T.Güngör ve S. Kuru used ATN for extracting Turkish suffixes [20].This is extensive study that includes different type of phrases with verbal items.In this study some standard NP generation and phrase generation networks are defined.When we investigate in detail this networks are not enough for generating all kind of sentences.Because the study does not consider the recursion in the sentence and type transformation.R.Çakıcı studied automatic induction of a CCG grammar for Turkish [21], [22].This study uses machine learning techniques and supervised learning method so this study strongly related with used data.Our literature research shows that it is the only study using Combinatory Categorial Grammar (CCG) which is a kind of CFG.In this study the word type transformation and Turkish specific phrase types are not included.
In 2006, Ö. İstek studied on a link grammar for Turkish [23].This study does not include multi-word expressions and punctuation symbols.
In 2007 E. İ. Ünkar parsed the Turkish sentences for text watermarking [24].In this study word types are not detailed.In this study word "olası" is called modifier and it is not called modifier as a adverbial verb.
The importance of our study is consideration of general representation of Turkish sentences.Using all Turkish phrase structure, word transformations between word types with suffixes, and the recursive sentence structure inside phrases make our study different from other studies.
In this paper, section 1 includes introduction and previous work related to the concept.In section 2 Turkish specific features are defined.In section 3, Turkish specific CFG rules are introduced.Section 4 and section 5 are related with data, evaluation and results.CFG representation is done step by step.First the rules for simple sentence and noun phrases in simple sentence are defined.Then the rules for complex sentence and noun phrase in complex sentence are defined.At the end CFG rules for compound sentence, quoted sentence and incomplete sentences are defined.

Turkish Specific Features
Turkish has specific features: 1. Phrase structure with case suffixes 2. Free phrase order in the sentence 3. Compatibility between predicate and the other phrases in sentence 4. The sentence recursion using participles (verbal adjectives) and con-verbs (verbal adverbs) and gerund (verbal nouns) 5. Transformation between word types using suffixes.

Phrase structure with suffixes
Turkish sentences can be separated to their phrases via using specific case suffixes.Each phrase type has a role in the sentence.There are lots of studies related with Turkish phrase structure [25], [26], [27].The phrases that are seen on Table 1 are used in this study.In Figure 1, "Ayşe bugün okula annesiyle gitti."sentence is devided to its phrases.This sentence has a verb root in its predicate."Gül en güzel çiçektir."sentence is separated to its phrases.This sentence has a noun root in its predicate.

Free phrase order
In Turkish, the phrase order is so flexible that the sentence S can be formed with all of the permutations of P 1 , P 2 , P 3 , P 4 , P 5 , P 6 , P 7 , P 8 , P 9 phrases and V predicate .
V predicate is used at the end of the sentence in a regular sentence.The computational analysis of the syntax and interpretation of "free" word order in Turkish are studied by Hoffman in 1995 [28].
In Figure 2, all of the permutation of "Ayşe", "okula" and "annesiyle" words are used.We assumed this sentence regular so predicate is at the end.But even we change the order of predicate, the meaning does not change.As seen in Figure 3, predicate is also determinative for the phrases that the sentence may include.For example if the sentence predicate is "ol (be)", the sentence do not involve an object phrase and if the predicate is "oturmak (reside)", the sentence do not involve a source phrase.

Recursion in sentences
One of the common features of the languages is the recursive structures in sentence [29], [30].Syntax of a phrase or a sentence is constructed from repeated rules.In this paper, we consider the recursion of sentence.A sentence should have judgment or should convey a statement.Sentences some times has inner statement and judgment inside.
In Turkish recursion in the sentence is done with verbal forms.When we assume p as one of the permutation of P 1 P 2 P 3 P 4 P 5 P 6 P 7 P 8 P 9 phrases, a simple sentence can be done by "p + V +suffix" rule.In "Ali okula geldi."sentence; "Ali" is subject, "okula" is dative phrase, "geldi" is constructed from a verb and past suffix.The recursion of "p+V +suffix" can be seen also in complex sentence.As seen in Figure 4 "Okula gelen Ayşe dün üzgündü."sentence has "p+V +suffix+X+p+V +suffix" form.

Figure 4. Recursion example
In compound sentence, the sentence is generated by conjunction of complex and/or basic sentences.The compound sentence also has recursive structure S ← S +C + S or S ← p+V +suffix+C+p+V +suffix .Here, S is the sentence and C is the conjunction."+" is concatenation operator.

Transformation structure with suffixes
Turkish is a agglutinative language so suffixes are deterministic features for phrase types; subject type; singularity or plurality; time and model type.There are lots of studies related with morphological structure of Turkish [31], [32].Suffixes are also used to transform basic word types (noun, adjective and verb).

Figure 5. Transformation between noun and verb example
As seen in Figure 5 it is possible to make transformation between noun and adjective; noun and verb and verb and adjective with suffixes.In Turkish rules related with original type can be applied to transformed type.Different from the English, in Turkish the time phrases, location phrases, adverb phrase and destination phrases can be easily separated from Verb Phrase because of the case suffixes and phrases free order.

CFG For Turkish
In this study the aim is creating a context free grammar with its rules to handle all Turkish text and to allow deriving all possible text.

CFG for simple sentence
In formal grammar representation a language is represented by {P, N, T, S} so that P is generation rule, N is non terminal, T is terminal and S is starting symbol [11].Turkish case markers and phrase relations are represented with formal grammar for simple sentence on Table 2 with determined rules.
As seen on the Table 2, the suffixes -i, -e, -de, -den, -le, noun phrase, noun, adverb phrase, adverb and predicate are terminals.S, P 1 , P 2 , P 3 , P 4 , P 5 , P 6 , P 7 , P 8 , P 9 , X and V predicate are non-terminals.λ denotes the empty string.Each rule has a name part and an expansion of the name part.P = {P i : 1 <= i <= 9} is ∏, for the generation rule p ∈ ∏, S is denoted as S ← pV predicate or S ← pV predicate mi?.As seen in Figure 6 "Ali bugün sabah okula geldi mi?" simple sentence can be represented with "S ← p V predicate mi?" rule and p includes P1, P8, P4.

CFG rules for noun phrase in simple sentence
All kinds of Turkish noun phrases for simple sentence can be generated using the rules on Table 3. Noun phrase in simple sentence will not include verbal words.The noun phrases in simple sentence are simple nouns, possessive constructions, adjectives and noun modifiers with nouns etc.They are constructed nouns, adjectives and pronouns, conjunctions and suffixes.X can be formed by combining more than one noun, conjunctions between noun phrases, using possessive suffixes.
Table 3. CFG for noun phrase (NP) in simple sentence Non terminal

(Nouns, adverbs and adjectives)
As seen in Figure 7, "senin robotun, demir kapı kolu ve camın pervazı" is a noun phrase.It may be used in "Senin robotun, demir kapı kolu ve camın pervazı bozuldu."sentence.This noun phrase is constructed from possessive constructions, adjectives and noun modifiers with nouns and conjunctions.In this table sg1 is used for singular 1st person, pl1 i used for plural 1st person and so on.4.
Table 4. CFG rules for simple and complex sentence Non-terminal Different from the simple sentence Noun phrases may include verbal gerunds (verbal nouns), participles (verbal adjectives) or/and con-verbs (verbal adverbs).For example "Okula gelen Ayşe bugün çok üzgündü.(Ayşe who came to school was unhappy today.)" is complex sentence.The subject phrase "Okula gelen Ayşe (Ayşe who came to school)" includes verbal adjective.λ denotes the empty string.
Figure 8. Parsing complex sentence including gerund "Bütün gün ders çalışmak beni çok yordu." is complex sentence with gerund.This sentence can be parsed as seen in Figure 8.
Table 5. Generating NP's in complex sentences Additional Rules to Table 3 and Table 4 Example NP with Gerunds X withGerund ← V gerund almak, gidiş X withGerund ← pV gerund ;p isn't empty eve gelmek X withGerund ← V gerund +'in' X+'i' atışın sesi X withGerund ← X+'in' V gerund +'i' dersin bitişi X withGerund ← V gerund +V gerund alış veriş NP with Participle X withParticiple ← V participle X biten iş X withParticiple ← V participle biten, olan X withParticiple ← pV participle X;p isn't empty at süren kız X withParticiple ← pV participle ;p isn't empty bu işi yapan As seen on the Table 5 X withGerund can be generated in different ways.Due to the adjective can be used instead of nouns in Turkish, X ← X withGerund and X ← X withParticiple rules are all valid.To have a general rule set, this rules should be added to the CFG rules for complex sentences and to the noun phrases rules for simple sentences."Acele kararlarla yönetilen şirket" is a noun phrase.It will be subject phrase if the sentence is "Acele kararlarla yönetilen şirket hata yapmaya mahkumdur.".As  The verbal item in complex sentence may be in gerund, participle and con-verb (verbal adverbs) form.In Table 6, the example generation rules for complex sentence with converb are seen.
Figure 10.Complex sentence including converb parsing example "Hızlıca üzerini degiştirip yola çıktı" sentence is a complex sentence with con-verb (verbal adverb).Adverb phrase P 8 include a verbal adverb.As seen in Figure 10 Adverb phrase P 8 can be generated from P 8 ← P 8 P 3 V converb or P 8 ← pV converb.

CFG for compound sentence
In compound sentences, there is equal emphasis on sentences which are connected by conjunctions.The rules on Table 4 contains generation rules for simple and compound sentence.When we add the rules which are seen on the Table 7 to the complex sentence rules on the Table 4, Turkish general context free grammar with phrases and suffixes can be maintained that contains simple, complex and compound sentence.If "S" denotes the complex or simple sentence; "S ← S C S" and "S ← C S C S" denotes the compound sentence.
Here "C" symbol denote the conjunction like "ama, ile, ve, ya, ya da" or some special punctuations like ";" and ",".Generally conjunctions are used between the sentences."Ali geldi, okula gitti ama hiç birşey söylemedi.(Ali came, went school but did not tell anything.)"is an example of compound sentence.In Turkish some sentence may start with conjunctions."Ya buradan gidersin ya da ben giderim (either you go or I go)" is an example."Ali bugün okula geldi ve sırasına oturdu" sentence is a compound sentence.It is formed from two different sentence "Ali bugün okula geldi" and "Sırasına oturdu" sentences with a conjunction "ve".As seen in Figure 11 compound sentence S can be generated from S ← P 1 P 8 P 4 V predicateCP 4 V predicate or S ← pV predicateCpV predicate.

Turkish CFG for sentence that include quotation
In Turkish the quoted speech should be processed like a Noun Phrase for Context Free Grammar.The main evidence for this hypothesis is the use of case suffixes after the quoted speech.We know that case suffixes are used for phrases after the noun phrase.
1. Example for nominative phrase (no suffix): "Ali 'haydi buraya gel' dedi.What did Ali said?The answer is "Lets come here".It is the object phrase of the sentence."Hasan, 'Ali buraya gel' dedi" sentence is a quoted sentence.It is formed from two different sentence.Even "Ali buraya gel" is a sentence, it is used instead of object phrase of the main sentence.As seen in Figure 12 quoted sentence S can be generated from S ← P 1 P 2 V predicate and P 2 ← "S" P 2 ← P 1 P 4 V predicate.
Table 8.CFG for sentence that include quotation A rule for Noun Phrase X ← "S" For the CFG representation of sentences that include quotation; the rule X ← "S" in Table 8 should be added to the Table 5.Here the quotation mark outside the sentence is important.

Turkish CFG for incomplete sentence
In incomplete sentences the sentence does not have to contain a predicate.In dialogs incomplete sentence may be generated from noun phrase, subject phrase, object phrase, destination phrase, location phrase, source phrase, instrument phrase, adverb phrase or/and prepositional phrases.As seen in Figure 13 there are some incomplete sentence example.This incomplete sentence is generated from noun phrase and case suffix like a phrase.

Table 9. CFG for incomplete sentence
A rule for Sentence For the CFG representation of incomplete sentences, the rule "S ← p" in Table 9 should be added to the Table 4

Turkish CFG related with verb and noun type transformation
In Table 10 the verbs that has a noun root is seen.In Table 11 the nouns that has a verb root is seen.In these tables the transformation rules between noun and verb are listed.
Figure 14.Verb which has noun root parsing example As seen in Figure 14 "Güzelleştirmektir" verb has noun root "güzel".There are two times noun to verb and one time verb to noun transformation in this example.The verb that has noun root can be generated directly with "len", "leş", "le" suffix and/or case, time and subject suffixes."Evdir" verb has TMS suffixes to a noun "ev".
In Table 11, sg1 is used for singular 1st person, pl1 i used for plural 1st person and so on.
After we create a verbal adjective or verbal noun we can use it as a noun.And we can use it to create new noun phrases using the rules related with noun phrase generation.For example "gelmek (to come)" is gerund word and "zamanın gelmesi" can be generated from X ← X+in X+i rule.This noun phrase which include verbal item is shown in Figure 15.The rules in Table 10 and the Table 11 are selected from the previous tables CFG rules to show the verb to noun and noun to verb type transformation.

Data
The data is taken from İTÜ NLP Machine Learning Corpus for the evaluation [34].This corpus is generated for English to Turkish machine translation project.One million parallel sentence in English and Turkish is included in this corpus.For our evaluation we generate 3 datasets.For dataset-1 the random 1000 simple sentences, for dataset-2 1000 simple and complex sentences, and for dataset-3 1000 all type of sentences are taken from this corpus.There are 23 same sentence between dataset-1 and dataset-2 and there are 69 same sentence between dataset-2 and dataset-3.
Because of the random selection of sentences, the common sentences will not effect the result.First dataset is used for the evaluation of CFG For Simple Sentence; 1000 simple and complex sentences are used for the evaluation of CFG for simple and complex Sentence; 1000 all kind of sentences are used for the other CFG's.
To compare and find the average result, Turkish National Corpus also has been used.This corpus is generated from 9 different area datasets.This dataset has 50.000.000words [35].10.000 sentences is downloaded from this dataset and Datasets are grouped according to their sentence types similar with the İTÜ Corpus datasets.For dataset-1 the random 1000 simple sentences, for dataset-2 1000 simple and complex sentences, and for dataset-3 1000 all type of sentences are taken from this corpus.There are 43 same sentence between dataset-1 and dataset-2 and there are 78 same sentences between dataset-2 and dataset-3.

Evaluation and Results
In "the evaluation metric in generative grammar" study of John Goldsmith, he proposes two ways to evaluate the accuracy of a generative language [33].First method is for understanding how appropriate the formal language is for the given language data and the second one is for understanding the formal language accuracy in dependent from the language data.In first method to understand how appropriate the formal language is for the given language data we test each sentence if the sentence can be generated via using related rules in related CFG.We evaluate the score "True" for this sentence, if the rules allow to derive the sentence.If the sentence is not generated via using related rules in related CFG, we evaluate the score "False" for this sentence.
The CFG accuracy is equal to total true scored sentences number divided by all sentences number.In Table 12 the evaluated values are compared to our previous "Turkish Context Free Grammar Rules with Case Suffixes and Phrase Relation" study (Version-1) [1].When we searched the reason of increasing accuracy, we found that, preposition like "için, e göre, gibi" usage for prepositional phrase and the additional rules related with incomplete, quoted sentences caused this average accuracy difference.We can not compare CFG-IV and CFG-V with the previous version directly because previous version does not have rules for incomplete, quoted sentences.The Accuracy-V1 values are taken from the previous study.
As seen on Table 12, compared with the first study, including all word and sentence types; considering adjectives and prepositions and involving the inverted sentences, incomplete sentences and the genitive case structures including verbal items cause a 8.2% increase in accuracy value of CFG for all sentence types.The second method was for understanding the formal language accuracy in dependent from the language data.To approximate the second method and to decrease the error, the evaluation is done for independent language corpus, and the average accuracy value is taken into account.
As seen on the Table 13, the CFG rules are grouped according to their function.For example CFG-I is used for evaluation of simple sentences and CFG V is used for evaluation of all sentences including simple complex, compound, quoted and incomplete sentences.As seen on the Table 14, there is a nearly %10 difference between the accuracy value of CFG-III, CFG-IV and CFG-V.The reason for this may be the more generic content of TNC corpus.It has some speech like quoted, incomplete and not regular sentences and phrases.
Finally the average accuracy values on different fields are calculated.500 sentences are used for different field types.
Here we used both İTÜ and TNC corpus.The result can be seen on Table 15.This grammar representation may have different accuracy value in different fields in other words the average accuracy value of the output will change according to field.
In the sentences related with academy, because of there is not much quoted sentences and incomplete sentences; the accuracy result for CFG II, CFG III and CFG IV become similar.The big difference between CFG II, CFG III and CFG IV is seen on "Story" field.

Discussion and Conclusion
As a conclusion Turkish is one of the most regular and rule based language in the world.So when we represent As it is known, in the recent days natural language understanding is one of the important topics.In order to understand semantic meaning of a sentence, we should separate the sentence to its meaningful parts and the functionality.
Relationship between these parts should be known.The correctly parsed sentences are necessary in many fields of Natural Language Processing.With our study the sentence may be parsed to its phrases and basic word types, with its suffixes and transformation and relation of words can be provided.
The suggested CFG has general and ruled base method.
Using the suggested CFG the sentence can be separated into its phrases.In Turkish understanding the phrases so important to understand the sentence.Who did the action?When did the action?Where did the action etc.Using the CFG rules the phrases can be also decomposed to its root words.We can understand complex sentences with their verbal adjective, gerund or con-verb.so we can understand the inner sentences.This CFG also recognizes question, compound, quoted and incomplete sentences.Noun phrase parsing rules can be use in Name Entity Recognition (NER) application.It is expected that suggested CFG for Turkish become a source for different area in NLP.

Figure 2 .
Figure 2. Free phrase order example

Figure 3 .
Figure 3. Predicate and other phrases example

Figure 9 .
Figure 9. Noun phrase in complex sentence

Table 7 .
CFG rules for compound sentence Non-terminal S ← S C S | C S | S (Basic Complex and Compound Sent.)C ← ama | ve | veya | , | ; etc

Figure 15 .
Figure 15.Verb which has noun root parsing example

Table 12 .
Suggested CFG accuracy values compared with previous version

Table 13 .
CFG and related rulesIn Table14all the accuracy value of different CFG's related with included sentence type is seen.The accuracy values are calculated in İTÜ Machine Learning Corpus and Turkish National Corpus (TNC) and the average accuracy is calculated to decrease the error.

Table 14 .
Grammars accuracy in different corpus's

Table 15 .
Suggested grammars accuracy in different fields CFG Rules Academic Story Twit Noval News Turkish sentences with CFG rules which have recursive structure and test this representation on two different corpus; the average accuracy value is found as 81.9 %.The suggested context free grammar rule for Turkish covers a large amount of Turkish corpus.