Finding Structured and Unstructured Features to Improve the Search Result of Complex Question

The current researches on question answer usually achieve the answer only from unstructured text resources such as collection of news or pages. According to our observation from Yahoo!Answer, users sometimes ask in complex natural language questions which contain structured and unstructured features. Generally, answering the complex questions needs to consider not only unstructured but also structured resource. In this work, researcher propose a new idea to improve accuracy of the answers of complex questions by recognizing the structured and unstructured features of questions and them in the web. Our framework consists of three parts: Question Analysis, Resource Discovery, and Analysis of The Relevant Answer. In Question Analysis researcher used a few assumptions and tried to find structured and unstructured features of the questions. In the resource discovery researcher integrated structured data (relational database) and unstructured data (web page) to take the advantage of two kinds of data to improve and to get the correct answers. We can find the best top fragments from context of the relevant web pages in the Relevant Answer part and then researcher made a score matching between the result from structured data and unstructured data, then finally researcher used QA template to reformulate the questions. Penelitian yang ada pada saat ini mengenai Question Answer (QA) biasanya mendapatkan jawaban dari sumber teks yang tidak terstruktur seperti kumpulan berita atau halaman. Sesuai dengan observasi peneliti dari pengguna Yahoo!Answer, biasanya mereka bertanya dalam natural language yang sangat kompleks di mana mengandung bentuk yang terstruktur dan tidak terstruktur. Secara umum, menjawab pertanyaan yang kompleks membutuhkan pertimbangan yang tidak hanya sumber tidak terstruktur tetapi juga sumber yang terstruktur. Pada penelitian ini, peneliti mengajukan suatu ide baru untuk meningkatkan keakuratan dari jawaban pertanyaan yang kompleks dengan mengenali bentuk terstruktur dan tidak terstruktur dan mengintegrasikan keduanya di web. Framework yang digunakan terdiri dari tiga bagian: Question Analysis, Resource Discovery, dan Analysis of The Relevant Answer. Pada Question Analysis peneliti menggunakan beberapa asumsi dan mencoba mencari bentuk data yang terstruktur dan tidak terstruktur. Dalam penemuan sumber daya, peneliti mengintegrasikan data terstruktur (relational database) dan data tidak terstruktur (halaman web) untuk mengambil keuntungan dari dua jenis data untuk meningkatkan dan untuk mencapai jawaban yang benar. Peneliti dapat menemukan fragmen atas terbaik dari konteks halaman web pada bagian Relevant Answer dan kemudian peneliti membuat pencocoka skor antara hasil dari data terstruktur dan data tidak terstruktur. Terakhir peneliti menggunakan template QA untuk merumuskan pertanyaan.


Introduction
Analyzing the focus of question is not a new issue on question analysis research. A big part of the purposes of those researches are to achieve the information of question type or user intention clearly and definitely. Understanding the key features of questions are the prominent works of WKRVH UHVHDUFKHV IRU UHDFK XVHU LQIRUPDWLRQ ¶V need. This topic becomes more interesting to face the long and complex questions. In some of the researches, complex questions often refer to long answer questions. On complex TXHVWLRQ ¶V research, an answer of a complex question is often a long passages, a set of sentences, a paragraph, or even an article [1]. Although many prior studies of keyword search over text documents (e.g HTML documents) have been proposed, they all produce a list of individual pages as results [2].
Automatic Question Answering System usually give a document or a passage that contain the answer as the result. For the example of the question is, ³:KR LV SUHVLGHQW RI 86$´ then we usually find the results as given by figure 1. We can see that the result usually returns a bag of ZRUGV 7KH DVNHU ¶V LQWHQWLRQ LV DFWXDOO\ TXLWH FOHDU that they need the name of current president of USA. The results from search engines used to be a bag of words that contain a relevant answers.
Sometimes, it is difficult to achieve the answer of one complex question since the answer can not be retrieved from only one web page or one resource. In fact, it is very common that the answer of one complex question is possibly separated in several web pages. Recently, the research of Question Answering got a challenge of complex question [3] [4][5] [6]. The detail of our observation will be described on next section.
In this work, the complex question is a natural language question that contains structured and unstructured features. Thus, researcher propose an idea to integrate structured and unstructured data on the web to answer those questions. It is effective to improve the search result of the question. The resources are need to consider not only unstructured data but also structured data. One example is, ³:KDW LV WKH capital city of the country that the largest country in Arabian 3HQLQVXOD´. The focus of this question is to know clearly capital name of the country that the country is largest in Arabian Peninsula. From WKLV TXHVWLRQ UHVHDUFKHU FDQ ILQG ³WKH FDSLWDO FLW\´ DV WKH VWUXFWXUHG IHDWXUH RI TXHVWLRQ DQG ³WKDW WKH ODUJHVW FRXQWU\ LQ $UDELDQ 3HQLQVXOD´ DV DQ unstructured feature of question. By these features researcher can effectively retrieve the relevant resource data to answer from both structured data and unstructured data.
For comparison, figure 2 shows the result from search engine Bing usually a relevant passage that contains the needed answer. The factual answer is Riyadh.
IQ DQRWKHU H[DPSOH LQ WRSLF ³PRYLH´ researcher can find the database of movie on the web as structured data. web pages that contain information of movie are also huge amount exist on the web. Actually, many domain data are stored as structured data on the web. Thus, these are all of our motivations in this work and the major concentration is about how to find the structured and unstructured features of the question and integrate two kinds of data as the effective resource to improve the answer of the question.  Structured data on the web is prevalent but ignored often by existing information search [7]. Moreover, structured data on the web usually have high-quality content such as flight schedules, library catalogs, sensor readings, patent filings, genetic research data, product information, etc. Recently, the World Wide Web is witnessing an increasing in the amount of structured heterogeneous collections of structured data. Such as product information, Google base, tables on the web pages, or the deep web [8].
According to the complementary characteristics of two kinds of data, it will be very useful to take the advantages of them. The user will not care about from which kind of the resource the relevant information can be found, they only want to get the better answers of their questions.
Since a question is the primary source of information to direct the search for the answer, a careful and high-quality analysis of the question is of utmost importance in the area of domainrestricted QA. [9] explains 3 mains questionanswering approaches based on Natural Language Processing, Information Retrieval, and question templates. [10] proposed another approaches according to the resource on the web. Lin [11] proposed federated approach and distributed approach. Federated approach is techniques for handling semistructured data to access web sources as if they were databases, allowing large classes of common questions to be answered uniformly. In distributed approach, large-scale text-processing techniques are used to extract answers directly from unstructured web documents.
NLP techniques are used in applications that make queries to databases, extract information from text, retrieve relevant documents from a collection, translate from one language to another, generate text responses, or recognize spoken words converting them into text. [12] explains QA based on NLP is the systems that allow a user to ask a question in everyday language and receive an answer quickly and succinctly, with sufficient context to validate the answer. [13] distinguishes questions by answer type: factual answers, opinion answers or summary answers. Some kinds of questions are harder than others. For example, ³ZK\´ DQG ³KRZ´ TXHVWLRQV WHQG WR EH PRUH difficult because they require understanding causality or instrumental relations, and these are typically expressed as clauses or separate sentences summary [12].
IR systems are traditionally seen as document retrieval systems, i.e. systems that return documents that are relevant to XVHU ¶V information need, but that do not supply direct answers. The Text Retrieval Conferences (TREC) aim at comparing IR systems implemented by academic and commercial research groups. The best performing system within the two latest TREC, Power Answer [14] had reached 83% accuracy in TREC 02 and 70% in TREC 03. A further step towards the QA paradigm is the development of document retrieval systems into passage retrieval systems [15] [21].
Template-based QA extends the pattern matching approach of NLP interfaces to databases. It does not process text. Like IR enhanced with shallow NLP, it presents relevant information without any guarantee that the answer is correct. This approach is mostly useful for structured data, as mentioned on [10]. [22] propose a generic model of template-based QA that shows the relations between a knowledge domain, its conceptual model, structured databases, question templates, user questions, and describes about 24 constituents of template-based QA. [23] used a kind template and used ontology on question analysis, and work on structured information on the text.
The Considered Problems: The existing search engines cannot integrate information from multiple unrelated pages to answer queries meaningfully [2]. On the other case, they usually only consider from one kind resource, unstructured data such as web pages or structured data such as freebase (Powerset uses it).
Question Analysis: In the beginning of UHVHDUFKHU ¶V idea, researcher only consider the question whose prefix has a question word (What, Who, Where, When, Which, Why, How) for each of topic domain, including Book, Country, and Movie.
In this first step, researcher need to know the structured feature and unstructured feature that exist on the questions. For the sake of simplification, in this initial work researcher only consider one kind of complex question that might contain structured and unstructured feature. As had been known, a natural language question has many forms of syntax and expression. Hence, researcher put some assumptions in this step according to our observation of the questions from Yahoo!Answer (in English). Besides finding those features, researcher also want to find the focus and subfocus of the question. From the same example, ³:KDW LV WKH FDSLWDO FLW\ RI WKH FRXQWU\ that is the largest country in Arabian PHQLQVXOD"´. Where Question Topic is ³FRXQWU\´, Question Focus is ³WKH FDSLWDO FLW\´, Question Subfocus is ³WKDW LV WKH ODUJHVW FRXQWU\ in Arabian PHQLQVXOD´, Structured feature is ³WKH FDSLWDO FLW\´, and Unstructured feature is ³FRXQWU\ WKDW LV ORFDWHG RQ D ORQJ ERRW VKDSHG SHQLQVXOD´.
We can see that the structured features are the question focus. This condition is one of situation that is issued in dealing with question analysis. Our question data are mostly about entity question. We want more to see the answer tends to structured data.
Resource Discovery and Reach the Relevant Answer: Figure 3 show a framework that use in this work. We take advantage for two kinds of data. For the structured data, the form of this data is simple relational data, e. g single table with attribute name and attribute value. For unstructured data researcher crawl web pages from several websites included Wikipedia. For this initial work, researcher tried to integrate the answer result from two different types of data resource. One of the basic problems of integration is relevant answer matching problem. In our work this answer matching is mostly about the matching terms of both two resources. We will propose a simple linear combination model to reach the score matching between the unstructured data and structured data for a given complex question. Finally, based on the simple answer matching model, it can be reached from both two kinds of resources. Hence researcher can improve the result answer of the question.
We focus on two main works, the first step is finding the structured and unstructured features on the question. The second step is retrieving the relevant information over structured data and unstructured data to achieve the exact answer. Some notations and definitions that would be used in this work are listed below. For

Methodology
Question Analysis: In the beginning of our idea, researcher only consider the question whose prefix has a question word (What, Who, Where, When, Which, Why, and How). We observed 100 questions of three topics, Book, Country, and Movie. We consider on the question that has SKUDVH ³RI D´ RU ³RI WKH´ RU KDV PDLQ FODXVH DQG subordinate clause. We proposed the Algorithm Finding Structured-Unstructured Feature, consists first step of finding the Question topic (Qt), Question focus (Qf) and Question sub focus (Qs) and the second step finding the Feature topic (Fs), Feature structured (Fs) and Feature unstructured (Fu) from the question.
To measure whether the Qf is Fs or Fu researcher use this equation: Where, Fs is Feature_structured, Qf is Question_focus, Ds is Data_structured, An is Attribute_name, and Av is Attribute_value.
Next, to measure whether the Qf can become the Focus of Attributes (FAn) researcher use this equation. (2) Where An is Attribute_name. Figure 4 is an algorithm of finding structured-unstructured features.
Resource Discovery: Most of information on the web is stored in semi structured or unstructured documents. Making this information available in a usable form is the goal of text analysis and text mining system [24]. In this prominent work researcher use on the Data_structured (Ds) side, the relational database single table, and as usually the Data_unstructured (Du) side, the web pages [25].   [30] . The main reason of their work is try to find the advantage on each of resources. Richer their resources mean better answer. Particularly [8] said that asker do not care the resource, they only want find the better answer. Another works [31][32] about using both structured and unstructured data to improve the answer. [2] first work on the keyword search on integration data: structured, semi structured and unstructured data with graph approach. Proposed a kind of integration entities that exist on tablelike format on the web pages. It is the integration of information on the unstructured data.

ALGORITHM OF FINDING STRUCTURED-UNSTRUCTURED FEATURES
Using the structured data and unstructured data in Information Retrieval or Question Answering researches are not new research issue. Since the size of high quality structured data on web is increasing and not yet be optimum explored, using the combination of them seems a new research issue on Question Answering. One previous proposed a prominent work, find structured content over text [33]. [34] proposed the integration of web document and myriad structured information about real word object embedded in static web and online web database. It said that hybrid approach, using both structured and unstructured feature gave the best result on object information retrieval.
The question example, ³:KDW LV WKH FDSLWDO of the country that is located on a long-boot VKDSHG SHQLQVXOD"´. Question_focus (Qf) is the same as Feature_structured (Fs), and ³FDSLWDO´ is Focus_Attribute_name (FAn) which is one of Attribute_name (An) on Data_structured (Ds).
Question_subfocus is identified as Feature_unstructured (Fu), ³WKDW is located on a long-ERRW VKDSHG SHQLQVXOD´, is annotated as terms on Data_unstructured (Du). From the annotated term on Du, some useful attributes names and their corresponding values can be extracted from term around the annotated terms, and find the best snippet or fragment on the Du.
To find the relevant page Du j by the cosine similarity measure which defines in Equation (3), and use the Fu to find the annotated snippet.  Where S is Score of cosine similarity between Du j and q, Du is Data_unstructured, and q is Feature_topic and Feature_unstructured. Where the weight (w) is based on TFIDF weighting scheme.

(4)
Be inspired from previous work [15], researcher want to find the relevant snippet of Du j , where N is the number of total attributes value in Ds, and n t is the number of total attribute value (Av) that contain t on Du j .

°°°°®
log Consider all terms on the snippet that could be the candidates of Attribute unstructured (Au) and calculate the score of answer matching of Unstructured data and Structured data in order to get the answer matching score of record R. We proposed score matching inspired from full string matching based Jaccard coefficient and n-gram matching. First, researcher use Jaccard coefficient to calculate the answer matching score between a record R in Ds. (8) Second, n-grams are typically used in DSSUR[LPDWH VWULQJ PDWFKLQJ E\ ³VOLGLQJ´ D window of length n over the characters of a string to create a number of 'n' length grams for matching a match is then rated as number of ngram matches within the second string over possible n-grams. Inspired from [35], researcher use equation (9) to calculate the answer matching score between R and Au. R contains a set of Av and Au is sequence of text, they are be a pair of ngrams in X and Y. Let R : x 1 « [ k and Au : y 1 « \ l (9) Where and contains at least one complete n-gram. (10) And if both strings exactly one n-gram, the initial definition is strictly binary: 1 if the n-gram are identical and 0 otherwise. (11) Researcher used n-gram, to find the similarity between Du and Ds and consider the position of letter so researcher will find similarity even not really exact. Those all about the answer matching score. The answer matching score is very important to match the unstructured data and structured data. It is all use IR approach then the score is a linear combination as follows: Where ., is weighting parameter (0.1 to 0.9).
To reach the final answer researcher use QA template approach that have modified by IR approach as structured retrieval. QA template approach is used to build the reformulation of question and make structured retrieval.
)RU WKH H[DPSOH RI WKH TXHVWLRQ ³:KDW LV the capital city of the country that the largest FRXQWU\ LQ $UDELDQ SHQLQVXOD´, the QA template is like figure 8.
What is <FAn> of <Ft> <Fu>  As in the very beginning of our explanation researcher used two kind of data. As follows our data in 3 topics. Structured Data is single table relational database and unstructured data is a web page from websites.
The attributes on the

Result and Analysis
In Question Analysis researcher use evaluation metrics Recall (R), Precision (P) and F-Measure (F-Measure). In the Resource Discovery and reach the relevant answer, besides use the Precision, Recall and F-Measure, researcher will use MRR in different fragment size, different threshold of match_score and different ..
We conducted several experiments to show how our simple approach could improve the result of complex question by finding the structured and unstructured features and using light combination of structured data and unstructured data. The experiment is devided into two sections, in the Question Analysis and the result answer.
In  We did the experiments on the small unstructured data. According to this condition researcher firstly only consider the first top rank document and did the experiment on different fragment size (fragment size: 50, 75 and 100) and different number of fragment (n: 3, 5, 7 and 10).
)RU WKH DERYH UHVXOWV RQ WRSLF ³0RYLH´ DQG ³%RRN´ WKH MRR values as show in table IV, not really high but very promising for this initial work that used shallow approach on Question Analysis and Relevant Answer.
We also have compared our approach to the other systems, QuALiM and Powerset. We compared to them because of the resource data of unstructured data were alike, from Wikipedia. Since the result of them is a snippet of result that contains the answer, researcher manually calculate the MRR of their result. We examine whether the answer exist on the snippet. The answer is correct if researcher could find the correct answer on the snippet.

Conclusions
We have proposed the preliminary work of finding structured and unstructured features on complex questions. The complex question in this work is a natural language question that contains structured features and unstructured features. Structured feature refers to Structured data and Unstructured feature refers to unstructured data. Structured data grows rapidly on the web but usually be ignored by existing search engine. In this work show that combination structured and unstructured data. Besides use two kinds of data, researcher also use two approaches, IR approach tend to unstructured data and QA-Template approach tend to Structured data. Actually, historically those two approach worked separately. The other idea of this work, researcher tried to use structured approach on unstructured approach.
This work gives a pretty good result on the Question Analysis in all evaluation metrics, Precision, Recall and F-Measure. In the finding the relevant answer, the result was not really high but still promising, the average > 0.5. Also the comparison with two other systems, QuALiM and Powerset, our approach outperforms both systems. We compared it because they use the similar unstructured data, Wikipedia (english version).
According to our knowledge, the idea on this work is novel, because the previous relevant researches used to worked on unstructured data or structured data. We believe it will very useful. Since this work is our preliminary work, researcher still have many things to do. Our future work will emphasis on Question Analysis and matching measure parts. Improving Question Analysis to handle many kinds of complex questions, even long questions.
Improving the scoring measure, as far as our observation, the main work of integrated structured and unstructured features is matching problem. This part still have a long journey on the integration data. In the unstructured data, work on bigger unstructured data and not really related with structured data and in the structured data side, work on more complex structured data, multi table, and multi scheme.