Crowdsourcing in developing repository of phrase definition in Bahasa Indonesia

,


Introduction
A repository is a place that stores resources containing an large size of data, usually with tools to access the data [1].A language repository is a repository that stores language resources in various types, such as words, phrases, sentences, and paragraphs, which is stored and processed by electronic means.The language repository (or commonly called corpus) may be in the form of dictionary, thesaurus, or collection of annotated texts.
Language repositories are important as a reference in using the language and they can be valuable to preserve a language along with the cultural context [2].In the field of natural language processing, language corpora plays important role to run applications that use language dependent algorithms.Corpora in the form of N-gram, for example, can be used in word and sentence similarity calculation [3,4].Language corpora are also essential as testing materials in the development of methods, such as those implemented in classification, speech recognition, and machine translation [5][6][7].Supervised algorithms use corpora as training data, while unsupervised methods may use corpora as testing data to assess algorithm performance.Even language independent methods may need corpora in proving the applicability of the methods in a target language.
Bahasa Indonesia is the formal language in Indonesia so it has a large number of users as more than 200 million people reside in the country.Unfortunately, language repository in Bahasa is scarce.The largest open repository is in the form of Big Dictionary that is widely available in print and is also accessible online [8].However, many definitions in Big Dictionary are outdated.A book containing thesaurus in Bahasa has been published but the number of items is small [9].Many researches claim that they use data repository they have built during research analysis but most repositories are not openly available [10,11].Developing language repositories can be costly for data collection, annotation, and validation [12,13].Many parties have attempted to build repositories of Bahasa but the result seems to have been unexciting [14][15][16][17].Experts or natural speakers may need to get involved to validate or annotate  ISSN: 1693-6930 TELKOMNIKA Vol.17, No. 5, October 2019: 2321-2326 2322 the items.This writing describes an effort to build repository of phrase definition in Bahasa Indonesia.The effort is conducted utilizing technology mediated social participation through "SiS", an application that manages undergraduate student in their final project.Such a model of repository construction is commonly called crowdsourcing, which describes an activity in taking a task once performed by employees and outsourcing it to a large network of people [18].

Research Method 2.1. Application
An application named SiS has been used to collect data from the crowd.The main purpose of the application is to manage undergraduate student final project.The long term use of the application make it potentially suitable for implementation of the crowdsourcing model.At the end of undergraduate study, a student will do a brief research project and write a scientific paper that explains the result.The project proposal shall consist of a title, paragraphs of description, and several keywords.During research activities, students have to write their progress in an online log book, for at least eight times.
A companion application add-on has been inserted into the main application.Every after the third time a student write a log, the add-on displays a small pop-up box to attract student attention.The pop-up box contains a message that invites students to participate in developing or validating repository items.The participation is not obligatory, which may eliminate uninterested parties to join and take part.When a student clicks a link in the pop-up box, the system will redirect to web pages that enable the user to contribute, either by writing a new definition of phrases or validating phrase definitions.Now because student scientific papers are written in Bahasa Indonesia, the resulting repository will be in Bahasa.

Data Collection and Analysis
The application is relatively new for the university, and the add-on is even later.After one semester running, we investigated the number of students served by the system and the number of students that contributes in crowdsourcing activities.More importantly, we calculated the number of phrase definitions, and the number of definition validation contributed by the crowd.Later, we investigated students with most contributions and invited them into a survey.Of 34 students invited, 16 students showed up and took part in the survey.We had prepared a questionnaire containing 22 questions asking about the application feature and student perception which need student conformations as shown in Table 1.The questions are grouped under 4 categories and were intended to observe the user knowledge and perception, product usability and user interest.The first category is about knowledge that drives the user into that feature.The second category is about user perception during the interaction with the system, including the keyword significance for their work (importance and relevance).The third category observes product usability whether user contribution gives benefits.The last category is user interest.The main purpose of the research is to indicate whether the information system may be employed to do crowdsourcing.Respondents need to state whether they strongly agree, agree, disagree, or strongly disagree on each statement in the questionnaire.The response were then metered using Likert Scale and each response is converted into one of the values 4, 3, 2, 1, correspondingly.

Result
After a year of implementation, the number of application users have grown and reached a steady number of about 1000 people.Users come and go because of the nature of student final project activity.Students do a final project for one semester in average and though some extends for another semester, they will eventually finish their work, pass the course, graduate and finally cease accessing the application.
Average users can be assumed to have access to the system for six months (or one semester), hence our analysis may be cropped into a frame of six months.We have selected a time frame from February until August 2018 for analysis.As the application was relatively new, the number of users were still growing from 205 in February to 1086 in August.The growth was quick in early phase of the semester and was very slow by the end of the semester.On the same period, the number users that participated as definers and validators were also growing, but with steadier and slower rate.
Not all users participate in crowdsourcing activities.In February, about 17% users contributed in writing definitions, i.e. being definers, and the proportion increased slowly and reached a value of 25% in August.For August, the percentage is equivalent to 271 users.Smaller proportion is seen for users that involved themselves in validation, which ranges from 12% to 17% of the total users as shown in see Figure 1.

Figure 1. Number of users, definers and validators during the period of observation
Despite the smaller number of users to do the task, validation has more productive output than definition writing.The phenomenon has a straightforward explanation because validating a definition needs much less effort than writing a definition.A user just need to click a multiple choice option and click a button to select whether or not a phrase definition is accurate/acceptable or not. Figure 2 tells that the number of phrases increased by more than 3000 items during the period of February to August.The same period saw the increase of definition by about 1300 items while the number of validation is increased by more than 2700 items.The figures imply that about 40% of new phrases get definition from users and in average each definition is validated by two users.
Most of 271 users that contributed in writing definition did the activity for phrases in their own scientific paper, i.e. for their own project.Interestingly, there were about 34 people that wrote at least 7 phrase definition which means that they have written definitions from other  Response of users against the questionnaire is displayed in Figure 3. Response for all category is more than 3 in Likert Scale which means that respondents are in average agree to strongly agree with statements in the questionnaire.Most users that were involved in crowdsourcing were well aware of the tasks.They knew what they were doing and what it was all about.They also have positive perception to the feature of defining phrases and validating definitions and feel somehow the feature is useful.

Figure 3. Response of users that were active in crowdsourcing;
Category A is knowledge, B is perception, and C is usability Category D of the questionnaire asks about user intention in taking part in crowdsourcing activities.Among 16 students that filled in the questionnaire, ten students thought that they would get additional score in their final project by getting involved in the activities, seven said that they did it because their supervisor suggested them to.Most of them (or 13) said that they got a sort of reward by doing the good thing.We observed at least three students which did all the works purely for the sake of goodness.

Discussion
An information system that manages undergraduate student final projects has been deployed as well as the add-on to crowd source a repository of phrase definition.The student users make definitions of keywords that they use in their scientific paper as part of their final project.Aside from phrase definition, students are opted to validate definitions written by their friends.The activities of writing phrase definitions and validating definitions are not obligatory.Instead, a pop-up box shows up to users randomly and intermittently, inviting them to join crowdsourcing activities [19].The pop-up box is like an open call to all registered users.The application gives a sort of reward in the form of virtual score and medals.Moreover, the application has a large number of active users, hence it fits the conditions to use in crowdsourcing.

2325
The application has attracted up to 25% registered users to contribute.This number is well above an expected value of 10% users as predicted by [20] who stated that 90% of users are expected to be passive.We spotted approximately 3% of users give much higher contribution than average users.One user shows exceptional work by contributing 51 word definitions alone.The contribution figure is not comparable off course to the work by the Madman in Winchester's tale [21], but the result for us is not less important.
The number of repository items has been growing since the application add-on was deployed.Phrase definitions increases by about 200 new entries each month, though the growth is unsteady in the range of items.On the other hand, definition validation increases progressively during the period of observation, but the increase appears to reach a steady value of 400 validation in further months (September and October).The result is stimulating.If the growth is steady at the aforementioned rate, we would optimistically have a dictionary of scientific terms with ten thousand entries after four years.
The optimism has a good reason based on the observed statistics.Success in building repositories in main international languages such as in [22][23][24][25] may be copied or conducted better.However, we need to get alert by the phenomena which is hard to estimate.Students who participate in crowdsourcing do not have single motivation.A few have stated that they got involved to follow directions of their supervisors.Some assumed that participation in crowdsourcing would add to the score of their final project.Some others have joined for immaterial rewards, i.e. a worthy cause.Apparently, some reasons may be unsustainable, which may disrupt target achievement.
Strategies may need to be thought up and implemented to keep the good work moving on.Experts have put some guidance for a variety of a crowdsourcing project.It should have a clear goal, a sound challenge, and regular report.The application should be easy and fun, reliable and quick, intuitive, and provide options to the user so they can choose what they work on [26].Besides, the contributors should be acknowledged, rewarded, and trusted.The content should be interesting, novel, focused on history or science, and there should be lots of it to create through the years.

Conclusion
Language repository can be developed through a crowdsourcing application.We have developed such a system that fit the conditions for crowdsourcing activities.It has a large number of trustable users, i.e. approximately one thousand students per semester.The main system runs well and it provides facilities for users to have their final project done.About 25% users participate in crowdsourcing activities to make phrase definitions and to validate definitions.During the period of observation, about 200 phrase definitions were written and in average each definition was validated by two users.In short, it is possible to develop language repository, in this case: phrase definition in Bahasa Indonesia, using an application that implements crowdsourcing model.

TELKOMNIKA
Vol. 17, No. 5, October 2019: 2321-2326 2324 students' project.The latters were really engaged with crowdsourcing activity which is the reason for them to be invited to fill in a questionnaire for further investigation as why they were keen to contribute.

Figure 2 .
Figure 2. Number of phrase, phrase definition and validation during the period of observation.

Table 1 .
Questionnaire to contributing users Statement Knowledge 1.I am aware of the feature to contribute on keyword definition and definition validation, in application SiS. 2. I understand the purpose of the features related to keywords in SiS 3. The keyword in a scientific article and thesis is no more than a less useful complement.4. A scientific term or keyword needs to be explained in terms of the definition of terms.