Web Crawler for Indexing Video e-Learning Resources: A YouTube Case Study

The main objective of the current paper is to develop and validate an algorithm focused on automatically indexing YouTube e-learning resources about a certain domain of interest. After identifying the keywords specific to the desired domain, a web crawler is developed for evaluating video resources (from the YouTube platform) in terms of relevance for that domain. Once the most relevant video resources are found, they are indexed with the usage of a NER engine applied on their transcripts. In this manner, semantic queries can be used further in order to find exactly the needed information inside these multimedia resources. The crawler will repeat the indexing process daily in order to maintain the repository of semantically indexed videos up to date. The final chapter presents the obtained results together with the validation of the model.


Introduction
YouTube is arguably one of the largest elearning resource of humankind. With 300 hours of video content uploaded to the platform every minute [1], no one knows how deep the information about a certain topic can be buried between a wide range of meaningless videos. To make things even worse, most of the information present on the platform is available in video format only, without captions or any other form of searchable content. The problem that raises from here is how can one search and watch interesting videos about a certain topic only, without being flooded with unwanted or distracting videos. Having in consideration that the platform's declared main scope is to increase the watch time [2], it is uncertain when and how will Google introduce a feature into their platform in the near future that will resolve the problem. One solution could be the usage of the existing search feature of the platform and the addition of one or more filtering layers that can extract and rank only the desired videos for a specific topic. In order to do this, this paper proposes a keyword-based search approach for finding the top most interesting videos for each and every key term related to the desired topic. Then, for each video, the captions are extracted (if they are present) or generated (as show in [3]). The DBpedia Spotlight algorithm [4] is used further for extracting the entities from the text and the results are saved into a Microsoft Azure Cosmos DB database. This cloud database was chosen because it is ranked as the second fastest graph database and the optimal option for saving large semantic data [5]. The indexing process presented above is repeated daily, with the search for newly added videos and their semantic indexing. The results of the proposed algorithm are presented either in a classic format as in the YouTube platform, either in a graph format which highlights the relationships between videos and the grade of their relevance to the searched terms or entities.

Related Work
A similar approach for e-learning video resources indexing was conducted by the author in his PhD thesis [6]. However, the current work differs by certain stages of data processing from the previous work. Paper [6] had used a manual data input scheme and the developed platform needed an administrator in charge with adding new e-learning resources. The current approach uses a keyword-based web crawler for automatically adding new data into the platform in order to be processed. Another distinction is given by the multi-lan-1 guage support of the current algorithm. Because the captions are generated, if they are not present on the YouTube platform, they can be easily translated into English in order to perform the entity recognition. After this step, as long as the users search for entities that have correspondents in English, the language of the video doesn't matter anymore from the algorithm's point of view. This was a limitation of the previous work, resolved in this version of the algorithm. A short literature review is presented below. Paper [7] proposes a framework for video semantic recognition with the usage of supervised and semi-supervised machine learning models. However, the paper is focused rather on feature extraction, based on the visual component of the multimedia content, than on the cognitive content per se. Even though in [8] the authors are using an Web Ontology Language (OWL)-based ontology for semantic content analysis, the main focus remains on feature extraction, not on knowledge extraction. A research closer to the current work is [9] where unsupervised machine learning algorithms are used to classify different videos from the Dailymotion website coming from 9 different channels with the usage of Named Entity Recognition (NER). Nevertheless, the final scope is to improve the classification of the videos rather than to extract and further use the content. The closest research paper to the this one is [10] where Natural Language Processing (NLP) techniques are used to extract certain entities from YouTube videos. The main disadvantage of the model presented in [10] is that it can be used by experienced users only via the SPARQL Protocol and RDF Query Language (SPARQL) endpoint or by connecting it to the Linked Open Data (LOD) cloud. To summarize, the added value of the current paper consists in the overall data flow between the different components, that allows the automatic indexing of the multimedia elearning resources from the YouTube platform, the final scope being further access to the indexed resources by non-technical users.

The web crawler
The first step of the developed algorithm is the automatic crawling of the newly added YouTube videos for a specific domain of interest. For the scope of this paper, entrepreneurship was chosen as the main topic. The YouTube Data API [11] is used to perform search queries based on relevant keywords. The keywords are extracted from the DBpedia ontology [12] based on The Dublin Core Metadata Initiative [13] component, that the ontology links to. Let us take a more specific case from the previously mentioned entrepreneurship domain. If we focus on the dct:subject property of the http://dbpedia.org/resource/Entrepreneurship entity, we will find the following entities as values (their English labels in fact): "Entrepreneurship", "Business models", "Business occupations", "Business terms", "Management occupations", and "Small business". By using the relations available in the LOD cloud further we can identify for each and every entity that was found in this step if it is subject of another DBpedia entity by using the inverse property of dct:subject. For example, for "Business occupations" we can find the following entities (enumerated by their English labels again): "Consultant", "Decision analyst", "Board of directors", "Chief Scientific Officer", "Chief operating officer", "Chief privacy officer", "Enrolled agent", and the list can continue. For the topic of entrepreneurship, by following these steps, a list of 732 keywords were extracted with the usage of SPARQL queries. The next step of the algorithm is to feed the identified keywords to the web crawler which will call the YouTube's Data API to obtain the first 50 results (the maximum number of results in the current version of the YouTube Data API) for every keyword. The API has options for result filtering based on a certain geographic locations or specific languages, if this is the desired behavior. After the video unique identifiers are obtained for all the extracted keywords, the next phase begins. In this stage, the algorithm requests the English captions for the videos. The videos that have English captions are processed first because this means just an additional YouTube API call for getting them. For the rest of the videos, the Google Speech-to-Text API is used to obtain an approximate transcription of the video content. If the language of the video is not English, the captions will be translated to English with the help of the Google Translate API. The main reason for doing this is that the DBpedia Spotlight NER algorithm provides the best results for the English language [14]. All the obtained data is saved into a Microsoft Azure Cosmos DB cloud database as mentioned before. Figure 1 shows the overall schema for the current component of the developed algorithm.

Extracting the knowledge
In the previous phase, the videos relevant for the specific domain, along with their captions, were saved in the Cosmos DB instance. The next step is the extraction of named entities from the captions of each video. In order to achieve this goal, the DBpedia Spotlight NER engine is used. The best results, when it comes to under-resourced languages (the case of Romanian for example), were obtained for English texts when tuning the confidence to 30% [14]. After performing the NER process for the captions of each video, the database containing the video list will be updated with the list of entities from the DBpedia ontology found in each e-learning resource. The entire process described until this point is repeated daily in order to search and index new videos added on the YouTube platform. The already indexed videos are ignored. The YouTube unique identifier is used in order to detect if a video is already indexed or not. After the first videos are indexed, the users of the platform are able to perform searches based on their needs. The searching process is not a keyword-based classical one. Rather than applying this method, the algorithm applies NER again on the search string in order to identify entities that the user is interested in. This time the n-best candidates version of the DBpedia Spotlight algorithm was chosen with the confidence set again to 30%. If the language of the platform is not English, then the search string is translated into English, the NER is performed and the translation of the identified entities' labels are displayed. Figure   2 shows the search page prototype of the designed platform. In this page the user can type the search string and entities are identified on the go. For every new entity found, a box with a random generated color is used to emphasize it. When the user hovers the text from the box, the description of the entity is displayed. For the entities that have more than one candidate, the box is replaced by a combo box with all the possible candidates ordered by their probability. Fig. 2 The prototype of the search page with the details of the "Business plan" entity displayed Once the entities that the user is interested in are found, the searching process starts. The searching algorithm uses these entities along with the relationships between them from the LOD cloud. In the first stage, the algorithm computes two ranks for each video from the database by taking into consideration the entities specified by the user. Below, equations (1) and (2) describe how these ranks are computed [6].
(1) = ∑ , , where Ri is the rank of the i video, and ni,j represents the number of occurrences of the j entity in the i video resource (2) = max ( , ), where is the maximum rank of the video i, and ni,j represents the number of occurrences of the j entity in the i video resource In the second stage, the searching algorithm determines the DBpedia classes that correspond to each and every of the searched entities. A SPARQL query is used to interrogate the DBpedia ontology for the parent classes of every entity: Afterwards, the videos that contain entities which share the same parent classes with the identified ones are ranked also. Both ranks are recomputed, so the equations (1) and (2) presented above, become (3) and (4) after this phase [6]. (3) = ∑ , + ∑ · , , where Ri is the rank of the i video, and ni,j represents the number of occurrences of the j entity in the i video resource, mi,k represents the number of occurrences of instances of the k class in the i video, and q is a configurable significance coefficient (set to 0.2 in our case) (4) = max (max ( , ), max ( · , )), where is the maximum rank of the video i, and ni,j represents the number of occurrences of the j entity in the i video resource, mi,k represents the number of occurrences of instances of the k class in the i video, and q is a configurable significance coefficient (set to 0.2 in our case) The rank Ri is used further for sorting the results of the search string and the to determine the dominant entity for each e-learning video. The dominant entity will be graphically marked in the platform by the use of a specific colour (similar to what is visible in Figure 2). Figure 4 presents the entire logical schema diagram of the searching algorithm. Because the algorithm needs a large amount of processing power, just grade one parent classes are taken into consideration. This step is necessary because some of the videos might not contain exactly the identified resource. For example, if one of the searched entities is "Funding", we might have the situation when no videos contain this entity. By applying the second phase of the algorithm, the parent class of "Funding" is queried, which in this case is "Band". In this way we can find videos that contain entities such as "Lobbying", "Society", "Bestinvest" and other similar terms that might be partially relevant for the person interested in funding. The results are presented by the platform in two manners: a classical one and a graphbased one which illustrates better the relationships between different video resources.  In order to validate the efficiency of the developed algorithm, compared to YouTube's algorithm, which is based on keyword search and focused on maximizing the watch time, a survey approach was taken. A sample of 105 persons were chosen among the second and third year students of The Faculty of Cybernetics, Statistics and Economic Informatics, all of them early users of the developed prototype. The chosen methodology was the one of random sampling without replacement with a 95% confidence interval. The results of the survey were processed with the usage of the SPSS software [15]. The respondents were split into 64,8% third year students and 35,2% second year students, being divided approximately equally per gender. The first section of the survey was focused on the time spent on the internet, the knowledge of English and the usage of search engines. The questioned students are good English speakers and are using Google and YouTube for searches related to educational resources (98.1% of them are using mainly these two search engines). Even though, 60% of them said that they do not find always the needed information when they use these platforms and 70.5% of them prefer the video e-learning resources over text ones.
When it comes to the developed prototype, the respondents said that they would evaluate the efficiency of the algorithm to 87.7%, compared to YouTube's one of just 76.5%. Table  1 and Table 2 present the descriptive statistics for the two questions in greater detail. Additionally, the students evaluated the efficiency of the graph-based representation of the results (depicted in figure 5) compared to the classical one (depicted in figure 3) to be Future work will include a way of allowing the users to rate the relevance of the found videos in such a manner that search results that are not appropriate will be ignored in future searches. Ways of implementing this include, but are not limited to, like and dislike buttons, as the ones from the YouTube platform, or 5star based ratings. Another future feature is the addition of a new step to the algorithm, that will make possible the creation of a video containing all the needed information by cropping and linking together parts from the relevant videos found based on the existing algorithm. In this manner, one can choose the time that he wants to invest in learning something and his current knowledge of the domain. The algorithm will generate as a result a single video of the selected length containing relevant information extracted from multiple e-learning materials.