Index Point Detection for Text Summarization using Cosine Similarity in Educational Videos

Explosive growth in digital content creates a trend of video-based learning and knowledge sharing. Several educational videos are shared by various trainers/bloggers/ instructors daily. It is now evident that even during emergencies like the recent ongoing CoVid-19 outbreak, video lectures are the saviors when the whole world comes to rest. This rapid and expansive growth has encouraged the researchers to efficiently and automatically index, browse, and retrieve video data. This research aims to process videos and precisely identify the index points which are discussed within the video. Experimental results show a good performance of the topic detection model in finding keywords and reducing dimensionality.


Introduction
Segmentation splits a video stream into different camera-takes/scenes or shots.These are nothing but the correlated sets of a contiguous picture frame sequence taken by a camera from the moment it begins to capturing the moment it finishes. [1] [2]. Content-wise, shots are homogeneous and possess a degree of visual uniformity. The idea is to extract the video content and analyze it for information and knowledge retrieval. This paper intends to offer a framework that assists in the video summarization technique by extracting keywords as and when they occur in the video. Shot Boundary Detection is done through key-frame identification using Cosine Similarity with a static threshold value.

Textual content in a Video
Text in a video sequence can exists in one of the two forms viz.
• Scene Text: It is also called graphic text and is displayed in a video sequence during the video making, for example, text written on boards, banners, and hoardings. • Artificial text: is the text that is introduced explicitly into the video through editing. It is always displayed at a specific position in the video. Examples are names, data about the video, or interpretation of discoursed in video film, captions in a video.

Online Video Lectures
Recording video lectures and presentations for enhancing the teaching-learning mechanism is now longstanding in India. [3] [4]. With the support of the Ministry of Human Resource Development (MHRD), institutes of national importance in India started recording the video lectures, which substantially increased the self-paced learning for students and academicians. This repository is now huge and authentic that many articles have been published, and much research is possible for data scientists by using this data-set [5]. This data-set is very useful in many research domains like natural language processing, speech recognition, image, and video analysis.

Related Work
Text summarization is the process that automatically gives the precise meaning of the given document and attracting the attention of Natural Language Processing researchers. [1] Proposed an algorithm in which the OpenCV library is used to decode video into frames. [6] proposed that Clustering and dictionary learning together can be employed to determine keyframes, and additional local and global features of these frames can is determined. Color features [7], [8], moments of inertia [9] as well as local binary features [10] are most predominantly used for representing image frames in a video. Table 1 represents the literature regarding text and feature extraction techniques used by different authors in their state-of-the-art research. Few methods came into existence that uses divide-and-conquer algorithms to segment videos and determines keyframes [11]. [12]Using an intra-modality and inter-modality fusion mechanism to select sub-summaries, the aggregated attention curve was created and all sub-summaries were finally combined into an aggregated summary. [13]proposed a dual-threshold method for video segmentation. Different keyframes are extracted from each segment. Scale-Invariant Feature Transform (SIFT) features are extracted from the keyframes of the segments. It is proposed in SVD-based method (Singular Value Decomposition) ,two video frames with SIFT point set descriptors are suggested to match. [14] figures out the mutual information method according to information theory and confirms that it improves the extraction efficiency.

Index Point Detection Framework
This research discovered that the frequency of words, frequency of n-grams, and several first-time words in a video provide valuable information for video segmentation by topic. The research is intended to identify semantic and topic-based index points in educational videos so that the video has discussed topics that can be quickly and efficiently listed. These index points are the topic/subtopic discussed in the video lecture. Data sets of online video lectures, tutorials, open course-ware, and similar resources with variable playback time and specifically the videos that contain lecture slides are used as input. As represented in figure 1, the proposed framework is robust and exhibits good performance for video text detection and recognition.
For similarity measures, cosine similarity is used, and we adopted a threshold value while measuring the similarity among video frames. Videos with most graphs, statistical data, and diagrams are avoided in the data set to maintain the throughput. The research work focuses on selecting Keyframes, and the goal is to make a sequence of keyframes in the original sequence while reducing the redundancy as much as possible.

Partitioning of Video into Frames
Initially, image frames are extracted from a video. Generally, a video contains 24 image frames per second, out of which the majority are redundant frames. Therefore, there is a need to implement some measures to extract critical frames only. Usually, a video of k minutes is partitioned into:

2.
A Novel Key-Frames Selection Framework for Comprehensive Video Summarization [16] Key frames are identified by aggregating spatio temporal features generated by CapsuleNet.G1 continuity error and cubic H Bézier curve fitting for shot boundary detection within local sliding windows.

3.
A Novel Shot Detection approach using 8 Neighbors and Key-Colors [17] For eight neighbours, the grey difference is calculated between the adjacent pixels. If the threshold is exceeded, 1 will denote the corresponding statistic, and 0 otherwise. The main colours are applied to the adjacent frames using colour feathers to enhance the accuracy of shot detection and to judge the similarity of material between adjacent frames.

4.
An efficient method for video shot boundary detection and key frame extraction using SIFT point distribution histogram· Multi Modal Visual Features Based Video Shot Boundary Detection· Abrupt Shot Detection in Video using Weighted Edge Information· Shot boundary detection using perceptual and semantic Information [18] SIFT, SURF (Speed up Robust Features), Sobel Edge Detector, MSER (Maximally Stable Extreme Regions) are the key state-of-the-art local feature extraction algorithms used recently, using histograms-based feature extraction techniques.

5.
A Study of Discriminant Visual Descriptors for Sport Video Shot Boundary Detection [19] The most popular global attribute of a video are colour histograms. They offer a good balance between precision and time.

6.
Improving the Video Shot Boundary Detection Using the HSV Color Space and Image Subsampling [20] The colour space of the HSV is more random since it determines a 3-D colour that provides more robust detection results in turn.

7.
Edge Strength Extraction using Orthogonal Vectors for Shot Boundary Detection [21] Edge detection for feature extraction.

8.
Shot Boundary Detection in MPEG Videos Using Local and Global Indicators [22] Hybrid features for feature extraction 9.
Video shot transition detection using spatio temporal analysis and fuzzy classification [23] Image intensity or color histogram for feature extraction where N is number of frames generated initially from the video. It is evident that in educational videos, a topic will be discussed for at least 10 seconds. Therefore, to curtail the number of redundant frames, 10 seconds delay is introduced in frame generation. Figure 2 displays the frame generation of a sample video from the dataset after every 10 seconds.

Cosine Similarity
Text similarity estimates the similarity index between the text documents using the distance measures or semantic similarity index. In this technique, text in the document acts as the local features used to find similarity between the documents. In our research, we will use the cosine similarity method [25]. Cosine similarity uses the cosine value of two vectors to represent the similarity between two image frames [26]. If frames are identical their cosine value is closer to 1, else varies from [0,1] as per equation 1 where A & B are image frames.
Values range from -1 and 1 , where -1 is perfectly dissimilar and 1 is perfectly similar. The similarity threshold is typically set between 0.7 and 0.8. It is observed that higher the compression ratio requirement, the smaller the threshold setting.

Index Point Identification
After successful extraction of image text, index points/keywords needs to be identified. The first step in this direction is text pre-processing [27], which involves removal of noisy data like punctuation marks, special character, tags, URL and removing redundant text components. In the second step removal of stop words viz. pronoun, prepositions, conjunctions and connecting verbs is done. Third and final step are Text Preparation [28] in which the Bag of Words model is used to count word frequencies. The word frequency determines the possibility of a word to be called as 'Index Point'.
In the process of Keywords extraction, removal of stop words, and conversion of the text into a list of words are also done. Unigram and bigrams are also generated so that accurate index points could be identified. These are converted to a matrix of integers by the process of vectorization. The figure 3 depicts that "Data structure", "Data type", "abstract data", and "primitive data", are bi-grams that are more often used by the instructor in the video.

Result Analysis
Keyframes extraction done using cosine similarity method with experimenting with different threshold values is shown in the   Key frames (--) ----accuracy of the framework. The average precision value of the framework is 86%.

Conclusion
Through this research paper, we want to propose a new framework that extracts text from keyframes for keyword-based video summarization. Throughout the article, the keywords are referred to as Index Points. Cosine-Similarity approach is used to verify the index of similarity between the frames. OCR tool PyTesseract is used to extract text from the keyframes. The findings are rather persuasive, and the precision is considerably higher. This system would allow better indexing of videos in the future.