WET: Word embedding-topic distribution vectors for MOOC video lectures dataset

In this article, we present a dataset containing word embeddings and document topic distribution vectors generated from MOOCs video lecture transcripts. Transcripts of 12,032 video lectures from 200 courses were collected from Coursera learning platform. This large corpus of transcripts was used as input to two well-known NLP techniques, namely Word2Vec and Latent Dirichlet Allocation (LDA) to generate word embeddings and topic vectors, respectively. We used Word2Vec and LDA implementation in the Gensim package in Python. The data presented in this article are related to the research article entitled “Integrating word embeddings and document topics with deep learning in a video classification framework” [1]. The dataset is hosted in the Mendeley Data repository [2].


Data
The key summary statistics of the MOOC video lecture transcripts corpus used to generate word embeddings and topic representation vectors is presented in Table 1. The dataset contains 12,032 video lecture transcripts that are composed of over 878 thousand sentences and more than 79 million tokens. The vocabulary size is over 68 thousand unique words.
Video transcripts are of different length, from 228 to 32,767 tokens, with an average of 6622 tokens per video transcript. Video transcripts length variation illustrated in box plot and the distribution of tokens among the entire video transcripts corpus represented by a density function are shown in Fig. 1.
In Table 2, we show categories of the dataset including general-level and fined-grained along with the number of video lecture transcripts associated to each category.
A visual representation of word embeddings generated from the MOOC video lectures corpora using principal component analysis (PCA) projected in a geometric space is illustrated in Fig. 2. More specifically, Fig. 2 shows an example of mapping of word 'studying' and its neighbours e.g. academic, studies, institution, reading, etc., in three-dimensional space.

Experimental design, materials, and methods
A new real-world dataset from the education domain is presented in this article. The dataset contains word embeddings and document topic distribution vectors generated by a corpus of 12,032 video lecture transcripts. The steps involved in collecting and creating the WET dataset are illustrated in Fig. 3.
As a first step, we downloaded video lecture transcripts from Coursera learning platform and annotated them. For annotation, we used a two-level hierarchical organizational structure of Coursera Specifications Table   Subject Computer Science Specific subject area Machine Learning, Natural Language Processing, Text Classification, eLearning Type of data Value of the Data This dataset is useful for the research community for two reasons: First, it is the first of its kind dataset aimed at video lectures classification using NLP techniques which is collected from the wild massive open online courses (MOOCs), and second, this dataset with its large-scale corpus size could serve as a standard benchmark for these research areas as well as for testing performance of the existing and new methods and techniques. The research community in the fields of machine learning, information retrieval, video processing, education can benefit from these data by using them in various research tasks such as: transfer learning, video lectures classification and recommendation, contextual analysis, short text enrichment with topics, performance analysis of deep learning models and techniques, personalized learning. Another possible value of these data is that they could be used and adopted from content providers and managers of MOOCs learning platforms to organize educational resources for maximum visibility and to easily search and find the best content.   where each downloaded video transcript is associated with one fine-grained category and one generallevel category of the structure. 8 general-level and 40 fine-grained categories constitute the dataset and the distribution of lecture transcripts among these categories is given in Table 2.
Prior to creating corpus and dictionary for generating word embeddings and topic distribution vectors, video lecture transcripts have undergone some preprocessing tasks including converting text to lowercase, removing stop words, punctuations, and removing words that are not purely comprised of alphabetical characters and those that are only one character. In addition, WordNetLemmatizer is used to lemmatize all words in transcripts. An open source Python Library for symbolic and statistical natural language processing called Natural Language Toolkit (NLTK) is used for performing preprocessing tasks.

Word embeddings
To train and generate word embeddings, we used the Word2Vec [3] word embedding technique implemented in Python's Gensim package [5]. Word2Vec is an unsupervised learning method in which word embeddings are learned using distribution of word co-occurrences within a local context, that is, a separate text window space scanned across the whole corpus. There are two model architecture of Word2Vec, namely the continuous bag-of-words (CBOW), and the skip-gram.
We set various parameters for Word2Vec as shown in Table 3. Word embeddings of different vector sizes including 50, 100, 200 and 300 dimensions, are generated.
Each of four csv files contains 68176 lines comprising of a unique word followed by either 50, 100, 200, or 300 real numbers that correspond to 50, 100, 200, 300 respective dimensions.

Topic distribution vectors
We conducted unsupervised topic modeling on the MOOC lecture transcript corpus. A latent Dirichlet allocation (LDA) [4] conventional topic modeling scheme implemented in the Python's Gensim package is used for generating document topic distribution vectors. LDA is a generative statistical model in which each document of a corpus is represented by a finite mixture of topics/themes which, in turn, are represented by a group of words. The parameter settings given in Table 4 are used for LDA and we performed training for varying number of topics including 50, 100, 200, 300 document topics.