On Building an Interpretable Topic Modeling Approach for the Urdu Language

On Building an Interpretable Topic Modeling Approach for the Urdu Language

Zarmeen Nasim

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence
Doctoral Consortium. Pages 5200-5201. https://doi.org/10.24963/ijcai.2020/740

This research is an endeavor to combine deep-learning-based language modeling with classical topic modeling techniques to produce interpretable topics for a given set of documents in Urdu, a low resource language. The existing topic modeling techniques produce a collection of words, often un-interpretable, as suggested topics without integrat-ing them into a semantically correct phrase/sentence. The proposed approach would first build an accurate Part of Speech (POS) tagger for the Urdu Language using a publicly available corpus of many million sentences. Using semanti-cally rich feature extraction approaches including Word2Vec and BERT, the proposed approach, in the next step, would experiment with different clus-tering and topic modeling techniques to produce a list of potential topics for a given set of documents. Finally, this list of topics would be sent to a labeler module to produce syntactically correct phrases that will represent interpretable topics.
Keywords:
Natural Language Processing: Natural Language Processing
Natural Language Processing: NLP Applications and Tools
Natural Language Processing: Embeddings
Natural Language Processing: Natural Language Summarization