基於MapReduce做文本自動分類與概念分析

因為資訊科技的頻繁使用，使用者於長時間下容易累積大量的文件資料，若這些文件沒有經過分類標示，使用者即無法在大量且雜亂無序的文件中立即找到所需的資訊。此外，在這大量的文件中，若欲採用人工分類方式即需要消耗一定的人力資源，因此，如何有效率地進行大量文件分析以及非人工方式的自動化分類是本研究努力的目標。在本研究中，我們使用Hadoop雲端運算平台，將分散式檔案系統HDFS中未分類的文件，經由前置作業處理後，取得文章中較具代表性的關鍵字詞與相關資訊，之後我們將這些數據交由Pig進行解析並計算出其TF-IDF權重值。接著透過機器學習的技術，分別採用Hierarchical Agglomerative Clustering與K-Nearest Neighbors演算法對文件做分類，於分類完成後，我們以Concept Analysis描述類別之間的關係，再將這分析後的關係建立成Ontology交回給前置作業進行解析調整，希望藉由不斷地調整後，可以從中找到最佳的類別與屬性關係並提高分類的準確性。

關鍵字

概念分析； K-最近鄰演算法； Hadoop ； Pig ； TF-IDF ；階層式聚合分群法

並列摘要

Frequent uses of information technologies will accumulate a large number of documents. If these documents are not classified, it will be difficult for the users to find information in a large unorganized set of documents. Large human resources will be needed if the documents are classified manually. How to analyze and classify such documents efficiently and automatically is the goal of this research. In this thesis, we used Hadoop platform to classify documents. At First, we analyze the documents to find representative keywords for each document. We then compute the TF-IDF weights of each keyword. After that, we use Hierarchical Agglomerative Clustering and K-Nearest Neighbors Algorithm to classify documents. After classification, we use Concept Analysis to describe the relationship between keyword concepts and to build an ontology. Finally, we use the ontology to improve the accuracy of the classification.

並列關鍵字

Hierarchical Agglomerative Clustering ； Concept Analysis ； K-Nearest Neighbors ； Hadoop ； Pig ； TF-IDF

參考文獻

[27] 楊智強. 在一個知識管理系統應用Ontology建立分類之設計. 中原大學

[30] 陳宛琳. 結合本體論與語意相似程度對文件萃取關鍵字. 中原大學資訊

[7] WordNet, November 2013. http://wordnet.princeton.edu

[8] Pig, April 2014. https://pig.apache.org

[9] FoodNetwork, March 2013. http://www.foodnetwork.com

被引用紀錄

陳宛琳（2014）。結合本體論與語意相似程度對文件萃取關鍵字〔碩士論文，中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu201400983

陳姿穎（2007）。中老年人參與槌球運動與其健康自評之關係〔碩士論文，亞洲大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0118-0807200916275520

國際替代計量

基於MapReduce做文本自動分類與概念分析

全文下載

主題瀏覽