Content Analysis of Scientific Articles in Apache Hadoop Ecosystem

Dendek, Piotr Jan; Czeczko, Artur; Fedoryszak, Mateusz; Kawa, Adam; Wendykier, Piotr; Bolikowski, Łukasz

doi:10.1007/978-3-319-04714-0_10

Piotr Jan Dendek⁷,
Artur Czeczko⁷,
Mateusz Fedoryszak⁷,
Adam Kawa⁷,
Piotr Wendykier⁷ &
…
Łukasz Bolikowski⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 541))

627 Accesses
2 Citations
7 Altmetric

Abstract

Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big data problems, which can be efficiently solved on Hadoop clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Byers, A.H.: Big data: the next frontier for innovation, competition, and productivity. Technical report, Mc Kinsey (2011)
Google Scholar
Dendek, P.J., Czeczko, A., Fedoryszak, M., Kawa, A., Wendykier, P., Bolikowski, Ł.: How to perform research in Hadoop environment not losing mental equilibrium—case study. arXiv:1303.5234 [cs.SE] (2013)
Google Scholar
Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in the corporate research environment. Nat. Lang. Eng. 10(3–4), 327–348 (2004)
Article Google Scholar
Bembenik, R., Skonieczny, L., Rybinski, H., Niezgodka, M.: Intelligent Tools for Building a Scientific Information Platform Studies in Computational Intelligence. Springer, Berlin (2012)
Book Google Scholar
Manghi, P., Manola, N., Horstmann, W., Peters, D.: An infrastructure for managing EC funded research output—the OpenAIRE project. Grey J: Int. J. Grey Lit. 6, 31–40 (2010)
Google Scholar
Manghi, P., Bolikowski, Ł., Manola, N., Schirrwagen, J., Smith, T.: OpenAIREplus: the European scholarly communication data infrastructure. In: D-Lib Magazine, vol. 18(9/10) (2012)
Google Scholar
Dendek, P.J., Bolikowski, Ł., Lukasik, M.: Evaluation of features for author name disambiguation using linear support vector machines. In: Proceedings of the 10th IAPR International Workshop on Document Analysis Systems, pp. 440–444 (2012)
Google Scholar
Dendek, P.J., Wojewodzki, M., Bolikowski, Ł.: Author disambiguation in the YADDA2 software platform. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 131–143. Springer, Berlin Heidelberg (2013)
Chapter Google Scholar
Bolikowski, Ł., Dendek, P.J.: Towards a flexible author name disambiguation framework. In: Sojka, P., Bouche, T., (eds.): Towards a Digital Mathematics Library, pp. 27–37. Masaryk University Press (2011)
Google Scholar
Tkaczyk, D., Bolikowski, Ł., Czeczko, A., Rusek, K.: A modular metadata extraction system for born-digital articles. In: 2012 10th IAPR International Workshop on Document Analysis Systems (DAS), pp. 11-16. (2012)
Google Scholar
Lukasik, M., Kusmierczyk, T., Bolikowski, Ł., Nguyen, H.: Hierarchical, multilabel classification of scholarly publications: modifications of ML-KNN algorithm. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niez- godka, M., (eds.): Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467 pp. 343–363. Springer, Heidelberg (2013)
Google Scholar
Kusmierczyk, T.: Reconstruction of MSC classification tree. Master’s Thesis, The University of Warsaw (2012)
Google Scholar
Fedoryszak, M., Bolikowski, Ł., Tkaczyk, D., Wojciechowski, K.: Methodology for evaluating citation parsing and matching. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 145–154. Springer, Heidelberg (2013)
Chapter Google Scholar
Fedoryszak, M., Tkaczyk, D., Bolikowski, Ł.: Large scale citation matching using apache hadoop. In: Aalberg, T., Papatheodorou, C., Dobreva, M., Tsakonas, G., Farrugia, C. (eds.) Research and Advanced Technology for Digital Libraries. Lecture Notes in Computer Science, vol. 8092, pp. 362–365. Springer, Heidelberg (2013)
Chapter Google Scholar
Lin, J.: MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That’s Not a Nail! Sept 2012
Google Scholar
Kawa, A., Bolikowski, A., Czeczko, A., Dendek, P., Tkaczyk, D.: Data model for analysis of scholarly documents in the mapreduce paradigm. In: Bembenik, R., Skonieczny, L., Rybinski, H., Kryszkiewicz, M., Niezgodka, M. (eds.) Intelligent Tools for Building a Scientific Information Platform. Studies in Computational Intelligence, vol. 467, pp. 155–169. Springer, Heidelberg (2013)
Chapter Google Scholar
Elmagarmid, A., Ipeirotis, P., Verykios, V.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
Cloudera: Mapreduce algorithms. http://blog.cloudera.com/wp-content/uploads/2010/01/5-MapReduceAlgorithms.pdf (2009)
Lee, H., Her, J., Kim, S.R.: Implementation of a large-scalable social data analysis system based on mapreduce. In: 2011 First ACIS/JNU International Conference on Computers, Networks, Systems and Industrial Engineering (CNSI), pp. 228–233 (2011)
Google Scholar
Wan, J., Yu, W., Xu, X.: Design and implement of distributed document clustering based on mapreduce. In: Proceedings of the 2nd symposium international computer science and computational technology (ISCSCT), pp. 278–280 (2009)
Google Scholar
Porter, M.F.: Readings in information retrieval, pp. 313–316. Morgan Kaufmann Publishers, San Francisco (1997)
Google Scholar
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. HLT-Short '08, pp. 265−268. Association for Computational Linguistics, Stroudsburg, PA, USA (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Interdisciplinary Centre for Mathematical and Computational Modelling, University of Warsaw, Warsaw, Poland
Piotr Jan Dendek, Artur Czeczko, Mateusz Fedoryszak, Adam Kawa, Piotr Wendykier & Łukasz Bolikowski

Authors

Piotr Jan Dendek
View author publications
You can also search for this author in PubMed Google Scholar
Artur Czeczko
View author publications
You can also search for this author in PubMed Google Scholar
Mateusz Fedoryszak
View author publications
You can also search for this author in PubMed Google Scholar
Adam Kawa
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Wendykier
View author publications
You can also search for this author in PubMed Google Scholar
Łukasz Bolikowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Piotr Jan Dendek , Artur Czeczko , Mateusz Fedoryszak , Adam Kawa , Piotr Wendykier or Łukasz Bolikowski .

Editor information

Editors and Affiliations

Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Robert Bembenik
Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Łukasz Skonieczny
Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Henryk Rybiński
Faculty of Electronics and Information Technology, Warsaw University of Technology, Institute of Computer Science, Warsaw, Poland
Marzena Kryszkiewicz
InterdisciplinaryCentre for Mathematical and Computational Modelling (ICM), University of Warsaw, Warsaw, Poland
Marek Niezgódka

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Dendek, P.J., Czeczko, A., Fedoryszak, M., Kawa, A., Wendykier, P., Bolikowski, Ł. (2014). Content Analysis of Scientific Articles in Apache Hadoop Ecosystem. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds) Intelligent Tools for Building a Scientific Information Platform: From Research to Implementation. Studies in Computational Intelligence, vol 541. Springer, Cham. https://doi.org/10.1007/978-3-319-04714-0_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-04714-0_10
Published: 27 February 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-04713-3
Online ISBN: 978-3-319-04714-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics