Discovering OLAP dimensions in semi-structured data
Section snippets
Introduction and motivation
Explosion of social network activity in the recent years has led to generation of massive volumes of user-related data, such as status updates, messaging, blog posts and forum entries, recommendations, connection requests and suggestions and has given birth to novel analysis areas, such as Social Media Analysis and Social Network Analysis. This phenomenon can be viewed as a part of the “Big Data” [1] challenge, which is to cope with the rising flood of digital data from many sources including
Acquiring facts and dimensions
To exemplify the challenges of transforming semi-structured data into multidimensional cubes, let us recall the relevant concepts of the data warehouse design. Data in a data warehouse is structured according to the aggregation-centric multidimensional data model that uses numeric measures as its analysis objects [24]. A fact entry represents the finest level of detail and normally corresponds to a single transaction or event occurrence. A fact consists of one or multiple measures, such as
Modeling discovered elements
Basically, a cube can be extended by adding new elements of type measure or dimension category. A measure is a simple atomic field of a fact entry. Therefore, computing a new field of this type does not require additional adjustments to the overall cube structure. However, adding a new dimension or a hierarchy level to an existing dimension imposes a number of challenges with respect to modeling, implementing, querying, and maintaining such added element. We demonstrate the differences in
Maintaining dynamic elements
Discovered elements of type dimension category may be of a static nature (e.g., language, sentiment, topic) or evolve over time along with the evolution of the dataset. The former type can be treated just as a full-fledged dimension category since no additional constraints on maintaining the data are imposed in that case. The latter type, however, behaves similar to a changing dimension – a term introduced by Kimball in [30]. Kimball distinguishes between slowly and rapidly changing dimensions
Demonstration
Twitter has become a reflection of all real-world events. Let it be the Arab uprising, any natural disaster, political elections, movie/music launch or sport events, it gets reciprocated into a huge social activity on Twitter. Data analysts expect valuable insights from event-oriented analysis of the Twitter stream that delivers user-generated content. Our usage scenario is concerned with the prominent sporting event of the 2012 UEFA European Football Championship,2
Conclusions and future work
In this work we proposed to extract multidimensional data cubes for OLAP from semi-structured datasets and to extend the resulting model by including dynamic categories and hierarchies discovered from the data through DM methods and other computations. The discovered classifications reflect “hidden” relationships in the dataset and thus represent new axes for exploring the measures in a cube.
As a non-conventional application for OLAP, we used the publicly available stream of the user-generated
References (36)
- et al.
A foundation for capturing and querying complex multidimensional data
Inf. Syst.
(2001) - K. Roebuck, Big Data: High-Impact Strategies—What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity,...
- C. Strauch, Nosql Databases. Lecture Selected Topics on Software-Technology Utra-Large Scale Sites, Manuscript, Lecture...
- J. Han, Olap mining: an integration of olap with data mining, in: Proceedings of the 7th IFIP 2.6 Working Conference on...
- Twitter Team, Twitter Turns Six,...
- R. Krikorian, Developing for @twitterapi (techcrunch disrupt hackathon),...
- J. MacLennan, Z. Tang, B. Crivat, Mining OLAP Cubes, Wiley Publishing, 2008, pp....
- M. Usman, S. Asghar, S. Fong, A conceptual model for combining enhanced olap and data mining systems, in: 5th...
- J. Han, S. Chee, J. Chiang, Issues for on-line analytical mining of data warehouses, in: Proceedings of the Workshop on...
- H. Zhu, On-line analytical mining of association rules (Ph.D. thesis), Simon Fraser University,...