Elsevier

Information Systems

Volume 44, August 2014, Pages 120-133
Information Systems

Discovering OLAP dimensions in semi-structured data

https://doi.org/10.1016/j.is.2013.09.002Get rights and content

Abstract

OLAP cubes enable aggregation-centric analysis of transactional data by shaping data records into measurable facts with dimensional characteristics. A multidimensional view is obtained from the available data fields and explicit relationships between them. This classical modeling approach is not feasible for scenarios dealing with semi-structured or poorly structured data. We propose to the data warehouse design methodology with a content-driven discovery of measures and dimensions in the original dataset. Our approach is based on introducing a data enrichment layer responsible for detecting new structural elements in the data using data mining and other techniques. Discovered elements can be of type measure, dimension, or hierarchy level and may represent static or even dynamic properties of the data. This paper focuses on the challenge of generating, maintaining, and querying discovered elements in OLAP cubes.

We demonstrate the power of our approach by providing OLAP to the public stream of user-generated content on the Twitter platform. We have been able to enrich the original set with dynamic characteristics, such as user activity, popularity, messaging behavior, as well as to classify messages by topic, impact, origin, method of generation, etc. Knowledge discovery techniques coupled with human expertise enable structural enrichment of the original data beyond the scope of the existing methods for obtaining multidimensional models from relational or semi-structured data.

Section snippets

Introduction and motivation

Explosion of social network activity in the recent years has led to generation of massive volumes of user-related data, such as status updates, messaging, blog posts and forum entries, recommendations, connection requests and suggestions and has given birth to novel analysis areas, such as Social Media Analysis and Social Network Analysis. This phenomenon can be viewed as a part of the “Big Data” [1] challenge, which is to cope with the rising flood of digital data from many sources including

Acquiring facts and dimensions

To exemplify the challenges of transforming semi-structured data into multidimensional cubes, let us recall the relevant concepts of the data warehouse design. Data in a data warehouse is structured according to the aggregation-centric multidimensional data model that uses numeric measures as its analysis objects [24]. A fact entry represents the finest level of detail and normally corresponds to a single transaction or event occurrence. A fact consists of one or multiple measures, such as

Modeling discovered elements

Basically, a cube can be extended by adding new elements of type measure or dimension category. A measure is a simple atomic field of a fact entry. Therefore, computing a new field of this type does not require additional adjustments to the overall cube structure. However, adding a new dimension or a hierarchy level to an existing dimension imposes a number of challenges with respect to modeling, implementing, querying, and maintaining such added element. We demonstrate the differences in

Maintaining dynamic elements

Discovered elements of type dimension category may be of a static nature (e.g., language, sentiment, topic) or evolve over time along with the evolution of the dataset. The former type can be treated just as a full-fledged dimension category since no additional constraints on maintaining the data are imposed in that case. The latter type, however, behaves similar to a changing dimension – a term introduced by Kimball in [30]. Kimball distinguishes between slowly and rapidly changing dimensions

Demonstration

Twitter has become a reflection of all real-world events. Let it be the Arab uprising, any natural disaster, political elections, movie/music launch or sport events, it gets reciprocated into a huge social activity on Twitter. Data analysts expect valuable insights from event-oriented analysis of the Twitter stream that delivers user-generated content. Our usage scenario is concerned with the prominent sporting event of the 2012 UEFA European Football Championship,2

Conclusions and future work

In this work we proposed to extract multidimensional data cubes for OLAP from semi-structured datasets and to extend the resulting model by including dynamic categories and hierarchies discovered from the data through DM methods and other computations. The discovered classifications reflect “hidden” relationships in the dataset and thus represent new axes for exploring the measures in a cube.

As a non-conventional application for OLAP, we used the publicly available stream of the user-generated

References (36)

  • T.B. Pedersen et al.

    A foundation for capturing and querying complex multidimensional data

    Inf. Syst.

    (2001)
  • K. Roebuck, Big Data: High-Impact Strategies—What You Need to Know: Definitions, Adoptions, Impact, Benefits, Maturity,...
  • C. Strauch, Nosql Databases. Lecture Selected Topics on Software-Technology Utra-Large Scale Sites, Manuscript, Lecture...
  • J. Han, Olap mining: an integration of olap with data mining, in: Proceedings of the 7th IFIP 2.6 Working Conference on...
  • Twitter Team, Twitter Turns Six,...
  • R. Krikorian, Developing for @twitterapi (techcrunch disrupt hackathon),...
  • J. MacLennan, Z. Tang, B. Crivat, Mining OLAP Cubes, Wiley Publishing, 2008, pp....
  • M. Usman, S. Asghar, S. Fong, A conceptual model for combining enhanced olap and data mining systems, in: 5th...
  • J. Han, S. Chee, J. Chiang, Issues for on-line analytical mining of data warehouses, in: Proceedings of the Workshop on...
  • H. Zhu, On-line analytical mining of association rules (Ph.D. thesis), Simon Fraser University,...
  • S. Dzeroski, D. Hristovski, B. Peterlin, Using data mining and olap to discover patterns in a database of patients with...
  • F. Dehne, T. Eavis, A. Rau-Chaplin, Coarse grained parallel on-line analytical processing (OLAP) for data mining, in:...
  • A. Abelló, J. Samos, F. Saltor, A framework for the classification and description of multidimensional data models, in:...
  • S. Lujn-Mora et al.

    A UML profile for multidimensional modeling in data warehouses

    Data Knowl. Eng.

    (2006)
  • S. Mansmann, Extending the OLAP technology to handle non-conventional and complex data (Ph.D. thesis), University of...
  • A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sen Sarma, R. Murthy, H. Liu, Data warehousing and analytics...
  • H. Kwak, C. Lee, H. Park, S. Moon, What is twitter, a social network or a news media?, in: Proceedings of the 19th...
  • A. Java, X. Song, T. Finin, B. Tseng, Why we twitter: understanding microblogging usage and communities, in:...
  • Cited by (0)

    View full text