TeraScope: distributed visual data mining of terascale data sets over photonic networks

https://doi.org/10.1016/S0167-739X(03)00072-4Get rights and content

Abstract

TeraScope is a framework and a suite of tools for interactively browsing and visualizing large terascale data sets. Unique to TeraScope is its utilization of the Optiputer paradigm to treat distributed computer clusters as a single giant computer, where the dedicated optical networks that connect the clusters serve as the computer’s system bus. TeraScope explores one aspect of the Optiputer architecture by employing a distributed pool of memory, called LambdaRAM, that serves as a massive data cache for supporting parallel data mining and visualization algorithms.

Introduction

“Where the telescope ends, the microscope begins. Which of the two has the grander view?”—Victor Hugo

Areas of research such as Geoscience, Astronomy, and High Energy Physics are routinely producing terabytes, and soon, petabytes of data from direct data gathering, data post processing, and simulations. Algorithmic detection of hidden patterns within these large data sets has been the focus of data mining [6]. Visualization used in this context (often referred to as Visual Data Mining) has been valuable as a way to verify the detected patterns; and in particular, for when algorithmic specifications of the patterns are difficult to derive [2], [3], [8], [9], [10], [15]. In the latter case user-interfaces that allow one to interactively browse, query and visualize enormous data sets need to be developed [17].

The work described in this paper is motivated by several emerging trends. Firstly, scientific databases are becoming highly distributed. Secondly, the cost of high speed networking is increasing at a rate far exceeding Moore’s Law, network bandwidth is doubling every 8 months, whereas processors are doubling in speed every 18–24 months. This means that computers, rather than the networks are the bottleneck. Thirdly, there is an increasing need and potential, facilitated by these high speed networks, for scientists to publish terabyte data sets on the Web in a manner similar to the way most netizens can create Web pages, so that researchers can make new discoveries by combining data from previously disparate disciplines. For example by correlating data from the World Health Organization with data from the National Center for Atmospheric Research, one could potentially understand how weather patterns influence the spread of diseases.

The Optiputer is a National Science Foundation funded project intended to exploit these trends by interconnecting distributed storage, computation, and visualization resources using extremely high speed photonic networks [13]. The important difference between this and classical Grid computing is that in this new model, the optical networks serve as the system bus for a potentially planetary-scale computer; and compute clusters taken as a whole, serve as the peripherals in the computer. For example, a cluster of computers with high performance graphics cards would be thought of as a single giant graphics card in this context. In the Optiputer concept, we refer to compute clusters as LambdaNodes to denote the fact that they are connected by multiples of light paths (often referred to as Lambdas) in an optical network. Each computer in a LambdaNode is referred to as a nodule, and collections of LambdaNodes form a LambdaGrid.

TeraScope is an experimental visual data mining toolkit intended to take advantage of the Optiputer paradigm. This paper describes the prototype that was developed and demonstrated at the IGrid 2002 conference in Amsterdam (http://www.igrid2002.org). Furthermore, this paper describes LambdaRAM, a high performance cache, for the Optiputer.

Section snippets

TeraScope

The vision for TeraScope is to provide a way to fluidly work with massive data sets as interactively as one would work with a spreadsheet on a laptop. The goal is not necessarily to massively parallelize visualization algorithms so that a terabyte of points can be plotted. The goal is to use parallel algorithms to process terabyte data sets to produce visual summaries (which we call TeraMaps) to help the user locate regions that are most interesting to them. Once the area of interest has been

Results

TeraScope was built in stages. The first prototype included only the visualization tools, and these tools interfaced directly with the data retrieved from the DSTP servers. All correlation calculations in the 3D histogram and 2D scatterplot were performed in the visualization tool. The test case consisted of 100 GB of generated data from National Center for Atmospheric Research’s (NCAR) Community Climate Model (CCM3). This first prototype was demonstrated at SC2001 in Denver, Colorado.

The second

Discussion and future work

This paper has provided an overview of a project recently underway to develop interactive visual data mining tools for exploring massive data sets, using the Optiputer paradigm. At the time of writing of this paper, a prototype of TeraScope had been demonstrated at IGrid 2002. With the TeraScope framework in place, future work will focus on the development of new tools for creating visual summaries, performance monitoring of the first LambdaRAM prototype, and augmentation of LambdaRAM with

Acknowledgements

The virtual reality and advanced networking research, collaborations, and outreach programs at the Electronic Visualization Laboratory (EVL) at the University of Illinois at Chicago are made possible by major funding from the National Science Foundation (NSF), awards EIA-9802090, EIA-0115809, ANI-9980480, ANI-0229642, ANI-9730202, ANI-0123399, ANI-0129527 and EAR-0218918, as well as the NSF Information Technology Research (ITR) cooperative agreement (ANI-0225642) to the University of California

Chong Zhang is a PhD student in the Department of Computer Science at University of Illinois at Chicago (UIC). He is also working as research assistant at Electronic Visualization Laboratory (EVL) at UIC. His current interests include distributed computing, science visualization and data mining.

References (21)

  • M.D Beynon et al.

    Processing large-scale multi-dimensional data in parallel and distributed environments

    Parallel Comput.

    (2002)
  • S. Bailey, R. Grossman, S. Gutti, H. Sivakumar, A high performance implementation of the data space transfer protocol...
  • Ann Chervenak, I. Foster, Carl Kesselman, Charles Salisbury, Steven Tuecke, The data grid: towards an architecture for...
  • K. Li, IVY: a shared virtual memory system for parallel computing, in: Proceedings of the International Conference on...
  • Project DataSpace....
  • W.J. Frawley, G. Piatetsky-Shapiro, C.J. Matheus, Knowledge Discovery in Databases: An Overview. Knowledge Discovery in...
  • QUANTA....
  • D.A. Keim, Information visualization and visual data mining, IEEE Trans. Visual. Comput. Graphics 8 (1)...
  • D.A. Keim, H.-P.K., Visualization techniques for mining large databases: a comparison, IEEE Trans. Knowledge Data Eng....
  • T Kurc et al.

    Exploration and visualization of very large datasets with the active data repository

    IEEE Comput. Graphics Appl.

    (2001)
There are more references available in the full text version of this article.

Cited by (19)

  • Federated grid clusters using service address routed optical networks

    2007, Future Generation Computer Systems
    Citation Excerpt :

    Optical networks can be as much as 100 times faster than regular 100 Mbps Ethernet interconnected network, which is used in some laboratories as the fabric for interconnection networks for tightly coupled distributed systems [26]. An example of an optiputer would be a set of clusters interconnected by dedicated optical fibers [3,14,25,28,31] sharing computing resources. Consider the distance between two sister campuses within the University of California: UC Irvine and UC San Diego.

  • Real-time multi-scale brain data acquisition, assembly, and analysis using an end-to-end OptIPuter

    2006, Future Generation Computer Systems
    Citation Excerpt :

    Applications accessing LambdaRAM see the distributed cache as a contiguous memory image. LambdaRAM is currently used in JuxtaView and TeraScope [6] (EVL’s visual data mining software) to provide data correlation algorithms with fast access to distributed database tables. LambdaStream [12] is a transport protocol designed specifically to support gigabit-level streaming, which is required by streaming applications over OptIPuter.

  • Data mining middleware for wide-area high-performance networks

    2006, Future Generation Computer Systems
    Citation Excerpt :

    Moving large data sets over high-speed wide-area networks has been recognized as a challenging task for many years. During iGrid 2002, various groups demonstrated prototypes of several different tools for high-performance data transport [2,3,9,16,18,21]. Since then, various new data transport protocols or related congestion control algorithms [8,10,12] have been designed and developed.

  • International real-time streaming of 4K digital cinema

    2006, Future Generation Computer Systems
    Citation Excerpt :

    In the first public demonstration of 4K motion pictures in 2001, nearly all content had been played from local servers. The demonstrations at iGrid 2002 included 2K class real-time networked applications, such as Griz, TeraScope, TeraVision, HDTV over IP and so forth [5–8]. In 2002 and again in 2004, 4K-over-networks was demonstrated in Japan and in the USA using domestic networks [9].

  • Distributed and collaborative visualization of large data sets using high-speed networks

    2006, Future Generation Computer Systems
    Citation Excerpt :

    High-speed networks that provide bandwidths larger than disk transfer rates make transferring data from remote memory faster than reading data from the local disk. In effect, we are using a large pool of memory distributed over multiple remote computers, similar to LambdaRAM/Optiputer [5]. With distributed resources network latency can become an issue for the application.

View all citing articles on Scopus

Chong Zhang is a PhD student in the Department of Computer Science at University of Illinois at Chicago (UIC). He is also working as research assistant at Electronic Visualization Laboratory (EVL) at UIC. His current interests include distributed computing, science visualization and data mining.

Jason Leigh is an associate professor in the Department of Computer Science at the University of Illinois at Chicago and a senior scientist at the Electronic Visualization Laboratory (EVL). Leigh is co-chair of the Global Grid Forum’s Advanced Collaborative Environments research group and a co-founder of the GeoWall Consortium.

Thomas A. DeFanti, PhD, is director of the Electronic Visualization Laboratory (EVL), a distinguished professor in the Department of Computer Science, and director of the Software Technologies Research Center at the University of Illinois at Chicago.

Marco Mazzucco is a post-doctoral fellow at the Univeristy of Wales, Swansea. He is currently working on an ESPRERC funded project in theoretic computer science under the direction of Dr. Martin Otto. He also does research and consulting for the National Center for Data Mining. He received his PhD in Mathematics at UIC in 2000.

Robert Grossman is the director of the Laboratory for Advanced Computing and National Center for Data Mining at the University of Illinois at Chicago (UIC). He is also the Founder and CEO of the Two Cultures Group.

View full text