TeraScope: distributed visual data mining of terascale data sets over photonic networks

doi:10.1016/S0167-739X(03)00072-4

Future Generation Computer Systems

Volume 19, Issue 6, August 2003, Pages 935-943

https://doi.org/10.1016/S0167-739X(03)00072-4 Get rights and content

Abstract

TeraScope is a framework and a suite of tools for interactively browsing and visualizing large terascale data sets. Unique to TeraScope is its utilization of the Optiputer paradigm to treat distributed computer clusters as a single giant computer, where the dedicated optical networks that connect the clusters serve as the computer’s system bus. TeraScope explores one aspect of the Optiputer architecture by employing a distributed pool of memory, called LambdaRAM, that serves as a massive data cache for supporting parallel data mining and visualization algorithms.

Introduction

“Where the telescope ends, the microscope begins. Which of the two has the grander view?”—Victor Hugo

Areas of research such as Geoscience, Astronomy, and High Energy Physics are routinely producing terabytes, and soon, petabytes of data from direct data gathering, data post processing, and simulations. Algorithmic detection of hidden patterns within these large data sets has been the focus of data mining [6]. Visualization used in this context (often referred to as Visual Data Mining) has been valuable as a way to verify the detected patterns; and in particular, for when algorithmic specifications of the patterns are difficult to derive [2], [3], [8], [9], [10], [15]. In the latter case user-interfaces that allow one to interactively browse, query and visualize enormous data sets need to be developed [17].

The work described in this paper is motivated by several emerging trends. Firstly, scientific databases are becoming highly distributed. Secondly, the cost of high speed networking is increasing at a rate far exceeding Moore’s Law, network bandwidth is doubling every 8 months, whereas processors are doubling in speed every 18–24 months. This means that computers, rather than the networks are the bottleneck. Thirdly, there is an increasing need and potential, facilitated by these high speed networks, for scientists to publish terabyte data sets on the Web in a manner similar to the way most netizens can create Web pages, so that researchers can make new discoveries by combining data from previously disparate disciplines. For example by correlating data from the World Health Organization with data from the National Center for Atmospheric Research, one could potentially understand how weather patterns influence the spread of diseases.

The Optiputer is a National Science Foundation funded project intended to exploit these trends by interconnecting distributed storage, computation, and visualization resources using extremely high speed photonic networks [13]. The important difference between this and classical Grid computing is that in this new model, the optical networks serve as the system bus for a potentially planetary-scale computer; and compute clusters taken as a whole, serve as the peripherals in the computer. For example, a cluster of computers with high performance graphics cards would be thought of as a single giant graphics card in this context. In the Optiputer concept, we refer to compute clusters as LambdaNodes to denote the fact that they are connected by multiples of light paths (often referred to as Lambdas) in an optical network. Each computer in a LambdaNode is referred to as a nodule, and collections of LambdaNodes form a LambdaGrid.

TeraScope is an experimental visual data mining toolkit intended to take advantage of the Optiputer paradigm. This paper describes the prototype that was developed and demonstrated at the IGrid 2002 conference in Amsterdam (http://www.igrid2002.org). Furthermore, this paper describes LambdaRAM, a high performance cache, for the Optiputer.

Section snippets

TeraScope

The vision for TeraScope is to provide a way to fluidly work with massive data sets as interactively as one would work with a spreadsheet on a laptop. The goal is not necessarily to massively parallelize visualization algorithms so that a terabyte of points can be plotted. The goal is to use parallel algorithms to process terabyte data sets to produce visual summaries (which we call TeraMaps) to help the user locate regions that are most interesting to them. Once the area of interest has been

Results

TeraScope was built in stages. The first prototype included only the visualization tools, and these tools interfaced directly with the data retrieved from the DSTP servers. All correlation calculations in the 3D histogram and 2D scatterplot were performed in the visualization tool. The test case consisted of 100 GB of generated data from National Center for Atmospheric Research’s (NCAR) Community Climate Model (CCM3). This first prototype was demonstrated at SC2001 in Denver, Colorado.

The second

Discussion and future work

This paper has provided an overview of a project recently underway to develop interactive visual data mining tools for exploring massive data sets, using the Optiputer paradigm. At the time of writing of this paper, a prototype of TeraScope had been demonstrated at IGrid 2002. With the TeraScope framework in place, future work will focus on the development of new tools for creating visual summaries, performance monitoring of the first LambdaRAM prototype, and augmentation of LambdaRAM with

Acknowledgements

The virtual reality and advanced networking research, collaborations, and outreach programs at the Electronic Visualization Laboratory (EVL) at the University of Illinois at Chicago are made possible by major funding from the National Science Foundation (NSF), awards EIA-9802090, EIA-0115809, ANI-9980480, ANI-0229642, ANI-9730202, ANI-0123399, ANI-0129527 and EAR-0218918, as well as the NSF Information Technology Research (ITR) cooperative agreement (ANI-0225642) to the University of California

Chong Zhang is a PhD student in the Department of Computer Science at University of Illinois at Chicago (UIC). He is also working as research assistant at Electronic Visualization Laboratory (EVL) at UIC. His current interests include distributed computing, science visualization and data mining.

References (21)

M.D Beynon et al.
Processing large-scale multi-dimensional data in parallel and distributed environments
Parallel Comput.
(2002)
S. Bailey, R. Grossman, S. Gutti, H. Sivakumar, A high performance implementation of the data space transfer protocol...
Ann Chervenak, I. Foster, Carl Kesselman, Charles Salisbury, Steven Tuecke, The data grid: towards an architecture for...
K. Li, IVY: a shared virtual memory system for parallel computing, in: Proceedings of the International Conference on...
Project DataSpace....
W.J. Frawley, G. Piatetsky-Shapiro, C.J. Matheus, Knowledge Discovery in Databases: An Overview. Knowledge Discovery in...
QUANTA....
D.A. Keim, Information visualization and visual data mining, IEEE Trans. Visual. Comput. Graphics 8 (1)...
D.A. Keim, H.-P.K., Visualization techniques for mining large databases: a comparison, IEEE Trans. Knowledge Data Eng....
T Kurc et al.
Exploration and visualization of very large datasets with the active data repository
IEEE Comput. Graphics Appl.
(2001)

There are more references available in the full text version of this article.

Cited by (19)

Federated grid clusters using service address routed optical networks
2007, Future Generation Computer Systems
Citation Excerpt :
Optical networks can be as much as 100 times faster than regular 100 Mbps Ethernet interconnected network, which is used in some laboratories as the fabric for interconnection networks for tightly coupled distributed systems [26]. An example of an optiputer would be a set of clusters interconnected by dedicated optical fibers [3,14,25,28,31] sharing computing resources. Consider the distance between two sister campuses within the University of California: UC Irvine and UC San Diego.
Clusters of computers have emerged as cost-effective parallel and/or distributed computing systems for computationally intensive tasks. Normally, clusters are composed of high performance computational nodes linked together by low-latency/high-bandwidth interconnection networks. With the advent of modern optical networking technologies, geographically distant clusters can be federated to yield systems considered tightly-coupled. By using Service Address Routed (SAR) optical networks, cluster federations are shown to be effective in dealing with complex scientific computations in a manner that is transparent to the user. The analysis of such federated clusters is carried out using a discrete event simulator. The findings include means to control the tradeoff between user response time and overall completion time, the advantages and disadvantages of exploiting and giving up locality, and how a meticulous control over the level of greediness can yield noticeable performance improvements.
Supporting serendipity: Using ambient intelligence to augment user exploration for data mining and web browsing
2007, International Journal of Human Computer Studies
Serendipity is the making of fortunate discoveries by accident, and is one of the cornerstones of scientific progress. In today's world of digital data and media, there is now a vast quantity of material that we could potentially encounter, and so there is an increased opportunity of being able to discover interesting things. However, the availability of material does not imply that we will be able to actually find it; the sheer quantity of data mitigates against us being able to discover the interesting nuggets.
This paper explores approaches we have taken to support users in their search for interesting and relevant information. The primary concept is the principle that it is more useful to augment user skills in information foraging than it is to try and replace them. We have taken a variety of artificial intelligence, statistical, and visualisation techniques, and combined them with careful design approaches to provide supportive systems that monitor user actions, garner additional information from their surrounding environment and use this enhanced understanding to offer supplemental information that aids the user in their interaction with the system.
We present two different systems that have been designed and developed according to these principles. The first system is a data mining system that allows interactive exploration of the data, allowing the user to pose different questions and understand information at different levels of detail. The second supports information foraging of a different sort, aiming to augment users browsing habits in order to help them surf the internet more effectively. Both use ambient intelligence techniques to provide a richer context for the interaction and to help guide it in more effective ways: both have the user as the focal point of the interaction, in control of an iterative exploratory process, working in indirect collaboration with the artificial intelligence components.
Each of these systems contains some important concepts of their own: the data mining system has a symbolic genetic algorithm which can be tuned in novel ways to aid knowledge discovery, and which reports results in a user-comprehensible format. The visualisation system supports high-dimensional data, dynamically organised in a three-dimensional space and grouped by similarity. The notions of similarity are further discussed in the internet browsing system, in which an approach to measuring similarity between web pages and a user's interests is presented. We present details of both systems and evaluate their effectiveness.
Real-time multi-scale brain data acquisition, assembly, and analysis using an end-to-end OptIPuter
2006, Future Generation Computer Systems
Citation Excerpt :
Applications accessing LambdaRAM see the distributed cache as a contiguous memory image. LambdaRAM is currently used in JuxtaView and TeraScope [6] (EVL’s visual data mining software) to provide data correlation algorithms with fast access to distributed database tables. LambdaStream [12] is a transport protocol designed specifically to support gigabit-level streaming, which is required by streaming applications over OptIPuter.
At iGrid 2005 we demonstrated the transparent operation of a biology experiment on a test-bed of globally distributed visualization, storage, computational, and network resources. These resources were bundled into a unified platform by utilizing dynamic lambda allocation, high bandwidth protocols for optical networks, a Distributed Virtual Computer (DVC) [N. Taesombut, A. Chien, Distributed Virtual Computer (DVC): Simplifying the development of high performance grid applications, in: Proceedings of the Workshop on Grids and Advanced Networks, GAN 04, Chicago, IL, April 2004 (held in conjunction with the IEEE Cluster Computing and the Grid (CCGrid2004) Conference)], and applications running over the Scalable Adaptive Graphics Environment (SAGE) [L. Renambot, A. Rao, R. Singh, B. Jeong, N. Krishnaprasad, V. Vishwanath, V. Chandrasekhar, N. Schwarz, A. Spale, C. Zhang, G. Goldman, J. Leigh, A. Johnson, SAGE: The Scalable Adaptive Graphics Environment, in: Proceedings of WACE 2004, 23–24 September 2004, Nice, France, 2004]. Using these layered technologies we ran a multi-scale correlated microscopy experiment [M.E. Maryann, T.J. Deerinck, N. Yamada, E. Bushong, H. Ellisman Mark, Correlated 3D light and electron microscopy: Use of high voltage electron microscopy and electron tomography for imaging large biological structures, Journal of Histotechnology 23 (3) (2000) 261–270], where biologists imaged samples with scales ranging from 20X to 5000X in progressively increasing magnification. This allows the scientists to zoom in from entire complex systems such as a rat cerebellum to individual spiny dendrites. The images used spanned multiple modalities of imaging and specimen preparation, thus providing context at every level and allowing the scientists to better understand the biological structures. This demonstration attempts to define an infrastructure based on OptIPuter components which would aid the development and design of collaborative scientific experiments, applications and test-beds and allow the biologists to effectively use the high resolution real estate of tiled displays.
Data mining middleware for wide-area high-performance networks
2006, Future Generation Computer Systems
Citation Excerpt :
Moving large data sets over high-speed wide-area networks has been recognized as a challenging task for many years. During iGrid 2002, various groups demonstrated prototypes of several different tools for high-performance data transport [2,3,9,16,18,21]. Since then, various new data transport protocols or related congestion control algorithms [8,10,12] have been designed and developed.
In this paper, we describe two distributed, data intensive applications that were demonstrated at iGrid 2005 (iGrid Demonstration US109 and iGrid Demonstration US121). One involves transporting astronomical data from the Sloan Digital Sky Survey (SDSS) and the other involves computing histograms from multiple high-volume data streams. Both rely on newly developed data transport and data mining middleware. Specifically, we describe a new version of the UDT network protocol called Composible-UDT, a file transfer utility based upon UDT called UDT-Gateway, and an application for building histograms on high-volume data flows called BESH (for Best Effort Streaming Histogram). For both demonstrations, we include a summary of the experimental studies performed at iGrid 2005.
International real-time streaming of 4K digital cinema
2006, Future Generation Computer Systems
Citation Excerpt :
In the first public demonstration of 4K motion pictures in 2001, nearly all content had been played from local servers. The demonstrations at iGrid 2002 included 2K class real-time networked applications, such as Griz, TeraScope, TeraVision, HDTV over IP and so forth [5–8]. In 2002 and again in 2004, 4K-over-networks was demonstrated in Japan and in the USA using domestic networks [9].
This paper describes the world’s first real-time, international transmission of 4K digital cinema and 4K Super High Definition (SHD) digital video at iGrid 2005, hosted at the California Institute of Telecommunications and Information Technology (Calit2) at the University of California, San Diego. Nearly six hours of live and pre-recorded 4K motion picture and audio content was streamed to iGrid in San Diego from the Research Institute for Digital Media and Content (DMC) at Keio University in Tokyo.
To implement this demonstration, several new technologies were introduced, including a prototype high-performance 4K compressed multicasting system called “JPEG 2000 Flexcast”, and “Soundscape”, a practical scheme for synchronizing audio and video transmitted from different locations over IP networks.
These iGrid 2005 demonstrations proved that it is now feasible to implement networked professional audio/video applications–production, post-production and distribution–even at 4K quality over IP networks up to 15,000 km long. The demonstrations also showed the new 4K motion picture technology being introduced for digital cinema can be usefully applied to other network applications such as remote telepresence, distance learning and scientific visualization.
Distributed and collaborative visualization of large data sets using high-speed networks
2006, Future Generation Computer Systems
Citation Excerpt :
High-speed networks that provide bandwidths larger than disk transfer rates make transferring data from remote memory faster than reading data from the local disk. In effect, we are using a large pool of memory distributed over multiple remote computers, similar to LambdaRAM/Optiputer [5]. With distributed resources network latency can become an issue for the application.

View all citing articles on Scopus

Jason Leigh is an associate professor in the Department of Computer Science at the University of Illinois at Chicago and a senior scientist at the Electronic Visualization Laboratory (EVL). Leigh is co-chair of the Global Grid Forum’s Advanced Collaborative Environments research group and a co-founder of the GeoWall Consortium.

Thomas A. DeFanti, PhD, is director of the Electronic Visualization Laboratory (EVL), a distinguished professor in the Department of Computer Science, and director of the Software Technologies Research Center at the University of Illinois at Chicago.

Marco Mazzucco is a post-doctoral fellow at the Univeristy of Wales, Swansea. He is currently working on an ESPRERC funded project in theoretic computer science under the direction of Dr. Martin Otto. He also does research and consulting for the National Center for Data Mining. He received his PhD in Mathematics at UIC in 2000.

Robert Grossman is the director of the Laboratory for Advanced Computing and National Center for Data Mining at the University of Illinois at Chicago (UIC). He is also the Founder and CEO of the Two Cultures Group.

View full text