TeraScope: distributed visual data mining of terascale data sets over photonic networks
Introduction
“Where the telescope ends, the microscope begins. Which of the two has the grander view?”—Victor Hugo
Areas of research such as Geoscience, Astronomy, and High Energy Physics are routinely producing terabytes, and soon, petabytes of data from direct data gathering, data post processing, and simulations. Algorithmic detection of hidden patterns within these large data sets has been the focus of data mining [6]. Visualization used in this context (often referred to as Visual Data Mining) has been valuable as a way to verify the detected patterns; and in particular, for when algorithmic specifications of the patterns are difficult to derive [2], [3], [8], [9], [10], [15]. In the latter case user-interfaces that allow one to interactively browse, query and visualize enormous data sets need to be developed [17].
The work described in this paper is motivated by several emerging trends. Firstly, scientific databases are becoming highly distributed. Secondly, the cost of high speed networking is increasing at a rate far exceeding Moore’s Law, network bandwidth is doubling every 8 months, whereas processors are doubling in speed every 18–24 months. This means that computers, rather than the networks are the bottleneck. Thirdly, there is an increasing need and potential, facilitated by these high speed networks, for scientists to publish terabyte data sets on the Web in a manner similar to the way most netizens can create Web pages, so that researchers can make new discoveries by combining data from previously disparate disciplines. For example by correlating data from the World Health Organization with data from the National Center for Atmospheric Research, one could potentially understand how weather patterns influence the spread of diseases.
The Optiputer is a National Science Foundation funded project intended to exploit these trends by interconnecting distributed storage, computation, and visualization resources using extremely high speed photonic networks [13]. The important difference between this and classical Grid computing is that in this new model, the optical networks serve as the system bus for a potentially planetary-scale computer; and compute clusters taken as a whole, serve as the peripherals in the computer. For example, a cluster of computers with high performance graphics cards would be thought of as a single giant graphics card in this context. In the Optiputer concept, we refer to compute clusters as LambdaNodes to denote the fact that they are connected by multiples of light paths (often referred to as Lambdas) in an optical network. Each computer in a LambdaNode is referred to as a nodule, and collections of LambdaNodes form a LambdaGrid.
TeraScope is an experimental visual data mining toolkit intended to take advantage of the Optiputer paradigm. This paper describes the prototype that was developed and demonstrated at the IGrid 2002 conference in Amsterdam (http://www.igrid2002.org). Furthermore, this paper describes LambdaRAM, a high performance cache, for the Optiputer.
Section snippets
TeraScope
The vision for TeraScope is to provide a way to fluidly work with massive data sets as interactively as one would work with a spreadsheet on a laptop. The goal is not necessarily to massively parallelize visualization algorithms so that a terabyte of points can be plotted. The goal is to use parallel algorithms to process terabyte data sets to produce visual summaries (which we call TeraMaps) to help the user locate regions that are most interesting to them. Once the area of interest has been
Results
TeraScope was built in stages. The first prototype included only the visualization tools, and these tools interfaced directly with the data retrieved from the DSTP servers. All correlation calculations in the 3D histogram and 2D scatterplot were performed in the visualization tool. The test case consisted of 100 GB of generated data from National Center for Atmospheric Research’s (NCAR) Community Climate Model (CCM3). This first prototype was demonstrated at SC2001 in Denver, Colorado.
The second
Discussion and future work
This paper has provided an overview of a project recently underway to develop interactive visual data mining tools for exploring massive data sets, using the Optiputer paradigm. At the time of writing of this paper, a prototype of TeraScope had been demonstrated at IGrid 2002. With the TeraScope framework in place, future work will focus on the development of new tools for creating visual summaries, performance monitoring of the first LambdaRAM prototype, and augmentation of LambdaRAM with
Acknowledgements
The virtual reality and advanced networking research, collaborations, and outreach programs at the Electronic Visualization Laboratory (EVL) at the University of Illinois at Chicago are made possible by major funding from the National Science Foundation (NSF), awards EIA-9802090, EIA-0115809, ANI-9980480, ANI-0229642, ANI-9730202, ANI-0123399, ANI-0129527 and EAR-0218918, as well as the NSF Information Technology Research (ITR) cooperative agreement (ANI-0225642) to the University of California
Chong Zhang is a PhD student in the Department of Computer Science at University of Illinois at Chicago (UIC). He is also working as research assistant at Electronic Visualization Laboratory (EVL) at UIC. His current interests include distributed computing, science visualization and data mining.
References (21)
- et al.
Processing large-scale multi-dimensional data in parallel and distributed environments
Parallel Comput.
(2002) - S. Bailey, R. Grossman, S. Gutti, H. Sivakumar, A high performance implementation of the data space transfer protocol...
- Ann Chervenak, I. Foster, Carl Kesselman, Charles Salisbury, Steven Tuecke, The data grid: towards an architecture for...
- K. Li, IVY: a shared virtual memory system for parallel computing, in: Proceedings of the International Conference on...
- Project DataSpace....
- W.J. Frawley, G. Piatetsky-Shapiro, C.J. Matheus, Knowledge Discovery in Databases: An Overview. Knowledge Discovery in...
- QUANTA....
- D.A. Keim, Information visualization and visual data mining, IEEE Trans. Visual. Comput. Graphics 8 (1)...
- D.A. Keim, H.-P.K., Visualization techniques for mining large databases: a comparison, IEEE Trans. Knowledge Data Eng....
- et al.
Exploration and visualization of very large datasets with the active data repository
IEEE Comput. Graphics Appl.
(2001)
Cited by (19)
Federated grid clusters using service address routed optical networks
2007, Future Generation Computer SystemsCitation Excerpt :Optical networks can be as much as 100 times faster than regular 100 Mbps Ethernet interconnected network, which is used in some laboratories as the fabric for interconnection networks for tightly coupled distributed systems [26]. An example of an optiputer would be a set of clusters interconnected by dedicated optical fibers [3,14,25,28,31] sharing computing resources. Consider the distance between two sister campuses within the University of California: UC Irvine and UC San Diego.
Supporting serendipity: Using ambient intelligence to augment user exploration for data mining and web browsing
2007, International Journal of Human Computer StudiesReal-time multi-scale brain data acquisition, assembly, and analysis using an end-to-end OptIPuter
2006, Future Generation Computer SystemsCitation Excerpt :Applications accessing LambdaRAM see the distributed cache as a contiguous memory image. LambdaRAM is currently used in JuxtaView and TeraScope [6] (EVL’s visual data mining software) to provide data correlation algorithms with fast access to distributed database tables. LambdaStream [12] is a transport protocol designed specifically to support gigabit-level streaming, which is required by streaming applications over OptIPuter.
Data mining middleware for wide-area high-performance networks
2006, Future Generation Computer SystemsCitation Excerpt :Moving large data sets over high-speed wide-area networks has been recognized as a challenging task for many years. During iGrid 2002, various groups demonstrated prototypes of several different tools for high-performance data transport [2,3,9,16,18,21]. Since then, various new data transport protocols or related congestion control algorithms [8,10,12] have been designed and developed.
International real-time streaming of 4K digital cinema
2006, Future Generation Computer SystemsCitation Excerpt :In the first public demonstration of 4K motion pictures in 2001, nearly all content had been played from local servers. The demonstrations at iGrid 2002 included 2K class real-time networked applications, such as Griz, TeraScope, TeraVision, HDTV over IP and so forth [5–8]. In 2002 and again in 2004, 4K-over-networks was demonstrated in Japan and in the USA using domestic networks [9].
Distributed and collaborative visualization of large data sets using high-speed networks
2006, Future Generation Computer SystemsCitation Excerpt :High-speed networks that provide bandwidths larger than disk transfer rates make transferring data from remote memory faster than reading data from the local disk. In effect, we are using a large pool of memory distributed over multiple remote computers, similar to LambdaRAM/Optiputer [5]. With distributed resources network latency can become an issue for the application.
Chong Zhang is a PhD student in the Department of Computer Science at University of Illinois at Chicago (UIC). He is also working as research assistant at Electronic Visualization Laboratory (EVL) at UIC. His current interests include distributed computing, science visualization and data mining.
Jason Leigh is an associate professor in the Department of Computer Science at the University of Illinois at Chicago and a senior scientist at the Electronic Visualization Laboratory (EVL). Leigh is co-chair of the Global Grid Forum’s Advanced Collaborative Environments research group and a co-founder of the GeoWall Consortium.
Thomas A. DeFanti, PhD, is director of the Electronic Visualization Laboratory (EVL), a distinguished professor in the Department of Computer Science, and director of the Software Technologies Research Center at the University of Illinois at Chicago.
Marco Mazzucco is a post-doctoral fellow at the Univeristy of Wales, Swansea. He is currently working on an ESPRERC funded project in theoretic computer science under the direction of Dr. Martin Otto. He also does research and consulting for the National Center for Data Mining. He received his PhD in Mathematics at UIC in 2000.
Robert Grossman is the director of the Laboratory for Advanced Computing and National Center for Data Mining at the University of Illinois at Chicago (UIC). He is also the Founder and CEO of the Two Cultures Group.