Analysis and Visualization of Coastal Ocean Model Data in the Cloud

: The traditional ﬂow of coastal ocean model data is from High-Performance Computing (HPC) centers to the local desktop, or to a ﬁle server where just the needed data can be extracted via services such as OPeNDAP. Analysis and visualization are then conducted using local hardware and software. This requires moving large amounts of data across the internet as well as acquiring and maintaining local hardware, software, and support personnel. Further, as data sets increase in size, the traditional workﬂow may not be scalable. Alternatively, recent advances make it possible to move data from HPC to the Cloud and perform interactive, scalable, data-proximate analysis and visualization, with simply a web browser user interface. We use the framework advanced by the NSF-funded Pangeo project, a free, open-source Python system which provides multi-user login via JupyterHub and parallel analysis via Dask, both running in Docker containers orchestrated by Kubernetes. Data are stored in the Zarr format, a Cloud-friendly n-dimensional array format that allows performant extraction of data by anyone without relying on data services like OPeNDAP.


Introduction
Analysis, visualization, and distribution of coastal ocean model data is challenging due to the sheer size of the data involved, with regional simulations commonly in the 10GB-1TB range. The traditional workflow is to download data to local workstations or file servers from which the data needed can be extracted via services such as OPeNDAP [1]. Analysis and visualization take place with environments like MATLAB®and Python running on local computers. Not only are these datasets becoming too large to effectively download and analyze locally, but this approach requires acquiring and maintaining local hardware, software, and personnel to ensure reliable and efficient processing. Archiving is an additional challenge for many centers. Effective sharing with collaborators is often limited by unreliable services that cannot scale with demand. In some cases, a subset of analysis and visualization tools are made available through custom web portals (e.g., [2,3]). These portals can satisfy the needs of data dissemination to the public but don't have the suite of scientific analysis tools needed for collaborative research use. In addition, the development and maintenance of these portals require • Zarr [10] format files with model output for cloud-friendly access • Dask [11] for parallel scheduling and execution • Xarray [12] for working effectively with model output using the NetCDF/CF [13,14] Data Model • PyViz [15] for interactive visualization of the output • Jupyter [16] to allow user interaction via their web browser ( Figure 1). maintenance of these portals require dedicated web software developers and is out of the reach of most scientists. The traditional method of data access and use is becoming time and cost inefficient. The Cloud and recent advances in technology provide new opportunities for analysis, visualization, and distribution of model data, overcoming these problems [4]. Data can be stored in the Cloud efficiently in object storage which allows performant access by providers or end users alike. Analysis and visualization can take place in the Cloud, close to the data, allowing efficient and costeffective access, as the only data that needs to leave the Cloud are graphics and text returned to the browser. As these tools have matured, they have lowered the barrier of entry and are poised to transform the ability of regular scientists and engineers to collaborate on difficult research problems without being constrained by their local resources.
The Pangeo project [5][6][7] was created to take advantage of these advances for the scientific community. The specific goals of Pangeo are to: "(1) Foster collaboration around the open source scientific Python ecosystem for ocean/atmosphere/land/climate science. (2) Support the development with domain-specific geoscience packages. (3) Improve scalability of these tools to handle petabytescale datasets on HPC and Cloud platforms." It makes progress toward these goals by building on open-source packages already widely used in the Python ecosystem, and supporting a flexible and modular framework for interactive, scalable, data-proximate computing on large gridded datasets. Here we first describe the essential components of this framework, then demonstrate two coastal ocean modeling use cases: (1) Calculating the maximum water level at each grid cell from a 53 GB, 720 time step, 9 million node triangular mesh ADCIRC [8] simulation of Hurricane Ike; (2) creating a dashboard for visualizing data from the curvilinear orthogonal COAWST/ROMS [9] forecast model.

Framework Description
Pangeo is a flexible framework which can be deployed in different types of platforms with different components, so here we describe the specific framework we used for this work, consisting of: • Zarr [10] format files with model output for cloud-friendly access • Dask [11] for parallel scheduling and execution • Xarray [12] for working effectively with model output using the NetCDF/CF [13,14] Data Model • PyViz [15] for interactive visualization of the output • Jupyter [16] to allow user interaction via their web browser ( Figure 1).
We will briefly describe these and several other important components in more detail. The Pangeo Cloud framework used here: Zarr for analysis ready data, on distributed, globally accessible storage; Dask for managing parallel computations; Xarray for gridded data analysis, PyViz for interactive visualization and; Jupyter for user access via a web browser. The framework works with any Cloud provider because it uses Kubernetes, which orchestrates and scales a cluster of Docker containers.

Zarr
Zarr is a Cloud-friendly data format. The Cloud uses object storage. Access to NetCDF files (the most commonly used format for model data) in object storage is poor, due to the latency of object Figure 1. The Pangeo Cloud framework used here: Zarr for analysis ready data, on distributed, globally accessible storage; Dask for managing parallel computations; Xarray for gridded data analysis, PyViz for interactive visualization and; Jupyter for user access via a web browser. The framework works with any Cloud provider because it uses Kubernetes, which orchestrates and scales a cluster of Docker containers.
We will briefly describe these and several other important components in more detail.

Zarr
Zarr is a Cloud-friendly data format. The Cloud uses object storage. Access to NetCDF files (the most commonly used format for model data) in object storage is poor, due to the latency of object requests and the numerous small requests involved with accessing data from a NetCDF file. Therefore, we converted the model output from NetCDF format to Zarr format which was developed specifically to allow Cloud-friendly access to n-dimensional array data. With Zarr, the metadata is stored in JSON format, and the data chunks are stored as separate storage objects, typically with chunk sizes of 10-100 mb, which enables concurrent reads by multiple processors. The major features of the HDF5 and NetCDF4 data models are supported: Self-describing datasets with variables, dimensions and attribute, supporting groups, chunking, and compression. It is being developed in an open community fashion on GitHub, with contributions from multiple research organizations. Currently, only a Python interface exists, but it has a well-documented specification and other language bindings are being developed. The Unidata NetCDF team is working on the adoption of Zarr as a back-end to the NetCDF C library. Data can be converted from NetCDF, HDF5 or other n-dimensional array formats to Zarr using the Xarray library (described below).

Dask
Dask is a component that facilitates out-of-memory and parallel computations. Dask arrays allow handling very large array operations using many small arrays known as "chunks". Dask workers perform operations in parallel, and dask worker clusters can be created on local machines with multiple CPUs, on HPC with job submission, and on the Cloud via Kubernetes [17] orchestration of Docker [18] containers.

Xarray
Xarray is a component which implements the NetCDF Data model, with the concept of a dataset that contains named shared dimensions, global attributes, and a collection of variables that have identified dimensions and variable attributes. It can read from a variety of sources, including NetCDF, HDF, OPeNDAP, Zarr and many raster data formats. Xarray automatically uses Dask for parallelization when the data are stored in a format that uses chunks, or when chunking is explicitly specified by the user.

PyViz
PyViz is a coordinated effort to make data visualization in Python easier to use, easier to learn, and more powerful. It is a collection of visualization packages built on top of a foundation of mature, widely used data structures and packages in the scientific Python ecosystem. The functions of these packages are described separately below along with the associated EarthSim project that is instrumental in advancing and extending PyViz capabilities.

EarthSim
EarthSim [19,20] is a project that acts as a testing ground for PyViz workflows specifying, launching, visualizing and analyzing environmental simulations such as hydrologic, oceanographic, weather and climate modeling. It contains both experimental tools and example workflows. Approaches and tools developed in this project often are incorporated into the other PyViz packages upon maturity. Specifically, key improvements in the ability to represent large curvilinear mesh and triangular mesh grids made the PyViz tools practical for use by modelers, e.g., TriMesh and QuadMesh.

PyViz: Datashader
Datashader [21] renders visualizations of large data into rasters, allowing accurate, dynamic representation of datasets that would otherwise be impossible to display in the browser.

PyViz: HoloViews
HoloViews [22] is a package that allows the visualization of data objects through annotation of the objects. It supports different back-end plotting packages, including Matplotlib, Plotly, and Bokeh. The Matplotlib backend provides static plots, while Bokeh generates visualizations in JavaScript that are rendered in the browser and allow user interaction such as zooming, panning, and selection. We used Bokeh here, and the visualizations work both in Jupyter Notebooks and deployed as web pages running with Python backends. A key aspect of using HoloViews for large data is that it can dynamically rasterize the plot to screen resolution using Datashader.

PyViz: GeoViews
GeoViews [23] is a package that layers geographic mapping on top of HoloViews, using the Cartopy [24] package for map projections and plotting. It also allows a consistent interface to many different map elements, including Web Map Tile Services, vector-based geometry formats such as Shapefiles and GeoJSON, raster data and QuadMesh and TriMesh objects useful for representing model grids.

PyViz HvPlot
HvPlot [25] is a high-level package that makes it easy to create HoloViews/GeoViews objects by allowing users to replace their normal object .plot() commands with .hvplot(). Sophisticated visualizations can therefore be created with one plot call, and then if needed supplemented with additional lower-level HoloViews information for finer-grained control.

PyViz: Panel
Panel [26] is a package that provides a framework for creating dashboards that contain multiple visualizations, control widgets and explanatory text. It works within Jupyter and the dashboards can also be deployed as web applications that work dynamically with Dask-powered Python backends.

JupyterHub
JupyterHub [27] is a component that allows multi-user login, with each user getting their own Jupyter server and persisted disk space. The Jupyter server runs on the host system, and users interact with the server via the Jupyter client, which runs in any modern web browser. Users type code into cells in a Jupyter Notebook, which get processed on the server and the output (e.g., figures and results of calculations) return as cell output directly below the code. The notebooks themselves are simple text files that may be shared and reused by others.

Kubernetes
Kubernetes [17] is a component that orchestrates containers like Docker, automating deployment, scaling, and operations of containers across clusters of hosts. Although developed by Google, the project is open-source and Cloud agnostic. It allows JupyterHub to scale with the number of users, and individual tasks to scale with the number of requested Dask workers.

Conda: Reproducible Software Environment
Utilizing both Pangeo and PyViz components, the system contains 300+ packages. With these many packages, we need an approach that minimizes the possibility of conflicts. We use Conda [28], "an open source, cross-platform, language agnostic package manager and environment management system". Conda allows installation of pre-built binary packages, and providers can deliver packages via channels at anaconda.org. To provide a consistent and reliable build environment, the community has created conda-forge [29], a build infrastructure that relies on continuous integration to create packages for Windows, macOS, and Linux. We specify the conda-forge channel only when we create our environment, and use specific packages from other channels only when absolutely necessary. For example, currently, over 90% of the 300+ packages we use to build the Pangeo Docker containers are from conda-forge.

Community
The Pangeo collaborator community [30] plays a critical role in making this framework deployable and usable by domain scientists like ocean modelers. The community discusses technical and usage challenges on GitHub issues [31], during weekly check-in meetings, and in a blog [32].

Deployment on Amazon Cloud
We obtained research credits from the Amazon Open Data program [33] to deploy and test the framework on the Amazon Cloud. The credits were obtained under the umbrella of the Earth System Information Partners (ESIP) [34], of which USGS, NOAA, NASA, and many other federal agencies are partners. Deploying the environment under ESIP made access possible from government and academic collaborators alike.

Example: Mapping Maximum Water Level During a Storm Simulation from an Unstructured Grid (Triangular Mesh) Model
A common requirement in the analysis is to compute the mean or other property over the entire grid over a period of simulation. Here we illustrate the power of the Cloud to perform one of these calculations in 15 s instead of 15 min by using 60 Dask workers instead of just one, describing a notebook that computes and visualizes the maximum water level over a one week simulation of Hurricane Ike for the entire model mesh (covering the US East and Gulf Coasts).
The notebook workflow commences with a specification of how much processing power is desired, here requesting 60 Dask workers utilizing 120 CPU cores ( Figure 2). The next step is opening the Zarr dataset in Xarray, which simply reads the metadata (Figure 3). We see we have water level variable called zeta, with more than 9 million nodes, and 720 time steps. We can also see data is arranged in chunks that each contain 10 time steps and 141,973 nodes. The chunk size was specified when the Zarr dataset was created, using Xarray to convert the original NetCDF file from Clint Dawson (University of Texas), and using the Amazon Web Services command line interface to upload to Amazon S3 object storage.

Deployment on Amazon Cloud
We obtained research credits from the Amazon Open Data program [33] to deploy and test the framework on the Amazon Cloud. The credits were obtained under the umbrella of the Earth System Information Partners (ESIP) [34], of which USGS, NOAA, NASA, and many other federal agencies are partners. Deploying the environment under ESIP made access possible from government and academic collaborators alike.

Example: Mapping Maximum Water Level During a Storm Simulation from an Unstructured Grid (Triangular Mesh) Model
A common requirement in the analysis is to compute the mean or other property over the entire grid over a period of simulation. Here we illustrate the power of the Cloud to perform one of these calculations in 15 s instead of 15 min by using 60 Dask workers instead of just one, describing a notebook that computes and visualizes the maximum water level over a one week simulation of Hurricane Ike for the entire model mesh (covering the US East and Gulf Coasts).
The notebook workflow commences with a specification of how much processing power is desired, here requesting 60 Dask workers utilizing 120 CPU cores (Figure 2). The next step is opening the Zarr dataset in Xarray, which simply reads the metadata (Figure 3). We see we have water level variable called zeta, with more than 9 million nodes, and 720 time steps. We can also see data is arranged in chunks that each contain 10 time steps and 141,973 nodes. The chunk size was specified when the Zarr dataset was created, using Xarray to convert the original NetCDF file from Clint Dawson (University of Texas), and using the Amazon Web Services command line interface to upload to Amazon S3 object storage.
After inspecting the total size that zeta would take in memory (58 GB), we calculate the maximum of zeta over the time dimension (Figure 4), and Dask automatically creates, schedules and executes the parallel tasks over the workers in the Dask cluster. The progress bar shows the parallel calculations that are taking place, in this case reading chunks of data from Zarr, computing the maximum for each chunk, and assembling each piece into the final 2D field.    Once the maximum water level has been computed, we can display the results using the GeoViews TriMesh method, which when combined with the rasterize command from Datashader, dynamically renders and rasterizes the mesh to the requested figure size (here 600 × 400 pixels) ( Figure 5). The controls on the right side of the plot allow the user to zoom and pan the visualization, which triggers additional rendering and rasterization of the data ( Figure 6). In this way, the user can see investigate the full resolution of the model results. Even with this 9 million node grid, rendering is fast, taking less than 1 s. This will become even faster with PyViz optimizations soon to be implemented. After inspecting the total size that zeta would take in memory (58 GB), we calculate the maximum of zeta over the time dimension (Figure 4), and Dask automatically creates, schedules and executes the parallel tasks over the workers in the Dask cluster. The progress bar shows the parallel calculations that are taking place, in this case reading chunks of data from Zarr, computing the maximum for each chunk, and assembling each piece into the final 2D field.  Once the maximum water level has been computed, we can display the results using the GeoViews TriMesh method, which when combined with the rasterize command from Datashader, dynamically renders and rasterizes the mesh to the requested figure size (here 600 × 400 pixels) ( Figure 5). The controls on the right side of the plot allow the user to zoom and pan the visualization, which triggers additional rendering and rasterization of the data ( Figure 6). In this way, the user can see investigate the full resolution of the model results. Even with this 9 million node grid, rendering is fast, taking less than 1 s. This will become even faster with PyViz optimizations soon to be implemented. Once the maximum water level has been computed, we can display the results using the GeoViews TriMesh method, which when combined with the rasterize command from Datashader, dynamically renders and rasterizes the mesh to the requested figure size (here 600 × 400 pixels) ( Figure 5). The controls on the right side of the plot allow the user to zoom and pan the visualization, which triggers additional rendering and rasterization of the data ( Figure 6). In this way, the user can see investigate the full resolution of the model results. Even with this 9 million node grid, rendering is fast, taking less than 1 s. This will become even faster with PyViz optimizations soon to be implemented.  The notebook is completely reproducible, as it accesses public data on the Cloud, and the software required to run the notebook is all on the community Conda-Forge channel. The notebook is available on GitHub [35], and interested parties can not only download it for local use, but launch it immediately on the Cloud using Binder [36].

Example: Creating a Dashboard for Exploring a Structured Grid (Orthogonal Curvilinear Grid) Model
In addition to dynamic visualization of large grids, the PyViz tools hvPlot and Panel allow for easy and flexible construction of dashboards containing both visualization and widgets. In fact, hvPlot creates widgets automatically if the variable to be mapped has more than two dimensions. We  The notebook is completely reproducible, as it accesses public data on the Cloud, and the software required to run the notebook is all on the community Conda-Forge channel. The notebook is available on GitHub [35], and interested parties can not only download it for local use, but launch it immediately on the Cloud using Binder [36].

Example: Creating a Dashboard for Exploring a Structured Grid (Orthogonal Curvilinear Grid) Model
In addition to dynamic visualization of large grids, the PyViz tools hvPlot and Panel allow for easy and flexible construction of dashboards containing both visualization and widgets. In fact, hvPlot creates widgets automatically if the variable to be mapped has more than two dimensions. We The notebook is completely reproducible, as it accesses public data on the Cloud, and the software required to run the notebook is all on the community Conda-Forge channel. The notebook is available on GitHub [35], and interested parties can not only download it for local use, but launch it immediately on the Cloud using Binder [36].

Example: Creating a Dashboard for Exploring a Structured Grid (Orthogonal Curvilinear Grid) Model
In addition to dynamic visualization of large grids, the PyViz tools hvPlot and Panel allow for easy and flexible construction of dashboards containing both visualization and widgets. In fact, hvPlot creates widgets automatically if the variable to be mapped has more than two dimensions. We can demonstrate this functionality with the forecast from the USGS Coupled Ocean Atmosphere Wave and Sediment Transport (COAWST) model [9]. In Figure 7, a simple dashboard is shown that allows the user to explore the data by selecting different time steps and layers. This was created by the notebook code cell shown in Figure 8. The curvilinear orthogonal grid used by the COAWST model is visualized using the QuadMesh function in GeoViews. As with the TriMesh example, the user has the ability to zoom and pan, which Datashader re-renders the data (within a second) and then delivers the result to the browser (Figure 9). This notebook is also available on GitHub [34], where it can be examined, downloaded, or run on the Cloud (using the "launch Binder" button). can demonstrate this functionality with the forecast from the USGS Coupled Ocean Atmosphere Wave and Sediment Transport (COAWST) model [9]. In Figure 7, a simple dashboard is shown that allows the user to explore the data by selecting different time steps and layers. This was created by the notebook code cell shown in Figure 8. The curvilinear orthogonal grid used by the COAWST model is visualized using the QuadMesh function in GeoViews. As with the TriMesh example, the user has the ability to zoom and pan, which Datashader re-renders the data (within a second) and then delivers the result to the browser (Figure 9). This notebook is also available on GitHub [34], where it can be examined, downloaded, or run on the Cloud (using the "launch Binder" button).   Figure 7. COAWST uses a curvilinear grid, we specify that HvPlot use the QuadMesh method to visualize the potential temperature, that we want to rasterize the result, and that by the GroupBy option, to create widgets for the time and vertical layers. We then specify that we want to use the ESRI Imagery tile service for a basemap, and overlay the visualization on top. Finally, we change widgets to type Select, which provide a dropdown list of values (instead of the default slider). Figure 9. Zooming into the Gulf of Mexico on the COAWST forecast temperature, using the pan and wheel zoom controls. The hover control is also selected, which allows data values to be displayed along with their coordinates. can demonstrate this functionality with the forecast from the USGS Coupled Ocean Atmosphere Wave and Sediment Transport (COAWST) model [9]. In Figure 7, a simple dashboard is shown that allows the user to explore the data by selecting different time steps and layers. This was created by the notebook code cell shown in Figure 8. The curvilinear orthogonal grid used by the COAWST model is visualized using the QuadMesh function in GeoViews. As with the TriMesh example, the user has the ability to zoom and pan, which Datashader re-renders the data (within a second) and then delivers the result to the browser (Figure 9). This notebook is also available on GitHub [34], where it can be examined, downloaded, or run on the Cloud (using the "launch Binder" button).   Figure 7. COAWST uses a curvilinear grid, we specify that HvPlot use the QuadMesh method to visualize the potential temperature, that we want to rasterize the result, and that by the GroupBy option, to create widgets for the time and vertical layers. We then specify that we want to use the ESRI Imagery tile service for a basemap, and overlay the visualization on top. Finally, we change widgets to type Select, which provide a dropdown list of values (instead of the default slider). Figure 9. Zooming into the Gulf of Mexico on the COAWST forecast temperature, using the pan and wheel zoom controls. The hover control is also selected, which allows data values to be displayed along with their coordinates.  Figure 7. COAWST uses a curvilinear grid, we specify that HvPlot use the QuadMesh method to visualize the potential temperature, that we want to rasterize the result, and that by the GroupBy option, to create widgets for the time and vertical layers. We then specify that we want to use the ESRI Imagery tile service for a basemap, and overlay the visualization on top. Finally, we change widgets to type Select, which provide a dropdown list of values (instead of the default slider). Figure 8. Notebook cell that generates the dashboard in Figure 7. COAWST uses a curvilinear grid, we specify that HvPlot use the QuadMesh method to visualize the potential temperature, that we want to rasterize the result, and that by the GroupBy option, to create widgets for the time and vertical layers. We then specify that we want to use the ESRI Imagery tile service for a basemap, and overlay the visualization on top. Finally, we change widgets to type Select, which provide a dropdown list of values (instead of the default slider).

Figure 9.
Zooming into the Gulf of Mexico on the COAWST forecast temperature, using the pan and wheel zoom controls. The hover control is also selected, which allows data values to be displayed along with their coordinates. Figure 9. Zooming into the Gulf of Mexico on the COAWST forecast temperature, using the pan and wheel zoom controls. The hover control is also selected, which allows data values to be displayed along with their coordinates.

Discussion
The Pangeo framework demonstrated here works not only on the Cloud, but can run on HPC or even on a local desktop. On the local desktop, however, data needs to be downloaded for analysis by each user, and parallel computations are limited to the locally available CPUs. On HPC, there may be access to more CPUs, but the data still needs to be downloaded to the HPC center. On the Cloud, however, anyone can access the data without it having to be moved and have virtually unlimited processing power available to them. On the Cloud, Pangeo allows similar functionality to Google Earth Engine [37], allowing computation at a scale close to the data, but can be run on any Cloud, and with any type of data. Let us review the advantages of the Cloud in more detail: Data access: Data in object storage like S3, can be accessed directly from a URL without the need of a special data service like OPeNDAP. This prevents the data service from being a bottleneck on operations, or data access failing because the data service has failed. It also means that data storage on the Cloud is immediately available for use by your collaborators or users. While data services like OPeNDAP can become overwhelmed by too many concurrent requests, this doesn't happen with access from object storage. Object storage is also extremely reliable, 99.999999999% with default storage on Amazon, which means if 10,000 objects are stored, you may lose one every 10 million years. Finally, data in object storage are not just available for researchers to analyze, but are also available for Cloud-enabled web applications to use. This includes applications that have been developed by scientists as PyViz dashboards, then published using Panel as dynamic web applications with one additional line of code.
Computing on demand: On the Cloud, costs accrue per hour for each machine type in use. It costs the same to run 60 CPUs for 1 min as it does to run 1 CPU for 60 min, and because nearly instantaneous access is available, with virtually unlimited numbers of CPUs, big data analysis tasks can be conducted interactively instead of being limited to batch operations. The Pangeo instance automatically spins up and down Cloud instances based on computational demands.
Freedom from local infrastructure: Because the data, analysis, and visualization are on the Cloud, buying or maintaining local computer centers, high power computer systems, or even fast internet connections is not necessary. Researchers and their colleagues can analyze and visualize data from anywhere with a simple laptop computer and the WiFi from a cell phone hotspot.
We have demonstrated the Pangeo framework for coastal ocean modeling here, but the framework is flexible and is being used increasingly by a wide variety of research projects, including climate scale modeling [38] and remote sensing [39]. While the framework clearly benefits the analysis and visualization of large datasets, it is useful for other applications as well. For example, the AWS Pangeo instance we deployed was used by the USGS for two multi-day machine learning workshops that each had 40 students from various institutions with a diversity of computer configurations, operating systems, and versions. The students were able to do the coursework on the Cloud using their web browsers, avoiding the challenges encountered when the course computing environment needs to be installed on a number of heterogeneous personal computers.
While there are numerous benefits to this framework, there are also some remaining challenges [40,41]. One important challenge is cost. The Cloud often appears expensive to researchers because much of the true cost of computing is covered by local overhead (e.g., the physical structure, electricity, internet costs, support staff). Gradual adoption, training, and subsidies for Cloud computing are some of the approaches that can help researchers and institutions make the transition to the Cloud more effectively. Another challenge is cultural: Scientists are accustomed to having their data local, and some do not trust storage on the Cloud, despite the reliability. Security issues are also a perceived concern with non-local data. Finally, converting the large collection of datasets designed for file systems to datasets that work well on object storage is a non-trivial task even with the tools discussed above.
Once these challenges are overcome, we can look forward to a day when all model data and analysis takes place on the Cloud, with all data directly accessible and connected by high-speed networks (e.g., Internet 2) and common computing environments can be shared easily. This will lead to unprecedented levels of performance, reliability, and reproducibility for the scientific community, leading to more efficient and effective science.
Several agencies have played key roles in the development of these open-source tools that support the entire community: DARPA (U.S. Defense Advanced Research Projects Agency) provided significant funding for Dask, and ERDC (U.S. Army Engineer Research and Development Center) has provided significant funding through their EarthSim project for developing modeling-related functionality in the PyViz package. We hope that more agencies will participate in this type of open source development, accelerating our progress on expanding this framework to more use cases and more communities.

Conclusions
Pangeo with PyViz provides an open-source framework for interactive, scalable, data-proximate analysis and visualization of coastal ocean model output on the Cloud. The framework described here provides a glimpse of the scientific workplace of the future, where a modeler with a laptop and a modest internet connection can work interactively at scale with big data on the Cloud, create interactive visual dashboards for data exploration, and generate more reproducible science.
Funding: This research benefited from National Science Foundation grant number 1740648, and EarthSim project was funded by ERDC projects PETTT BY17-094SP and PETTT BY16-091SP. This project also benefited from research credits granted by Amazon.