Skip to main content
Log in

A mathematical framework for earth science data provenance tracing

  • Software Article
  • Published:
Earth Science Informatics Aims and scope Submit manuscript

Abstract

This paper identifies three distinct data production paradigms for Earth science data, each having its own versioning structure:

  • Climate data record production, used when the data producer’s dominant concern is providing a homogeneous error structure for each data set version, particularly when the data record is expected to cover a long time period

  • Operational data set production, used when the producer must ensure low latency and service continuity with less attention to error homogeneity across the entire record

  • Exploratory production, used for validation or research in which the producer decides which processes to apply by interacting with the data. In this paradigm, there may not be a common versioning structure from one production episode to another

This paper then develops a mathematical framework for three provenance tracing activities that are important in long-term preservation of Earth science data:

  • tracing the history of data production that created an item of Earth science data, with particular attention to the versioning structure of the data collections

  • tracing the history of custody for an item

  • tracing the history of Intellectual Property Rights transfers for an item

Each of these activities has its own type of Directed Acyclic Graph (DAG) underlying a particular kind of provenance. Provenance tracing is equivalent to performing a Breadth First Search on the appropriate DAG.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Listing 1
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Abiteboul S, Quass D, McHugh J, Widom J, Wiener J (1997) The Lorel query language for semistructured data. Int J Digit Libr 1:1

    Article  Google Scholar 

  • Appell D (2009) Stumbling over data: mistakes fuel climate-warming skeptics. Sci Am 301:19–20

    Article  Google Scholar 

  • ASDC (2010) CERES metadata and data quality summaries, see http://eosweb.larc.nasa.gov/PRODOCS/ceres/table_ceres.html as well as http://eosweb.larc.nasa.gov/PRODOCS/ceres/level2_ssf_table.html, http://eosweb.larc.nasa.gov/PRODOCS/ceres/SSF/Quality_Summaries/CER_SSF_Aqua_Edition2C.html, http://eosweb.larc.nasa.gov/PRODOCS/ceres/SSF/Quality_Summaries/ssf_toa_aqua_ed2A.html

  • Barker A, Hemert JV (2008) Scientific workflow: a survey and research directions. In: Parallel processing and applied mathematics lecture notes in computer science, vol 4967. Springer Berlin, pp 746–753

    Google Scholar 

  • Barkstrom BR (1984) The earth radiation budget experiment (ERBE). Bull Am Meteorol Soc 65:1170–1185

    Article  Google Scholar 

  • Barkstrom BR (2003) Data product configuration management and versioning in large-scale production of satellite scientific data. In: Westfechtel B, van den Hoek A (eds) Software configuration management/ICSE workshops SCM 2001 and SCM 2003, Toronto, Canada, May 2001 and Portland, OR, USA, May 2003. Lecture notes in computer science, vol 2649. Springer, Berlin, pp 118–133

    Google Scholar 

  • Barton J, Whitfield E (2005) Letter to Dr. Michael Mann dated June 23, 2005. Available online at http://republicans.energycommerce.house.gov/108/Letters/062305_Mann.pdf after going to http://republicans.energycommerce.house.gov/ and doing a search for “Letter to Dr. Mann”. Accessed 29 Sept 2009

  • Baudin M (1990) Manufacturing systems analysis: with application to production scheduling. Prentice-Hall, Englewood Cliffs

    Google Scholar 

  • Belhajjame K, Wolstencroft K, Corcho O, Oinn T, Tanoh F, William A, Goble C (2008) Metadata management in the Taverna workflow system. In: IEEE international symposium on cluster computing and the grid, pp 651–656

  • Bose R (2002) A conceptual framework for composing and managing scientific data lineage. In: Proc. 14th international conf. on scientific and statistical database management, pp 15–19

  • Bose R, Frew J (2004) Composing lineage metadata with XML for custom satellite-derived data products SSBDM. In: 16th international conf. on scientific and statistical database management (SSBDM’04), p 275

  • Bose R, Frew J (2005) Lineage retrieval for scientific data processing: a survey. ACM Comput Surv 37:1–28

    Article  Google Scholar 

  • Buneman P, Suciu D (2007) Data Eng 32(special issue):1–58

    Google Scholar 

  • Buneman P, Khanna S, Tan W-C (2002) On propagation of deletions and annotations through views. In: PODS ’02: proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. Madison, Wisconsin, 3–6 June 2002

  • Buneman P, Khanna S, Tan W-C (2002) Computing provenance and annotations for views. Workshop Paper: Workshop on Data Derivation and Provenance (Oct.), Chicago, IL

  • Buneman P, Fernandez M, Suciu D (2000) UnQL: a query language and algebra for semistructured data based on structural recursion. VLDB J 9:76–110

    Article  Google Scholar 

  • Buneman P, Khanna S, Tajima K, Tan W-C (2004) Archiving scientific data. Trans Database Syst (TODS) 29:2–42

    Article  Google Scholar 

  • Buneman P, Cheney J, Tajima, Tan W-C, Vansummeren S (2008) Curated databases. In: PODS ’08: proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. Vancouver, BC, Canada, 9–12 June 2008

  • Burroughs J (2010) Web page on quality control for the integrated global radiosonde archive. Available at http://www.ncdc.noaa.gov/oa/climate/igra/index.php

  • Cane MA, Kaplan A, Miller RN, Tang B, Hackett EC, Busalacci AJ (1996) Mapping tropical Pacific sea level: data assimilation via a reduced state space Kalman filter. J Geophys Res 101(C10):22599–22617

    Article  Google Scholar 

  • CCSDS (2002) Reference model for an open archival information system (OAIS). Consultative Committee for Space Data Systems, CCSDS 650.0-B-1, Blue Book, CCSDS Secretariat, Washington, DC

  • Chase RB, Aquilano NJ, Jacobs FR (1998) Production and operations management: manufacturing and services. Irwin McGraw-Hill, Boston

    Google Scholar 

  • Chebotko A, Lin C, Fei X, Lai Z, Lu S, Hua J, Fotouhi F (2007) VIEW: a VIsual sciEntificWorkflow management system. In: IEEE congress on services, Salt Lake City, Utah, USA, 9–13 July 2007

  • Cheney J, Buneman P, Ludäscher B (2008) Report on the principles of provenance workshop. SIGMOD Rec 37:62–65

    Article  Google Scholar 

  • Committee on Climate Data Records from NOAA Operational Satellites (2004) Climate data records from environmental satellites. National Academies, Washington

  • Committee on Surface Temperature Reconstructions for the past 2,000 Years (2006) Surface temperature reconstructions for the last 2,000 years. National Academies, Washington

  • Consens MP, Mendelzon AO (1990) GraphLog: a visual formalism for real life recursion. In: PODS ’90. ACM, New York, pp 404–416

    Chapter  Google Scholar 

  • Conway E, Dunckley M, McIlwrath B, Giaretta D (2009) Preservation network models: creating stable networks of information to ensure the long term use of scientific data. In: Proc. PV2009, Madrid, Spain, 1–3 Dec 2009

  • Cormen TH, Lieserson CE, Rivest RL (1997) Introduction to algorithms. MIT, Cambridge

    Google Scholar 

  • Cui Y, Widom J, Wiener JL (2000) Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst. 26:179–227

    Article  Google Scholar 

  • Easterling DR, Karl TR, Mason EH, Hughes PY, Bowman DP (1996) United states historical climatology network (US HCN) monthly temperature and precipitation data. ORNL/CDIAC-87, NDP-019/R3. Carbon Dioxide Information Analysis Center, Oak Ridge National Laboratory, US Department of Energy, Oak Ridge, Tennessee

  • Eifrem E (2009) Neo4j—the benefits of graph databases. In: O’Reilly open source convention, 20–24 July 2009. Available online at http://en.oreilly.com/oscon2009/public/schedule/detail/8364

  • ESW (2009) ESW wiki—large TripleStores. Available online at http://esw.w3.org/topic/LargeTripleStores

  • Euler L (1736) Solutio problematis ad geometriam situs pertinentis. Comment Acad Sci Imper Petropol 8:128–140

    Google Scholar 

  • Fleig AJ, Tilmes C (2006) Provenance and reuse: essential elements for long term climate data sets. EOS Trans. AGU 87

  • Foster I, Vockler J, Wilde M, Zhao Y (2002) Chimera: a virtual data system for representing, querying, and automating data derivation. In: Proc. 14th int. conf. on scientific and statistical database management, pp 37–46

  • Frew J, Metzger D, Slaughter P (2008) Automatic capture and reconstruction of computational provenance. Concurrency Comput Pract Exper 20:485–496

    Article  Google Scholar 

  • Frew J, Bose R (2001) Earth system science workbench: a data management infrastructure for earth science products. In: Fairfax VA, Kerschberg L, Kafatos M (eds) Proc. of the 13th international conference on scientific and statistical database management (SSDBM ’01) (July). IEEE Computer Society, Washington, pp 180–189

    Google Scholar 

  • Gershwin SB (1994) Manufacturing systems engineering. PTR Prentice Hall, Englewood Cliffs

    Google Scholar 

  • Giaretta D (2007) The CASPAR approach to digital preservation. Int J Digit Curation 2:112–131

    Google Scholar 

  • Gibbons A (1985) Algorithmic graph theory. Cambridge University Press, Cambridge

    Google Scholar 

  • Groth P, Jiang S, Miles S, Munroe S, Tan V, Tsasakou S, Moreau L (2006) An architecture for a provenance system: enabling and supporting provenance in grids for complex problems. Available online at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.96.3841&rep=repl&type=pdf

  • Guan Z, Hernandez F, Bangalore P, Gray J, Skjellum A, Velusamy V, Liu Y (2005) Grid-flow: a grid-enabled scientific workflow system with a Petri-net-based interface. Concurrency Comput Pract Exper 18:1115–1140

    Article  Google Scholar 

  • Hook R, Romaniello M, Ullgrén M, Maisala S, Solin O, Oittinen T, Savolainen V, Järveläinen P, Tyynelä J, Péron M, Izzo C, Ballester P, Gabasch A (2006) ESO reflex: a graphical workflow engine for astronomical data reduction. Messenger 131:41–44. Available online at http://www.eso.org/sci/publications/messenger/archive/no.131-mar08/messenger-no131-42.pdf. Accessed 28 Sept 2009

    Google Scholar 

  • Jüngnickel D (1999) Graphs, networks and algorithms. Springer, Berlin

    Google Scholar 

  • Knuth DE (1993) The Stanford GraphBase: a platform for combinatorial computing. Addison-Wesley, Reading

    Google Scholar 

  • Knuth DE (1997) The art of computer programming: vol 1. Fundamental algorithms, 3rd edn. Addison-Wesley, Boston

    Google Scholar 

  • Loeb NG, Wielicki BA, Doelling DR, Kato S, Wond T, Smith GL, Keyes DF, Manalo-Smith N (2009) Toward optimal closure of the earth’s top-of-atmosphere radiation budget. J Clim 22:748–766

    Article  Google Scholar 

  • Lorenc AC, Ballard SP, Bell RS, Ingleby NB, Andrews PLF, Barker DM, Bray JR, Clayton AM, Dalby T, Li D, Payne TJ, Saunders FW (2006) The met. office global three-dimensional variational data assimilation scheme. Q J Royal Meteorol Soc 126:2991–3012

    Article  Google Scholar 

  • Mann M, Bradley E, Hughes RS, Malcolm K (1998) Global-scale temperature patterns and climate forcing over the past six centuries. Nature 392:779–787

    Article  Google Scholar 

  • McIntyre S, McKitrick R (2003) Corrections to the Mann et al. proxy data base and northern hemisphere average temperature series. Energy Environ 14:751–772

    Article  Google Scholar 

  • Miles S, Groth P, Munroe S, Jiang S, Assandri T, Moreau L (2000) Extracting causal graphs from an open provenance model. Concurrency Comput Pract Exper 00:1–7

    Google Scholar 

  • MIT World (2009) The climategate debate, on-line discussion. Available online at http://mitworld.mit.edu/video/730. Accessed 3 Feb 2010

  • Moradkhani H, Sorooshian S, Gupta HV, Houser PR (2004) Dual state-parameter estimation of hydrological models using Kalman filter. Adv Water Plann 26:135–147

    Google Scholar 

  • Moreau L, Groth P (2009) Open provenance challenge. Available online at http://twiki.ipaw.info/bin/view/Challenge/WebHome

  • Moreau L, Plale B, Miles S, Goble C, Missier P, Barga R, Simmhan Y, Futrelle J, McGrath RE, Myers J, Paulson P, Bowers S, Ludaescher B, Kwasnikowsak N, den Bussche JV, Ellkvist T, Freire J, Groth P (2008) The open provenance model (v1.01). Available online at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.143.7208&rep=repl&type=pdf

  • Morton TE, Pentico DW (1993) Heuristic scheduling systems: with applications to production systems and project management. Wiley, New York

    Google Scholar 

  • NARA (2007) Strategic directions: appraisal policy. Available online at http://www.archives.gov/records-mgmt/initiatives/appraisal.html

  • NARA (2010) Archives and records management resources 2010. Available online at http://www.archives.gov/records-mgmt/initiatives/appraisal.html

  • NSIDC/WDC for Glaciology (2009) Glacier photograph collection. National snow and ice data center/world data center for glaciology. NSIDC/WDC for Glaciology, Boulder

  • Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A, Li P (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20:3045–3054

    Article  Google Scholar 

  • Pashkin N (2006) The DOI handbook, Ed. 4.4.1 International DOI Foundation, Oxford. Available online at http://www.doi.org

  • Peterson TC, Vose RS (1997) An overview of the global historical climatology network temperature database. Bull Am Meteorol Soc 78:2837–2849

    Article  Google Scholar 

  • Reichle RH, Koster RD, Liu P, Mahanama SPP, Njoku EG, Owe M (2007) Comparison and assimilation of global soil moisture retrievals from the advanced microwave scanning radiometer for the earth observing system (AMSR-E) and the scanning multichannel microwave radiometer (SMMR) J Geophys Res Atmos 112:D09108

    Article  Google Scholar 

  • Rodell M, Houser PR, Jamjor U, Gottschalck J, Mitchell K, Meng C-J, Arsenault K, Cosgrove B, Radakovich J, Bosilovich M, Entin JK, Walker JP, Lohmann D, Toll D (2003) The global land data assimilation system. Bull Am Meteorol Soc 85:381–394

    Article  Google Scholar 

  • Sedgewick R (1989) Algorithms, 2nd edn. Addison-Wesley, Reading

    Google Scholar 

  • Simmhan YL, Plale B, Gannon D (2005) A survey of data provenance in e-science. SIGMOD Rec 34:31–36

    Article  Google Scholar 

  • Simmhan YL, Plale B, Gannon D (2006) A framework for collecting provenance in data-centric scientific workflows. In: IEEE intn’l. conf. on web services (CWS’06)

  • Solomon S, Qin D, Manning M, Chen Z, Marquis M, Averyt KB, Tignor M, Miller HL (eds) (2007) Climate change, the physical science basis. In: Solomon S, Qin D, Manning M (eds) Contribution of working group I to the fourth assessment report of the intergovernmental panel on climate change contribution of working group I. Cambridge University Press, Cambridge

  • Stein J (1966) The random house dictionary of the English language: the unabridged edition. Random House, New York

    Google Scholar 

  • Stonebraker M (2009) Saying good-bye to DBMSs. Commun ACM 52:12–13

    Article  Google Scholar 

  • Szunyogh I, Kostelich EJ, Gyarmati G, Patil DJ, Hunt BR, Kalnay E, Ott E, Yorke JA (2005) Assessing a local ensemble Kalman filter: prefect model experiments with the national centers for environmental prediction global model. Tellus A-57:528–545

    Google Scholar 

  • Szomszor M, Moreau L (2003) Recording and reasoning over data provenance in web and grid services. In: Meersman R et al. (eds) CoopIS/DOA/ODBASE 2003. Lecture notes in computer science, vol 2888. Springer, Berlin, pp 603–620

    Google Scholar 

  • Tilmes C, Fleig A (2008) Provenance tracking in an earth science data processing system. In: Freire J, Koop D, Moreau L (eds) Provenance and annotation of data and processes. Lecture notes in computer science, vol 5272. Springer, Berlin, pp 221–228

    Chapter  Google Scholar 

  • Ullman JD (1988) Principles of database and knowledge-base systems. In: Classical database systems computer, vol 1. Science, Rockville

    Google Scholar 

  • USGCRP Program Office (1999) Global change science requirements for long-term archiving. Report of the Workshop, 28–30 Oct 1998, National Center for Atmospheric Research, Boulder. Available online at http://wiki.esipfed.org/images/4/40/USGCRP_Long-Term_Archiving.pdf

  • Valentini M (2009) Preserving intellectual property rights in the long term: demo presented at CASPAR all hands meeting, Rome, IT 15–16 Sept 2009. Available online at www.casparpreserves.eu/training/training-lectures/10.ppt

  • Weaver P (2006) A brief history of scheduling: back to the future. myPrimavera06, 4–6 April 2006, Canberra, Australia

  • Weaver P (2007) The origins of project management. In: Fourth annual PMI college of scheduling conference, 15–18 April 2007, Vancouver, BC

  • Wegman E, Scott DW, Said YH (2006) Ad hoc committee report on the ‘hockey stick’ global climate reconstruction. Available online as http://republicans.energycommerce.house.gov/108/home/07142006Wegman_Report.pdf. after going to http://republicans.energycommerce.house.gov/ and doing a search for “Wegman report”. Accessed 29 Sept 2009

  • Widom J (2005) Trio: a system for integrated management of data, accuracy, and lineage. In: Proc. CIDR conf

  • Wielicki B, Barkstrom BR, Harrison EF, Lee RB III, Smith GL, Cooper JE (1996) Clouds and the earth’s radiant energy system (CERES): an earth observing system experiment. Bull Am Meteorol Soc 77:853–868

    Article  Google Scholar 

  • Woodruff A, Stonebraker M (1997) Supporting fine-grained data lineage in a database visualization environment report no UCB/CSD-97-932. Computer Science Division, University of California, Berkeley

  • World Wide Web Consortium (2009) RDF available online at http://www.w3.org/RDF/

  • Yunck T, Wilson B, Fetzer E, Braverman A, Eldering, A, Garay, M, Manipon, G, Dobinson E, Tang B (2006) Rolling out GENESIS/SciFlo in the ESIP federation’s earth information exchange. Available online at http://esto.nasa.gov/conferences/ESTC2006/papers/a1p3.pdf

Download references

Acknowledgements

The author is deeply grateful to the reviewers of this paper for helping him to remove a number of misconceptions and to clarify the writing.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bruce R. Barkstrom.

Additional information

Communicated by: H. A. Babaie

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barkstrom, B.R. A mathematical framework for earth science data provenance tracing. Earth Sci Inform 3, 167–196 (2010). https://doi.org/10.1007/s12145-010-0057-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12145-010-0057-0

Keywords

Navigation