Digital Forensics Formats: Seeking a Digital Preservation Storage Container Format for Web Archiving

In this paper we discuss archival storage container formats from the point of view of digital curation and preservation, an aspect of preservation overlooked by most other studies. Considering established approaches to data management as our jumping off point, we selected seven container format attributes that are core to the long term accessibility of digital materials. We have labeled these core preservation attributes. These attributes are then used as evaluation criteria to compare storage container formats belonging to five common categories: formats for archiving selected content (e.g. tar, WARC), disk image formats that capture data for recovery or installation (partimage, dd raw image), these two types combined with a selected compression algorithm (e.g. tar+gzip), formats that combine packing and compression (e.g. 7-zip), and forensic file formats for data analysis in criminal investigations (e.g. aff – Advanced Forensic File format). We present a general discussion of the storage container format landscape in terms of the attributes we discuss, and make a direct comparison between the three most promising archival formats: tar, WARC, and aff. We conclude by suggesting the next steps to take the research forward and to validate the observations we have made. International Journal of Digital Curation (2012), 7(2), 21–39. http://dx.doi.org/10.2218/ijdc.v7i2.227 The International Journal of Digital Curation is an international journal committed to scholarly excellence and dedicated to the advancement of digital curation across a wide range of sectors. The IJDC is published by UKOLN at the University of Bath and is a publication of the Digital Curation Centre. ISSN: 1746-8256. URL: http://www.ijdc.net/ 22 Digital Forensics Formats doi:10.2218/ijdc.v7i2.227


Introduction
The selection of a storage container format for digital materials that facilitates the long-term accessibility of digital object content, and supports the continued recognition of behaviour and functionalities associated to digital objects, is one of many core tasks of a digital archive.This task is especially challenging with respect to complex aggregate digital objects, such as weblogs, involving multimedia objects that are produced in varying formats to carry out a wide range of interactive functionalities, including dynamic changes overtime, and displayed using distributed information within the context of social networks.As a first step to meet this challenge, we present here results of our preliminary investigations examining storage container formats likely to benefit a dynamic weblog archive, a study conducted as part of the BlogForever project 1 , which aims to create a platform for aggregating, preserving, managing and disseminating blogs.
There have been many studies on the impact of digital object formats on the preservation of digital information (e.g.Brown, 2008;Todd, 2009;Buckley, 2008;Christensen, 2004;Fanning, 2008;McLellan, 2006).The retention of essential object properties can be facilitated by examining the preservation attributes of the file format.Some of these (e.g.scale of adoption and disclosure, support for data validation, and flexibility in embedding metadata) have surfaced elsewhere as sustainability factors (cf.Library of Congress sustainability factors 2 ; Arms & Fleischhauer, 2003;Rog & van Wijk, 2008;Brown, 2008) and as factors that capture the format's capacity to retain significant digital object properties (Hedstrom & Lee, 2002;Dappert & Farquhar, 2009;Guttenbrunner et al., 2010).
Most of these studies seem to be focused on considerations of individual digital object formats and, even then, generate many differences of opinion.There has been little consensus on best practices for selecting storage container formats (e.g.tar) that aggregate or capture collections composed of multiple object types, such as we might encounter within a single standalone computer, a complex office system, or a web archiving environment.While formats such as WARC [A3] 3 have been proposed and developed into an international ISO 4 standard, these recommendations are rarely based on a comparison of a range of formats using the full range of preservation attributes within the same environmental setup.Even when storage architecture is discussed on a wider scale, it often comes focused on one or two selected factors 5 (e.g.software and hardware scalability and costs).
In the following, we discuss a core set of preservation attributes for storage formats.These include those that have been addressed in common by several previous studies on file formats, such as those conducted by the UK Digital Curation Centre6 (e.g.Abrams, 2007), the US Library of Congress, and the technology watch reports published by the Digital Preservation Coalition7 (e.g.Todd, 2009).However, we have augmented the set of attributes to reflect an increased cognizance of the concepts covering the quality and completeness of data, as reflected in the ability to represent the full digital content of an object and/or data.The central role of quality and completeness of data has been observed as a relevant factor before (e.g.Todd, 2009;Pipino, Lee &Wang, 2002;Batini & Scannapieco, 2006;Huc et al., 2004).However, the "completeness of data" we address here refers to much more than the target digital object content.For example, in the digital environment, provenance evidence surrounding digital objects can be derived from information external to the object, such as file modification dates, lists of files that were deleted, logs of processes (e.g.installation of programs) and resulting errors, and trails of programs that had been run on the system.This kind of history is retained on the system disk, as a result of often tacitly understood standard practice in software design and systems administration8 , and should be retained to track accountability (not only with respect to humans but also software and hardware).Once you reduce the preservation activity to that associated with digital objects only, all this supporting information tends to become hidden and may be even lost.Indeed, although we focus here on storage, we believe that this is a reductionist approach to preservation, and that real advances will come from system preservation and system thinking (e.g.Checkland, 1981), based on an understanding of complex systems driven by inter-related data.
We have also placed more emphasis on scalability (e.g.measured by compression ratio to meet storage requirements, and decompression speed to reduce overheads on any processes that take place on the material) and flexibility (e.g.being able to deal with multiple types, sizes, and numbers of digital materials through a variety of operating systems) than previous studies.Scalability and flexibility are crucial within the web environment where we need to support rapidly growing data, distributed processing, aggregation of multimedia objects, and sophisticated approaches to search.
In the next section, these observations will be reflected in our proposal of seven cores attributes for assessing storage container formats.We will then discuss a range of container formats with respect to these attributes, and make some concluding remarks with suggestions of next steps in the final section.

Seven Core Preservation Attributes for File Formats
We propose seven core attributes that should be considered with respect to storage container formats for the purposes of supporting digital preservation, based on current knowledge.As mentioned in the previous section, these attributes were selected to reflect preservation requirements identified through other research and application development initiatives, such as the sustainability factors for formats discussed at the Library Congress.9 However, previous studies have placed much emphasis on front-end isolated formats for individual digital objects.The attributes here include the notion of completeness of data, intended to consider the extent of contextual metadata10 (e.g.file system information, permissions and error logs) surrounding the object that is being captured.We also put weight on scalability, not only in terms of minimising storage and optimising management efficiency with respect to variations in the quantities of data (crucial in the case of web archives that become increasingly bigger in size and diverse with respect to included object types, or data collected from scientific instruments), but also in terms of reducing overhead with respect to sophisticated data mining and search technologies that are likely to play a more ubiquitous role in the future.The attributes are described below along with Library of Congress sustainability factors (LC SF) in parenthesis, for comparison, where relevant: 1. Completeness of data: The container format should preserve data as closely as possible to raw data at the time of storage or capture.For example, this could be a sector-by-sector replication (e.g.disk image) of the raw data on a system disk, block-by-block replication of tape storage, or packet-by-packet recording of streamed content as it was captured, inclusive of any file structure, dependencies, and history.This: i. Minimises deterioration and information loss; ii. Maximises the chances of preserving file system information (e.g.directory structure, file size, permissions, encoding, any relationships and dependencies between files and executables); iii.Increases the possibility of retaining extra information about changes that have been made on the disk to be used for tracking accountability, integrity, authenticity, and maximising recoverability.

Recoverability of data:
The container format should support the recovery of data wherever possible.For example, one corrupted file or sector, if possible, should not pose serious problems in recovering other files or sectors in the archive.3. Support for data validation (cf.LC SF "technical protection mechanisms"): The format should support validation procedures.For example, the container format should provide: i. Piecewise hashing utilities (i.e.programs that hash arbitrary sized blocks of data, such as md5deep11 ) and digital signatures to verify it as an authentic representation of the initial instantiation (Ross, 2006); and, ii. Optional means of encryption12 to protect the data from malmanipulation or illicit access.
While these functions can be added in some cases, it is best to minimise the accumulation of functionality through the use of third-party tools and added procedures, as this increases overhead and the margin for introducing errors.

Scalability of data management processes:
The format should have properties that make all processes within the archive scalable to handle files of any size, datasets of any size and added services.In particular, the format should: i.Not limit the size of input file, output file, and/or media; and, i. Support efficiency with respect to storage and processing speed.For example, the format should:  Have inherent efficient and effective compression13 methods, which could be used to reduce storage requirements;  If possible, not require decompression for accessing information within the stored data (e.g.searching and indexing); and,  Support random access of files within the archive.
2. Transparency (cf.LC SF "disclosure", "impact of patents", and "transparency"): Any tools and specifications involved in the format should be a publicly published open standard and non-proprietary to avoid restrictions regarding activities that support long-term preservation and access of material in the archive, such as making modifications to the format, distributing new versions, and tracing accountability and authenticity.3. Flexibility of embedding metadata (cf.LC SF "self-documentation"): The container format should, if feasible, support the possibility of embedding user-defined metadata with the data objects.4. Flexibility in handling data (cf.LC SF "external dependencies"): The container format must be: i. Able to capture data objects in their entirety or in small portions; ii.Able to handle any media type (e.g.text, image, audio, video, executable); iii.Able to process any source of material (e.g.entire disk contents, folders, files, webpages, websites) whether it is acquired through the network or provided on some form of storage media; and, iv.Accessible using a variety of methods, environments and operating systems.

Comparison of Storage Formats
In this section, we compare several file formats that have been widely accepted as formats for storage of information, with respect to the attributes identified previously.
A list of widely used formats is presented in Table 1, shown in the appendix.The examples listed above are not meant to constitute an exhaustive list of storage container formats by any means.Some formats (such as the EnCase image format [A12] and other proprietary formats for forensics, and rar [A22] format for archiving content) were omitted because they are clearly restricted and closed proprietary formats.Also, formats whose license status was hard to resolve (e.g.BagIt14 [A6]), formats which have a stable extended version (e.g.Internet Archive ARC [A3], now extended by the ISO standard WARC [A27]), and formats that are designed for limited purposes (e.g.jar [A16] for java applications and associated libraries, and iso image [A15] for optical media) have also been excluded.Formats such as cpio [A9] are not extensively discussed here.Some formats have little documentation and support.This may be because the format is associated to a linux native command (e.g.shar [A25] and dd raw image [A10]), old (e.g.SEA ARC [A4]), and/or not widely adopted (e.g.cfs [A8] and kgb [A17]).While we have mentioned them in some of our discussion, the lack of documentation and support would suggest them to be unsuitable in a large scale preservation context.Likewise, formats for which there is no evidence of further planned development (e.g.forensic format gfzip [A13], frozen since 2006), or those tied to a specific program (e.g.sgzip [A24], native format of forensic software PyFlag) or specific platform (e.g.dmg [A11] for MAC OS X) seem unsuitable for serious consideration as candidates for preservation formats.

The formats in
The container formats can first be compared on the basis of compression and decompression speed, and compression ratio, which may impact on system performance and management cost.We have excluded any discussion of compression methods, such as xz-utils [A28] and lzop [A19], which have not been adopted widely.The formats above are not accompanied by compression, and therefore actually have the best compression and decompression speed.However, they also require the largest amount of storage, which may impact on system design (and, hence, also on performance) and maintenance cost.The format tar compressed using gzip and bzip2 has been compared to 7-zip and PeaZip on the basis of compression ratio and compression speed by Nieminen (2004) who found that, while 7-zip produces the best compression ratio, tar+bzip2 and tar+gzip show the best ratio to speed comparator.Other studies that have compared the gzip, bzip2 and lzma compression methods show that, while lzma outperforms the other two in terms of compression size, gzip is significantly superior to the other two in terms of compression and decompression speed (Collins, 2005;Klausmann, 2008).The gzip compression method also has the least demanding memory requirements.While there is no information on compression ratios for WARC, or AXF in combination with bzip2, gzip, and zip, as WARC and AXF are container formats that do not make special provision to optimize size of embedded objects beyond the capability of a selected compression algorithm, it cannot be expected to greatly outperform tar (with a selected compression algorithm) in terms of compression ratio.We could not find a direct comparison of compression ratio and speed between the above formats and the forensics file format aff.However, we do know that the compression algorithms supported within aff are zlib and lzma 15 .The former has a typical compression ratio of 2:1 to 5:116 , which is comparable to that of gzip.The latter is the compression algorithm for 7-zip.This suggests aff format's potential to compete with tar+gzip and 7-zip in terms of compression ratio and speed.Furthermore, aff has the advantage that it comes with the tools that allow the contents to be read without decompression.
Earlier, we presented a general discussion on storage container formats with respect to our seven attributes extracted from the literature.We have followed up on the discussion with a direct comparison between tar, WARC, and aff, three formats listed above that our preliminary analysis indicated to be the most promising.While AXF also claims to be an open standard conforming to preservation aims, it is a very new development.At the time of writing this paper, there was precious little documentation and source code publicly accessible, it was difficult to assess.For this reason, we propose that we should reserve judgement on this format at this stage with regards to its suitability for inclusion in large scale long-term storage initiatives.

General Discussion of File Format Attributes
In this section, we first present some broad observations on various formats with respect to several of the attributes identified earlier.We have organised these under four headings: completeness of data, recovery and validation, scalability, and flexibility.Transparency was not discussed separately, as we have opted, as evidenced throughout the paper, not to consider container formats that are not public open source, and that are not well documented.

Completeness of data
There are different degrees of information being archived in each of the formats listed.For example, tar will save systems information, such as permissions and file directory structure.Others, such as partimage, have limitations on supported file systems and do not retain information from unused sectors. 17Formats such as 7-zip do not retain file permissions across platforms.For instance, data on a Windows system aggregated using 7-zip would lose file permission information when transferred onto a Linux machine, as these attributes will be reset upon transfer.Many of these formats have intrinsic and implicit ways of handling processes that are not widely known, and that impact on their sustainability for preservation purposes.The inability to retain information of this sort also manifests in formats such as WARC, which is designed to aggregate resources on the Internet in a descriptive, surface-oriented fashion without much regard to original file system structure or the file system characteristics of the embedded resource (e.g.image).In contrast, forensics formats are implemented to keep the data as close to the way it was at the time of creation, as this can constitute vital evidence in judicial contexts.

Recovery and validation
Publicly available information on archive file formats (excluding WARC and AXF) show that only shar, ace, afa, arj, DGCA, WinMount format, rar, and ultra compressor II come with support for integrity checks, recovery records and encryption. 18These formats are proprietary, poorly documented (e.g.shar) or have a limited community of support (e.g.DGCA).The WARC format, as far as we know, does not have any validation mechanisms (e.g.checksum) built into it.In contrast, forensic disk images (e.g.aff) almost always come with some means of supporting all three, as they impact the weight a court might give to the extracted information when it is produced as evidence. 19While the Archival eXchange Format (AXF) does provide validation mechanisms, its provisions for recovery -that is, robustness against errors -are yet to be tested.In fact, while with many container formats the corruption of part of the data leads to the loss of a big chunk of data, formats like Advanced Forensics Format (aff) have provisions for the restoration of maximum amount of the uncorrupted data.

Scalability
Many of the listed formats have limitations on the size of the input and output file that they can produce.For example, older versions of tar only allowed up to a file size of eight gigabytes.The elasticity and processability of a format are key aspects of their scalability.Even some forensic file formats came with this limitation.However, unlike forensic file formats, most of the other formats do not allow easy partitioning of the data to be archived into blocks of user-defined size.In addition, newer versions of forensic file formats, such as Advanced Forensics Format (aff), have lifted the limitation on file size.More importantly, some archival formats (e.g.tar) do not allow random access to data, so for these there is no way to retrieve individual files without decompressing and unpacking everything.As a result, this will incur a significant overhead for management (e.g.migration of selected file types within the archived object), indexing, and retrieval operations within the archive.Even when a format allows random access (e.g.7-zip), it is often the case that the selected file has to be decompressed before processing.Forensics formats, such as aff, in contrast, allows searching and analysis of the data without any decompression.Yunhyong Kim & Seamus Ross 29 image) support only a limited amount of predefined metadata.This is natural, as content archival formats and raw disk images are generally born as a means of storing and transferring data from one location to another, while WARC and forensics formats are designed to support data access, analysis by end-users, and sometimes the maintenance of evidential value, as well as storage and transfer.

Flexibility
With respect to flexibility across platforms, while many of the listed formats support multiple platforms, tar requires third party tools on Windows, which may incur extra cost in terms of processing time and pose potential obstacles for long term preservation, as the third party tools are often not open source.One clear disadvantage of aff is that it assumes the image is from a disk as opposed to a collection of files or folders.However, this is not an insurmountable obstacle, as harvested websites can be, in theory, mounted on to virtual disks that are then turned into images using aff (see Figure 1).Further, an extension of aff, known as aff4, now allows the capture of webpages over the network as images.It may be too soon for aff4 to be employed as it may not be stable enough, but the format promises to be compatible with aff formats.This means a plan to use aff initially, with a view to migrate to aff4 when it becomes stable, is fully feasible.
Figure 1.Workflow: Implementation of aff format using virtual disks.
In addition to what was mentioned above, the International Internet Preservation Consortium WARC format has been shown to have compatibility issues with the Internet Archive ARC format, even though it was created to accommodate previous data stored in the Internet Archive ARC format 20 .Data recovery problems have also been observed with respect to tar.Table 2. Comparison of seven attributes across three formats, tar, WARC and aff.

Comparison of tar, WARC, and aff
In Table 2, we have summarised aspects of the seven attributes with respect to three file formats: tar, WARC, and aff.The description in Table 1 illustrates that: 1.The tar format has limited provisions for validation or recovery mechanisms, and no support for metadata.While the format allows working with various media types and collections, it does not allow userdefined block sizes.The format does retain file structure information and sometimes even file permissions, but it does not retain sector by sector information including unused space.
2. While WARC is specific to web crawls and therefore may provide features that are not available to other generic formats, the biggest drawback for this format is that rendered access is available only using the Internet Archive Way-Back Machine.

The International Journal of Digital Curation
Volume 7, Issue 2 | 2012 3. The Advanced Forensic File (aff) format is clearly the most robust in that it stores sector by sector information as a sequence of user-defined block sizes designed for maximum recovery when an error is found, has an inbuilt validation mechanism, and allows user-defined metadata.
Another attractive feature of the aff format is that the collection can be searched and indexed without decompression or unpacking.While the aff format is limited to imaging disks, we have already been pointed out that this can be partially circumvented with the use of virtual disks.

Conclusions
In this document, we made some observations on the advantages of employing forensic file formats (more specifically, the aff format) in a digital archive.We have: 1. Discussed attributes for file formats that need to be considered within an archive to support digital preservation; 2. Compared a broad range of file formats with respect to seven core file format attributes; 3. Made a direct comparison of three of the file formats, tar, WARC, and aff; and, 4. Proposed the Advanced Forensic File (aff) format, as the most robust among the three formats as a data-mining aware preservation storage format, where the preservation of a complex system of different file types is required -a situation often encountered within, but not limited to, a web archive.
While the aff format was originally intended for use in imaging disks (Garfinkel, 2006;Panda, Giordano & Kalil, 2006), we have illustrated that this limitation can be partially overcome through the use of virtual disk technology.Once the virtual disk technology is used to extend aff functionality, aff can be deployed as a storage container format for diverse types of media and information, such as tapes and data streams.In the context of information from the web crawled automatically, the virtual disk approach would not capture all the information available at the time of creation, which is often beyond our reach.However, it still helps us to work towards preserving the information we gather at the time of capture.This serves the purpose of not only supporting the preservation of the targeted information, but also recording the process by which we have gathered and processed the information, as the data capture history will be preserved in the aff disk image.
In digital forensics, the fidelity, integrity and authenticity of the data is crucial, as it directly links to the weight and sometimes even the admissibility of the object content as evidence in judicial settings (Goodin, 2011;Bell & Boddington, 2010).The forensics community is sensitive to the vital role of tracing data history.For example, the provenance of data and how the data was changed plays a part in understanding accountability and discovering evidence.The discipline's focus on not tampering with the data, even at the time of searching (e.g.no decompression and unpacking of the storage), is intended to ensure that the integrity of the digital material is maintained.As such, the handling of data within digital forensics is centred around preservation The International Journal of Digital Curation Volume 7, Issue 2 | 2012 aims.Further, as forensics often involves making connections between several information entities, it is rapidly opening up to supporting data mining techniques (see Louis & Engelbrecht, 2011).The possibility of processing data in an archive without unpacking and decompressing reduces overheads in implementing these processes.It is also a valuable property with respect to basic large dataset indexing and search, which are must-preserve functionalities within the web data context.By absorbing digital forensics technology into the archival storage architecture, we could bring together the strengths of digital forensics that focuses on preserving digital information as evidence (data and interaction), and the wider context of preserving digital information, to introduce a preservation approach that also supports future data mining potential.The main questions to be answered to carry out the adoption of aff are: how will information be captured into virtual disks (e.g. will blogs from one website be kept together?), and how will the information within each object be segmented and distributed?

Next Steps
We suggest that a small-scale experiment be conducted to compare the formats tar, WARC and aff, (and possibly AXF format, which has not been properly examined here), using compression ratio, speed, and preservation attributes as evaluation criteria.The experiment should be based on a framework that can be used as a benchmark for comparing currently available container formats, as well as evaluating the suitability of new formats as they emerge.The steps of such an experiment must:  Include the precise definition of the experimental context (e.g. research communities, public sector, business);  Investigate the variance of performance with respect to the heterogeneity of data types (e.g.file types, programs, databases);  Examine the scalability over a range of data collection sizes (say, from one gigabyte to ten terabytes); and,  Compare the difficulties posed by equipment (e.g.processors, bandwidth, device type), and software constraints (e.g.operating systems).
In addition, it must also be emphasised that rigorous quantitative measures for each of the seven attributes should be developed so that each experiment can be replicated, compared, reviewed and validated within the information sciences community.
In terms of metadata, both WARC and AFF are designed to support user-defined metadata.The format tar and other content archival formats (partimage and dd raw The International Journal of Digital Curation Volume 7, Issue 2 | 2012 doi:10.2218/ijdc.v7i2.227 Table 1 will be broadly considered with respect to five format categories:  Formats for archiving content, mostly intended for aggregating, storing, transferring, and backing-up the content (e.g.tar [A26], International Internet Preservation Consortium WARC [A27] , AXF [A5]).Formats that capture raw data, including or excluding unused portions, as it is on the disk, mostly intended for recovery or installation (e.g.partimage [A20], dd raw image [A10]).