Powered by
JSPWiki v2.8.2
g2gp 17-01-2009
View PDF
This is version . It is not the current version, and thus it cannot be edited.
[Back to current version]   [Restore this version]
Needs revising

Appendix 1. The Open Archival Information System#

The development of the OAIS reference model has been pioneered by NASA's Consultative Committee for Space Data Systems (CCSDS) and has been accepted as an ISO (14721:2003) standard[1]. A technical recommendation is also available for consultation on the CCSDS website[2]. As a reference model OAIS provides a conceptual framework within which to consider the functional requirements for an archival system suited to the long term management and preservation of digital data. Such consideration can be given both to proposed and to existing systems. The model is also seen as a way of comparing systems through mapping discipline-specific jargon to OAIS terminology, and that such terminology is clear and unambiguous enough to allow understanding by those beyond dedicated archival staff. The core entities and work flows within the model are shown in fig. 1 below.

Figure 1: OAIS Functional Entities (after CCSDS Fig.4.1[3]).

Data producers create Submission Information Packages (SIP). A SIP equates to a deposit of digital data plus any documentation and metadata necessary for the archive to facilitate the long term preservation of the data and to provide access for consumers (i.e. reuse). The SIP provides a basis for the creation of an Archival Information Package (AIP) and a Dissemination Information Package (DIP) generated by the archive. The process involves generating preservation and dissemination versions of the deposited data where necessary. For example, a Microsoft Word .doc file might be converted to an XML based format such as an Open Office text document for long term preservation and to PDF for dissemination. Metadata documenting this processing is added to the AIP as is any relevant information from the SIP. Similarly any resource discovery metadata and reuse documentation in the SIP is added to the DIP. Consequently metadata and documentation supplied as part of a SIP assume major importance in terms of data deposition. The OAIS standard notes of the SIP that 'Its form and detailed content are typically negotiated between the Producer and the OAIS'. In practice most repositories offer guidelines to depositors about acceptable formats, delivery media, copyright issues and necessary documentation and metadata.

In general the archival community are actively seeking to become compliant with the reference model through the process of certification (see Archival Strategies). It should, however, be noted that such audit checklists are a very recent development and, for the time being, a state of trust needs to exist between creator and archive.

Creating an Archival Information Package (AIP)#

Data in the Submission Information Package (SIP) should be in, or have migration paths to, suitable preservation formats and, together with the associated documentation, this data should be sufficient to support the creation of an Archival Information Package (AIP).

The AIP should consist 'of the Content Information and the associated Preservation Description Information (PDI), which is preserved within an OAIS'.

  • The Content Information is defined as the 'set of information that is the original target of preservation. It is an Information Object comprised of its Content Data Object and its Representation Information. An example of Content Information could be a single table of numbers representing, and understandable as, temperatures, but excluding the documentation that would explain its history and origin, how it relates to other observations, etc'.
  • The PDI is the 'information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, and Context information'[4].

With the provision of a well formed SIP an archive will have minimal problems in generating the AIP. It is the rich metadata that provides for the ongoing management of the data it references through, for example, the automated audit of data using fixity or checksum values or through migration as a batch process.

Dissemination Information Packages#

Data in the Submission Information Package (SIP) should also be in, or have migration paths to, formats suitable for dissemination for reuse. The submitted format can in many cases be the same for both preservation and dissemination. The SIP needs to contain any documentation that facilitates reuse including metadata relating to resource discovery, fitness for use, access, transfer and use. A well formed SIP will facilitate the generation of the Dissemination Information Package (DIP).

Many of the formats noted as suitable for preservation are also suitable for dissemination. This is the ideal situation, especially so for 'Big Data', as datasets need only be stored once. However, there is an already noted problem in that archivists generally prefer simple file formats such as ASCII whilst users prefer the smaller file sizes of binary files. Some formats have associated tools that allow a file to be stored as ASCII and for a binary file to be automatically generated from it on demand. For example, the NetCDF format appears to support such an operation. The development of LAStoASCII and ASCIItoLAS tools would also provide an ideal environment for this increasingly popular format

Revise from here

6.2 Dissemination strategies#

As with data transfer between creator and archive the dissemination of data to a wider audience is often seen as problematic. The preference by users is for online access to file downloads. Whilst archival organisations are often hooked into high bandwidth systems many end users are not. For this reason the ADS, as an example, restricts file download sizes so users don't unwittingly affect their networks. On occasion larger files are made available for download by special arrangement for users known to have suitable connections. This may be one solution.

Other network technologies that were investigated included BitTorrent, a peer to peer (P2P) communications protocol for file sharing which appears to have possibilities as a means of distribution. To share a file an initial peer creates a 'torrent' which is a small file containing metadata about the file(s) to be shared, and about the computer that coordinates the file distribution which is known as the 'tracker'. When the first peers pick up the torrent and download the file(s) using BitTorrent clients they are expected as part of the process to become distributors of a small piece of the file(s). The tracker maintains a manifest of which peer has which part of a file and tells new peers where to download each piece. As the number of peers build up the load is increasingly shifted off the seed computer. Clearly the system needs peers or clients to have largely persistent network connections so that others can access the file fragments.

The above works very well with audio and video data that will have a high download usage and hence lots of potential peers. Research by CableLabs in 2006 suggests that 'some 18% of all broadband traffic carries the torrents of BitTorrent' . This could provide a distributed archiving model; however, the reuse of Big Data is likely to be an occasional and limited activity with the consequence that BitTorrent is unlikely to provide an advantageous service where within a small community there will be limited downloads and thus peers. To quantify this file fragments are typically between 64 KB and 1 MB each; taking the upper value a 1 GB file would need 1,000 peers. There would be some advantage to the original seed but anyone attempting to reuse the data will experience even longer download times because of administration overheads.

High speed 'Point of Access' (PoA) optical networks and Grid Computing were also considered. UKLight 'is a national facility to support projects working on developments towards optical networks' http://www.managinginformation.com/news/content_show_full.php?id=3080. Data is transferred across dedicated 10 gigabit channels in a continuous stream rather than the conventional breaking down into small packets of data which are variously routed to their destination with a propensity for packet loss and the need to retransmit. As well as speed the dedicated channels mean other network users are unaffected in terms of bandwidth loss. The HP Vista Centre within the Institute of Archaeology and Antiquity at the University of Birmingham is a UKLight member and is connected to the PoA in London. The ADS discussed the possibility of adding a spur to an existing UKLight connection with Computing Services at the University of York (where the ADS is based).

The cost; however, of several thousand pounds prevented this proceeding further. Interestingly, although part of the academic network, UKLight is not exclusive in that collaborative projects within a wider community are considered. This may be worth investigating further as a way to link up academic and other archaeological organizations.

Grid Computing has a number of meanings . Of specific interest are data grids which are concerned with ‘the controlled sharing and management of large amounts of distributed data’. Data grids may be combined with computational grid systems. A number of open source middleware applications have been developed to support grids as a means of data sharing . The in depth investigation of using grids would be a project in its own right. The ADS is actively investigating the possibility of a project proposal to the e-Science programme . This would use Big Data archives or data coming from the Virtual ExploratioN of Underwater Sites or VENUS project in which the ADS is a partner. This will feed back into the wider archaeological community.

Currently the most consistent way of disseminating large datasets is likely to be on portable media; DVDs for the lower end of Big Data and external hard drives for anything bigger. As noted already one terabyte portable hard drives are available for under £300 and can be supplied and returned.

Acquiring large files is likely to be expensive in one way or another whether it is terms of taking up bandwidth or of costs for preparing media. Clearly potential users need to be able ascertain the relevance to them of available data. Traditionally this has been done through descriptive metadata. The use of ‘tasters’ such as thumbnail images or movie clips is also a well established decision support mechanism. Big data throws up some perhaps more unusual mechanisms such as fly-throughs and point cloud models. These are generally project outcomes and tend to use decimated datasets but they will inform on the relevance of the associated raw data. For example, the point cloud models produced by the Big Data case study Breaking through Rock Art Recording. These models are available through the ADS website as Visualisation Toolkit (.vtk) files which can be viewed with 3D visualisation software including the freely available ParaView.

Archival Strategies at large (reccs from BD report).#

Because the certification metrics are very new many archives are currently working towards OAIS compliance. As such trust must exist between creator and archive

The Submission Information Package (SIP) assumes major importance in the relationship between data producer and an OAIS compliant archive where as well as the data; documentation and metadata inform on preservation and reuse

Data creation#

In order to effectively undertake the long term preservation and dissemination of data archival organisations need a well formed Submission Information Package (SIP)

Consideration must be given to software and the formats it supports during data creation. Where long term reuse is a goal there must be clear migration paths for both preservation and reuse

In general ASCII text is seen as the most stable format for data preservation whilst open binary formats suit the dissemination of Big Data because of a dramatic reduction in file size

Inadequate documentation during data creation is the single biggest barrier to the future reuse of data. Documentation including metadata facilitates reuse as well as supporting in house administration and management during a project

It is recommended that the UK GEMINI metadata standard which is compliant with the ISO (19115) Standard for Geographic Information is used to describe survey data. Further, maintenance of provenance and fixity metadata is identified as a crucial part of data creation

Any other documentation that may facilitate reuse should also be included in the SIP.

[1] http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=24683&ICS1=49&IC S2=140&ICS3
[2] http://public.ccsds.org/publications/archive/650x0b1.pdf
[3] http://public.ccsds.org/publications/archive/650x0b1.pdf
[4] See Section 1.7.2 TERMINOLOGY of http://public.ccsds.org/publications/archive/650x0b1.pdf