Powered by
JSPWiki v2.8.2
g2gp 17-01-2009
View PDF
This is version . It is not the current version, and thus it cannot be edited.
[Back to current version]   [Restore this version]

Planning for the Creation of Digital Data#

From the moment a project begins, careful thought must go into the planning of the digital archive that will be created at the project's conclusion. Whilst project planning is often specific to the project and data types being generated (and will be covered within the technique specific chapters that follow) some general project planning principles apply regardless of project type. Planning should include:

  • Planning for the Creation of Data (covered generally in this chapter) which includes:
    • Preparing a Project Design that defines the types of digital data that will be created or acquired and defines and documents areas of responsibility for creating and managing these files at all stages of their life.
    • Planning the File Formats that will be used during the data creation and analysis phases and how these relate to formats used in the secure archiving and dissemination of data. The formats used may be the same or may change throughout the project lifecycle.
  • Assessing the Level of Documentation and Metadata Required - at the various levels (project, technique, file) to ensure that the project and its datasets are understandable and reusable. This is covered in the following chapters on Project Documentation and Project Metadata.
  • Assessing the Type of Digital Archive that will be created and deciding which files will be preserved. This is covered in the following chapters on Data Selection and Storage.
  • Checking with the digital archive facility destined to receive the files to see which specific guidelines or standards should be followed. If guidelines are not specified, it is recommended that the advice put forward in these Guidelines are followed.

Most importantly, data creators should document the tasks necessary for the successful completion of the project at its outset and update this documentation throughout the life of the project. This section outlines some of the considerations required at a general project level during the planning and data creation stages.

Data Creation and Capture#

Distinctions should be made early on in a project between data that is created digitally ('born digital'), captured from a non-digital original (digitised), or acquired in digital form from external sources or suppliers (i.e. purchased data such as base map layers for GIS). Within each of the data type or technique-based chapters in these Guides, file formats and software applications commonly used during the data creation phase are discussed in terms of their suitability for long term preservation and possible migration paths to other formats. These sections recognise that, while certain formats may not be suitable for long-term preservation, such formats may be the most appropriate for data creation, development and dissemination (Brown 2008, 5). It is recommended that such considerations be recognised in the data creation stage of a project so that adequate planning can be put in place - if required - for the later conversion of possible problematic files. However, there are a number of general principles, commonly cited by archives and repositories (see Todd 2009), that guide the identification of stable and reliable file formats.

  • Open and Proprietary Formats and Standards - There is a general preference amongst archives and repositories to store data in formats which are standardised, openly documented and, where possible, non-proprietary. Such formats are deemed to be more easily sustainable due to the openness and availability of the file format and, in many cases, its use across platforms or by multiple applications. This is discussed elsewhere (Library of Congress[1] and Todd(2009)) in terms of Disclosure, External dependencies, Impact of patents and Adoption.
  • Binary and Plain Text files - For many datasets, such as raster images, file formats based on binary encoding is the only option. However, for a wide range of datasets, such as spreadsheets, databases, text documents and so on, there is generally an option (and, archivally, a preference) to use a file format based on some form of textual encoding e.g. ASCII plain text or XML. The advantages here are that the file format is more transparent (i.e. it is human-readable and more open to direct analysis) and, as a result, more likely to be identified in terms of content and associated software. The easy access to the file content also means that such formats have less external dependencies and more possible migration paths should the associated software become unavailable.
  • Compressed Files - Compression can be used either in the creation of a compressed archive file (e.g. a ZIP or RAR archive) or within specific file formats such as JPG or PNG. In both cases it can use either lossless (no data is discarded) or lossy techniques. In the case of single 'archive' files, as with binary encoding, data compression creates a potential barrier to identifying file types and accessing content. When combined with security features such as password protection, such encoding can ultimately make data permanently inaccessible. Compression used within a file format, such as JPG images or MPEG video, can cause additional problems and degradation of data quality - known as 'generation loss' - when files are continuously reprocessed and re-compressed. In all files in which it is used, compression can result in data loss when using lossy techniques as well as magnifying the effect of data loss or corruption via bit corruption (Heydegger 2008).

An additional element that is key to identifying a suitable file format is the level of Adoption that a specific file format has within a certain community. This can vary between techniques, data types and countries and so will be discussed in more detail with specific chapters later in these Guides.

There are a number of resources available which describe in detail the criteria used to identify suitable digital formats for preservation. Examples include those published by the Library of Congress[2], the UK Nation Archives (Brown 2008) and the Digital Preservation Coalition (DPC)[3].

Digitised and External Data

These Guides largely address born digital data and do not aim to provide advice on digitisation. However, such projects, whether digitising internally or outsourcing such work, should still consider the implications of file formats upon the resulting dataset. Regardless of format, such projects should plan to digitise original material using the highest quality data capture to create archival quality data files. These files may be compressed for dissemination purposes, by techniques that often depend on degrading the data quality, and it may be advisable to create and store multiple versions of each file for different purposes. The originals from which the digital datasets were created may still be useful and a documentary archive should be consulted to establish whether this information should be preserved in paper format.

A number of organisations and guidelines exist which provide substantial guidance on undertaking digitisation. JISC Digital Media[4] provides a wide range of advise on digitising existing images as well as analogue video and audio while other guidelines, such as those produced by the AHDS[5] and UKOLN[6], provide shorter guidance on digitisation project planning.

File Naming Conventions

Regardless of the source of the data, one of the first (and most immediate) steps in ensuring that your data is understandable is to use meaningful file names that reflect content. Data creators should plan to use standard file naming conventions and directory structures from the beginning of a project and, where possible, use consistent conventions across all projects. Directory structures and file names are discussed in relation to specific types of data in a number of the following chapters but general principles and conventions which may be followed are:

  • Reserve the 3-letter file extension for application-specific codes, e.g. PDF, DOC, TIF, and avoid using a full stop (.) elsewhere in a file name.
  • Where possible avoid using spaces within filenames as these can cause problems in some operating systems. It is recommended that data creators use the underscore character to imply a space within the filename.
  • Include some means of identifying the relevant activity in the file name, e.g. a unique reference number, project number or project name
  • Include version number information in the file name where necessary.

Files generated under certain operating systems may have specific requirements e.g. DOS must use standard 8 character file names with 3-character file extensions whereas long file names may be used under a Windows environment.

Version Control

It is extremely important to maintain strict version control when working with files, especially if different people are working on single files or where files undergo multiple stages of processing. For example, within the Newham Museum Archaeological Services archive (see What is Digital Archiving? for more details) there were multiple versions of the same file without any indication of which was the most up-to-date. As the archive had no documentation to accompany the digital files, the ADS was forced into making judgements on the currency of the files based upon their date and file size.

There are three common strategies for providing version control:

  • File-naming conventions
  • Standard headers listing creation dates and version numbers
  • File logs

It is important to record, where practical, every change to a file no matter how small the change. Versions that are no longer needed should be weeded out, after making sure that adequate back-up files have been created.

File Structures

Digital files should also be organised into easily understandable directory structures. It may be desirable to collect related data files in a folder using standard naming conventions to aid retrieval from the directory structure. Some files produced as the result of certain techniques (e.g. geophysical survey or GIS) are best structured and stored in standardised file structures and these are discussed in the relevant chapters later on in these Guides.

Storing Digital Datasets - precautions during the data creation stage (and beyond).#

During the working life of most projects digital data will be created on the hard disks of standalone PCs, on laptop computers or on network drives. Additionally, data may be acquired or stored on USB drives, back-up tapes, CD or DVD ROMs or other electronic media. Ideally, however they were created or acquired, digital files in current use will be routinely backed up as part of good working practice.

It is not sufficient to leave digital media languishing on shelves or in data safes. Fireproof, anti-magnetic facilities are extremely important for the safe storage of digital media, and back-up versions should be stored separated from original media. Data creators should make sure that the archive is complete before storing it away, and ensure that archive documentation is also included. It is also important to have an effective data management system in place in which it is noted where files are stored and how the media are labelled. Data creators can follow a number of simple procedures or strategies which will ensure that data is safe during the creation stage of a project.

Secure backing-up

Back-up is the familiar task of ensuring that there is an emergency copy, or a snap-shot, of data held somewhere else. For a small project this may mean a single file held on an external disc drive, media or over a network; for a larger project or dataset it may mean more rigourous procedures of disaster planning, with fireproof cupboards, off-site copies and daily, weekly and monthly copies. These are important in the life span of the project, but are not the same as long-term archiving because once the project is completed and its digital archive safely deposited, the action of backing up will become unnecessary.

The most widely used back-up strategy is the so-called 'Grandparent-Parent-Child' strategy, often implemented by large institutions using digital tapes, but appropriate to other storage media too. The system works by employing a rotation of full and partial back-ups on each day of the week or month. The most recent full back-up, the 'Parent' contains a snap-shot of the whole network or dataset at the start of a week. 'Children' are more frequent, normally daily, back-ups containing only the changes to the system executed on that day. These tapes don't have to be kept in perpetuity, but can be recycled every time a new Parent is created. Once a month or so, a permanent complete snap-shot is taken, which should be stored in perpetuity and would not normally be recycled. This monthly back-up is the 'Grandparent', and can be brought out in moments of real crisis. It is best practice that the weekly and monthly back-ups are stored away from the office where the data are normally stored, preferably in a secure, fireproof, anti-magnetic environment. Of course, for a small dataset, or one that changes infrequently, such regular copying is excessive. The system can be tailored to individual requirements and the time periods expanded or contracted as necessary.

It is also important to validate the back-up copies to ensure that all formatting and important data have been accurately preserved. Create back-ups when a project is complete or dormant, prior to any major changes, or if files are large enough to cause handling difficulties on the network. Each back-up should be clearly labelled, and its location should be logged.

Periodic checking for viruses and other issues

Periodic checks should be performed on a random sample of digital datasets whether in active use or stored elsewhere. Appropriate checks include searching for viruses and routine screening procedures that come with all computer operating systems. These periodic checks should be in addition to constant rigorous virus searching on all files.

Viruses are self-executing programs that enter a computing system, either hidden inside harmless programs or files or disguised in such a way as to encourage unsuspecting users to install them. Once in a system, they replicate themselves and carry out operations over which the user, and often the operating system, have no control. The type of operation depends on the virus, ranging from the invisible to the vaguely irritating to the absolutely devastating. Because they replicate, they can be very difficult to flush out, and because they are invisible, they can come from the most innocent of sources.

Trojans are programs that appear to have useful or desirable features that entice people to install or download them. Trojans may well have some functionality, but they actually exist to do damage. They are technically different from normal viruses because they are not programmed to replicate themselves: once the damage has been repaired, they do not return (whereas some conventional viruses just keep coming back). However, they can be very damaging.

A third group of malicious programs are called worms. These are similar to viruses in that they replicate themselves and often, but not always, interfere with the normal use of a computer or a program. The difference is that they exist as separate entities; they do not attach themselves to other files or programs. While viruses, trojans and worms can cause great damage, the actual risk is less than some would believe. Experience suggests that much of the damage blamed on viruses and trojans is actually the result of poor management. There is, however, a constant and real if minor risk from genuine, malicious programs.

There are some basic steps which can be taken to avoid viruses, trojans and worms:

  • Install anti-virus software on the computer, and make sure that it is kept up to date. There are numerous versions of such software available, some of them free, and some provided by large host institutions (e.g. many universities supply such software to students). Otherwise, most software suppliers will be able to advise on price and functionality. It is important to ensure that the product is routinely upgraded (most check for updates on a daily basis) because new viruses and trojans are constantly being designed and older software might not identify these.
  • Be suspicious of any unsolicited programs or files, particularly from unwanted email, and don't download software from the internet that you are unsure about. Scan all files received with the appropriate software, even if the file was solicited from a close colleague or friend.
  • Don't forward emails called things like 'Virus Warning' unless you are certain that it comes from a reliable source. Most of these are hoaxes and can be viruses in their own right. Several of the anti-virus software houses maintain up-to-date lists of all the known viruses (and hoax viruses) in the world. Consult these before forwarding the notice.
  • When buying software, ensure that the supplier will underwrite (reasonable) damages incurred should the software contain a virus (i.e. that the supplier is reputable), or that the product comes with an anti-virus guarantee that could reasonably be used as the basis of legal protection. (If the supplier or manufacturer offers such legal protection, then they will have taken steps to ensure the quality of their product).
  • Have a back-up strategy in place should the worst happen.

The creation of secure back-up copies does not adequately protect digital data in the long term, whether from the degradation of the media on which they are stored or from changes in hardware or software leading to the data becoming irretrievable. Additional steps are required for successful digital archiving.

[1] http://www.digitalpreservation.gov/formats/sustain/sustain.shtml
[2] http://www.digitalpreservation.gov/formats/sustain/sustain.shtml
[3] http://www.dpconline.org/advice/preservationhandbook/media-and-formats
[4] http://www.jiscdigitalmedia.ac.uk/
[5] http://www.ahds.ac.uk/creating/information-papers/checklist/index.htm
[6] http://www.ukoln.ac.uk/interop-focus/gpg/DigitisationProcess/

Refs (to move)

Brown, A. (2008) Selecting File Formats for Long-Term Preservation. The National Archives. http://www.nationalarchives.gov.uk/documents/selecting-file-formats.pdf

Todd, M. (2009) File formats for preservation. DPC Technology Watch Series Report 09-02. http://www.dpconline.org/publications/technology-watch-reports

Heydegger, V. (2008) 'Analysing the Impact of File Formats on Data Integrity' in Archiving 2008, Volume 5. http://www.imaging.org/IST/store/epub.cfm?abstrid=38884