Powered by
JSPWiki v2.8.2
g2gp 17-01-2009
View PDF

Next | Contents

Section 1. Introduction to Documents and Texts#

Document and text files are arguably the most common file type created as the result of archaeological research. Regardless of the type of work being undertaken - whether it is field survey, desktop assessment or radiocarbon dating - the overwhelming majority of projects will, at the very least, produce some kind of final report in the form of a text document. In addition to reports, text documents can often be produced to record processes and metadata related to other elements of a project such as geophysical survey or databases documentation.

This guide aims to provide an overview of the main types of binary and plain text documents commonly produced by archaeological projects. In addition to a discussion of common files types and archival formats, this guide will also discuss what elements should be considered as significant properties of text documents, how different methods of creation influence these and what archival strategies should be employed in order to ensure that these properties survive.

1.1 What are Documents and Texts?#

Simply put, the majority of text documents are digital analogues of traditional publications and can therefore range in size and complexity from fairly simple reports and short papers through to substantial documents such as theses or books. These files consist predominantly of structured text (sentences, paragraphs, pages, chapters) but often include other elements such as images, figures and tabular data.

Digital texts can be produced in a variety of ways though most are largely created from scratch in word processing packages such as Microsoft Word and OpenOffice. In terms of actual formats, files produced by word processing packages have in the past been predominantly stored in proprietary binary file formats although more recent packages such as Microsoft Word 2007 and OpenOffice have highlighted a distinct move towards human readable xml-based formats and standards such as .docx (part of the Office Open XML[1] format) and .odt (part of the OpenDocument [2] format). In addition to the formats in which documents are originally created, many text documents in their final versions may be stored and disseminated in a common interchange format, most notably Adobe's Portable Document Format [3] (PDF), which allows the format and structure of a document to remain consistent across a variety of platforms while also removing much of the editing possibilities.

In addition to documents created within word processing software, a significant proportion of text documents can be created as the result of a digitisation process. Journal digitisation, usually for the preservation or dissemination of pre-digital collections, is often the largest source of digital texts created outside of a word processor. This process generally starts with a digitised image of the hard copy page which is then processed using optical character recognition (OCR) in order to transform the image into 'real' (i.e. editable, searchable, etc.) text. The final text, which may also include images and figures, is predominantly stored using the PDF file format though an xml-based format may also be used, especially where dynamic online dissemination is required.

Beyond common word processing formats and PDF files, texts may also exist in a range of plain text or marked up formats such SGML, HTML and XML. This range of formats is discussed in detail in the Oxford Text Archive Preservation Manuals [4] and will be dealt with briefly alongside other formats below.

1.2 Current Issues and Concerns#

File Formats

A number of issues will be discussed below in reference to specific file formats but, in general terms, there are two areas of concern for archives that can be discussed in reference to text documents as a whole. The first of these, as is also seen in relation to a wide number of file formats, is the continually developing nature of formats used by word processing packages. Aside from the possibility of receiving files produced by now defunct software (e.g Wordstar), the continual development and enhancement of formats used by currently popular word processing packages often results in incompatibility between older versions of a file and the current version of the software. As mentioned above, the current move towards XML-based open standard formats such as .docx and .odt has been an attempt both to standardise these formats and to allow different software packages to read non-native formats. To some extent a similar problem has also been apparent with the PDF format and, again, the recent move towards an open standard (PDF/A [5]) has been an attempt to address these long term access issues.

Embedded Objects

In addition to the file formats themselves, there are general concerns regarding the ability to embed content within text documents and the implications this has for preserving such content in the long term within the original document format. The most common type of embedded content is arguably images although in certain formats, most notably Microsoft Word and PDF, more complex content such as spreadsheets and video can be stored with the text document itself and often in a format which should be deposited and archived separately. It is generally recommended that, in addition to embedding, such content is stored and archived separately thereby retaining the original qualities of the content (e.g. image resolution) and allowing it to follow a separate archival strategy to the textual content.

[1] http://www.ecma-international.org/publications/standards/Ecma-376.htm
[2] http://xml.openoffice.org/
[3] http://www.adobe.com/products/acrobat/adobepdf.html
[4] http://ota.ahds.ac.uk/documents/index.xml
[5] http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=50655

Next | Contents