Text files and documents are generally a component of a larger archive and often themselves document the project planning and requirements of a larger project. While little project planning in itself is required for the creation of a word processed document, there are a few considerations that should be borne in mind when creating these files:
A more general consideration would be that creators should ensure, where possible, that the content of a document is complete and self explanatory.
In addition to these general considerations, each application and file format (many of which can be created from the same application) offers different benefits. The table below aims to outline some of the common formats used to create text documents along with their associated applications and potential uses for long term preservation.
|.doc||A proprietary binary format from Microsoft Word.||A popular format and the default file format for all versions of MS Word from 1.0 through to 6.0, 95, and 97-2003. In addition to use within Word, the files can also be read in OpenOffice and can easily be converted to .pdf. Although backwards compatibility has been fairly good across versions of Word, the recent addition of service pack 3 to Microsoft Office Word 2003 has stopped support for versions 2.0 and earlier. From 2008 the specifications for a number of MIcrosoft binary file formats were made available from the Microsoft website as well as the British Library .||Though not suitable for archival preservation or dissemination it is popular format and thus convenient for archival deposit.|
|.docx||Part of the Office Open XML (OOXML) format created by Microsoft. An ECMA (ECMA-376 ) and ISO (ISO/IEC 29500-1:2008 ) standard.||A relatively new format from Microsoft, released with Office 2007. They chose to develop their own specification (OOXML) rather than use the existing ODF international standard (ISO/IEC 26300:2006, see ODT below) in order to provide better backwards compatibility with earlier MS Word file formats. The format consists of human readable XML files packed with other content within a single zipped file.||Suitable for deposit, dissemination and preservation though embedded content should be stored separately. The final file is essentially a zipped archive and may be best stored in an uncompressed format.|
|.rtf||RTF (Rich Text Format) is a tagged textual format developed by Microsoft.||Although a largely human readable plain text format, and therefore suitable for both presentation and preservation, there are compatibility issues regarding formatting (e.g. textboxes and tables) when opening files in different word processing packages. In addition, file size of an .rtf file is generally much larger than the equivalent .doc, .pdf or .odt file.||Although suitable for deposit and preservation, newer formats such as .docx and .odt provide a more compact and compatible format and should be used in preference to .rtf.|
|.odt||Open Document Text is part of the OpenDocument Format, an ISO standard (ISO/IEC 26300:2006) for XML-based office document formats.||As with the .docx format, the .odt file is essentially a compressed zip file containing separate style, text (as XML) and embedded content (e.g. images) files.||As an open XML-based format, ODT is suitable for both deposit and preservation though, in the latter case, the files should be stored in their uncompressed form. Additionally, where the document contains images or other content, these should ideally be stored separately in a suitable preservation format.|
|.sxw||Part of the OpenOffice.org XML format used by OpenOffice/StarOffice from version 1.0 to version 2.0. Superceeded by the OpenDocument Format.||Although this format has been superceeded by ODF, it is structured similarly (i.e. zipped XML files) to ODF and can still be read by OpenOffice.org 2.0.||Suitable for preservation but ODT should be used where possible.|
|.wpd||A binary and proprietary format from WordPerfect.||The popularity of WordPerfect has declined significantly since its initial release in the early 1980s in response to the rise of its main competitor Microsoft Word. Although .wpd (also .wp or .wp5 etc. for earlier versions) is the main format, more recent versions of the software support a wide range of import and export options.||Not recommended for preservation or dissemination. Although native to WordPerfect, .wpd files can also be read in Microsoft Office Word and OpenOffice. Current versions of WordPerfect Office also support the export (and import) of both ODF and OOXML files, it is therefore recommended that the latter XML-based open-standards are used.|
|.txt / plain text files||A simple plain text document. Plain text also forms the basis for marked-up texts (discussed below).||Plain text files are the "lowest common denominator encoding for textual information" (Wynne & Yeates 2004) and are widely compatible with a number of platforms and software. However, as a result, they support little in the way of formatting and can only be useful for the very simplest of documents. For all plain text files, an encoding (commonly US-ASCII or UNICODE) should be specified. Plain text files are covered in more detail in the AHDS Preservation Manual on Plain Text .||Suitable for ingest, preservation and dissemination but only for extremely simple files.|
Marked up text formats
Marked-up texts are dealt with in detail elsewhere in the AHDS Guide 'Creating and Documenting Electronic Texts: A Guide to Good Practice' (Morrison et al 2001) and in the 'AHDS Preservation Handbook: Marked-up Textual Data' (Morrison & Wynne 2005). Although not commonly created for report-style documents (HTML is more commonly used for web-pages and XML for data exchange), a number of common marked-up text formats are outlined briefly below.
|.sgml||Standardised Generalised Markup Language(SGML). A certified ISO standard (ISO 8879:1986 SGML ) metalanguage.||SGML is a metalanguage used to define other markup languages such as HTML and XML||Suitable for preservation and dissemination though documents must be valid.|
|.html / .xhtml||Hypertext Markup Language (HTML) is a plain text based markup language developed as a subset of SGML and maintained by the W3C.||HTML is a markup language commonly used to create webpages. Aside from the plain text content of the HTML file (including either inline or linked stylesheets), websites generally consist of a wide range of linked media (images, video, audio, documents, etc.) which should be dealt with as separate objects.||Suitable for preservation and dissemination though documents must adhere to (and specify) a valid DTD and character encoding. Where used, CSS styles should either be specified within the document or supplied as a separate file. Images and other media should be dealt with as individual objects as per other guides.|
|.xml||Extensible Markup Language (XML) is a plain text based open standard produced by the W3C.||XML was developed as a subset of SGML and is generally used on the web and in exchanging data between systems (e.g. databases)||Suitable for preservation and dissemination though documents must adhere to (and specify) a valid DTD/schema and character encoding.|
In terms of lifecycles, for the most part text files remain in the same format while they are being created. An exception to this would be the PDF format in that few documents are created originally in PDF, the vast majority coming from a word processed file (e.g. from Word or OpenOffice) which is then saved into PDF format for dissemination at the end of the authoring process.