As with text documents, spreadsheets and databases largely remain in a single format throughout their creation period. They are also largely self-contained in that they seldom include imported images or other media within the file. Where this does happen, however, it is recommended that embedded content is stored separately in order that these files can follow a separate relevant preservation path.
As far as many archives are concerned the core of both a spreadsheet and a database are the data tables/worksheets themselves along with documentation and metadata describing the contents of and relationships between tables or sheets. The order or layout of the columns and rows may also be of significance (especially in spreadsheets where layout is significant) but forms, reports, queries and macros are not seen as core data and are therefore often not preserved.
The significant properties - i.e. the most basic elements of a file that are to be preserved and maintained - are outlined below:
In addition to ensuring that the significant properties of a file are retained during any file conversions, there are certain checks to be made before a conversion can be carried out. These checks ensure that certain significant characteristics of a spreadsheet or database are not lost during the conversion process.
Depending on how the data is to be preserved, it may be necessary to rename database tables and spreadsheet worksheets. If files are to be preserved as delimited text then, for both databases and spreadsheets, the final dataset will consist of a text file per table/worksheet. Where possible files should retain the same name as the original (though the file extension may be different). Where multiple worksheets or database tables are being exported, filenames should reflect both the name of the original spreadsheet/database and the name of the worksheet or table that the data came from, for example:
Where extra files are created to hold images, queries, notes or formulae, these should also be named in logical way that makes it easy to trace exactly where they came from, e.g. [spreadsheet name]-[worksheet name]-[chart name].tif.
In some cases it may be necessary to change the table names to either shorten them or remove punctuation that cannot be used within a filename. In such cases it is important to remain as close to the original name as possible.
For the majority of databases and spreadsheets, the preferred format for preserving data is as delimited text. However, as highlighted above, there are often stylistic or functional elements of a spreadsheet or database that can only be maintained within a certain format. In such cases it is recommended that a compatible open XML-based format (e.g. .ods or xlsx) is used or that the elements of the file are documented and preserved alongside a text export of the data.
|Delimited Text (e.g..txt, .tab, csv||Delimited text is generally the preferred format for the preservation of spreadsheet and database files and is a common export format from many applications. In such files cells are separated by a delimiter and (optionally) a qualifier (e.g. a comma and enclosed in double quotes in .csv files) and rows are represented by new lines. Common delimiters include commas (csv files), tabs or the pipe '|' character. It is a format that most people are familiar with and has the added advantage that many files (e.g. .csv) will open directly into common applications such as MS Excel. This makes the format ideal for both preservation and dissemination as the file can be easily opened using common software. As highlighted above, this format merely stores the data and additional elements (images, formulae, etc.) should be archived separately (see section below).|
|.ods||Suitable for long-term preservation of spreadsheet files but embedded graphs or charts should also be stored separately.|
|.xlsx||Suitable for long-term preservation of spreadsheet files but embedded graphs or charts should also be stored separately.|
Checks after Migration
When migrating files to new formats it is important to carry out a number of checks to ensure that data hasn't become lost or corrupted during the process. Such checks include:
Portable Document Format (preferably .pdfa, see the chapter on Documents and Digital Texts) can occasionally be used to disseminate spreadsheet data. This should only be used where there is significant information present in the form of formatting and layout which can not adequately be replicated in a csv file and is lost when the data is migrated to an XML-based format (ods or xlsx). In some cases it may be necessary to provide a pdf version alongside a csv version for complicated spreadsheets so that users can see what the spreadsheet looked like but can also work with the data.
XML is also widely used for spreadsheets and an application that may become more popular is Gnumeric part of the GNOME desktop. It is freely available and open source and, as with OpenOffice Calc and Excel, its native file format again is compressed XML.
XML also offers a potentially reliable format for storing database data. A schema (or DTD) would hold the data dictionary which can not be stored so readily and usefully for delimited text files. A number of converters for a variety of database formats are available. Alternatively Table definitions and data can be stored as SQL DDL and DML statements (though users should be aware of variations in SQL dialects).
Another option for databases is the SIARD database archiving tool (Software Independent Archiving of Relational Databases) which has been partially incorporated into the Planets toolkit. The SIARD Suite is based on international standards such as XML, SQL:1999 and UNICODE and currently supports a number of databases including Oracle, Microsoft SQL Server and Microsoft Access. The Swiss Federal Archives distributes the SIARD Suite free of charge (subject to licence agreement terms).
Databases and spreadsheets require metadata and documentation at a number of levels to ensure that they can be preserved and reused reliably. The following elements should be recorded and stored with the dataset
|Name of database/spreadsheet file|
|Name of worksheet/table|
|Purpose of worksheet/table|
|Number of rows of data|
|Primary keys (database only)|
|Foreign keys (database only)|
Repeat the following section for each column/field within your spreadsheet/database:
|Field Name||Name of database field or spreadsheet column.|
|Field Description||Full description of fields and codes or terminology used. Alternatively, codes used within a dataset can be supplied as a separate document.|
|Data type and field length (database only)|
The elements above essentially make up the 'data dictionary' commonly associated with databases. In addition, for databases in particular, it is also necessary to describe the relationships between tables either in words or by the inclusion of an entity relationship diagram.
Documentation can also include any extra features in the spreadsheet or database that require preservation, for example, formulae, queries, macros, and comments. These can generally be stored alongside the data as text files.
Although the relationships between files should be clear from the filenames, it is useful to store exported tables or worksheets together under a single directory. Any exported images or documentation can then be stored in sub directories within this structure.