Powered by
JSPWiki v2.8.2
g2gp 17-01-2009
View PDF
This is version . It is not the current version, and thus it cannot be edited.
[Back to current version]   [Restore this version]

Databases and Spreadsheets#

Section 1. Introduction to Databases and Spreadsheets#

Section 2.#

Section 3.#


Starting format Procedure End format
.mdb, .dbf, .db 1. Open or import the data into MS Access. Perform checks outlines above.
2. Export each data table to delimited text (select the table and then Export from either the File menu or the pop-up menu .....or see SJW for a bulk export tool), include headers, use the pipe as delimiter, do not enclose fields in quotes.
3. If Unicode characters are present consider translation to UTF8.There are no tools to help you do this.
4. Check row counts against those supplied by the depositor.
5. If necessary re-organise the data into a suitable relational format.
6. Build the required tables in Oracle.Note that Oracle does not have a Boolean (Yes/No) data type.
7. Transfer the data by whatever means you can deal with.
8. Build and test a ColdFusion interface.
9. Create those indexes necessary and tune queries to optimise the search performance.
.txt
Delimited text 1. Archive the tables as is (seeADS Data Procedures Binary and plain text documents)

2. Create a MS Access database and import each table.

Note that:
* Access examines the first twenty or so rows and assigns a data type according to what it finds, one of text, memo, long, or double. This sample is sometimes inadequate and, for example, a text field is assigned where a memo field is required. Simple mis-assignements like this can normally be picked up from the table of import errors generated by Access. The data must then be re-imported and the field types set manually.
* Access converts character sets behind the scenes. While this is normally not a problem if the data is in any way exotic unexpected character replacements may occur.
* Be aware that more subtle problems may occur and that it is possible for the imported data to be corrupt.

3. Check row counts against those supplied by the depositor.

4. If necessary re-organise the data into a suitable relational format.

5. Build the required tables in Oracle.

6. Transfer the data by whatever means you can deal with.

7. Build and test a ColdFusion interface.

8. Create those indexes necessary and tune queries to optimise the search performance.
.txt
Oracle dump file 1. Import into Oracle (I don’t know how).This must be performed by the SYS user.

2. Using SQLPlus create queries to export the data as delimited text.
* Telnet into the server
* start SQLPlus as the SYS user
* for each imported table
* use DESC table to examine the fields
* create a query in the form SELECT field1
' ' field2 ' ' … FROM table } including all fields
* run the query to check it is working – interrupt using Control+C if it appears to work correctly.
* issue the SPOOL filename command to write output to the required file
* run the query fully using ‘/’
* terminate output with SPOOL OFF
* copy the file to your machine
* use an ASCII text editor to remove leading and trailing non-data linesNote that there is the possibility of character set problems with UTF8 data.

3. Treat the resultant delimited text files as detailed above.

Storing the data in the AIP directory structure#

Any documentation for that database (such as a completed ADS database documentation template or a relationships diagram) should be stored in a subdirectory within the /delim/ directory called /documentation/. This documentation should be stored in formats that are suitable for preservation (for example, odt, tif).

See Ingest checklist for more comprehensive list of tasks.


ADS Data Procedures: Spreadsheets#

Purpose of this document.#

This is a working document to be used in the production of presentation and preservation copies of spreadsheets. The content of this document has been taken from the following:
  • Spreadsheets Preservation Manual (on the AHDS Wiki)

Significant properties#

  • The actual data within the spreadsheet (there may be several sheets of data) - including cell headings and the values themselves
  • Any figures/graphs in the spreadsheet
  • Where formatting/use of colour etc is seen as adding meaning to the data, this meaning becomes a significant property and must be preserved in some way

Note: we do not attempt to preserve the full look/feel and functionality of the spreadsheet. Formulas, formatting, colours etc are not normally preserved.

Accessioning checks#

  • Do we have necessary documentation? (any codes used within the data need to be documented)
  • Where spreadsheets contain multiple sheets, is it clear what each sheet represents? The sheet needs a meaningful name or a heading, or some separate documentation that describes its contents
  • Are all the sheets intended for preservation? Sometimes people deposit spreadsheets with sheets that are just random workings out and calculations and are not intended for preservation

Formats#

DeliveryPreservationPresentationDocumentation Required
Comma separated values .csv
(Preferred format)

Microsoft Excel .xls and .xlsx
(Preferred format)

OpenOffice.org 1.0 Calc .sxc

ODF Spreadsheet .ods

Lotus 1-2-3
version 97/9 .123
versions 4, 5 .wk4
version 3.x .wk3
version 2.x .wk1
version 1.0 .wks
educational .wke (problematic)

Quattro Pro - DOS .wq1

Quattro Pro for Windows
version 1/5 .wb1
version 6 .wb2
version 7/8 .wb3
version 9 .qpw
Comma separated values .csv

May be necessary to preserve as .xlsx or .ods where significant properties can not be preserved adequately in .csv format

Also ensure any images/charts within the spreadsheet are preserved in a suitable format (see raster images data procedures document)
Comma separated values .csv

May be necessary to disseminate as a pdf/a file where the significant properties can not be maintained in .csv

Figures/charts within the spreadsheet should be disseminated as raster images (see raster images data procedures document for details)
Software, version and platform.

Purpose and content of spreadsheet and individual worksheets.

Content of each column and row if not obvious from column and row headings.

Data type and scale for each column.

Key for any codes within data.

Column and row counts.

Documentation of any extra features the spreadsheet may contain, for example, formulae, macros, charts, comments and an indication of which of these features represent significant characteristics to be preserved.

Notes on formats#

[also see preservation manuals]

Microsoft Excel spreadsheets (.xls and more recently .xlsx) are the most commonly deposited formats and though the software is widely used, Excel spreadsheets are only suitable as a deposit format. For both preservation and presentation they normally need to be migrated to non-proprietary formats (.csv).

There are exceptions to this rule. When some spreadsheets are transferred to .csv, their significant properties are not preserved. This could be the case if there is complicated formatting (which conveys meaning), use of special characters/symbols which do not translate into .csv, or formula which are considered to be a significant property and can not be preserved in any other way. Where .csv is deemed to be an inadequate preservation format, an XML based file format can be used instead. This may be .xlsx or .ods - it is worth having a go at both of these as one may be better than another for preserving the significant properties in question. In one recent example, an Excel spreadsheet was saved as .ods but this did not preserve the values within the cells, just the formulas. Watch out for special characters too - they may translate differently in .xlsx and .ods. Where .xlsx or .ods are used as a preservation format, make sure that figures/charts within the spreadsheet are also preserved in a suitable format.

Note: the following info is out-of-date - not worth doing any research on how we deal with Quattro Pro or Lotus 123 using MS Office 2007 unless we actually receive any! //There is a copy of Quattro Pro 9 available on the network for installation. It is recommended that preservation of Quattro Pro files is carried out using this Quattro Pro. There can be problems importing these files into Excel. For example, graphs, embedded objects and macros in Quattro Pro for Windows are lost when the file is imported into Excel.

Alternatively, most Quattro Pro files can be imported into Microsoft Excel 2000. For Quattro Pro for Windows versions 1-5 the Quattro Pro Converter will need to be installed (this is not included in the default Microsoft Office 2000 installation).

To install the Quattro Pro Converter follow these instructions:

  • go to Add/Remove Programs in Control Panel
  • select 'Microsoft Office 2000 SR-1 Professional' and click 'Change'
  • add/remove features
  • click the plus next to the Microsoft Excel for Windows folder
  • click on the Quattro Pro 5.0 Converter and select 'Run from my computer'
  • Once this has installed you will need to reboot.

Lotus 1-2-3 is not a package that is available on the York University network. However, these files can be read into either MS Excel 2000 or Quattro Pro 9 where they can be converted into a more suitable preservation and presentation format. Not all features of a Lotus 1-2-3 spreadsheet will be available when it is viewed in one of these packages, for example, functions and formulas may be calculated in a different way or be unavailable. It is particularly important therefore that full documentation is supplied explaining any important additional features and functionality.

Open Office.org Calc (currently ODF Spreadsheet (.ods) but spreadsheets created in OpenOffice.org 1.0 have the extension .sxc) is freely available from http://www.openoffice.org/. It's native spreadsheet format is a compressed XML file, with the actual spreadsheet data being stored in a file called content.xml and images in a separate directory. Ods can be used as a format for preservation where csv is not able to adequately preserve all the significant properties of a file. It may be worth trying file conversions to xlsx too in order to see which is most effective. Note that Calc provides character set options when exporting to csv. This is a good option if you need to get from xls to csv with UTF-8 encoding.

Comma separated values (.csv) is the preferred format for dissemination of spreadsheet data. Cells are separated by a comma and enclosed in double quotes. It is a format that most people are familiar with, and has the added advantage that the files will open directly into MS Excel for those users who have this software. This makes it easier for people who are not so comfortable with importing delimited data into a spreadsheet application of their choice. This format is also suitable for preservation.

Tab delimited text (.tab) is another suitable preservation format for spreadsheet data but it is not routinely used by the ADS any more. There may be some data in our archive that has been preserved like this in the past. Spreadsheet data cells are separated by tabs and should not be enclosed by double or single quotes. A new row in the data is indicated by the new line character. This overcomes the problems sometimes found with .csv format as tab characters are not routinely used within spreadsheet data. Indeed, it is impossible to add a tab or a new line character to a cell in Microsoft Excel as these keystrokes are used for navigating around the spreadsheet.

Portable Document Format (.pdf) can occassionally be used to disseminate spreadsheet data. This should only be used where there is information in the spreadsheet which can not adequately be replicated in a csv file. Where possible, csv should be used so that users can more effectively re-use the data. It may be necessary to provide a pdf version alongside a csv version for complicated spreadsheets so that users can see what the spreadsheet looked like but also work with the data. Obviously, pdf/a should be used rather than standard pdf if we are creating pdfs for dissemination.

Future directions?#

Another spreadsheet application to watch out for is Gnumeric, http://www.gnome.org/projects/gnumeric/ part of the GNOME desktop. It is freely available and open source. Its native file format again is compressed XML.

Other formats#

If we receive formats other than those listed above we should contact the depositor and ask if they can supply the data in a format we support. If not need to inform them of our current practice. This is that we endeavour to transform the file(s) into an archive format if the software we have to hand can do this in a quick and automated fashion. If this is not the case we will archive the file(s) as is, but will be unable to migrate it to newer versions of that format.

How to transfer...#

  • All file conversions should be carried out on local versions of files.
  • The files in question should be copied from the server and a fastsum check carried out this file should then be compared with the checksum on the server.Before converting files

There are certain checks to be made before a conversion can be carried out. Certain significant characteristics of a spreadsheet may not be preserved in the recommended preservation format so need to be dealt with separately as described below.

ProblemSolution
Does the file contain non-textual features that need preservation?Graphs and charts will need to be saved in an appropriate image format alongside the data (see image manual for further guidance)
Does the layout and formatting of the text convey extra meaning? For example do headings span multiple rows or columns or is information conveyed through use of colour, borders or font?The spreadsheet data will need to be edited by hand before migration to ensure that meaning is not lost. For example, merged cells must be split and the text within them duplicated within each cell.
Does the file contain complex functions/formulae that need to be preserved in their own right?Migrated versions of the spreadsheet data will only preserve the actual values calculated by the functions, not the functions themselves. Complicated formulae may need to be preserved separately within a text file so the spreadsheet functionality can be recreated at a later date.
Does the file contain macros or Visual Basic scripts that need to be preserved?Macros can be preserved as plain text if necessary.
Do any of the cells contain textual notes or comments?Migration process described below will not save comments. Before migration, comments will need to be stored in a separate text file with a clear indication of which cell the comment relates to.
Does the spreadsheet contain multiple sheets where the sheets are linked by formulae?Migration strategies for complex spreadsheets such as these should be decided on a case by case basis.
Does the spreadsheet contain hidden cells?Information within hidden cells probably needs to be preserved too (check with depositor) so ensure your migration strategy takes account of this.
Does the spreadsheet contain symbols that can not be converted to UTF-8?Odf seems to do a good job of preserving special characters, but it is also worth trying xlsx.

File-naming#

Where possible files should retain the same name as the original (though the file extension may be different). Where multiple worksheets are being converted, filenames should reflect both the name of the original spreadsheet and the name of the worksheet the data came from, for example [spreadsheet name]-[worksheet name].txt. Where extra files are created to hold images, notes and formulae, these should also be named in logical way that makes it easy to trace exactly where they came from, for example [spreadsheet name]-[worksheet name]-[chart name].txt

Starting FormatProcedureEnd Format
.xls1. Open file in Microsoft Office Excel 2007 - check contents of worksheet against supplied documentation

2. Select Save As... and save the active sheet as 'CSV (comma delimited) (*.csv)'

3. Open the new file in a text pad application and check

4. Follow this procedure for each worksheet
.csv
.xls1. Open file in Microsoft Office Excel 2007 - check contents of worksheets against supplied documentation

2. Select Save As... and save the active sheet as 'Excel Workbook (*.xlsx)'

3. Check the file in Excel to ensure Significant Properties have been preserved. Look in particular at any special characters and formulas to ensure all is well

4. Open the new file in zip software and check contents. Remember if there are images within the spreadsheet they will need to be saved separately

.xlsx
.xls1. Open file in Microsoft Office Excel 2007 - check contents of worksheets against supplied documentation

2. Select Save As... and save the active sheet as 'Open Document Spreadsheet (*.ods)'

3. Check the file in Excel and Open Office to ensure Significant Properties have been preserved. Look in particular at any special characters and formulas to ensure all is well

4.Open the new file in zip software and check contents. Remember if there are images within the spreadsheet they will need to be saved separately

.ods
.xls1. Open file in Microsoft Office Excel 2007 - check contents of worksheets against supplied documentation

2. If you have Acrobat Pro installed there should be an 'Acrobat' tab in Excel. Under this tab, click 'Preferences'. Under 'Settings' ensure that the checkbox 'Create PDF/A-1a: 2005 compliant file' is selected.

3. Click 'Create PDF' and select appropriate option to save entire workbook, selection or sheet as pdf

4. Open the new file in Acrobat Reader and check contents (also check it is a pdf/a file).

.pdf (pdf/a)
Note: the following instructions are probably out-of-date and have been left in for reference only
Quattro Pro (all versions)1. Open file in Quattro Pro 9 and check contents against supplied documentation

2. Under File menu, select Save As... and save as 'ASCII text (tab delimited) (*.txt)' and 'CSV (comma delimited) (*.csv)'

3. Open the new file in a text pad application and check

4. Follow this procedure for each worksheet
.txt.csv
Lotus 1-2-3 (all versions)1. Open file in either MS Excel 2000 or Quattro Pro 9 and check contents against supplied documentation. Establish which features are likely to be lost in translation.

2. Under File menu, select Save As... and save as 'Text (tab delimited) (*.txt)' and 'CSV (comma delimited) (*.csv)'

3. Open the new file in a text pad application and check

4. Follow this procedure for each worksheet
.txt.csv
.sxc1. Open file in Open Office.org Calc 1.1.4

2. Under File menu, select Save As... and save as 'Text CSV (.csv; .txt)'

3. A dialog box will then allow you to select your delimiters. Select (,) as the field delimiter and (") as text delimiter (not possible to save with no text delimiter).

4. Convert the resulting .csv file to preservation format as described above
.csv

Maximum (best) requirements for a deposited archive#

We receive:
  • Comma separated values (.csv) files, tab delimited files (.txt), with appropriate documentation
  • Images/charts supplied separately in appropriate preservation format (see images handbook)
  • Documentation of any formulas used or comments that appear within the data
  • Metadata record specifying row count, column count, data types and scales etc.
We must:
  • Accession files
  • Create web interface
  • Create metadata record based on supplied metadata
  • Release resource

See Ingest checklist for more comprehensive list of tasks.

Minimum (worst) requirements for a deposited archive#

We receive:
  • Spreadsheet file in proprietary format containing charts, formulae, comments and macros
  • No documentation
  • No metadata
We must:
  • Thoroughly check the spreadsheet for additional features which may need to be preserved separately (embedded images, notes, formulae).
  • Make any necessary edits to data to allow for meaning conveyed through colour, layout and other formatting
  • Migrate to preservation format
  • Migrate to dissemination format
  • Preserve any additional features as images or plain text
  • Record all data conversions in Accessions database
  • Record ingest procedures with GMDP tool
  • Create metadata record for resource
  • Tasks as above and from the Ingest checklistTime estimate:
All time estimates depend on the size and contents of a file and are given as a multiple of the best option above. A simple spreadsheet consisting of straightforward data tables alone does not take a long time to migrate. However, if no documentation is supplied, time can be wasted checking for additional features or data corruptions.