Finding aid aggregation: Difference between revisions

Revision as of 18:08, 2 September 2008

Overview of Aggregation Process

EAD files for all partners are being hosted as part of the Mountain West Digital Library (MWDL) network. The workflow for this process is illustrated in the diagram to the left and discussed below. All LSTA partners should follow the workflow on the right of this diagram, building EAD collections that reside on CONTENTdm servers. The left of this diagram shows the workflow for NEH-grant-funded partners in the Western Waters Digital Library, whose EAD collections reside on other Open Archives Initiative (OAI)-compatible servers. Each LSTA partner institution controls its own EAD collection, using tools and scripts developed by the technology team at the University of Utah.

The EAD collections for the LSTA-funded project are available locally by institution, as well as searchable and browsable centrally in the MWDL statewide index. Each EAD file exists in only one place, namely, on the local repository. When a user searches or browses the central index, and clicks on a item of interest in the search/browse results, he or she is taken to the EAD file on the local repository. Users can search locally as well, but they get results only from the local institution when conducting a local search. A link allows them to go to the central MWDL index at any time.

For more information about the process described below, please contact Sandra McIntyre (mailto:sandra.mcintyre@utah.edu), Nathan Pugh (mailto:nathan.pugh@utah.edu), or Debbie Rakhsha (mailto:debbie.rakhsha@utah.edu) at the University of Utah Marriott Library.

Configuring CONTENTdm for Your EAD Collection

If a collection in CONTENTdm has not already been created for your EAD files, ask your CONTENTdm server administrator to create it and give it a collection name, such as "Weber State University Encoded Archival Description Files" and a collection alias, such as "WSU_EAD".

In the CONTENTdm Administration interface, on the "Collections" tab, click "Fields" to go to the Fields properties page. Create CONTENTdm fields for the EAD information as suggested in this screen capture. Placing the CONTENTdm fields in this order will make the importing process easier, since the tab-delimited file will contain the extracted information in this order; the extracted information will automatically match up with your field order.

Notes

The extraction script will assign these field names to the extracted information. You can rename any field name by assigning a different name in the fields setup.
You must keep the "DC map" column of Dublin Core mapping as shown here, so the MWDL central index will harvest your records in a standard way. This mapping determines the Dublin Core fields that the CONTENTdm fields map to for exposing the metadata via the Open Archives Initiative (OAI) protocols.
Note that the Date field has the data type set to "Date", not "Text". This enforces proper date formatting and allows for accurate searches by date.

Make sure the Description field is set to "Full Text Search".
The other fields should be set to searchable (i.e., you should toggle "Search" to be "Yes") if you want to make it possible for users to search on them locally. None of the "Administrative fields" at the bottom of the Fields properties page will be visible in the interface, but you may wish to make some of them searchable, depending on your needs.

Uploading Your EAD Files to Your Institution's Repository

EAD files are uploaded to a CONTENTdm digital assets management system server that hosts your collections as part of the Mountain West Digital Library network. This involves two steps:

extracting the values in certain EAD elements and mapping them to CONTENTdm fields
using CONTENTdm's Acquisition Station to upload the EAD file and metadata to the Mountain West Digital Library hub server.

Both steps can be done on multiple EAD files at a time for batch processing.

1. Extracting the EAD Elements

An extraction script has been created to automate the first step. The 35 EAD elements chosen for extraction, the local CONTENTdm fields they correspond to, and the Dublin Core fields they correspond to are given in the EAD-CONTENTdm-Dublin Core Elements Assignments (Mapping Table). The extraction script has been written in VBScript by Nathan Pugh at the University of Utah, based on a VBScript created by Terry Reese at Oregon State University. Nate has tailored this script for the LSTA project partners.

Summary of the Process

Use the extraction script when you have a batch of EAD files that are ready for uploading to the server. When the script is double-clicked, it acts on all files in the same folder as itself that have the extension ".xml". It automatically goes through each file and queries the values in certain EAD elements and saves them as CONTENTdm fields within a tab-delimited text file. This tab-delimited file can then be used to upload the extracted metadata to the CONTENTdm collection along with your EAD files in Step 2 below.

Directions

If you have not already done so, save the file ead-to-cdm-extraction_v11.zip to a local drive. The script will work only on a Windows machine.
Unzip the file. The unzipped file is called "ead-to-cdm-extraction.vbs". (The file was zipped to allow the transfer of the file over the Internet. Most systems will not allow an executable file to be transferred on its own.)
Move or copy the extraction script file, ead-to-cdm-extraction.vbs, into the same directory as the EAD files you want to process. All files that you want to process must be in one folder. Remove from this folder any files that you don't want to process at this time. Please note that the EAD files in this folder must already be validated and ready to upload.
Double-click the extraction script file and wait for it to process all the EAD files in the folder. Depending on the number and size of your EAD files, this may take anywhere from a fraction of a second to 30 or more seconds. An alert will appear when the processing is done, giving the number of files processed successfully. Click the "OK" button in the alert window. Note that a new file, called ead-to-cdm-extraction.txt, has been created in the same folder (or, if you have already run the script before, the existing file has been modified). This is the tab-delimited text file that you will use in Step 2 below.

Exceptions

What if the extraction script produces an error?: At least one of your EAD files is not coded the right way. Please review the UMA Best Practices Guidelines, particularly the guidelines regarding the elements to be extracted. The list of elements extracted is given in EAD-CONTENTdm-Dublin Core Elements Assignments (Mapping Table). Then revise your EAD files and try again.
What if we have encoded our EAD elements slightly differently than the extraction script queries?: Your IT staff can change the queries in the extraction script as needed to reflect the details of your encoding and/or the structure of your CONTENTdm fields. If, for example, you have encoded the browse subject terms source as "umabroad" instead of "UMAbroad", you may want to change this query in the extraction script file. Or, if you want to rename the "Repository" field to "Holding Institution", you can change the field name accordingly. There are comments and instructions in the extraction script file for making modifications. Warning: When making changes, please continue to conform to the UMA Best Practices Guidelines. If the extraction script is changed to pull elements in non-standard ways, the resulting records may not show up in searches in the MWDL central index.

2. Uploading EAD Files to CONTENTdm Server

CONTENTdm's Acquisition Station software is used to upload the EAD file and metadata to the Mountain West Digital Library hub server. This follows standard CONTENTdm procedures for importing and uploading multiple files. You can upload all the EAD files in a single directory at one time.

Directions

Follow the directions given in the CONTENTdm Help files in the page "Importing Multiple Files with Tab-Delimited Text" at http://contentdm.com/help4/acq-station/importing5a.html. (Do NOT use the directions in the page "Importing EAD Files"! The process there imports only a few fields, not the full list in the LSTA project.)

Use the tab-delimited text file you created above, ead-to-cdm-extraction.txt. If you have created the CONTENTdm fields in the order listed above, the mapping from imported field to collection field will be correct by default. You do not need to add any import actions using the Template Creator.

Important note re date ranges:

After importing the records into the Project Spreadsheet, carefully examine the Date column. There is a bug in CONTENTdm, and some date ranges do not expand correctly and will need to be edited.
CONTENTdm's import should have expanded most date ranges from the format yyyy-yyyy into the expanded list of years. For example, "1813-1815" will be expanded automatically into "1813; 1814; 1815". However, when CONTENTdm encounters a date range that spans a century or more, such as "1898-1905", it mangles it instead into an incorrect date with the format yyyy-01-dd, such as "1898-01-05" for this example.
If you catch these errors after the Multiple Files import process but before you have uploaded the records, you can correct them fairly easily:

Exceptions

As I view a record in the Project Spreadsheet before uploading, I can see an encoding error -- a misspelling or other minor error. Should I correct the error in the record?: Remember that any corrections in CONTENTdm do not change the error in the EAD file itself. Recommended practice is to delete the record with the error from the Project Spreadsheet, correct the error in the EAD file using xEAD or other XML encoding tool, re-extract, and re-import.

My import aborts and produces an error.: At least one of the values in the tab-delimited file is not in the format that CONTENTdm requires. This is most likely to be a date field.

Viewing Your Individual EAD Files in CONTENTdm

Summary

The display of EAD files is within CONTENTdm's item viewer. As with all CONTENTdm collections, the CONTENTdm item viewer displays a header and footer that can be configured at the partner's choice, typically with the partner's logo and other branding related to the EAD collection. See a sample EAD file in the University of Utah's EAD test collection.

Nathan Pugh has modified the CONTENTdm item viewer to bypass the usual display of metadata and instead to go directly to the display of the EAD file itself. He will be posting the required code shortly.

The display is done using an XSL transform (XSLT), which uses an XSL stylesheet (template) to transform the XML in the EAD file into XHTML for viewing in a browser. Nate has created a stylesheet for the specific needs of the partners in this project. The stylesheet transforms the elements recommended by the Stylesheet Subcommittee convened by Dan Davis. A separate default stylesheet is being released to transform the container lists. Although the default container list stylesheet will transform most container lists, some partners may wish to modify this default styling to reflect their own organization of the collection. Nate will be sharing both stylesheets in early September.

Directions

Once your CONTENTdm staffmember has configured the item viewer with Nathan Pugh's code, you will be able to do the following:

Browse your EAD collection in CONTENTdm by going into your Digital Collections page and selecting the EAD collection. You will see a results page listed the first 20 or so of your EAD files.
Click any file in the results list to view it.

Exceptions

My EAD file shows elements that seem to be misplaced or styled differently than I had expected.: Please check your encoding against the UMA Best Practices Guidelines. To change an already-uploaded EAD file, see these directions (to be written).
I would like to change the formatting of my container list.: Your IT staff can change the default container list stylesheet. The default stylesheet is designed to group hierarchically embedded container elements, <c01> through <c06>. Various partners on this project, as well as with the Northwest Digital Archives, have created a variety of stylings for container lists. Contact Nathan Pugh (mailto:nathan.pugh@utah.edu) and he will point you to some resources if you are interested in pursuing this.

Searching and Browsing

Institutional Search and Browse: Once you have created your EAD collection in CONTENTdm and uploaded files to it, your users will be able to use standard CONTENTdm features to search and browse your EAD files. In addition, you may create special search and browse pages using CONTENTdm's Custom Query Report functions.

Central Search and Browse in MWDL: The metadata from all uploaded EAD files will be harvested periodically and aggregated into the Mountain West Digital Library at http://mwdl.org. Search and browse pages within MWDL's interface will allow users to discover finding aids from all partners, or from any selected subset of the partners. Sandra McIntyre and Nathan Pugh are creating interface mockups for both searching and browsing for consideration by the LSTA partners.

@@ Line 65: / Line 65: @@
 Use the tab-delimited text file you created above, ead-to-cdm-extraction.txt.  If you have created the CONTENTdm fields in the order listed above, the mapping from imported field to collection field will be correct by default.  You do not need to add any import actions using the Template Creator.
-<blockquote>'''Important note re date ranges:'''  </blockquote>
+'''Important note re date ranges:'''
-<blockquote>After importing the records into the Project Spreadsheet, carefully examine the Date column.  There is a bug in CONTENTdm, and some date ranges do not expand correctly and will need to be edited.  CONTENTdm's import should have expanded most date ranges from the format yyyy-yyyy into the expanded list of years.  For example, "1813-1815" will be expanded automatically into "1813; 1814; 1815".  </blockquote>
+<blockquote>After importing the records into the Project Spreadsheet, carefully examine the Date column.  There is a bug in CONTENTdm, and some date ranges do not expand correctly and will need to be edited.
-<blockquote>If you catch these  </blockquote>
+CONTENTdm's import should have expanded most date ranges from the format yyyy-yyyy into the expanded list of years.  For example, "1813-1815" will be expanded automatically into "1813; 1814; 1815".  However, when CONTENTdm encounters a date range that spans a century or more, such as "1898-1905", it mangles it instead into an incorrect date with the format yyyy-01-dd, such as "1898-01-05" for this example.
+If you catch these errors after the Multiple Files import process but before you have uploaded the records, you can correct them fairly easily:
+</blockquote>
 ==== Exceptions ====