Finding aid aggregation: Difference between revisions
Line 34: | Line 34: | ||
=== 1. Extracting the EAD Elements === | === 1. Extracting the EAD Elements === | ||
An extraction script has been created to automate the first step. The extraction script has been written in | An extraction script has been created to automate the first step. The extraction script has been written in JSE by Nathan Pugh at the University of Utah, based on a VBScript created by Terry Reese at Oregon State University. Nate has re-organized and tailored this script for the LSTA project partners, complete with a logger. | ||
==== Summary of the Process ==== | ==== Summary of the Process ==== |
Revision as of 14:47, 30 March 2010
Overview of Aggregation Process
EAD files for all partners are being hosted as part of the Mountain West Digital Library (MWDL) network. The workflow for this process is illustrated in the diagram to the left and discussed below. All LSTA partners should follow the workflow on the right of this diagram, building EAD collections that reside on CONTENTdm servers. The left of this diagram shows the workflow for NEH-grant-funded partners in the Western Waters Digital Library, whose EAD collections reside on other Open Archives Initiative (OAI)-compatible servers. Each LSTA partner institution controls its own EAD collection, using tools and scripts developed by the technology team at the University of Utah.
The EAD collections for the LSTA-funded project are available locally by institution, as well as searchable and browsable centrally in the MWDL statewide index. Each EAD file exists in only one place, namely, on the local repository. When a user searches or browses the central index, and clicks on a item of interest in the search/browse results, he or she is taken to the EAD file on the local repository. Users can search locally as well, but they get results only from the local institution when conducting a local search. A link allows them to go to the central MWDL index at any time.
For more information about the process described below, please contact Sandra McIntyre (mailto:sandra.mcintyre@utah.edu) at the Mountain West Digital Library, or Debbie Rakhsha (mailto:debbie.rakhsha@utah.edu) at the University of Utah Marriott Library.
Configuring CONTENTdm for Your EAD Collection
If a collection in CONTENTdm has not already been created for your EAD files, ask your CONTENTdm server administrator to create it and give it a collection name, such as "Weber State University Encoded Archival Description Files" and a collection alias, such as "WSU_EAD".
In the CONTENTdm Administration interface, on the "Collections" tab, click "Fields" to go to the Fields properties page. Create CONTENTdm fields for the EAD information as suggested in this screen capture. Placing the CONTENTdm fields in this order will make the importing process easier, since the tab-delimited file will contain the extracted information in this order; the extracted information will automatically match up with your field order.
Notes
- The extraction script will assign these field names to the extracted information. You can rename any field by assigning a different name in the fields setup.
- You must keep the "DC map" column of Dublin Core mapping as shown here, so the MWDL central index will harvest your records in a standard way. This mapping determines the Dublin Core fields that the CONTENTdm fields map to for exposing the metadata via the Open Archives Initiative (OAI) protocols.
- Note that the Date field has the data type set to "Date", not "Text". This enforces proper date formatting and allows for accurate searches by date.
- Make sure the Description field is set to "Full Text Search". The other fields should be set to searchable (i.e., you should toggle "Search" to be "Yes") if you want to make it possible for users to search on them locally. None of the "Administrative fields" at the bottom of the Fields properties page will be visible in the interface, but you may wish to make some of them searchable, depending on your needs.
Uploading Your EAD Files to Your Institution's Repository
EAD files are uploaded to a CONTENTdm digital assets management system server that hosts your collections as part of the Mountain West Digital Library network. This involves two steps:
- extracting the values in certain EAD elements and mapping them to CONTENTdm fields
- using CONTENTdm's Acquisition Station to upload the EAD file and metadata to the Mountain West Digital Library hub server.
Both steps can be done on multiple EAD files at a time for batch processing.
1. Extracting the EAD Elements
An extraction script has been created to automate the first step. The extraction script has been written in JSE by Nathan Pugh at the University of Utah, based on a VBScript created by Terry Reese at Oregon State University. Nate has re-organized and tailored this script for the LSTA project partners, complete with a logger.
Summary of the Process
Use the extraction script when you have a batch of EAD files that are ready for uploading to the server. When the script is double-clicked, it acts on all files in the same folder as itself that have the extension ".xml". It automatically goes through each file and queries the values in certain EAD elements and saves them as CONTENTdm fields within a tab-delimited text file. This tab-delimited file can then be used to upload the extracted metadata to the CONTENTdm collection along with your EAD files in Step 2 below.
The 35 EAD elements chosen for extraction, the local CONTENTdm fields they correspond to, and the Dublin Core fields they correspond to are given in the EAD-CONTENTdm-Dublin Core Elements Assignments (Mapping Table). The script also converts the EAD-conformant date formats into CONTENTdm-conformant date formats, according to the conversion given in Date Formats for EAD Central Index Project.
Directions
See the full procedure with screenshots, "Using CONTENTdm's Multiple File Import Feature to Import EAD Files", or follow the brief set of steps below.
- If you have not already done so, save the file ead-to-cdm-extraction_v2.jse.zip to a local drive. The script will work only on a Windows machine.
- Unzip the file. The unzipped file is called "ead-to-cdm-extraction_v2.jse". (The file was zipped to allow the transfer of the file over the Internet. Most systems will not allow an executable file to be transferred on its own.)
- Move or copy the extraction script file, ead-to-cdm-extraction_v2.jse, into the same directory as the EAD files you want to process. All files that you want to process must be in one folder. Remove from this folder any files that you don't want to process at this time. Please note that the EAD files in this folder must already be validated and ready to upload.
- Double-click the extraction script file and wait for it to process all the EAD files in the folder. Depending on the number and size of your EAD files, this may take anywhere from a fraction of a second to 30 or more seconds. An alert will appear when the processing is done, giving the number of files processed successfully. Click the "OK" button in the alert window. Note that a new file, called ead-to-cdm-extraction.txt, has been created in the same folder (or, if you have already run the script before, the existing file has been modified). This is the tab-delimited text file that you will use in Step 2 below. Also, a logging file called "_extraction_status.txt" is created, which should make it easier to determine which, if any, files are causing the batch extraction to fail.
Exceptions
- What if the extraction script produces an error?
- At least one of your EAD files is not coded the right way. Please review the UMA Best Practices Guidelines, particularly the guidelines regarding the elements to be extracted. The list of elements extracted is given in EAD-CONTENTdm-Dublin Core Elements Assignments (Mapping Table). Then revise your EAD files and try again.
- Note: One of the most common errors is formatting the date incorrectly, for example, using "1986-1989" instead of "1986/1989". The "normal" version of the element ead/archdesc/did/unitdate is the value that is pulled by the extraction script. Refer to the UMA Best Practices Guidelines and to Date Formats for EAD Central Index Project for more information about acceptable date formats.
- What if we have encoded our EAD elements slightly differently than the extraction script queries? Or what if we want to assign different labels to our CONTENTdm fields?
- Your IT staff can change the queries in the extraction script as needed to reflect the details of your encoding and/or the structure of your CONTENTdm fields. If, for example, you have encoded the browse subject terms source as "umabroad" instead of "UMAbroad", you may want to change this query in the extraction script file. Or, if you want to rename the "Repository" field to "Holding Institution", you can change the field name accordingly. There are comments and instructions in the extraction script file for making modifications. Warning: When making changes, please continue to conform to the UMA Best Practices Guidelines. If the extraction script is changed to pull elements in non-standard ways, the resulting records may not show up in searches in the MWDL central index.
- I changed some values in my EAD file(s) after running the extraction script. Do I need to run the script again?
- Yes. The extraction script is not dynamically linked to the EAD files. Re-run the extraction script and proceed with uploading below.
2. Uploading EAD Files to CONTENTdm Server
CONTENTdm's Acquisition Station software is used to upload the EAD file and metadata to the Mountain West Digital Library hub server. This follows standard CONTENTdm procedures for importing and uploading multiple files. You can upload all the EAD files in a single directory at one time.
Directions
See the full procedure with screenshots, "Using CONTENTdm's Multiple File Import Feature to Import EAD Files", or follow the brief set of steps below.
You can also refer to the directions given in the CONTENTdm Help files in the page "Importing Multiple Files with Tab-Delimited Text" at http://contentdm.com/help4/acq-station/importing5a.html. (Do NOT use the directions in the page "Importing EAD Files"! The process there imports only a few fields, not the full list in the LSTA project.)
Use the tab-delimited text file you created above, ead-to-cdm-extraction.txt. If you have created the CONTENTdm fields in the order listed above, the mapping from imported field to collection field will be correct by default. You do not need to add any import actions using the Template Creator.
- Important note re date ranges (for CONTENTdm 4.3 only -- this issue is resolved in CONTENTdm 5):
- After importing the records into the Project Spreadsheet, carefully examine the Date column. There is a bug in CONTENTdm, and some date ranges do not expand correctly and will need to be edited.
- CONTENTdm's import should have expanded most date ranges from the format yyyy-yyyy into the expanded list of years. For example, "1813-1815" will be expanded automatically into "1813; 1814; 1815". However, when CONTENTdm encounters a date range that spans a century or more, it mangles it instead into an incorrect date with the format yyyy-01-dd. For example, "1898-1905" is converted to "1898-01-05".
- If you catch these errors before you have uploaded the records, by looking through the values in the Date column in the Project Spreadsheet, you can correct them fairly easily:
- Double-click the problem record in the Project Spreadsheet to open it in the Media Editor.
- Re-type the date range in the format yyyy-yyyy.
- Click the "Save" button. The date range should now be expanded to the list of years in the Project Spreadsheet.
Exceptions
- As I view a record in the Project Spreadsheet before uploading, I can see an EAD encoding error. Should I correct the error in the record?
- Remember that any corrections in CONTENTdm do not change the error in the EAD file itself. We recommend you delete the record with the error from the Project Spreadsheet, correct the error in the EAD file using xEAD or other XML encoding tool, re-extract, and re-import.
- My import aborts and produces an error.
- At least one of the values in the tab-delimited file is not in the format that CONTENTdm requires. This is most often a date field. The extraction script takes the value of the "normal" parameter of the ead/archdesc/did/unitdate field in the EAD file and attempts to convert it to a date format that will be recognized by CONTENTdm. Refer to the UMA Best Practices Guidelines and to Date Formats for EAD Central Index Project for more information about acceptable date formats.
Viewing Your Individual EAD Files in CONTENTdm
Summary
The display of EAD files is within CONTENTdm's item viewer. As with all CONTENTdm collections, the CONTENTdm item viewer displays a header and footer that can be configured at the partner's choice, typically with the partner's logo and other branding related to the EAD collection.
Nathan Pugh has modified the CONTENTdm item viewer to bypass the usual display of metadata and instead to go directly to the display of the EAD file itself.
The display is done using an XSL transform (XSLT), which uses an XSL stylesheet (template) to transform the XML in the EAD file into XHTML for viewing in a browser. Nate has created a set of two stylesheets for the specific needs of the partners in this project, and together they transform the elements recommended by the Stylesheet Subcommittee convened by Dan Davis. The two stylesheets work in tandem; both are necessary. The first stylesheet transforms everything in the EAD file *except for* the container list. The first stylesheet calls the second one, which transforms *only* the container list. Some partners may wish to modify the second stylesheet to reflect their own organization of the container list.
Directions
Download the complete package of files, EAD_LSTA_package-rev13.zip prepared by Nathan Pugh in 2008 and slightly edited by Sandra McIntyre in November 2009.
Read the README file, LSTA EAD stylesheet installation instructions.pdf for complete instructions. You will need to copy some files to appropriate folders on your CONTENTdm server and change one of the PHP files. Optionally, you can change the XSL transform file or, more likely, the CSS stylesheet file, to suit your institution's particular needs.
Once your IT staff has configured CONTENTdm, you will be able to do the following:
- Browse your EAD collection in CONTENTdm by going into your Digital Collections page and selecting the EAD collection. You will see a results page listing the first set of your EAD files.
- Click any file in the results list to view it.
Exceptions
- My EAD file shows elements that seem to be misplaced or styled differently than I had expected.
- Please check your encoding against the UMA Best Practices Guidelines. To change an already uploaded EAD file, see the directions below for editing EAD files that have already been uploaded.
- I would like to change the formatting of my container list.
- Your IT staff can change the default container list stylesheet. The default stylesheet is designed to group hierarchically embedded container elements, <c01> through <c06>, and to display levels <c01> through <c04> in the Table of Contents at the left. Various partners on this project, as well as with the Northwest Digital Archives, have created a variety of stylings for container lists. Contact Nathan Pugh (mailto:nathan.pugh@utah.edu) and he will point you to some resources if you are interested in pursuing alternative stylings.
Updating EAD Files that have already been uploaded
See the full procedure with screenshots, "Updating an EAD File Using xEAD and CONTENTdm".
Searching and Browsing
Institutional Search and Browse: Once you have created your EAD collection in CONTENTdm and uploaded files to it, your users will be able to use standard CONTENTdm features to search and browse your EAD files. In addition, you may create special search and browse pages using CONTENTdm's Custom Query Report functions. For an example, see the University of Utah Marriott Library's search/browse page in the Special Collections section at http://www.lib.utah.edu (click "Search Finding Aids" on the Special Collections main page).
Central Search and Browse in MWDL [not yet available]: The metadata from all uploaded EAD files will be harvested periodically and aggregated into the Mountain West Digital Library at http://mwdl.org. Search and browse pages within MWDL's interface will allow users to discover finding aids from all partners, or from any selected subset of the partners. Sandra McIntyre and Nathan Pugh are creating interface mockups for both searching and browsing for consideration by the LSTA partners.