Procedure for ETD Transfer, Conversion, and Load into CONTENTdm
Required for this procedure:
- Notices from Proquest that files are available.
- Filezilla.
- Proquest FTP login info.
- ETD Directory on hard drive with pdf and xml subdirectories.
- 7-Zip.
- Adobe Acrobat Standard. Modify Acrobat seetings: When you're in Acrobat, go to edit, then preferences. Click on "Documents" in the left-hand column. In the main part of the pop-up, under PDF/A view mode, use the drop-down to select "never."
- Computer configured to open XML files with WordPad (Right click an XML file and select "Open with" and then "Chose Program." Select WordPad, then click "Always use the selected program to open this kind of file.").
- Microsoft Excel with the Developer tab enabled and macros enabled (Left click on the windows symbol and select "Excel Options." On the popular tab, check "Show developer tab in the Ribbon." Go to the Trust Center Tab. Click "Trust Center Settings." Click "Enable all Macros.").
- Excel Template (attached here).
- CONTENTdm Project Client. Set-up and configure as follows: If you haven't used the installation of ContentDM before, you'll need to enter the server URL, http://contentdm.ad.umbc.edu, and the username and password for it (same as ContentDM Admin). Then choose the collection UMBC Electronic Theses and Dissertations and name the project the same thing. In Project Settings Manager: Set Metadata Templates to PDF, then click edit. Fill in sponsor: University of Maryland, Baltimore County (UMBC) Collection: UMBC Thesis and Dissertations. Rights Statement: This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu. Source: Change drop-down to file name. Proofread and click ok. In Metadata Field Types, select text. In Images & Thumbnails, select JPEG2000. in Processing, select DO NOT convert multiple page PDF's to compound objects. In Project Options, chose use HTTP..., Do not spellcheck full-text, and Show spelling errors. In Finding Collection, check both boxes.
When notified that files have been ftp'ed:
- When you receive notices from Proquest, first insure that they were able to successfully transfer all of the files. If some transfers fail, they'll likely try again the next day. Wait until you likely have all the files for a semester before beginning work on a set. Be sure to keep track of what you have already loaded into ContentDM and what you haven't loaded. Enter the date range last done to keep track of where we are: -7/15/10.
- If haven't already done so, give Lindsey copies of everything from your PDF and XML folders for the ContentDM backup. Then delete any old content from your ETD folder.
FTP and Unzip the files:
- Use Filezilla to FTP the new thesis and dissertations from Proquest. Open Filezilla. Enter the Proquest FTP IP, 130.85.192.107, into the host field. Enter the username, proquest, into the name field. Enter the password into the password field. Enter port 990 into the port field. Push Enter. The last line of the top box on the screen should say "Directory Listing Successful" and the lower left-hand portion of the screen should be populated with files on the Proquest server. The left side of the screen shows your computer--find the ETD folder on your hard drive. Use the date to identify the new files that we need to obtain. Highlight all of the files we need by holding down the shift key while clicking the first and last files you want highlighted. Drag them to your ETD folder. The progress of file transfer will show on the bottom of the screen. Wait while all files transfer (you can minimize and do something else).
- Verify that you have all of the files that have been sent by checking the number of files against the number of files the e-mail notices said were successfully downloaded. Add the number of successful downloads. Highlight all the ETD files, right click, and select properties. The number of files stated in the notices should match the total here.
- Use 7-Zip to extract the zip files. Open 7-Zip. The folder that your files are in should be selected in the bar across the top of the window. If not, use the drop-down arrow to find it. Once you are on the correct folder, all of your zip files should display in the window. Highlight all of the zip files by holding the the shift key by clicking the first and last files you want highlighted. Click extract. The destination for the extract opens to C:\ETD\ZIP*\. Delete the *\ so that the files all go into the ZIP folder. Click ok.
- Use Windows Explorer to sort the files by going to the "View" menu and selecting "arrange by file type.". Select all of the files of a given type, and move the to the appropriate sub-folder: Highlight all of the PDF files by holding down the shift key while clicking the first and last files you want highlighted. Drag them to the PDF subfolder and drop them there. Highlight all of the XML files by holding down the shift key while clicking the first and last files you want highlighted. Drag them to the XML subfolder and drop them there.
Prepare the files:
- Check for anything unexpectedly left over. ETD's with extra files will unzip into folders or into additional zip files. These will require some initial manipulation to prepare them. Take a look at the extra files and handle each case as appropriate as follows:
Approval sheets--These are forms that the adviser signs approving the thesis or dissertation. These are extraneous extra files that should simply be deleted. Move the pdf and xml files to your usual directories and process as usual.
Other data that is usually included in the main file and pdf appendices--Combine the extra files with the main file using Adobe Acrobat. Click "Combine" then click Merge Files Into a Single PDF. Click "Add Files" and select the ones you want to combine. Put the files in the order they should be combined. Click "combine files" and overwrite the main PDF with the new one. Move the pdf and xml files to your usual directories and process as usual.
Non-pdf appendices and other files not meeting above criteria--Word documents, Excel Worksheets, and MPEGĀ videosĀ (convert other video formats using Avidimux) may be included. Each should contain a note stating what type(s) of non-pdf files are attached as follows:
Word 97-2003 Document
Excel 97-2003 Worksheet
Quicktime .mov Video
MPEG .mp4 Video
The exact phrase will make the files locatable for forward versioning as necessary. Additional file types should only be added on consideration of whether this is the best file type for presentation and archiving of the material and a standard note should be added to the above list. - Open each PDF and delete everything in front of the abstract using the Delete Pages command in the Document menu (Find the last page before the abstract and note the page number. "Click Document" then "Delete Page." Input the page range you need to delete then click "ok" and then "yes.") Save the edited file with the same filename over-writing the original file. Some files will have missing thesis or dissertation. Handle these as follows:
Missing Documents There are two reasons a thesis or dissertation may be missing. The document may be embargoed, or the document may have not been FTP'ed because it's a large file that couldn't be sent via the Proquest administration page, so was sent to Proquest on disk. To determine which case this is, take a look at the metadata and the DISS_submission publishing_option tag. This is usually the first field in the metadata. In that tag, there is a an embargo code set with a numeric embargo code:
"0" - No embargo
"1" - 6 month embargo
"2" - 1 year embargo
"3" - 2 year embargo
"4" - Until specified date
If the code is 0, we should have the file, and can obtain it by downloading from the ContentDM Administrator Resources & Guidelines page at http://www.etdadmin.com/cgi-bin/main/resources?siteId=75. Click on Dissertations & Theses @ University of Maryland, Baltimore County and search for the missing document. When you find it, download it and process as usual.
If the code is 1-4, the document is embargoed and we won't receive the document until the embargo period has passed.
At the end of the metadata file there is a DISS_sales_restriction code," and the date in that tag indicates when the embargo will expire and when we should receive that file. Note the file name along with the date the embargo will expire in the embargo list at the end of this procedure so that we can ensure that we receive the file when the time comes. When you process the metadata for embargoed documents in Excel, insert a note into the metadata for the document stating: "At the author's request, this dissertation isn't being made available at this time." The metadata is then uploaded as usual along with the title page. The metadata will be revised to remove this note when we receive the full file.
For other problems with the files Proquest FTP's to us, ask Michelle to call Proquest technical support at 877-408-5027 to find a soluation.
Prepare the metadata:
- Open the Excel Metadata Template spreadsheet and insure that you're on the ContentDM worksheet. Then run the macro "Delete_Everything" by using CTRL-X. IMPORTANT: This Macro will delete everything on whatever worksheet you happen to be on, so be absolutely certain you're in the right place before running it.
- Go to the Proquest worksheet and import each XML file. To import, click "Developer" then "Import" then find and select there file. After each import, run the "Mover" macro to move each metadata record into the ContentDM worksheet by using CTRL-M.
- Insure that you have a metadata record for each PDF before proceeding. If not, you missed something or did something twice and need to figure out what you did and fix it.
- Run the macro "ReFormatter" macro by using CTRL-R.
- Check all fields that require checking against the XML files. A metadata map showing how each ContentDM field corresponds to the Proquest XML data is attached. Correct or report any problems. Details:
- Check the advisor field to insure there is only 1 advisor as it's unclear that 2 or more advisors will load correctly.
- Check the document type field to insure it's thesis or dissertation. Values converting are Ph.D., M.A., M.S., and M.F.A. Macro can be revised to account for other values.
- Separate the keywords by semi-colons by changing commas in the keyword field to semicolons where appropriate. This may be most quickly achieved by finding and replacing the commas with semicolons's then manually changing the semicolons's back to commas's in the few places where you can readily identify coordinated keywords. This requires some judgment to determine where commas are separating keywords verses parts of the same keyword. For example, "University of Maryland, College Park," is likely one keyword and the comma should be a comma, and "Frederick, Maryland," is likely one keyword and the comma should be a comma (note that in other instances cities and states are separated and clearly not one keyword). However, more often than not, the commas are separating separate keywords and need to be semicolons: Antisocial, Antisocial Personality Disorder, Development, Gender, Marital quality, Relationship. This may require some judgment. For example, is American Jews, History, two separate keywords or one? Since history if such a broad topic, and unlikely to be a valuable keyword by itself, it might be best to interpret this as one keyword and leave the comma so that this is a single keyword. Also note that in some instances, whether the comma belongs or not may be determined which way it best fit the topic best based on the abstract, but this often doesn't work because subjects are often so technical it's impossible determine if the comma belongs or not without in-depth knowledge of the field. In case of doubt, change the comma making the words into separate keywords, as the knowledgeable searcher will be able to put the words together when searching. Note that the authors entering keywords are not skilled at this and some of the keywords aren't entered as library staff would enter them, and entries are inconsistent between authors. All the same, don't edit, and try to get the conversion of the commas correct without second-guessing or spending excessive time."
- Check the identifier to insure that the algorithm that draws the identifier from the file name is working correctly. If the file name formatting changes, the algorithm may stop working.
- Insure that the language is English.
Loading the flies into your ContentDM client:
- Delete all extraneous information from the spreadsheet until it contains ContentDM field titles in the first row and the data, and nothing else.
- Save the Excel file as a .xlsm, or Macro-enabled workbook.
- Save the Excel file as a text tab delimited file. When you do this, Excel will warn you that Excel doesn't support saving multiple-page worksheets as a text file and only the active page will be saved. So long as you're on the page with the data, this is ok. Excel will also warn you that the format doesn't support all of the features, and this is ok too. Just be sure to save with a different file name to insure that you still have a template to work with later.
- Close Excel and open the tab-delimited text file. Find and replace all " with nothing. Re-save the file. Note that if Excel is still open, the edited .txt file won't save.
- Open ContentDM. If you want to watch the upload progress, click "View Upload Manager." Click "Add Multiple Items." Select your text metadata file. Click next and select the folder with the PDF's and click next. Keep clicking next until you get to the "Multiple Items--Map Metadata Fields" screen. The left and right columns should match except for the last field. If they don't, you did something wrong and need to go back and insure that you have a .txt file with the column headings in the first line. Click "Next" again, then click "Add Items." Either it will begin uploading or you'll get an error. If you get an error, the contents of your txt metadata file don't match the contents of the folder with the pdf files. Go back and fix. If everything goes right, it will begin uploading, the number of files being uploaded will match the number of files you were trying to upload. Then wait a very long time. If everything goes correctly, you'll see a message saying how many files were added. This should again match how many you were trying to add.
- If everything uploaded correctly, the project will open behind the add message. Close the message screen, and use the project screen to edit each file by double-clicking it. Once you have a file open, you can navigate to the next by using the "Save and Back" or "Save and Next" buttons, or by closing the edit screen and returning the project tab (it will prompt you to save). In each ETD, change the thumbnail by clicking "Replace Thumbnail" and then click "Autogenerate." In each ETD, select a department from the department list.
ContentDM Load Errors:
System Error
One of the files being loaded is open in another program and needs to be closed.
Too many fields in delimited text file...
One of more characters in the metadata are being creating an extra tab in the middle of a field. To find and fix the problem:
- Open the .txt file in Word.
- Convert the text to a table. Highlight it all. To convert, go to the "Insert" tab and select "Table" "Convert Text to Table."
- Increase the size of the page so that you can see what's going on. Go to "Page Layout" and select "Size" "More Paper Sizes." Make the width and height both 22." When prompted, click "Ignore."
- Scroll down your file to find where the table mis-aligns. Likely a portion of the abstract immediately before the misalignment is partially in the identifier field, and could extend to other fields as well.
- Note the author or title of the problem, and the text immediately surrounding the mis-alignments.
- Re-open the Excel metadata file and find the problems you noted. Delete and retype the characters immediately surround the erroneous tab, eliminated any problem characters such as smart quotes or colons.
- Return to step 2 in the "Loading" procedure.
- If the error occurs again, redo these steps until all of the problems preventing the file from loading have been found and fixed.
Upload for approval and approve:
- When done with an entire set, go back to ContentDM and select all files. Then click upload for approval. This will take awhile.
- After files are uploaded for approval, go to contentDM administration at http://contentdm.ad.umbc.edu/cgi-bin/admin/start.exe. Login. Click on items, then change the collection to UMBC Theses and Dissertations. Click on approve. A list of all the files ready for approval will appear. You can select all, and approve all, and also edit or delete from here. If everything is in order, select all and approve them. Next, click on index in the menu bar, then click on index now. This will take awhile.
- Put all of your files from the batch, as well as any that Michelle has handled, on to a flash drive, and take it Lindsey. She'll add the new materials to the ContentDM back-drive.