...
Required for this procedure:
...
Required for this procedure:
Notices from Proquest that files are available.
Filezilla.
Proquest FTP login info.
ETD Directory on hard drive with pdf and xml subdirectories.
7-Zip.
Adobe Acrobat Standard. Modify Acrobat settings: When you're in Acrobat, go to edit, then preferences. Click on "Documents" in the left-hand column. In the main part of the pop-up, under PDF/A view mode, use the drop-down to select "never."
Computer configured to open XML files with WordPad (Right click an XML file and select "Open with" and then "Chose Program." Select WordPad, then click "Always use the selected program to open this kind of file.").
Editix XML Editor.
XSL file for reformatting the XML files, ETDConversionForDspace.xsl (attached here).
Microsoft Excel with the Developer tab enabled and macros enabled (Left click on the windows symbol and select "Excel Options." On the popular tab, check "Show developer tab in the Ribbon." Go to the Trust Center Tab. Click "Trust Center Settings." Click "Enable all Macros.").
Excel Template, ETDtempDspace.xlsm, (attached here).
SAF Builder program (downloaded from Github and installed by LITS) and Java JDK, GIT, and Maven. Oracle VM Virtual Box for running it on Linux, and directory that can be accessed both for Linux and windows. Instructions for installation here: STEPS_rev1.docx. Use the command git
cloneclone https://github.com/DSpace-Labs/SAFBuilder to install it.
Collection File program (attached here in a zip file–unzip it and put it in your ETD directory) and Python to run it. It can also be run on the staff eLumin desktop, this procedure includes that method..
For converting video files to mp4's: Avidemux.
When notified that files have been ftp'ed:
When you receive notices from Proquest, first insure that they were able to successfully transfer all of the files. If some transfers fail, they'll likely try again the next day. Wait until you likely have all the files for a semester before beginning work on a set. Be sure to keep track of what you have already loaded into ContentDM and what you haven't loaded.
FTP and Unzip the files (about 100 at a time): Downloaded files through
...
After FTPing the files, be sure to delete them from the FTP server.
Use Filezilla to FTP the new thesis and dissertations from Proquest. Open Filezilla. Enter the Proquest FTP IP, the username, and port. Push Enter. The last line of the top box on the screen should say "Directory Listing Successful" and the lower left-hand portion of the screen should be populated with files on the Proquest server. The left side of the screen shows your computer--find the ETD folder on your hard drive. Use the date to identify the new files that we need to obtain. Highlight all of the files we need by holding down the shift key while clicking the first and last files you want highlighted. Drag them to your ETD folder. The progress of file transfer will show on the bottom of the screen. Wait while all files transfer (you can minimize and do something else).
Verify that you have all of the files that have been sent by checking the number of files against the number of files the e-mail notices said were successfully downloaded. Add the number of successful downloads. Highlight all the ETD files, right click, and select properties. The number of files stated in the notices should match the total here.
Use 7-Zip to extract the zip files. Open 7-Zip. The folder that your files are in should be selected in the bar across the top of the window. If not, use the drop-down arrow to find it. Once you are on the correct folder, all of your zip files should display in the window. Highlight all of the zip files by holding the the shift key by clicking the first and last files you want highlighted. Click extract. The destination for the extract opens to C:\ETD\ZIP*\. Delete the *\ so that the files all go into the ZIP folder. Click ok.
Use Windows Explorer to sort the files by going to the "View" menu and selecting "arrange by file type.". Select all of the files of a given type, and move the to the appropriate sub-folder: Highlight all of the PDF files by holding down the shift key while clicking the first and last files you want highlighted. Drag them to the PDF subfolder and drop them there (or alternately, copy and paste them). Highlight all of the XML files by holding down the shift key while clicking the first and last files you want highlighted. Drag them to the XML subfolder and drop them there or alternately, copy and paste them).
Transform the Metadata
Combine the XML files into 1 File
DOS prompt:
Click the Windows Start button and type .cmd in the box. Push enter. A box with DOS will open.
Change the directory to the where you want the new file to go by entering cd followed by the path for the directory. For example, “CD C:\ETD” changes the directory to the ETD directory. To go up one level, "CD .." To go to the root directory, "cd /"
To copy the individual xml metadata files, use copy path *.xml newfilename. For example, if your xml files are in the ETD\xml\ directory, “copy c:\ETD\xml\*.xml combined.xml”.
Notepad:
Open the new file in notepad. Copy <?xml version="1.0" encoding="iso-8859-1"?> from the beginning of the file. Find and replace with nothing by pasting it <?xml version="1.0" encoding="iso-8859-1"?> . Put the <?xml version="1.0" encoding="iso-8859-1"?> back at the beginning of the file, inserting a line break between it and the remainder of the XML.
Add <ETD> after the <?xml version="1.0" encoding="iso-8859-1”?> at the beginning with a line break between it and the remainder of the XML.
At the end of the file, add a line break and </ETD> at the end.
TEST: finding all paragraph marks and replacing them with nothing.
Save and close the file.
Reformat the XML File using Editix:
Open Editix.
Open the XSL file ETDConversionForDspace.xsl (go to file, open, then change the file type to XSLT 2.0 document (*.xsl *.xslt)
Go to XSLT/Xquery transform a document
In XML source find your XML file.
In result find the directory you want the new file to go in and type the name with the extension .xml
Click ok.
Prepare the metadata in Excel:
Open the Excel template ETDtempDspacewithMacrosMASTER.xlsm.
Save the file with a new name for the set your working on.
Run the macro "Delete_Everything" by using CTRL-X. This will delete the content of sheet 1 and any existing XML map. If there is not an existing XML map, it will make an error which can be ignore.
Delete the sheet2 that has old content in it. Create a new worksheet and rename it sheet2 if not named that already.
Return to sheet1, cell 1A. Use developer import to import the file you created using Editix.
Press ctr-r to run the reformatter macro.
Go to the sheet2.
Change the header of the author column from dc.creator to dc.contributor.author
Separate the keywords by changing commas in the keyword field to || where appropriate.
Ensure that there are no spaces in file names. If there are, you'll need to change them in the spreadsheet, and also change the actual file name to match it.
Sort by
the collection field with the departments in itdepartments column. Check for any department that didn't fill in and any that aren't the correct department names, using the .csv in the Collection File Program to find the definitive versions of the
departmentsdepartment name to use and correct any that don’t match it. Also checking the dc.relation.ispartof column and correct it. Watch for a dash in
Marine-Estuarine Environmental Sciences |
where it doesn’t belong. Fix with find and replace, both in the department field and dc.relation.ispartof field with departmental collections names. Add these find and replaces to the macro so that they don't have to be done manually each time.
Ensure that all department names are in the Collection File .csv.
Prepare the files:
Check for anything unexpectedly left over. ETD's with extra files will unzip into folders or into additional zip files. These will require some initial manipulation to prepare them. Take a look at the extra files and handle each case as appropriate as follows:
Approval sheets--These are forms that the adviser signs approving the thesis or dissertation. These are extraneous extra files that should simply be deleted. Move the pdf and xml files to your usual directories and process as usual.
Other data that is usually included in the main file and pdf appendices--Combine the extra files with the main file using Adobe Acrobat.Move the files to your usual directories and process as usual.
Non-pdf appendices and other files not meeting above criteria--Convert them to the most appropriate file format given here: Non-Proprietary File Formats (if not already in one of these formats). If it can't be satisfactorily converted to one of those file formats, leave it in the format that it's in. Put these in a supplement folder in your pdf folder.
Open each PDF. IF there is personal information such as phone numbers or addresses included in the CV, delete it. Otherwise, leave it in the document. Some files will have missing thesis or dissertation. Handle these as follows:
Missing Documents There are two reasons a thesis or dissertation may be missing. The document may be embargoed, or the document may have not been FTP'ed because it's a large file that couldn't be sent via the Proquest administration page, so was sent to Proquest on disk. To determine which case this is, take a look at the metadata and the DISS_submission publishing_option tag. This is usually the first field in the metadata. In that tag, there is a an embargo code set with a numeric embargo code:
"0" - No embargo
"1" - 6 month embargo
"2" - 1 year embargo
"3" - 2 year embargo
"4" - Until specified date
If the code is 0, we should have the file, and can obtain it by downloading from the ContentDM Administrator Resources & Guidelines page at http://www.etdadmin.com/cgi-bin/main/resources?siteId=75. Click on Dissertations & Theses @ University of Maryland, Baltimore County and search for the missing document. When you find it, download it and process as usual.
If the code is 1-4, the document is embargoed and we won't receive the document until the embargo period has passed.
At the end of the metadata file there is a DISS_sales_restriction code," and the date in that tag indicates when the embargo will expire and when we should receive that file. Note the file name along with the date the embargo will expire in the embargo list at the end of this procedure so that we can ensure that we receive the file when the time comes. When you process the metadata for embargoed documents in Excel, insert a note into the metadata for the document stating: "At the author's request, this dissertation isn't being made available at this time." The metadata is then uploaded as usual along with the title page. The metadata will be revised to remove this note when we receive the full file.
For other problems with the files Proquest FTP's to us, ask Michelle to call Proquest technical support at 877-408-5027 or 800-889-3358 (or email at tsupport@proquest.com or
http://support.proquest.com/ ) to find a solution.
Adding Supplements to the metadata in Excel and Moving them to the PDF Directory
Rename supplement files to a simple name that makes sense. Add their file names to the spreadsheet files with || separating file names. Then move them to the main PDF directory (even if they're not PDF).
In the filename column, enter the names of any extra files to be loaded in the appropriate line. Separate it from the existing file with ||. In the dc.description column for these, add a note indicating that there's a supplement and it's format, eg "Include 1 .jpeg3 supplement". Move the supplement from the supplement folder to the pdf folder after it's added to the metadata.
Adding Publication Forms to the metadata in Excel:
Licenses were last downloaded on:
...
9/
...
5/
...
24
Log in to https://www.etdadmin.com/main/home and download the licenses from the day after the last downloaded date to today. Add 3 new columns to your spreadsheet: Open Access, Limited Access, and License.
For each publication form:
Find it's line on the spreadsheet (they are in alphabetical order, but if you don't see it, search for both the author and part of the title). If you can't find it on the spreadsheet, move it to the "not in this set" folder.
Check the title and remember the first couple of Words
Open the publication form. If the publication form file doesn't contain a publication form, or is blank, delete it.
Ensure that the publication form has the correct title. Remember if there's an embargo.
Do Save as...
Replace
everything between the authorany blank spaces in the file name with an underscore. Then replace everything between the author's name and .pdf with Open. eg.:
Dutrow-Daryl_Open.pdf |
For limited access items, Copy and paste everything in the Filename column into the Limited access column as appropriate.
If there's an embargo still in effect, the title, author, and the date the embargo expires to the embargo list.
Close the license and delete the file without the Open or Lim in it.
Log back in to https://www.etdadmin.com/main/home and search for any missing publication agreements and add any additional ones that you find.
Completing the Licenses
For all open access items, copy and paste the file name into the open access column. For all limited access items, copy and paste the file name into the limited access column. Copy and paste the file name for everything without a license into the limited access column.
Change the header on the filenames column, which should now be blank, to
Change the header of the open access column to filename__permissions:-r'Anonymous'__primary:true
Change the header of the limited access column to filename__permissions:-r'ScholarWorksUMBCIP'__primary:true
Change the header of the license column to filename__bundle:LICENSEIn the dcterms.accessRights column, for all open access items, change the value to "Distribution Rights granted to UMBC by the author." Value for limited access items remains "Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan through a local library, pending author/copyright holder's permission."
Copy and past the dcterms.access.Rights column over the filename column.
Change the header of the open access column to filename__permissions:-r'Anonymous'
*It's important to note that __primary:true
Change the header of the limited access column to filename__permissions:-r'ScholarWorksUMBCIP'__primary:true
Change the header of the license column to filename__bundle:LICENSE__permissions:-r'Anonymous' *It's important to note that this must include the final quote--ignore any examples that suggest that it's not
needed.- In the dcterms.accessRights column, for all limited access items, fill in "Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan through a local library, pending author/copyright holder's permission." For all open access items, fill in "Distribution Rights granted to UMBC by the author."
Check
- Search for any blank spaces in license file names and fix them.
- Check that all departments are in the collection builder file. Sort and scan.
- Check the rights field labels to ensure they are dcterms.accessRights
- Check the author field label is dc.contributor.author.
- Fix accessRights issues: Delete extraneous access rights column. Look for and fix an extra space after dcterms.accessRights in the column with the standard note.
- Check that dates are in the year-mo-da format. After this step is done, do NOT open in Excel but import selecting "delimited" as type and "comma" as the delimiter. When you get to step 3, make sure ALL the columns with dates are set to TEXT.
...
needed.
Check
Search for any blank spaces in license file names and fix them.
Make sure all licenses were saves to the pdf folder.
Check that all departments are in the collection builder file. Sort and scan.
Check the rights field labels to ensure they are dcterms.accessRights
Check the author field label is dc.contributor.author.
Delete all of the rows where extra data was filled in.
Check departments to ensure they're all in the correct form for the collection program.
Save your Excel file final version.
Save your sheet2 (you must be on it) as a .csv file. While on the "save as" screen, change the character encoding to UTF8 by using the tools drop-down, selecting web options, then encoding, and UTF8.
Note the dates in the Excel file. Close the csv in Excel, and open it with notepad. Use find and replace to change them to the YEAR-MO-DA format. Save and close.
Run the SAF builder (documentation here: https://github.com/DSpace-Labs/SAFBuilder) :
Be sure the csv is closed in all programs.
Put the .csv metadata file and all of the files to be loaded in the directory in the ETD directory. (The directory should include all of the files that compose a work, including supplements, and a csv file with metadata in it. The directory must be in the directory mapped to Linux)
Open Ubuntu.
Use the command ls to list all the files in the directory, and cd to change the directory to navigate to the directory with the safbulider.sh file. Use the cd command alone to go up a level in the directory. Remember directory names, file names and commands are all case sensitive.
Run the safbuilder by typing "sudo ./safbuilder.sh -c etd/path to metadata file." For example, "sudo ./safbuilder.sh -c etd/Oct2019etds/PDFs/ETDtempDspace_Oct2019Load.csv"would run the safbuilder on the metadata.csv file and all of the files in the directory with it. Note that the etd in the path must be lower case despite that it's upper case in windows. You can use the up arrow to cycle through previous commands so that you don't have to retype. When you push enter to run the command, you'll be prompted to enter your password.
The program will make a bunch of text appear in the DOS window. If it doesn't, the program didn't run. You probably made a typo when you typed the run command in. Try again, and be sure to type it all correctly. When the program successfully runs, it creates a SimpleArchiveFormat directory within the directory that you ran it on. The SimpleArchiveFormat contains numbered subdirectories: Item_000/, Item_001/, Item_0002, etc. Each of those subdirectories should contain a dublin_core.xml file, a contents file, and all the files that consist of the work described in the metadata.
When it's run correctly, in DOS window, the last line should indicate that ETDtempDspace.csv has been used 0 times, and that should be the only line with a "File:" error See below:
A SimpleArchiveFormat directory should appear in your folder that the files and the csv file are in.If there is more than the one "File" error, there is something wrong. See below:
These errors happen when the files in the folder and filenames in the csv file don't match. Determine if there is a problem that needs to be corrected by comparing your .csv file to the contents of your directory. If necessary, make the corrections, then delete your SimpleArchiveFormat directory, and run the safbuilder again. If you can't fix the problems, or don't know what's causing them, ask Michelle for help. If she's not there, you can copy and paste all the errors to Word by pushing the PrtScn and Ctrl keys together to copy your screen to the clipboard, and paste your screen into Word--if there are many errors, scroll through them getting them all pasted into Word.If other errors occur, it's usually because of a typo in the command/path. Try to run it again.
Run the program to create the collection files (STEPS_rev1.docx) and Send the Load Request
Move the entire current SimpleArchiveFormat directory into your CollectionFilesProgram directory. Give the CollectionFilesProgram Directory a name with the date in it.
Upload the CollectionFilesProgram directory to Google Drive
In your browser, go to https:/elum.in/umbc-facstaff and log in. All of the remaining steps are done on elumin facstaff,
Log into myumbc on elumin and download your CollectionFilesProgram from Google Drive.
Navigate to the zip file in the downloads directory in File Explorer then unzip it using extract.
Open the command line dos prompt by typing cmd into start.
Navigate to the unzipped CollectionFilesProgram directory in downloads.
Run the Collection File Program by typing "python safscript.py"
Look at the log, saf_log.txt for any items skipped. If items have been skipped fix them, or ask Michelle to fix them, and rerun the SAFbuilder and re-do all steps after that. If no items have been skipped, zip the SAF directory (need instructions for this)
Rename the SimpleArchiveFormat Directory to indicate that it includes collection files and add the date.
Upload it to google drive.
Download from google drive to laptop, unzip and move the ETD directory.
Zip the Final SimpleArchiveFormat directory with collection files and send to UMCP for Load
Open Ubuntu and zip the final SimpleArchiveFormat directory by:
navigating to the directory that the final SimpleArchiveFormat directory, using ls to view the contents of the directory you're in and cd to change directories.
Zip the SimpleArchiveFormat directory using the command zip -r myfilename.zip SimpleArchiveFormat/
Send the zipped SAF directory to MD-SOAR help, mdsoar-help@umd.edu, requesting that they load it.