Large File Technology
When we receive large files through alternate transfer mechanisms, we first move them to a dedicated transfer server, which then syncs the files to Amazon S3.
Moving Large Files into DSpace
Bitstream replace method
This method is preferred, but it can be a pain for packages with many bitstreams. It is the best way to update items while they are still in the curation process. There is a python script that automates most of the process, but you must have the bitstream ID and the key to the file in Amazon S3.
Perform these actions logged in as curator on the production server.
- Locate the bitstream to be replaced in the curation system:
- Run the script get_bitstreams_for_package.py from the server.
- Enter the package ID when prompted.
- It will return a list of all associated bitstreams and files.
- Find the dummy file you want to replace and note its bitstream ID.
- Locate the large file in AWS S3:
- Run the script s3_list_ftp_files.py from the server.
- It will return a list of S3 keys to the files that have been uploaded to the FTP server.
- Find the key to the replacement file.
- Run the s3_replace_largefile_bitstream.py script:
$ /home/dryad/scripts/s3_replace_largefile_bitstream.py Enter the bitstream ID or URL: <PASTE BITSTREAM ID OF DUMMY FILE> Bitstream ID: 123456 Enter the path on the filesystem to the large file: <PASTE S3 KEY OF REPLACEMENT FILE> Using MIME type application/x-tar Executing SQL: UPDATE bitstream set size_bytes=12345667, name='filename.tgz', source='filename.tgz' ,checksum='8e9b105a306649361b07c8fe55e1f496', bitstream_format_id=55 where bitstream_id = 123456 UPDATE 1 Copying '/home/transfer/filename.tgz' -> '/opt/dryad-data/assetstore/12/34/56/12345695953157516195038155553895222810' $
The script verifies the checksums before and after. Verify on the 'Item Bitstreams' page that the name and format have been updated, and that the file is accessible from the item page. If the mime type cannot be determined from the file, the script will offer application/octet-stream and allow you to enter a MIME type manually. If you enter a mime type it must exist in the bitstreamformatregistry table.
The script does NOT remove the files from the transfer server, so delete them or move them to a different directory when the replacement has been completed. The S3 version will be automatically deleted.
This method is preferred for dealing with many bitstreams, but only use it for items in the Dryad archive. Do not use it for items that are in the user's workspace or the curation workflow system.
- Locate the target data package
- Export a the item from DSpace. NOTE: we don't use the package exporter, because the output METS/AIP output formats are highly redundant and not suitable for human editing.
sudo /opt/dryad/bin/dspace export -t ITEM -i target_item_handle -d . -m -n 1
- Modify the item to contain the appropriate text/files.
- Import the new item into DSpace.
- the importer must be run from one directory above the target content
- when using the sample content from Dryad's code repository, .svn directories must be removed
sudo /opt/dryad/bin/dspace import -a -s . -c 10255/2 -m map.out -e email@example.com
- This creates a new DSpace object, so remove the old DSpace item (using its handle).
Current problems with the process:
- HIVE must be disabled -- it doesn't work well with the commandline tools
- DOI registration doesn't work
Technical Server Details
There is a standalone server running on AWS. The transfer user is set up to be accessible only via sftp and scp (locked with rssh). The home directory for the transfer account only allows access to the user directories within, set up by the curators.
The transfer account's home directory is meant for users to transfer in their data only. A crontab script (transfer.sh) runs periodically to move completed uploads to a different directory and removes them from transfer's directory. A second crontab script (notify-transfer.sh) emails the curators when new uploads are complete.