Large File Technology Improvements

The way we handle large uploads is somewhat clumsy.

We can accept deposits up to 1.5GB, but larger files must be transferred through an external system. We need to increase the transfer capabilities in the native Dryad system.

Goal
Ensure that implementation and use of file transfer in the process of Dryad data submission is in keeping of Dryad’s overall mission and objectives. Specifically, ensure that using it incurs minimal difficulty to users, that its running costs do not exceed the revenue expected from large file surcharges, and that its implementation does not divert efforts disproportionate to the share of users in need of it.

Requirements
Absolute requirements:   Compatibility with users’ operating systems and browsers. The technology is compatible with multiple operating systems and browsers that cover at least 95% of Dryad’s users.   At present, this means recent versions of Chrome (36%), Firefox (32%), Safari (17%), and IE (12%) (user share in parentheses).   At present, this precludes Java Plugin-based technologies running within the browser.    Integrates with users’ web browsers. The technology is integrated with the user's web browser so that at least installation of any necessary client software or browser plugin is triggered from within the browser. If a technology requires client software installation on a user’s machine, the installation must not require admin privileges.   Seamless and user-friendly 3rd party authentication. If a technology requires authenticating a user to a 3rd party, obtaining the 3rd party account as well as authenticating to it must be straightforward. Specifically, it must meet the following requirements:   Ability to complete as part of the "data upload session" started by the user (i.e., no delay of hours or more waiting for approval etc.). </li>  Account creation must not require retyping of information that the user has already provided to Dryad. </li> </ol></li>  <span style="background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">Reliability of file transfer and robustness to variations in network availability and bandwidth. This includes the following.   A file must not indicate that transfer is complete unless the transfer actually has completed and the integrity of the file has been verified. </li>  Users must be able to restart a failed transfer without starting over at the beginning. </li> </ol></li>  <span style="background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">Minimum large size support. The transfer mechanism must support files up to at least 20 GB. </li>  <span style="background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">No 3rd party costs or charges. The technology must not incur an additional charge or cost to the user beyond the large file surcharge that Dryad levies already. </li> </ol>

<span style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; font-weight: bold; vertical-align: baseline; white-space: pre-wrap;">Preferences: <span style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">(these are in addition to required)   The technology is seamlessly integrated with the user's web browser.   Selection of file, file transfer, and status monitoring all take place from within the browser. </li>  No client software installation aside from browser plugins is necessary. </li> </ol></li>  <span style="background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">Seamless and user-friendly 3rd party authentication.  <li> If and once Dryad supports federated ID and auth, the 3rd party technology requiring ID and auth should support that mechanism as well, allowing the user to use the same federated ID at both places. </li> <li> 3rd party account creation should be programmable. </li> </ol></li> <li> <span style="background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">Reliability of file transfer and robustness to variations in network availability and bandwidth. <ol style="margin-top:0pt;margin-bottom:0pt;"> <li> Ideally, resumption of a failed transfer from the point of failure is transparent to the user, for example by a server pulling the file from the client endpoint. </li> </ol></li> <li> Directories w/ multiple files can be transferred. </li> <li> The transfer mechanism can support files beyond 100 GB. </li> <li> Information about file or package properties (in particular size) are obtainable by Dryad prior to transfer of the file. This would be used to present certain information to the user (such as projected large file surcharge, suitability for transfer technology, expected transfer time etc). </li> </ol>

iPlant

 * iPlant's system is based on iRODS.
 * iPlant account creation
 * User must login with iPlant credentials -- BUT iPlant will soon support InCommon, which will be compatible with Dryad's eventual support for DataONE's Tier 2 services.
 * Managing Data Files and Folders
 * Storing and Accessing Your Data
 * Sharing Data Files and Folders

Discussion notes:
 * Call 2013-8-22

Globus

 * GlobusOnline
 * Notes from Lee Taylor's IDCC presentation and Open Repositories 2013 presentation:
 * Exeter leveraged globus to provide large file transfer for DSpace
 * CILogon helped them get around the issues with separate DSpace & globus login processes
 * Richard Jones helped them rework SWORD so it separated DSpace ingest from the file transfer process
 * Users must "create a new globus endpoint" -- essentially this means installing a globus client and telling dspace where to find it.
 * Really big files (multi-Terabyte) may take a week to transfer
 * It takes several seconds for dspace to connect with globus
 * Exeter will add the Globus support into Dspace 4.0.

Aspera
We received a demo of Aspera in the spring of 2016.

Pros:
 * It's fast and robust
 * Resumable uploads
 * Has a nice administrative dashboard
 * We can control the total bandwidth used by Apera, as well as the bandwidth allowed for each client
 * Supports uploads directly into Amazon S3

Cons:
 * It requires users to install a plugin. (There is supposed to be an HTTP fallback, but the plugin is much better.)
 * It is relatively expensive (their prices are confidential; see (private) pricing sheet)

Resources:
 * (private) notes from Aspera demo
 * http://www-01.ibm.com/software/info/aspera/
 * https://developer.asperasoft.com/
 * http://asperasoft.com/software/transfer-servers/
 * http://cloud.asperasoft.com/aspera-on-demand/

HTML5 resumable upload for DSpace as implemented by DataShare

 * https://github.com/edina/DSpace/tree/xml-html5-upload
 * http://www.slideshare.net/paulineward/growing-open-data-making-the-sharing-of-xxlsized-research-data-files-online-a-reality-using-edinburgh-datashare
 * http://datablog.is.ed.ac.uk/2016/06/17/twentys-plenty-datashare-v2-1-upload-upgrade/

Other technologies that merit research

 * Java large file uploader
 * Jupload
 * Apache Commons file upload
 * aerofs
 * nginx
 * BitTorrent experimental tools, particularly Soshare and BitTorrent Sync
 * PaddleOver and the PaddleOver code
 * Akamai
 * iPlant/Cyberduck
 * iPlant/Cyberduck