Format Tools Testing Project

Format Tools Testing Project - Fall 2015

Christine Mayo, Dryad Asst Curator

Purpose: To support section 5.1.1 of the Dryad Preservation Policy Draft, I will be testing three different tools for automatic detection and analysis of file formats: FIDO, DROID, and JHOVE[a].

Testing Procedure: Tools will be rated on five criteria:  accuracy of identification - does it correctly identify the file formats? are specialty scientific formats misidentified as more common types (e.g., vcf)? ease of use/complexity of tool - is this something that could be integrated into current workflows, or into Dryad’s architecture? A large amount of back-end effort may be worthwhile for a tool that significantly outperforms others. usefulness of information/integration with other tools - what is the format of the results? are they machine-readable? do they include assessment of the file’s vulnerability, or are they readily integrable with another tool that does? migration options - will this tool reformat files? does it link to or integrate with a tool that does? how does it handle zip files? can it tell what’s inside a folder, before or after that folder is unizipped? 

Testing will be carried out on a Mac OS, potentially to be followed by a second round of testing (time permitting) on PC to ensure that transitions between operating systems do not affect the tools’ performance.

Updated project timeline and tasks:  Get each tool scripted & running on a VM copy of the dev server.
 * FIDO : Complete
 * DROID: Complete
 * JHOVE: Complete
 * XENA: Incomplete </li>

Results from Running Tools over the Development Server:        JHOVE:<ul style="padding-right: 0px; padding-left: 0px; line-height: normal; margin: 0px 0px 0px 40px; list-style-type: none; color: rgb(0, 0, 0); font-family: Times; font-size: medium;"> Less than 1% of the files on Dev are corrupted. We can generate a list of these files if necessary, but that’s not currently a priority.</li> ⅔ of files on Dev were identified as invalid (i.e., the file did not match the profile JHOVE expected it to.</li> the vast majority of these files (99%) were xml files; likely BEAST files that lacked the appropriate XML header.</li> other formats that had instances of invalid files: .owl (3), .menu (2), no extension (3), .kml (1), .sdd(4), .trex (1), .html (6), .morphoj (1), .pdf (15), .tif (1), .txt (5), and .xsl (3). In some cases this constitutes a significant proportion or even all files of that type.</li> Performance based on the criteria outlined above:</li> </ul> <ul style="padding-right: 0px; padding-left: 0px; line-height: normal; margin: 0px 0px 0px 80px; list-style-type: none; color: rgb(0, 0, 0); font-family: Times; font-size: medium;"> does it correctly identify file formats? <span style="color: rgb(255, 0, 0);">Not really . It can only check formatting for 12 basic file types (some in the list above were checked against its idea of what a text file, or a general binary/proprietary file should look like), and all file extension information had to be added to the output by script.</li> ease of use/complexity: needs a developer assessment (ask Debra, add info)</li> usefulness of information: <span style="color: rgb(0, 255, 0);">Decent. Results are text based, but uniform and relatively easily extractable. Does not assess vulnerability or integrate with a tool that does. Most useful for easy detection of corrupted files. The signature recognition would be more useful for a repository that had fewer proprietary filetypes, but the types that it usefully recognizes make up less than half of our holdings.</li> does it migrate files? <span style="color: rgb(255, 0, 0);">No .</li> Can it look inside a zipped folder? <span style="color: rgb(255, 0, 0);">No .</li> </ul>
 * Meet with Ryan to make transition to run tools across the entire dev server
 * Meeting scheduled for<span style="color: rgb(255, 0, 0);"> 8/14.
 * Run each tool over all files on Dev.
 * Based on preliminary testing, the time required for this will vary by tool, and will take hours in some cases and days in others.
 * Ryan’s assessment is that some or all of the three tools that are running on the VM can be started running on dev as early as <span style="color: rgb(255, 0, 0);">8/14.
 * Parse and assess results.
 * This will require additional scripting to programmatically assess results provided by the tools as text files. However, this is useful, because if a tool produces results that aren’t amenable to scripted interaction, its utility is severely decreased.
 * Present recommendations for tool(s) to integrate into the live workflow.
 * To be completed by <span style="color: rgb(255, 0, 0);">9/1.

Overall assessment: Probably only worth adding if none of the other tools can detect corrupted files.

FIDO: <ul style="padding-right: 0px; padding-left: 0px; line-height: normal; margin: 0px 0px 0px 40px; list-style-type: none; color: rgb(0, 0, 0); font-family: Times; font-size: medium;"> The process failed (was unable to adequately identify file type on just over 7% of files. The files that were considered problematic were very different from those that gave JHOVE problems. (FIDO had no problem with all of the XML files that JHOVE tagged as invalid)</li> Nearly all file recognitions were based on signature. Fewer than 1% of files were recognized by status as a tiff or container (zip) file, which were the only other positive recognition types.</li> </ul> <ul style="padding-right: 0px; padding-left: 0px; line-height: normal; margin: 0px 0px 0px 40px; list-style-type: none; color: rgb(0, 0, 0); font-family: Times; font-size: medium;"> Performance based on the criteria outlined above:</li> </ul> <ul style="padding-right: 0px; padding-left: 0px; line-height: normal; margin: 0px 0px 0px 80px; list-style-type: none; color: rgb(0, 0, 0); font-family: Times; font-size: medium;"> does it correctly identify file formats? <span style="color: rgb(255, 0, 0);">Uncertain. FIDO does not provide information about why a file would be considered “KO” instead of “OK”,so it’s hard to be sure whether the KO files were ones that didn’t match the PRONOM signatures for their type, or if something else was going wrong with the process there. All file extension information had to be added to the output by script.</li> ease of use/complexity: needs a developer assessment (ask Debra, add info)</li> usefulness of information: <span style="color: rgb(0, 255, 0);">Good. Results are text based, but uniform and relatively easily extractable. Does not assess vulnerability or integrate with a tool that does. Provides more useful information than JHOVE in that it does not seem to throw false positives based on files with incorrect internal metadata (XML headers, etc)</li> <li>Does it migrate files? <span style="color: rgb(255, 0, 0);">No .</li> <li>Can it look inside a zipped folder? <span style="color: rgb(255, 0, 0);">No .</li> </ul>

Overall assessment: Seems more accurate than JHOVE, or at least the information is more pertinent. We would have to look at some of the failed files to figure out why they were failing to determine if that information is useful. Lack of explanation for *why* identification failed is a major drawback.

Current status of project:

This project is on hold as of 10/21/2015 due to lack of resources. FIDO and JHOVE have been tested. Recommendations for picking this project back up in the future include: <ul style="padding-right: 0px; padding-left: 0px; line-height: normal; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; list-style-type: none; color: rgb(0, 0, 0); font-family: Times; font-size: medium;"> <li>Please continue to follow the testing criteria outlined above, so that work does not need to be replicated on JHOVE and FIDO.</li> <li>Testing priorities should include DROID and FITS.</li> <li>Do not bother testing Xena - look for other migration tools. XENA is not widely used or supported, it has not developed the open source community necessary for extended future support and development, and it does not actually carry any native migration capabilities - it relies on dependencies with other programs such as ImageMagick and Open Office to carry out migrations.[b]</li> </ul>

<span style="color: rgb(0, 0, 0); font-family: Arial; font-size: 11pt; line-height: 14.6667px; vertical-align: baseline;"> [a] at iPRES 2015 I was alerted to http://www.scape-project.eu/, which "finished last year but did quite a bit about file characterisation and also built a tool called c3po which visualises FITS output" -- recommended by Catherine Jones

<span style="color: rgb(0, 0, 0); vertical-align: baseline;"> [b] <span style="color: rgb(0, 0, 0); font-family: Arial; font-size: 11pt; line-height: 14.6667px; vertical-align: baseline;">that's not surprising, nor is it a negative as long as the dependencies are for well supported open tools, as these are

Analysis scripts and outputs ( available on GitHub) : <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">JHOVE <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">(http://jhove.openpreservation.org/getting-started/) <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;"> <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">RunJHOVE.py: This program queries the Dryad database to get a list of internal_ids for data files currently in the Dryad repository. It loops through the data files and for each file runs JHOVE, a file validation program that reports whether a file is "well-formed" and "valid." <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;"> <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">This program is designed to be run on the Dryad server. It creates ouptput files with the results for each file processed. To run the program, JHOVE must be installed on the server. Once JHOVE is installed, update paths to match the location of the JHOVE executable in runjhovecommand and the variable with the location of the output files (loc_seq). Once variables are updated, go to the command line of the server and type the following: python RunJHOVE.py <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;"> <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">JHOVE_output_interpreter.py: This script reads in and parses JHOVE outputs from RunJHOVE.py.  <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal;">  <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;"><span style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal;">To run from the command line, type the following: python JHOVE_output_interpreter.py. <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;"> <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">  <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">FIDO <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">(http://openpreservation.org/technology/products/fido/) <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">  <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">RunFIDO.py: This program queries the Dryad database to get a list of internal_ids for data files currently in the Dryad repository. It loops through the data files and for each file runs FIDO, a file validation program that reports whether a file is "well-formed" and "valid." <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;"> <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">This program is designed to be run on the Dryad server. To run the program, FIDO must be installed on the server (http://openpreservation.org/technology/products/fido/). Once FIDO is installed, update paths to match the location of the FIDO executable in sys.path.append and int_id_path_regexpand and the location of the output files (f). Once variables are updated, go to the command line of the server and type the following: python RunFIDO.py <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">  <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;"><div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal;">FIDO_interpreter.py: This script reads in and parses FIDO outputs from RunFIDO.py. <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal;"> <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal;"> To run from the command line, type the following: python FIDO_interpreter.py. <div style="color: rgb(34, 34, 34); font-family: arial, sans-serif; font-size: 12.8px; line-height: normal; margin-left: 40px;">