Curator Report and Query Technology

Methods for generating various reports that are needed for Dryad presentations and documents.

DSpace Curation Reports
Many of the reports needed by curators can be run through the DSpace curation tasks system.

Summary Reports
Distribution of bitstream formats across all items in the Data Files collection: /opt/dryad/bin/dspace curate -v -t profileformats -i 10255/2 -r - >~/temp/profileFormats.txt

Check that data files/packages have all of the required metadata: /opt/dryad/bin/dspace curate -v -t requiredmetadata -i 10255/2 -r - >~/temp/requiredFileMetadata.txt /opt/dryad/bin/dspace curate -v -t requiredmetadata -i 10255/3 -r - >~/temp/requiredFileMetadata.txt

Identifiers

 * datacitechecker
 * datacitesynchronizer

Data Files
Aggregate statistics about data files, including embargo status: /opt/dryad/bin/dspace curate -v -t filestats -i 10255/2 -r - >~/temp/fileStats.txt

Locate embargoed files that do not yet have a designated end date for the embargo:
 * embargoedwithoutpubdate

Complete statistics about individual data files. This includes: handle, fileDOI, journal, journalAllowsEmbargo, journalAllowsReview, fileSize, embargoType, embargoDate, downloads, views, dateAccessioned. /opt/dryad/bin/dspace curate -v -t datafilestats -i 10255/2 -r - >~/temp/dataFileStats.txt

Detailed download statistics for a particular file, identified by its itemID (in this example 40063): http://datadryad.org/solr/statistics/select/?indent=on&rows=10000000&fl=continent,countryCode,city,time,userAgent&q=isBot:false+owningItem:40063

Data Packages
Aggregate statistics about data packages, including counts of integrated submissions: /opt/dryad/bin/dspace curate -v -t packagestats -i 10255/3 -r - >~/temp/packageStats.txt

Simple/quick statistics about individual data packages. This includes: handle, doi, journal, dateAccessioned. /opt/dryad/bin/dspace curate -v -t datapackageinfo -i 10255/3 -r - >~/temp/packageInfo.txt

Statistics in a format that is compatible with DASH: /opt/dryad/bin/dspace curate -v -t dashstats -i 10255/dryad.20 -r - >dashStats.txt

Complete statistics about individual data packages. This includes: handle, doi, journal, numKeywords, numKeywordsJournal, numberOfFiles, packageSize, embargoType, numberOfDownloads, manuscriptNum, numReadmes, wentThroughReview, dateAccessioned. /opt/dryad/bin/dspace curate -v -t datapackagestats -i 10255/3 -r - >~/tmp/dataPackageStats.txt

The report can be processed using grep to find the lines for the CSV file by typing the following from within the /tmp directory: grep '^10255/dryad.' dataPackageStats.txt > dataPackageStats.csv

Journals
Dates for the first archived item associated with each journal in the repository: /opt/dryad/bin/dspace curate -v -t journaldates -i 10255/3 -r - >~/temp/journalDates.txt

Number of data packages for each journal in the repository: /opt/dryad/bin/dspace curate -v -t datapackagesperjournal -i 10255/3 -r - >~/temp/journalPackages.txt

List of all journals associated with archived items: \copy (select distinct text_value from metadatavalue where metadata_field_id =77 and item_id IN (select item_id from item where in_archive=TRUE and item_id IN (select item_id from collection2item where collection_id=2))) to '/home/dryad/currentJournals.txt' with CSV;

List of all journals associated with items in publication blackout: \copy (select text_value from metadatavalue where metadata_field_id =77 and item_id in (select item_id from metadatavalue where text_value like '%Entered publication blackout%' and item_id not in (select item_id from metadatavalue where text_value like '%Approved for entry into archive%'))) to '/home/dryad/allJournalsBlackout.txt' with csv;

Items in curation
Items in the review workflow stage: /opt/dryad/bin/dspace curate -v -t itemsinreview -i 10255/3 -r - >~/temp/itemsInReview.csv

Graphing Dryad submissions over time
First, get all of the deposit dates. On the production server:

postgres-client.sh -c "select text_value from metadatavalue where metadata_field_id=12 and item_id in (select item_id from item where owning_collection=2 and in_archive='t');" > dryadSubmitDates.txt

(collection 2 is "Data Packages" and metadata field 12 is "Date Available")

Edit the file to remove the timestamps (leave only the date portion).

Sort the dates.

Import the dryadSubmitDates.txt into Excel.

Create a column beside the dates that starts at 1 and counts up (this represents the total number of submissions present on each day).

Create a graph of these two columns.

Items in publication blackout
Items in blackout are owned by "Dryad Queue". The queue's ePerson ID is 949.

select item_id from workflowitem where workflow_id IN (select workflow_item_id from taskowner where owner_id=949);

To get metadata about items in the blackout, you must select the metadata directly from the database, one field at a time. Below is a query to select metadata field 17 (DOI). When performing these types of queries, always order by item_id to keep the ordering of the metadata values consistent across queries. select text_value from metadatavalue where metadata_field_id=17 and item_id IN (select item_id from workflowitem where workflow_id IN (select workflow_item_id from taskowner where owner_id=949)) ORDER BY item_id;

Submissions from a particular instituiton
This query will identify archived data packages that are associated with a particular email suffix. Note that this undercounts the number of deposits from an institution, since around 25% of Dryad user create accounts with a generic email address (e.g., GMail). select item_id from item where submitter_id IN (select eperson_id from eperson where email like '%TARGET_INSTITUTION_EMAIL_SUFFIX') and in_archive='t' and item_id IN (select item_id from collection2item where collection_id=2) order by last_modified;

Vouchers
\copy (select voucher.code as voucher_code, shoppingcart.status, voucher.customer, metadatavalue.text_value as DOI, metadatavalue.item_id, shoppingcart.notes from voucher, shoppingcart, metadatavalue where shoppingcart.voucher >0 and shoppingcart.voucher = voucher.voucher_id and shoppingcart.item = metadatavalue.item_id and metadatavalue.metadata_field_id = 17) to '/tmp/voucher_export.csv' with csv;

Item Metadata
select element, qualifier, text_value from metadatavalue, metadatafieldregistry where metadatavalue.metadata_field_id = metadatafieldregistry.metadata_field_id and item_id =145629;

Items That Have Been Paid, but are Incorrectly in Reauthorization
select item_id from workflowitem, taskowner, shoppingcart where workflowitem.workflow_id = taskowner.workflow_item_id and (taskowner.step_id like 'reAuthorizationPaymentStep' or taskowner.step_id like 'pendingPublicationReAuthorizationPaymentStep') and workflowitem.item_id=shoppingcart.item and shoppingcart.status like 'completed';