Difference between revisions of "Old:Curator Report and Query Technology"
Revision as of 11:41, 10 June 2016
Methods for generating various reports that are needed for Dryad presentations and documents.
DSpace Curation Reports
Many of the reports needed by curators can be run through the DSpace curation tasks system.
Distribution of bitstream formats across all items in the Data Files collection:
/opt/dryad/bin/dspace curate -v -t profileformats -i 10255/2 -r - >~/temp/profileFormats.txt
Check that data files/packages have all of the required metadata:
/opt/dryad/bin/dspace curate -v -t requiredmetadata -i 10255/2 -r - >~/temp/requiredFileMetadata.txt /opt/dryad/bin/dspace curate -v -t requiredmetadata -i 10255/3 -r - >~/temp/requiredFileMetadata.txt
Aggregate statistics about data files, including embargo status:
/opt/dryad/bin/dspace curate -v -t filestats -i 10255/2 -r - >~/temp/fileStats.txt
Locate embargoed files that do not yet have a designated end date for the embargo:
Complete statistics about individual data files. This includes: handle, fileDOI, journal, journalAllowsEmbargo, journalAllowsReview, fileSize, embargoType, embargoDate, downloads, views, dateAccessioned.
/opt/dryad/bin/dspace curate -v -t datafilestats -i 10255/2 -r - >~/temp/dataFileStats.txt
Detailed download statistics for a particular file, identified by its itemID (in this example 40063):
Aggregate statistics about data packages, including counts of integrated submissions:
/opt/dryad/bin/dspace curate -v -t packagestats -i 10255/3 -r - >~/temp/packageStats.txt
Simple/quick statistics about individual data packages. This includes: handle, doi, journal, dateAccessioned.
/opt/dryad/bin/dspace curate -v -t datapackageinfo -i 10255/3 -r - >~/temp/packageInfo.txt
Complete statistics about individual data packages. This includes: handle, doi, journal, numKeywords, numKeywordsJournal, numberOfFiles, packageSize, embargoType, numberOfDownloads, manuscriptNum, numReadmes, wentThroughReview, dateAccessioned.
/opt/dryad/bin/dspace curate -v -t datapackagestats -i 10255/3 -r - >~/temp/dataPackageStats.txt
Dates for the first archived item associated with each journal in the repository:
/opt/dryad/bin/dspace curate -v -t journaldates -i 10255/3 -r - >~/temp/journalDates.txt
Number of data packages for each journal in the repository:
/opt/dryad/bin/dspace curate -v -t datapackagesperjournal -i 10255/3 -r - >~/temp/journalPackages.txt
List of all journals associated with archived items:
\copy (select distinct text_value from metadatavalue where metadata_field_id =77 and item_id IN (select item_id from item where in_archive=TRUE and item_id IN (select item_id from collection2item where collection_id=2))) to '/home/dryad/currentJournals.txt' with CSV;
List of all journals associated with items in publication blackout:
\copy (select text_value from metadatavalue where metadata_field_id =77 and item_id in (select item_id from metadatavalue where text_value like '%Entered publication blackout%' and item_id not in (select item_id from metadatavalue where text_value like '%Approved for entry into archive%'))) to '/home/dryad/allJournalsBlackout.txt' with csv;
Items in curation
Items in the review workflow stage:
/opt/dryad/bin/dspace curate -v -t itemsinreview -i 10255/3 -r - >~/temp/itemsInReview.csv
Graphing Dryad submissions over time
First, get all of the deposit dates. On the production server:
postgres-client.sh -c "select text_value from metadatavalue where metadata_field_id=12 and item_id in (select item_id from item where owning_collection=2 and in_archive='t');" > dryadSubmitDates.txt
(collection 2 is "Data Packages" and metadata field 12 is "Date Available")
Edit the file to remove the timestamps (leave only the date portion).
Sort the dates.
Import the dryadSubmitDates.txt into Excel.
Create a column beside the dates that starts at 1 and counts up (this represents the total number of submissions present on each day).
Create a graph of these two columns.
Items in publication blackout
Items in blackout are owned by "Dryad Queue". The queue's ePerson ID is 949.
select item_id from workflowitem where workflow_id IN (select workflow_item_id from taskowner where owner_id=949);
To get metadata about items in the blackout, you must select the metadata directly from the database, one field at a time. Below is a query to select metadata field 17 (DOI). When performing these types of queries, always order by item_id to keep the ordering of the metadata values consistent across queries. select text_value from metadatavalue where metadata_field_id=17 and item_id IN (select item_id from workflowitem where workflow_id IN (select workflow_item_id from taskowner where owner_id=949)) ORDER BY item_id;
Submissions from a particular instituiton
This query will identify archived data packages that are associated with a particular email suffix. Note that this undercounts the number of deposits from an institution, since around 25% of Dryad user create accounts with a generic email address (e.g., GMail).
select item_id from item where submitter_id IN (select eperson_id from eperson where email like '%TARGET_INSTITUTION_EMAIL_SUFFIX') and in_archive='t' and item_id IN (select item_id from collection2item where collection_id=2) order by last_modified;
\copy (select voucher.code as voucher_code, shoppingcart.status, voucher.customer, metadatavalue.text_value as DOI, metadatavalue.item_id, shoppingcart.notes from voucher, shoppingcart, metadatavalue where shoppingcart.voucher >0 and shoppingcart.voucher = voucher.voucher_id and shoppingcart.item = metadatavalue.item_id and metadatavalue.metadata_field_id = 17) to '/tmp/voucher_export.csv' with csv;
select element, qualifier, text_value from metadatavalue, metadatafieldregistry where metadatavalue.metadata_field_id = metadatafieldregistry.metadata_field_id and item_id =145629;
- Annual Statistics Reports -- describes the statistics we collect for annual reports, and how they are collected
- Statistics Technology -- describes how the statistics tracking system works, including how to access file download statistics
- Batch Metadata Editing -- describes a process for exporting metadata, which can be used for reporting