[Rdap] DOIs for research data in DSpace

Mon Oct 1 13:25:50 EDT 2012

On Oct 1, 2012, at 12:30 PM, Konkiel, Stacy Rose wrote:

> Hello everyone,
> 
> We've got a research group here at Indiana University that is interested in registering their bitstream (data file) URLs for DOIs for each dataset in their project that's been uploaded to our institutional repository.
> 
> They point out that a PURL already exists for the metadata page/item record-that's the DSpace Handle. They want their DOIs to allow for direct data file download. Our IR team is a bit hesitant to endorse this, as bitstream URLs are somewhat less stable that Handles, and DOIs (registered to IRs, SRs, and publishers alike) seem to always point to a landing page of some sort.
> 
> Has anyone dealt with this issue before? I'm familiar with the ANDS/Griffith U example [1] but would like to know more about best practices at other institutions, and the arguments for/against registering bitstream URLs for DOIs.

The topic was discussed at the August 2011 BRDI meeting on data citation, and the technical breakout group went over some of the issues, and came up with a list of recommendations ... which was in part, 'landing pages' over directly linking to the data.

I'm expecting the official report from the meeting to be released soon (they were targeting September, which ended yesterday) ... but as I thought it was important enough, I also presented on the advantages of landing pages at this year's RDAP:

	http://docs.virtualsolar.org/wiki/Citation

Specifically, from the handout from the poster (with a typo corrected):

	http://vso1.nascom.nasa.gov/rdap/RDAP2012_landingpages_handout.pdf

	Acts as the endpoint for citations
	----------------------------------
	Citations should go to this intermediary landing page,
	rather than directly to the data. Sending researchers
	directly to the data can be a disservice, as the
	data may not be useful on its own without the proper
	software to read it or the proper documentation to
	understand it. The data may be excessive in size, and
	pushing researchers to download data without an
	interstitial warning about what they are downloading
	is a disservice to both interested researchers and to
	the providers hosting the data.

(there's video of the talk, too ... but I really don't want to talk about that ... I have, however, figured out that Apple's Keynote has a setting for what size you want the slides to be rendered at ... it does *not* automatically scale up to the display resolution, which is what screwed me up ... and the talk was really about the advantages of citing the data directly vs. the other alternatives, so wouldn't help with this question really)

Part of the discussion at the BRDI meeting against directly linking to the data was:

	1. Do you want people downloading multiple-GB or TB files without knowing if they really want them, or can even use them?  (I personally don't want people wasting our site's bandwidth)

	2. Where can you put up notices that although the data's still available in the original form used, it's been deprecated by a different calibration?

	3. As file standards change, the data may be packaged in more than one format, and there should be a place where a researcher can select what's appropriate for them.  (and most people don't know how to adjust the HTTP Accept header so that we can make the decision for them).

	4. The data maybe stored at more than one location, so we want to prompt them to select an appropriate mirror.

	5. Many 'self documenting' file formats don't actually have enough information to really use the data.  (is it suitable to the type of research I'm doing?  What assumptions did they make in collecting the data?  Are the terms they're using consistent with what I think they mean?)

We still wanted some sort of machine-readable links from the landing page to the actual data, so that some sort of client *could* go and download all of the data automatically ... but we didn't want someone clicking on a link from a paper to suddenly find out they've just requested a 200TB dataset be staged for them to download.*

-Joe

* I can actually craft a URL to do it ... but I try to make sure that I never generate URLs larger than 2 GB right now, as not all of the mirrors have appropriate robots.txt files to protect themselves from search engines.