[Rdap] What makes an 'Archive Quality' Digital Object?

Wed Apr 27 11:44:00 EDT 2011

On Apr 27, 2011, at 8:16 AM, Joe Hourcle wrote:

> 
> 
> On Tue, 26 Apr 2011, Ruth Duerr wrote:
> 
>> I like both Joe and John's lists - pretty darn comprehensive.  One thing I noted about the lists is that while calibration was explicitly listed in John's list, neither list explicitly discussed validation as a step beyond calibration, though I suppose some of the other items on each list might cover that concept.  Validation is particularly important with remote sensing data since knowing that a sensor is calibrated to some precision may not say anything about how accurately it is actually measuring some physical parameter - ground truth is often needed to judge that.
> 
> I think this is one of those issues where what you're studying comes into play.  There's no way to get the 'ground truth' for the data I deal with. (even with those plans to put a man on the sun ... but they'd do it at night, so it'd be okay).

Yup
> 
> In fact, my talk that spurred the checklist (dealing with some of the problems we were having in processing catalogs) was put in a session on 'data and instrumentation', and one of the talks before mine was from someone who worked on the SXI (Solar X-ray Imager) on the GOES fleet (constellation?  I'm not sure what you call a group of spacecraft, and 'spacecrafts' sounds funny).
> 
> Anyway, they do an intercalibration between GOES launches, so that the next SXI's data is calibrated to be comparable to the previous SXI ... and it seems that the calibration factor they were using to get to physical units (W/m^2) had actually been wrong ... the later instruments were calibrated correctly on the ground, and it was original ones that should've been adjusted.
> 
> And it was by a factor of about 20-30%.
> 
> They threw out the question to the scientists of how to deal with it -- go back and reclassify all flares?  (an M8 would now be an X1), change the definition of the flare class (X would now be > 8*10^-3 vs. 10^-4), or something else?   I don't know if a decision was ever made, but if groups are using two different scales for classifying flares, it could be messy.
> 
> 
> 
>> I especially liked John's "subtle characteristics," especially the one about data being annotatable.  Given that the quality of any particular data set varies depending on the use to which you'd like to put it to and that the data originator rarely is in a position to know all of the potential uses and users of their data, capturing the annotations of users is often the only way to start capturing information about the utility of the data to audiences other than the original producer.
> 
> What?  We're supposed to actually *test* the backups?
> 
> And then you're going to tell me that it's a problem when it takes 15 hrs to retrieve a day's worth of data from tape, just because we need to re-calibrate the first 9 months of data from the mission, and someone accidentially flushed the raw data from disk.
> 
> (Hmm ... now I just have to figure out how to get the scientists to actually look at these lists before they build data systems ... catalogs can generally be cleaned up after the fact, but data systems not so much)
> 
> ...
> 
> But for the annotation -- in a way, the various 'catalogs' that I deal with are annotations, but we have some really subtle issues that basically correlate to some of the problems in data citation:
> 
>    What am I annotating?
> 	(a) The world as observed in this data (eg, X2 flare)
> 	(b) The observation  (eg, partially obscurred by clouds)
> 	(c) The observation as it exists on disk (eg, blocks lost in
> 		transfer; partial image)
> 	(d) The calibrated edition of data (eg, notes on oddities in the
> 		data)
> 	(e) The file on disk (eg, invalid checksum; possible corruption)
> 	(f) The instrument (eg, a discontinuity due to servicing, or even
> 	    the lack of data for a time period)

and possibly all of the above and more!  
> 
> It's possible that some of these might 'trickle down'.  (eg, if I'm annotating the calibrated form, I'm also indirectly annotating the observation and the state of the world ... I saw an X2 flare based on this set of calibrated images ... which means that I'm asserting that there's an X2 flare that might've been visible by other instruments observing that region at that time if they had similar observing characteristics)
> 
> And I mention data citation for a few reasons:
> 
> 	1. Citation should be a type of annoation;  if someone is later
> 	   browsing the data, we should be able to tell them what papers
> 	   have been published using that data, so they can avoid
> 	   duplicating work or identify collaborators for additional
> 	   analysis.

Actually the USGCRP paper includes citations as well as a host of technical documentation - while the paper is long the list inside the paper is relatively short:

"Instrument / sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, noise characteristics, etc.)
Instrument / sensor calibration data and method;
Processing algorithms and their scientific basis, including complete description of any sampling or mapping algorithm used in the creation of the product (e.g. contained in peer reviewed papers, in some cases supplemented by thematic information introducing the data set or product to scientists unfamiliar with it);
Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product;
Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive;
Quality assessment information;
Validation record, including identification of validation data sets;
Data structure and format, with definition of all parameters and fields;
In the case of earth-based data, station location and any changes in location, instrumentation, controlling agency, surrounding land use and other factors that could influence the long-term record;
A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set;
Information received back from users of the data set or product."
> 
> 	2. Downloading should be a type of annotation.  It allows a
> 	   researcher to easily identify what they had downloaded, so they
> 	   can then generate a record for citation.  It can also be used
> 	   to generate periods/locations of interest in general, but even
> 	   anonymous, we get into some issues like Mike Brown's Haumea
> 	   incident [1].

Yes - in ESIP we've discussed this with the idea that a repository could create a citation for a user that referenced back to a specific set of files.  The issue with that is that generally users wouldn't be citing all of the data they downloaded, but some fraction of it or more likely some fraction of a number of downloads.  Just because it doesn't work perfectly for citation generation doesn't mean that it isn't a good idea in general though...
> 
> 	3. But if we know who's downloaded the data, we can inform
> 	   researchers if we've identified problems with the data and/or
> 	   a recalibration run, so they don't get caught unaware when it
> 	   happens after they've downloaded the data, but before they've
> 	   submitted their research paper.

Agreed...
> 
> [1] Another researcher published that they had discovered the 'dwarf'
>    planet, but it was later discovered they had downloaded the observing
>    logs, and determined where Brown had been looking :
>    http://www.nytimes.com/2005/09/13/science/space/13plan.html
> 
> 
> 
> -Joe

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.kunverj.com/pipermail/rdap/attachments/20110427/bf342b16/attachment.html>