[Rdap] What makes an 'Archive Quality' Digital Object?

Wed Apr 27 10:16:32 EDT 2011

On Tue, 26 Apr 2011, Ruth Duerr wrote:

> I like both Joe and John's lists - pretty darn comprehensive.  One thing 
> I noted about the lists is that while calibration was explicitly listed 
> in John's list, neither list explicitly discussed validation as a step 
> beyond calibration, though I suppose some of the other items on each 
> list might cover that concept.  Validation is particularly important 
> with remote sensing data since knowing that a sensor is calibrated to 
> some precision may not say anything about how accurately it is actually 
> measuring some physical parameter - ground truth is often needed to 
> judge that.

I think this is one of those issues where what you're studying comes into 
play.  There's no way to get the 'ground truth' for the data I deal with. 
(even with those plans to put a man on the sun ... but they'd do it at 
night, so it'd be okay).

In fact, my talk that spurred the checklist (dealing with some of the 
problems we were having in processing catalogs) was put in a session on 
'data and instrumentation', and one of the talks before mine was from 
someone who worked on the SXI (Solar X-ray Imager) on the GOES fleet 
(constellation?  I'm not sure what you call a group of spacecraft, and 
'spacecrafts' sounds funny).

Anyway, they do an intercalibration between GOES launches, so that the 
next SXI's data is calibrated to be comparable to the previous SXI ... and 
it seems that the calibration factor they were using to get to physical 
units (W/m^2) had actually been wrong ... the later instruments were 
calibrated correctly on the ground, and it was original ones that 
should've been adjusted.

And it was by a factor of about 20-30%.

They threw out the question to the scientists of how to deal with it -- go 
back and reclassify all flares?  (an M8 would now be an X1), change the 
definition of the flare class (X would now be > 8*10^-3 vs. 10^-4), or 
something else?   I don't know if a decision was ever made, but if groups 
are using two different scales for classifying flares, it could be messy.

> I especially liked John's "subtle characteristics," especially the one 
> about data being annotatable.  Given that the quality of any particular 
> data set varies depending on the use to which you'd like to put it to 
> and that the data originator rarely is in a position to know all of the 
> potential uses and users of their data, capturing the annotations of 
> users is often the only way to start capturing information about the 
> utility of the data to audiences other than the original producer.

What?  We're supposed to actually *test* the backups?

And then you're going to tell me that it's a problem when it takes 15 hrs 
to retrieve a day's worth of data from tape, just because we need to 
re-calibrate the first 9 months of data from the mission, and someone 
accidentially flushed the raw data from disk.

(Hmm ... now I just have to figure out how to get the scientists to 
actually look at these lists before they build data systems ... catalogs 
can generally be cleaned up after the fact, but data systems not so much)

...

But for the annotation -- in a way, the various 'catalogs' that I deal 
with are annotations, but we have some really subtle issues that basically 
correlate to some of the problems in data citation:

     What am I annotating?
 	(a) The world as observed in this data (eg, X2 flare)
 	(b) The observation  (eg, partially obscurred by clouds)
 	(c) The observation as it exists on disk (eg, blocks lost in
 		transfer; partial image)
 	(d) The calibrated edition of data (eg, notes on oddities in the
 		data)
 	(e) The file on disk (eg, invalid checksum; possible corruption)
 	(f) The instrument (eg, a discontinuity due to servicing, or even
 	    the lack of data for a time period)

It's possible that some of these might 'trickle down'.  (eg, if I'm 
annotating the calibrated form, I'm also indirectly annotating the 
observation and the state of the world ... I saw an X2 flare based on 
this set of calibrated images ... which means that I'm asserting that 
there's an X2 flare that might've been visible by other instruments 
observing that region at that time if they had similar observing 
characteristics)

And I mention data citation for a few reasons:

 	1. Citation should be a type of annoation;  if someone is later
 	   browsing the data, we should be able to tell them what papers
 	   have been published using that data, so they can avoid
 	   duplicating work or identify collaborators for additional
 	   analysis.

 	2. Downloading should be a type of annotation.  It allows a
 	   researcher to easily identify what they had downloaded, so they
 	   can then generate a record for citation.  It can also be used
 	   to generate periods/locations of interest in general, but even
 	   anonymous, we get into some issues like Mike Brown's Haumea
 	   incident [1].

 	3. But if we know who's downloaded the data, we can inform
 	   researchers if we've identified problems with the data and/or
 	   a recalibration run, so they don't get caught unaware when it
 	   happens after they've downloaded the data, but before they've
 	   submitted their research paper.

[1] Another researcher published that they had discovered the 'dwarf'
     planet, but it was later discovered they had downloaded the observing
     logs, and determined where Brown had been looking :
     http://www.nytimes.com/2005/09/13/science/space/13plan.html

-Joe