[Rdap] What makes an 'Archive Quality' Digital Object?
Joe Hourcle
oneiros at grace.nascom.nasa.gov
Wed Apr 27 10:16:32 EDT 2011
On Tue, 26 Apr 2011, Ruth Duerr wrote:
> I like both Joe and John's lists - pretty darn comprehensive. One thing
> I noted about the lists is that while calibration was explicitly listed
> in John's list, neither list explicitly discussed validation as a step
> beyond calibration, though I suppose some of the other items on each
> list might cover that concept. Validation is particularly important
> with remote sensing data since knowing that a sensor is calibrated to
> some precision may not say anything about how accurately it is actually
> measuring some physical parameter - ground truth is often needed to
> judge that.
I think this is one of those issues where what you're studying comes into
play. There's no way to get the 'ground truth' for the data I deal with.
(even with those plans to put a man on the sun ... but they'd do it at
night, so it'd be okay).
In fact, my talk that spurred the checklist (dealing with some of the
problems we were having in processing catalogs) was put in a session on
'data and instrumentation', and one of the talks before mine was from
someone who worked on the SXI (Solar X-ray Imager) on the GOES fleet
(constellation? I'm not sure what you call a group of spacecraft, and
'spacecrafts' sounds funny).
Anyway, they do an intercalibration between GOES launches, so that the
next SXI's data is calibrated to be comparable to the previous SXI ... and
it seems that the calibration factor they were using to get to physical
units (W/m^2) had actually been wrong ... the later instruments were
calibrated correctly on the ground, and it was original ones that
should've been adjusted.
And it was by a factor of about 20-30%.
They threw out the question to the scientists of how to deal with it -- go
back and reclassify all flares? (an M8 would now be an X1), change the
definition of the flare class (X would now be > 8*10^-3 vs. 10^-4), or
something else? I don't know if a decision was ever made, but if groups
are using two different scales for classifying flares, it could be messy.
> I especially liked John's "subtle characteristics," especially the one
> about data being annotatable. Given that the quality of any particular
> data set varies depending on the use to which you'd like to put it to
> and that the data originator rarely is in a position to know all of the
> potential uses and users of their data, capturing the annotations of
> users is often the only way to start capturing information about the
> utility of the data to audiences other than the original producer.
What? We're supposed to actually *test* the backups?
And then you're going to tell me that it's a problem when it takes 15 hrs
to retrieve a day's worth of data from tape, just because we need to
re-calibrate the first 9 months of data from the mission, and someone
accidentially flushed the raw data from disk.
(Hmm ... now I just have to figure out how to get the scientists to
actually look at these lists before they build data systems ... catalogs
can generally be cleaned up after the fact, but data systems not so much)
...
But for the annotation -- in a way, the various 'catalogs' that I deal
with are annotations, but we have some really subtle issues that basically
correlate to some of the problems in data citation:
What am I annotating?
(a) The world as observed in this data (eg, X2 flare)
(b) The observation (eg, partially obscurred by clouds)
(c) The observation as it exists on disk (eg, blocks lost in
transfer; partial image)
(d) The calibrated edition of data (eg, notes on oddities in the
data)
(e) The file on disk (eg, invalid checksum; possible corruption)
(f) The instrument (eg, a discontinuity due to servicing, or even
the lack of data for a time period)
It's possible that some of these might 'trickle down'. (eg, if I'm
annotating the calibrated form, I'm also indirectly annotating the
observation and the state of the world ... I saw an X2 flare based on
this set of calibrated images ... which means that I'm asserting that
there's an X2 flare that might've been visible by other instruments
observing that region at that time if they had similar observing
characteristics)
And I mention data citation for a few reasons:
1. Citation should be a type of annoation; if someone is later
browsing the data, we should be able to tell them what papers
have been published using that data, so they can avoid
duplicating work or identify collaborators for additional
analysis.
2. Downloading should be a type of annotation. It allows a
researcher to easily identify what they had downloaded, so they
can then generate a record for citation. It can also be used
to generate periods/locations of interest in general, but even
anonymous, we get into some issues like Mike Brown's Haumea
incident [1].
3. But if we know who's downloaded the data, we can inform
researchers if we've identified problems with the data and/or
a recalibration run, so they don't get caught unaware when it
happens after they've downloaded the data, but before they've
submitted their research paper.
[1] Another researcher published that they had discovered the 'dwarf'
planet, but it was later discovered they had downloaded the observing
logs, and determined where Brown had been looking :
http://www.nytimes.com/2005/09/13/science/space/13plan.html
-Joe
More information about the RDAP
mailing list