[Rdap] Actual cases of data management SNAFUs
Joe Hourcle
oneiros at grace.nascom.nasa.gov
Fri Sep 25 12:20:02 EDT 2015
On Fri, 25 Sep 2015, Stefan Ekman wrote:
> Dear list members,
>
> I'm looking for actual cases where research has been prevented or
> impaired by poor data management and/or sharing practice, preferably in
> Europe but anywhere is fine. Created scenarios (such as this:
> https://www.youtube.com/watch?v=N2zK3sAtr-4) are easy to find, or indeed
> imagine, but real-life examples would be valuable in discussing RDM with
> researchers (and funders).
What do you consider to be 'impaired'?
One of the distribution systems that I deal with had multiple errors the
year before last, which caused problems for people trying to get the data:
1. The authoritative site had one of their machines that managed a number
of disk arrays fail.
2. Someone accidentially issued a command that deleted almost all of the
contents of a disk array.
3. There *were* backups (tape), but the distribution system wouldn't
perform a tape load when remote sites attempted to access the file.
(we had to e-mail the list of objects that we were trying to retrieve,
and someone would manually issue the command to load them from tape).
Restoring the deleted data took ~2 weeks.
4. Another (larger) disk array failed.
5. The authoritative system didn't mark the data as unavailable, so remote
systems kept trying to retrieve the files repeatedly (slowing down
total throughput so much that we couldn't keep up w/ the data stream
for new data).
6. The distribution system had no fail-back for what to do when data
wasn't available, so we had no mechanism to inform users what was
going on and why their requests were failing.
7. I was given access to a CGI to call to request a tape load at the
authoritative site & modified my system to make requests if the data
was from the failed storage array.
8. The tape drivers were written in-house, and would only process loads in
order, rather than doing opportunistic reads (if the tape's loaded, and
there are requests queued for that tap, load them out-of-order).
9. My sending tape load requests (batched up once per hour) was causing
the system to thrash so badly it was affecting their normal use of the
tape drives.
10.They finally agreed to do a full restore of the failed array, about 2-3
months after it had failed. (and it took another month or so to finish
restoring the data)
...
I don't know if this is a failure of a data management plan ... I've been
putting the blame on the site for deciding to implement their own system
for managing the storage, rather than using something that's in use at
other sites (and therefore, well-tested).
There was another incident with that site just before the embargo was to
be lifted -- someone issued a command to delete almost all of the data
from one of the instruments, and they had to restore from tape. However,
they had written the controller for the tape drives single-threaded, and
so it was taking more than 15hrs to restore a day's data. (and they had
~4 months of data to restore) ... and the tape drives were sitting 2/3
idle because the contention was in the controller, not the drives).
...
This is also the group that I had to argue with for 3+ months to try to
get them to put checksums in their data.
They insisted they were using RAID6, and so they didn't need them as
they'd have to lose 2 drives per array for there to be a problem. Of
course, 2 weeks before launch, they had a power outage in their data
center and a third of the arrays lost 2 disks. Suddenly the PI got the
idea that checksums would be a good thing.
...
The problem is also related to other management issues resulting in
their two main software developers quitting a good year or so before
launch. I suspect it was because of how dismissive the PI was towards IT
expertise. (as was the case when our group was brought in ... and I tried
getting them to implement checksums and add identifiers to allow the
scientific community to better track data from their system, but got
nothing but push-back).
...
And even if you don't consider that 'impaired' -- dealing with that group
has eaten up ~6 years of effort to support them, and as such, the Virtual
Solar Observatory hasn't integrated a number of other solar data sources
that aren't well-distributed.
-Joe
More information about the RDAP
mailing list