[Rdap] data publishing - policy question re: data with errors

Wed Apr 13 15:05:34 EDT 2016

I'd like to take this opportunity to share the following (hopefully
helpful) summary of observations from my work with data scientists:

Many scientists refer to a published/archived/networked set of data as
a 'data release' and append a version number to track changes. The
release version number allows the scientists to be able to concretely
identify which version of the dataset was used (aka for
reproducibility).  When the dataset has changes, the data managers
create a new release and/or version number.  A record of changes (aka
changelog) is often critical for the users to understand what changes
were made between versions.

There are probably only about 6-10 modalities of data ingest/update
that I have observed in the wild, but I haven't seen any publications
that summarize these use cases in terms of data management.  Perhaps
someone else has, tho.  I did a brief lit survey at the time of
writing this email and didn't see anything new.  There are quite a few
resources discussing the problem, but you may need to use the term
'data citation' or 'data quality control' in order to find the
publications (e.g. http://docs.virtualsolar.org/wiki/Citation )  Be
aware that some of the discussions in these publications may not be
appropriate to the type of use and storage modalities present at NIST.

1) The considerations that you have for data errors is part of a
larger discussion that is necessary for your group to make long term
plans around publishing data and creating data release versions for
distribution and/or reuse.  Matt Mayernik, Carole Palmer, and many
others in the fed.gov.us data world have had some excellent hands-on
discussions about data management practices for both their live and
static data sets.  They are very helpful to chat with if you can swing
it.

2) The exact considerations around storage and metadata
(e.g.overwriting vs sequential releases of data on your servers) needs
to be tailored to how the data is actually used and structured.  Many
of the very largest datasets are overwritten with detailed changelog
files b/c the network storage and delivery costs would be untenable.
E.g. sometimes it is cheaper and faster to maintain a detailed record
of the changes than it is to maintain multiple versions of a
structured database, but it depends on the dataset and its users/uses.
Some datasets are sequential, e.g. every day more data is loaded into
the set and therefore, the dataset changes every day.  Hence the way
the identifiers are applied depends on how each specific dataset is
structured for loading new data and corrections.

3) The larger discussion around data citation and data release
versions is necessary in order to apply finite identifiers to your
data releases.

4) Finite identifiers are necessary for citing and locating the
dataset for publication purposes.  My IDCC paper
(http://www.ijdc.net/index.php/ijdc/article/view/174 ) focuses on this
problem, if you want to read a more detailed argument about
identifiers and linking to data.  Dataset versioning information is
critical to appending finite identifiers.  This is because records
and/or metadata around dataset versioning are critical to
reproducibility (e.g. so that the reader can correctly identify which
version of the dataset was used in a research publication).

Best of luck with your data publication activities,
-l
L. Wynholds
wynholds at ucla.edu
PhD Candidate, Information Studies,
University of California, Lost Angeles

On Mon, Apr 11, 2016 at 11:03 AM, Avila, Regina L. (Fed)
<regina.avila at nist.gov> wrote:
> Hello all,
>
>
>
> As we’re formulating our policies and practices for publishing scientific
> research data at NIST, I’m wondering how my colleagues are handling
> published research data that is discovered to have errors.
>
>
>
> My questions are:
>
>
>
> Are you overwriting the erroneous data with the corrected data and then
> publishing changelogs?
>
> Are you keeping the original data but making it inaccessible?
>
> If you’re keeping the original data accessible, how are you noting the
> errors?
>
> If you’re keeping the erroneous data, is it kept permanently or for a set
> amount of time?
>
> Has any authority published a best practice guide on this matter?
>
> Any arguments for against any of these ideas?
>
>
>
>
>
> Thanks in advance for your assistance,
>
>
>
> Regina Avila
>
> ____________________________________
> Regina L. Avila
>
> Digital Services Librarian
>
> National Institute of Standards and Technology
>
> 301-975-3575
>
>
>
>
> _______________________________________________
> Rdap mailing list
> Rdap at mail.asis.org
> http://mail.asis.org/mailman/listinfo/rdap
>