DataONE Preservation Strategy
=============================

.. index:: preservation

.. Note::

  This document is a literal conversion of the DataONE Preservation Strategy
  document developed by the "Preservation and Metadata Working Group" as an
  outcome of the December 2010 meeting held in Chicago. The original document
  may be retrieved from the `DataONE document library`_.


.. _DataONE document library: https://docs.dataone.org/member-area/documents/management/nsf-reviews/nsf-review-february-2011/documents-for-nsf-review/DataONE_Preservation_Strategy_2011.docx

Summary
-------

To meet the objective of “easy, secure, and persistent storage of data”,
DataONE adopts a simple three-tiered approach.

1. Keep the bits safe. Retaining the actual bits that comprise the data is
   paramount, as all other preservation and access questions are moot if the
   bits are lost. Key sub-strategies for this tier are (a) persistent
   identification, (b) replication of data and metadata, (c) periodic
   verification that stored content remains uncorrupted, and (d) reliance on
   member nodes to adhere to DataONE protocols and guidelines consistent with
   widely adopted public and private sector standards for IT infrastructure
   management.

2. Protect the form, meaning, and behavior of the bits. Assuming the bits are
   kept undamaged, users must also be able to make sense of them into the
   future, so protecting their form, meaning, and behavior is critical. In
   this tier we rely on collecting characterization metadata, encouraging use
   of standardized formats, and securing legal rights appropriate to long-term
   archival management, all of which supports future access and, as needed to
   preserve meaning and behavior, format migration and emulation.

3. Safeguard the guardians. If the DataONE organization and its member nodes
   were to disappear, that would be equivalent to 100 percent data loss. The
   DataONE network itself provides resiliency against the occasional loss of
   member nodes, and this will be shored up by succession planning, ongoing
   investigations into preservation cost models, and open-source software
   tools that can sustained by external developer communities.


Preservation Objectives
-----------------------

Fundamentally, DataONE’s preservation goal is to protect the content, meaning,
and behavior of data sets registered in its global network of heterogeneous
data repositories. This a complex undertaking that warrants a layered,
prioritized approach. To get started on a solid footing, our first objective
was to build a platform that immediately provides a significant degree of
preservation assurance and makes it easy to add more sophisticated
preservation function over time. Initially, DataONE will focus on preventing
loss due to non-malicious causes, such as,

- Technological obsolescence (e.g., loss of support for rendering software and hardware),
- Accidental loss (human error, natural disaster, etc.), and
- Financial instability (loss of funding).

While malicious threats do exist, many of them are addressed as a side-effect
of DataONE protocols and information technology (IT) management standards in
place at member nodes (MNs). By design, DataONE protocols limit the ability of
any MN or Coordinating Node (CN) to directly alter content on another node,
which in turn limits the havoc that an intruder could wreak. Moreover, MN
guidelines call for the same strong local IT management standards that are
widespread in financial services, manufacturing, and large technology
organizations, and these are typically already in force at well-established
data repositories.

An ancillary objective is to help inform an overall NSF DataNet preservation
strategy, and to that end this strategy was prepared at a DataONE workshop
(Chicago, 2010) with direct input from the Data Conservancy project.

Three DataONE preservation tiers
--------------------------------

The initial DataONE approach to preservation is described in the following
three sections.

1.  Keep the bits safe
~~~~~~~~~~~~~~~~~~~~~~

Retaining the actual bits comprising the data and metadata is paramount, as
all other preservation and access questions are moot if the bits are damaged
or lost. The direct role played by MNs in bit-level preservation is addressed
in the third section describing organizational sustainability.

Identify data persistently
..........................

Persistent identifiers (PIDs) are required for stable reference to all content
stored in DataONE. Without them, reliable data citation and long-term access
would not be possible. Because there are many legacy identifiers to be
accommodated, some of them dating from before the advent of the world-wide
web, DataONE uses PIDs from a variety of schemes and support systems, such as
purl.org and handle.net. Remaining agnostic about identifier syntax, DataONE
will also rely on scheme-agnostic support systems such as n2t.net/ezid, which
can deal with ARKs, DOIs, and traditionally non-actionable identifiers such as
PMIDs (PubMed Identifiers).

Make lots of copies
...................

To protect against the possible loss of a MN, or a bit-level failure at a
single MN,DataONE replicates both data and metadata. Two replicas of the raw
bits representing each dataset are created upon registration of a dataset by a
MN; the two replicas and the original dataset held at the MN result in a total
of three instances. The instances are kept at three different MNs, which
creates safety through copies that are “de-correlated” by geographic,
administrative, and financial domain. In this way, the instances are not
vulnerable to the same power failure, same earthquake, same funding loss, etc.
As for metadata, three replicas of all metadata are created and held at the
CNs, resulting in a total of four de-correlated metadata instances.

Depending on its data replication policy, a MN may fall into one of several
classes:

1. Read-Only: MNs that are unwilling or unable to hold replicas (whether such
   MNs are admitted by DataONE is up to the Governance Working Group).

2. Replication-Open: MNs that are willing to accept whatever content a CN
   tells them (up to capacity). These nodes will try to honor any access
   control rules that are applied to the data objects they replicate. If a MN
   is unable to honor a given access rule, the associated data object will be
   “darkived”, namely, stored within a dark archive that is not accessible to
   general users.

3. Replication-Only: MNs that are set up by DataONE specifically to provide
   replication services and that do not provide original content of their own.
   They are designed for capacity and capability, and are willing to accept
   whatever content a CN tells them to replicate (up to capacity). They are
   able to honor any access restriction rules defined within the DataONE
   system.

Replication will be triggered automatically by content registration. While it
is desirable to maintain three instances of each data object, over time there
may arise practical limits to replication due to a number of changeable
factors:

- MN capacities and/or MN capabilities
- willingness of MNs to be targets of replication (to hold copies)
- willingness of MNs to permit outbound copies
- perceived value of data relative to its size
- changes in access control restrictions

Refresh and verify the copies
.............................

MN guidelines also call for the common-sense and usual practice of periodic
“media refresh”, which is the copying of data from old physical recording
devices to new physical recording devices to avoid errors due to media
degradation and vendor de-support.

Damage or corruption in those copies is detected by periodically re-computing
checksums (e.g., SHA-256 digests) for randomly selected datasets and comparing
them with checksums securely stored at the CNs. Any bit-level changes detected
can be repaired by copying from an unchanged copy. This kind of “pop quiz”
cannot be cheated by simply reporting back a previously computed checksum; the
actual MN replica data is requested and the checksum recomputed. It is
appropriate that this entails sampling only a subset of the data as it is not
computationally feasible to keep the MNs and CNs constantly busy exhaustively
checking the amount of content that DataONE anticipates holding.

2.  Protect the form, meaning, and behavior of the bits
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Digital content has three related aspects that must be considered when
planning and performing preservation functions: content has a specific digital
form; that form encapsulates a given abstract meaning; and that meaning is
recovered for use through appropriate behaviors applied against the form.
Preservation of form ensures that the low-level structure of the content is
preserved; preservation of meaning ensures that the semantics of the content
are recoverable, at least in theory; and preservation of behavior ensures that
the semantics are recoverable in practice.

For example, consider a dataset of environmental samples. At the structural
level these numeric data are organized in a tabular fashion. But their full
meaning is only recoverable by knowing the variables and units of measure
associated with each column in the table. If the data are represented in
binary, rather than textual form, then use of the data also depends on an
appropriate software application that can expose the information in a directly
human-useable form. DataONE metadata standards should incorporate schemas to
document and describe data in terms enabling preservation of form, meaning,
and behavior.

Preservation activities in this tier fall into four conceptual categories: 

- Know your rights
- Know what you have and share that knowledge
- Cope with obsolescence 
- Watch the copies, yourself, and the world 

Know your rights
................

Ultimately, having the bits and their meaning is useless if one doesn’t also
have the legal right (a) to hold the data, (b) to make copies and derivatives
in performance of preservation management (such as replication and migration),
and (c) to transfer those same rights to a successor archive. Just as
important is to know specifically who owns the original data and *whether* those
rights have been granted.

As a start we strongly encourage providers to assign “Creative Commons Zero”
(CC0) licenses to all contributed data, which facilitates preservation and
does not prevent data providers from requiring an attribution statement as a
condition of re-use.

Know what you have and share that knowledge
...........................................

Understanding the content being preserved in the DataONE network is a function
of metadata that describes various aspects of the content. This metadata comes
from a variety of sources, including automated characterization of content at
the point of acquisition or submission to the DataONE network, and direct
contribution from content creators, curators, and consumers. Characterization
is important in order to determine the significant properties of digital
resources that must be preserved over time, and for purposes of
classification, since many preservation actions will take place in an
automated fashion on groups of similar resources. DataONE will encourage use
of data formats that are open, transparent, widely used, and non-encrypted --
formats that are more inherently amenable to long-term preservation.

Use of automated characterization tools, such as DROID
(http://droid.sourceforge.net/) and JHOVE2
(https://bitbucket.org/jhove2/main/wiki/Home) will be strongly recommended of
data providers. While most existing characterization tools focus on general
formats for cultural heritage content, the DataONE community can contribute to
development that will result in increased coverage of scientific formats.

Rich preservation metadata will be managed at varying levels of granularity,
including individual files, aggregations of files that collectively form a
single coherent resource, and meaningful subsets of files. In addition to
describing primary data sets, metadata is also necessary for associated
toolkit workflows, which are themselves first-class data objects needing
DataONE preservation. In some cases, characterization metadata will consist of
references to information managed in external technical registries such as
UDFR (http://udfr.org/) and PRONOM
(http://www.nationalarchives.gov.uk/PRONOM/Default.aspx). DataONE is fortunate
to have partnered with the primary maintainers of JHOVE2 and UDFR.

Preservation metadata will be expressed and managed within the DataONE network
using a variety of schemas in the science metadata. In addition to providing a
means for accurate description and citation of resources managed by DataONE,
this information also will be exposed for search and retrieval by external
indexing, search, and notification services.

Cope with obsolescence
......................

*Migration* and *emulation* are sub-strategies that DataONE will use in the
event that formats become obsolete. At some time in the future, one may expect
that available contemporary hardware and software will be unable to render or
otherwise use bits saved in some formats.

*Migration* is used to convert from older to newer formats; all converted
content is subject to “before” and “after” characterization to ensure semantic
invariance. *Emulation* effectively preserves older computing environments in
order to retain the experience of rendering older formats; once considered a
specialized intervention, emulation has become a more viable technique with
recent developments in consumer and enterprise server virtualization
solutions.

Migration workflows need only be available on a subset of DataONE member
nodes, which can function as service utilities to the greater network. A
successful migration strategy requires versioning of the content, where all
versions are retained. The versioning of managed content that results from
migration will be reflected in that content’s system metadata. All migrated
content will be subject to “before” and “after” characterization to ensure the
semantic invariance of the transformation.

While emulation has become a more viable technique, it is important to
understand the technological dependencies of the component parts of the
workflow underlying the use of a particular resource. Emulation may become
difficult if various workflow components require multiple levels of emulation
support.

Watch the copies, yourself, and the world
.........................................

Preservation action plans for mitigating preservation risks will be developed
ahead of the need for their application. Protection against obsolescence
requires an understanding of the technological dependencies underlying that
use. While some resources are easily manipulated using relatively standard and
long-lived desktop tools, others require highly specialized applications and
complex workflows.

DataONE will rely on existing notification services, such as AONS II (Pearson
2007), CRiB (Ferreira 2006), and PLATO
(http://www.ifs.tuwien.ac.at/dp/plato/intro.html). These services themselves
depend on other external technical registries such as PRONOM. Existing
coverage by these services may be limited to formats and tools geared towards
general applicability in the cultural heritage realm. DataONE will encourage
community effort to expand the scope of these services to understand technical
components specific to scientific disciplines. It is preferable to enhance
these existing frameworks and services so that obsolescence detection can take
place centrally or in a consistent federated manner, rather than in an ad hoc
and parallel manner.

Responses to obsolescence should be planned in advance of need and captured in
action plans (cf. the Planets template at
http://www.ifs.tuwien.ac.at/dp/plato/docs/plan-template.pdf and the FCLA
template at http://www.fcla.edu/digitalArchive/formatInfo.htm).

3.  Safeguard the guardians
~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the DataONE federated network and its member nodes were to disappear, that
would amount to total data loss.

Safeguard the federation
........................

By design, the DataONE network provides resiliency against the occasional loss
of nodes. While departure of a MN (or even a CN) from DataONE should not be
frequent, it is also not an unexpected occurrence. It is a feature of networks
that they can sustain such events by redistributing the assets and workload
among the remaining nodes. The arrival of a new node will be less disruptive.
The software infrastructure of the DataONE network, including the Investigator
Toolkit the cyberinfrastructure protocol stacks, are open-source in order to
help it have a life beyond the end of DataONE funding; open-source community
ownership improves not only buy-in and adoption, but also long-term external
support for the DataONE network.

The Sustainability and Governance Working Group is investigating a range of
issues in protecting the DataONE organization. These include CN and MN
succession planning, an analysis of the costs of preservation, the possibility
of services that offer accreditation for repositories, realtime monitoring,
and external auditing.

Safeguard the member nodes
..........................

Risks to the DataONE federated network are different from risks to individual
nodes. Some risks are reduced by the redundancy and geographic distribution
that the network provides. As for malicious threats that might increase due to
federation, these are addressed by the authentication and authorization
strategies that DataONE is developing with participation of Teragrid security
experts. It is a core requirement of MNs that they conform to DataONE
authentication and authorization policies and protocols.

Not any data repository can qualify as a DataONE MN. Guidelines call for
organizations to be on a sound technical and financial footing and to adhere
to important standards. The DataONE network is in some respects only as secure
as its weakest link, so local Information Technology (IT) standards at the MNs
are critical.

MNs conform to IT management practices typically found in federal agencies and
higher education, which in turn are based on Payment Card Industry data
security standards (PCI DSS) and the widely adopted ITIL (IT Infrastructure
Library) best practices for such things as physical protection, electronic
perimeter control (firewalls), account management, and event logging for
forensic analysis. Adopters include financial service organizations, and large
technology, pharmaceutical, and manufacturing companies.

These standards call for common-sense practices such as periodic “media
refresh”, which is the copying of data from older to newer physical devices
and media with the aim of avoiding errors due to media degradation and vendor
de-support.