Mutability of Content in DataONE
================================

.. index:: mutability

.. contents::

Overview
--------

All content synchronized by DataONE is immutable, and so resolution of a
:term:`persistent identifier` (PID) will always result in a pointer (URI) to a set of
bytes that are in all respects identical to the original. Version 2.0 of the
DataONE APIs introduced the ability to associate an optional series identifier
(SID) with an object. Unlike a PID, resolution of a SID will always result in a
pointer (URI) to a set of bytes that represent the latest revision of an object.

A revision or obsolescence chain is constructed by setting the obsoletes and
obsoletedBy properties of the new and old objects respectively. For example,
here PID_B represents the latest revision of object as it obsoletes PID_A
(object PID_A has a value of "PID_B" in its system metadata
:attr:`~Types.SystemMetadata.obsoletedBy` property, and object PID_B has a
value of "PID_A" in its system metadata :attr:`~Types.SystemMetadata.obsoletes`
property)::


    +------------+                      +------------+
    |            | ----- obsoletes ---> |            |
    |   PID_B    |                      |   PID_A    |
    |            | <--- obsoletedBy --- |            |
    +------------+                      +------------+    

                 resolve(PID_A) => PID_A
                 resolve(PID_B) => PID_B


In version 1.x of DataONE, it was necessary to manually follow the obsolescence
chain in order to find the latest version of an object. This process is
simplified in version 2.x and later through the use of series identifiers. The
previous example can be augmented with series identifiers::


    +------------+                      +------------+
    |            | ----- obsoletes ---> |            |
    |   PID_B    |                      |   PID_A    |
    |   SID_1    |                      |   SID_1    |
    |            | <--- obsoletedBy --- |            |
    +------------+                      +------------+    

                 resolve(PID_A) => PID_A
                 resolve(PID_B) => PID_B
                 resolve(SID_1) => PID_B


Each object in the obsolescence chain has the same value for the series
identifier ("SID_1"), and calling :func:`~CNRead.resolve` with the value "SID_1"
will result in the URIs from which the object "PID_B" may be retrieved, since
that object is the latest revision in the obsolescence chain.

The availability of PIDs and SIDs means users may now refer to objects using
either a PID when it is necessary or appropriate to refer to an exact set of
bytes that represent an object or through a SID when referring to the latest
version of an object. The former is important for repeatable analyses, since
the same content may be reliably referenced and retrieved. The latter is
important for referencing the most up to date revision of some object, and so
may be useful for example to perform anaysis with the latest information
available.

Unless indicated otherwise, the DataONE version 2.x and later APIs will accept
either a PID or a SID when an identifier is specified as a request parameter.


Resolving Series Identifiers
----------------------------

In a perfect world, all obsolescence chains will have be complete,
bi-directional links, and so determining the latest version of an object is
determined simply by examining the set of all objects with the same SID, and
selecting the object that is not ``obsoletedBy`` anything else. Obsolescence
chains may be incomplete for various reasons and in such situations, resolution
of series identifiers should still operate consistently.

The following series of scenarios demonstrate the behavior of the DataONE
system when resolving a seriesId to a specific object. The behavior of
resolution is to rely primarily on the obsoletes and obsoletedBy entities,
falling back to the date when an object is added to a Member Node
(:attr:`~Types.SystemMetadata.dateUploaded`) to determine the newer version.

The following notation is used herein:

:|Pi|:               Refers to a Persistent Identifier (PID)

:|Si|:               Refers to a Series Identifier (SID)

:|ti|:               The value of :attr:`~Types.SystemMetadata.dateUploaded` for
                     an object

:|t1| < |t2|:        |t1| is older than |t2|

:|PiSjtk|:          An object with
                     :attr:`~Types.SystemMetadata.identifier` (PID) |Pi|, a
                     :attr:`~v2_0.Types.SystemMetadata.seriesId` (SID)
                     of |Sj|, and a :attr:`~Types.SystemMetadata.dateUploaded`
                     of |tk|.

:|Pi| |b| |Pj|:      |Pi| has an :attr:`~Types.SystemMetadata.obsoletedBy`
                     entry that contains the value |Pj|

:|Pi| |o| |Pj|:      |Pj| has an :attr:`~Types.SystemMetadata.obsoletes`
                     entry that contains the value |Pi|

:|Pi| |O| |Pj|:      |Pi| has an
                     :attr:`~Types.SystemMetadata.obsoletedBy` entry that
                     contains the value |Pj| and |Pj| has an
                     :attr:`~Types.SystemMetadata.obsoletes` entry that
                     contains the value |Pi|.

:|Pi| |x| |Pj|:      Neither :attr:`~Types.SystemMetadata.obsoletedBy`
                     nor :attr:`~Types.SystemMetadata.obsoletes` is set by
                     |Pi|  or |Pj|.

:``??``:             Object was not synchronized, and so unknown to DataONE

:|rSi|: Resolving SID |Si| results in PID |Pj|


.. |Pi| replace:: :math:`P_i`
.. |Pj| replace:: :math:`P_j`
.. |P1| replace:: :math:`P_1`
.. |P2| replace:: :math:`P_2`
.. |P3| replace:: :math:`P_3`
.. |P4| replace:: :math:`P_4`
.. |Si| replace:: :math:`S_i`
.. |Sj| replace:: :math:`S_j`
.. |S1| replace:: :math:`S_1`
.. |S2| replace:: :math:`S_2`
.. |ti| replace:: :math:`t_i`
.. |tk| replace:: :math:`t_k`
.. |t1| replace:: :math:`t_1`
.. |t2| replace:: :math:`t_2`
.. |t3| replace:: :math:`t_3`
.. |_| unicode:: 0xA0
   :trim:
.. |PiSjtk| replace:: :math:`P_i \binom{S_j}{t_k}`
.. |rSi| replace:: :math:`resolve(S_i) \Rrightarrow P_j`
.. |o| replace:: :math:`\leftarrow`
.. |b| replace:: :math:`\rightarrow`
.. |O| replace:: :math:`\leftrightarrows`
.. |x| replace:: :math:`\square`



Case 1
~~~~~~


.. math::
   :label: c1

   P_1\binom{S_1}{t_1} & \leftrightarrows P_2\binom{S_1}{t_2} \\
   t_1 & < t_2 \\
   resolve(S_1) & \Rrightarrow P_2

A set of objects :math:`O = \lbrace P_1, P_2 \rbrace` have the same series 
identifier, |S1|.

All objects in :math:`O` are participants in an obsolescence chain since |P2|
:attr:`~Types.SystemMetadata.obsoletes` |P1| and |P1| is
:attr:`~Types.SystemMetadata.obsoletedBy` |P2|.

All elements of the obsolescence chain :math:`P_1 \leftrightarrows P_2` have the 
same series identifier, |S1|.

The :attr:`~Types.SystemMetadata.dateUploaded` of |P1| is older than that of
|P2|.

This is a perfect obsolescence chain and resolving |S1| will result in the
object identified by |P2|.


Case 2
~~~~~~

.. math::
   :label: c2

   P_1\binom{S_1}{t_1}\; & \square \; P_2\binom{S_1}{t_2} \\
   t_1 & < t_2 \\
   resolve(S_1) & \Rrightarrow P_2

A set of objects :math:`O = \lbrace P_1, P_2 \rbrace` have the same series 
identifier, |S1|.

No obsolescence information associates objects in :math:`O`.

The :attr:`~Types.SystemMetadata.dateUploaded` of |P1| is older than that of
|P2|.

No obsolescence assertions are made, so resolution is inferred by the most
recent value of :attr:`~Types.SystemMetadata.dateUploaded`.


Case 3
~~~~~~

.. math::
   :label: c3

   P_1\binom{S_1}{t_1}\; & \leftarrow \; P_2\binom{S_1}{t_2} \\
   t_1 & < t_2 \\
   resolve(S_1) & \Rrightarrow P_2

A set of objects :math:`O = \lbrace P_1, P_2 \rbrace` have the same series 
identifier, |S1|.

All objects in :math:`O` are participants in an obsolescence chain since |P2|
:attr:`~Types.SystemMetadata.obsoletes` |P1| even though |P1| does not assert
it is :attr:`~Types.SystemMetadata.obsoletedBy` |P2|.

All elements of the obsolescence chain :math:`P_1 \leftarrow P_2` have the 
same series identifier, |S1|.

The :attr:`~Types.SystemMetadata.dateUploaded` of |P1| is older than that of
|P2|.

This is a damaged, but consistent obsolescence chain and resolving |S1| will
result in the object identified by |P2|.


Case 4
~~~~~~

.. math::
   :label: c4

   P_1\binom{S_1}{t_1} \leftrightarrows 
   P_2\binom{S_1}{t_2}& \leftrightarrows 
   P_3\binom{S_2}{t_3}\\ 
   t_1 < t_2 & < t_3 \\
   resolve(S_1) &\Rrightarrow P_2 \\
   resolve(S_2) &\Rrightarrow P_3 \\

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2 \rbrace` have the same series 
identifier, |S1|.

A set of objects :math:`O_{S_2} = \lbrace P_3 \rbrace` has the series 
identifier, |S2|.

Objects :math:`O = O_{S_1} \cup O_{S_2}` all participate in a full,
bi-directional obsolescence chain.

In this case resolving |S1| will result in |P2| which is not the most recent
object in the obsolescence chain, however it is the newest version in the
obsolescence chain identified by |S1|. 

Resolving |S2| will result in |P3|.


Case 5
~~~~~~

.. math::
   :label: c5

   P_1\binom{S_1}{t_1} \leftarrow 
   P_2\binom{S_1}{t_2}& \leftarrow 
   P_3\binom{S_2}{t_3}\\ 
   t_1 < t_2 & < t_3 \\
   resolve(S_1) &\Rrightarrow P_2 \\
   resolve(S_2) &\Rrightarrow P_3 \\

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2 \rbrace` have the same series 
identifier, |S1|.

A set of objects :math:`O_{S_2} = \lbrace P_3 \rbrace` has the series 
identifier, |S2|.

Objects :math:`O = O_{S_1} \cup O_{S_2}` all participate in a damaged, though
consistent obsolescence chain.

In this case resolving |S1| will result in |P2| which is not the most recent
object in the obsolescence chain, however it is the newest version in the
obsolescence chain identified by |S1|. 

Resolving |S2| will result in |P3|.


Case 6
~~~~~~

.. math::
   :label: c6

   P_1\binom{S_1}{t_1} \leftrightarrows 
   P_2\binom{S_1}{t_2}& \leftrightarrows 
   P_3\binom{}{t_3}\\ 
   t_1 < t_2 & < t_3 \\
   resolve(S_1) &\Rrightarrow P_2 \\

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2 \rbrace` have the same series 
identifier, |S1|.

Objects :math:`O = O_{S_1} \cup P_3` all participate in an obsolescence chain.

In this case resolving |S1| will result in |P2| which is not the most recent
object in the obsolescence chain, however it is the newest version in the
obsolescence chain identified by |S1|. 


Case 7
~~~~~~

.. math::
   :label: c7

   P_1\binom{S_1}{t_1} \leftrightarrows 
   P_2\binom{S_1}{t_2}& \leftrightarrows 
   P_3\binom{}{t_3} \leftrightarrows 
   P_4\binom{S_2}{t_4} \\
   t_1 < t_2 & < t_3 < t_4\\
   resolve(S_1) &\Rrightarrow P_2 \\
   resolve(S_2) &\Rrightarrow P_4

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2 \rbrace` have the same series 
identifier, |S1|.

A set of objects :math:`O_{S_2} = \lbrace P_4 \rbrace` has the series 
identifier, |S2|.

Objects :math:`O = O_{S_1} \cup P_3 \cup O_{S_2}` all participate in an
obsolescence chain.

In this case resolving |S1| will result in |P2| which is not the most recent
object in the obsolescence chain, however it is the newest version in the
obsolescence chain identified by |S1|. 

Resolving |S2| will result in |P4|


Case 8
~~~~~~

.. math::
   :label: c8

   P_1\binom{S_1}{t_1} \leftrightarrows 
   P_2\binom{S_1}{t_2}& \rightarrow
   ?? \leftarrow 
   P_4\binom{S_1}{t_4} \\
   t_1 < t_2 & < t_4\\
   resolve(S_1) &\Rrightarrow P_4 \\

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace` have the same series 
identifier, |S1|.

Objects :math:`O_{S_1}` all participate in an obsolescence chain, however the
chain is broken with no way to traverse between |P2| and |P4| because the
object that |P2| indicates it is ``obsoletedBy``, and the object that |P4|
indicates it ``obsoletes`` is not recorded by the DataONE Coordinating Nodes 
(does not resolve).

In this case resolving |S1| will result in |P4| since that is the most recent
object in the set of objects :math:`O_{S_1}`. 


Case 9
~~~~~~

.. math::
   :label: c9

   P_1\binom{S_1}{t_1} \leftrightarrows 
   P_2\binom{S_1}{t_2}& \;\square\;
   ?? \leftarrow 
   P_4\binom{S_1}{t_4} \\
   t_1 < t_2 & < t_4\\
   resolve(S_1) &\Rrightarrow P_4 \\

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace` have the same series 
identifier, |S1|.

Objects :math:`O_{S_1}` all participate in an obsolescence chain, however the
chain is broken with no way to traverse between |P2| and |P4| because the the
object that |P4| indicates it ``obsoletes`` is not recorded by the DataONE
Coordinating Nodes (does not resolve).

In this case resolving |S1| will result in |P4| since that is the most recent
object in the set of objects :math:`O_{S_1}`. 


Case 10
~~~~~~~

.. math::
   :label: c10

   P_1\binom{S_1}{t_1} \leftrightarrows 
   P_2\binom{S_1}{t_2}& \rightarrow
   P_{del}\binom{}{} \leftarrow 
   P_4\binom{S_1}{t_4} \\
   t_1 < t_2 & < t_4\\
   resolve(S_1) &\Rrightarrow P_4 \\

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace` have the same series 
identifier, |S1|. 

The object :math:`P_{del}` was deleted from the system, so the identifier is
known, but the object and associated system metadata are no longer available.

Objects :math:`O_{S_1}` all participate in an obsolescence chain, however the
chain is broken with no way to traverse between |P2| and |P4| because the
object that |P2| indicates it is ``obsoletedBy``, and the object that |P4|
indicates it ``obsoletes`` is not recorded by the DataONE Coordinating Nodes 
(does not resolve).

In this case resolving |S1| will result in |P4| since that is the most recent
object in the set of objects :math:`O_{S_1}`. 


Case 11
~~~~~~~

.. math::
   :label: c11

   P_1\binom{S_1}{t_1} \leftrightarrows 
   P_2\binom{S_1}{t_2}& \leftrightarrows 
   archived\biggl[P_3\binom{S_1}{t_3}\biggr] \\ 
   t_1 < t_2 & < t_3 \\
   resolve(S_1) &\Rrightarrow P_3 \\

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2, P_3 \rbrace` have the same series 
identifier, |S1|.

Objects :math:`O_{S_1}` all participate in an obsolescence chain.

Object |P3| has been archived, and so is not discoverable.

In this case resolving |S1| will result in |P3| which is the most recent
object in the obsolescence chain even though it is archived. 


Case 12
~~~~~~~

.. math::
   :label: c12

   P_1\binom{S_1}{t_1} & \leftrightarrows 
   P_2\binom{S_1}{t_2} \rightarrow
   ?? \\ 
   t_1 & < t_2 \\
   resolve(S_1) &\Rrightarrow P_2 \\

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2 \rbrace` have the same series 
identifier, |S1|.

Objects :math:`O_{S_1}` participate in an obsolescence chain which is damaged
by |P2| indicating it is ``obsoletedBy`` some object that is not resolvable.

In this case resolving |S1| will result in |P2| which is the most recent
resolvable object in the obsolescence chain.


Case 13
~~~~~~~

.. math::
   :label: c13

   P_1\binom{S_1}{t_1} & \leftarrow 
   P_2\binom{S_1}{t_2} \rightarrow
   ?? \\ 
   t_1 & < t_2 \\
   resolve(S_1) &\Rrightarrow P_2 \\

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2 \rbrace` have the same series 
identifier, |S1|.

Objects :math:`O_{S_1}` participate in a damaged obsolescence chain since |P2|
indicates it is ``obsoletedBy`` some object that is not resolvable, and |P1|
does not assert it is ``obsoletedBy`` |P2|.

In this case resolving |S1| will result in |P2| which is the most recent
resolvable object in the obsolescence chain.


Case 14
~~~~~~~

.. math::
   :label: c14

   P_1\binom{S_1}{t_1} \leftarrow 
   P_2\binom{S_1}{t_2}& \rightarrow 
   P_3\binom{S_2}{t_3}\\ 
   t_1 < t_2 & < t_3 \\
   resolve(S_1) &\Rrightarrow P_2 \\
   resolve(S_2) &\Rrightarrow P_3

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2 \rbrace` have the same series 
identifier, |S1|.

A set of objects :math:`O_{S_2} = \lbrace P_3 \rbrace` has the series 
identifier, |S2|.

Objects :math:`O = O_{S_1} \cup O_{S_2}` all participate in a damaged
obsolescence chain, with |P1| not indicating it is obsoleted by |P2|, and |P3|
not indicating that it obsoletes |P2|.

In this case resolving |S1| will result in |P2| which is not the most recent
object in the obsolescence chain, however it is the newest version in the
obsolescence chain identified by |S1|. 

|S2| will resolve to |P3|.


Case 15
~~~~~~~

.. math::
   :label: c15

   P_1\binom{S_1}{t_1} \leftrightarrows 
   P_2\binom{S_1}{t_2} \; & \square \;
   ?? \leftarrow 
   P_4\binom{S_1}{t_4} \leftrightarrows
   P_5\binom{S_2}{t_5}\\
   t_1 < t_2 & < t_4 < t_5\\
   resolve(S_1) &\Rrightarrow P_4 \\
   resolve(S_2) &\Rrightarrow P_5

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace` have the same series 
identifier, |S1|.

A set of objects :math:`O_{S_2} = \lbrace P_5 \rbrace` has the series 
identifier, |S2|.

Objects :math:`O = O_{S_1} \cup P_3 \cup O_{S_2}` all participate in a damaged
obsolescence chain with no assertion of the relationship between |P2| and |P4|.

In this case resolving |S1| will result in |P4| which is not the most recent
object in the obsolescence chain, however it is the newest version in the
obsolescence chain identified by |S1|. 

Resolving |S2| will result in :math:`P_5`.


Case 16
~~~~~~~

.. math::
   :label: c16

   P_1\binom{S_1}{t_1} \leftarrow 
   P_2\binom{S_1}{t_2} & \rightarrow 
   ?? \leftarrow 
   P_4\binom{S_2}{t_4} \\
   t_1 < t_2 & < t_4\\
   resolve(S_1) &\Rrightarrow P_2 \\
   resolve(S_2) &\Rrightarrow P_4

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2 \rbrace` have the same series 
identifier, |S1|.

A set of objects :math:`O_{S_2} = \lbrace P_4 \rbrace` has the series 
identifier, |S2|.

Objects :math:`O_{S_1}` and :math:`O_{S_2}` are both damaged obsolescence
chains though the Coordinating Nodes may infer association between
:math:`O_{S_1}` and :math:`O_{S_2}` since even though the object that |P2| is
``obsoletedBy`` and the object that |P4| ``obsoletes`` can not be resolved,
:math:`P_2.obsoletedBy` and :math:`P_4.obsoletes` are be the same value.

In this case resolving |S1| will result in |P2| which is the most recent
resolvable object in the obsolescence chain. 

Resolving |S2| will result in :math:`P_4`.


Case 17
~~~~~~~

.. math::
   :label: c17

   P_1\binom{S_1}{t_1} \leftarrow 
   P_2\binom{S_1}{t_2} & \rightarrow 
   ?? \leftarrow 
   P_4\binom{S_1}{t_4} \\
   t_1 < t_2 & < t_4\\
   resolve(S_1) &\Rrightarrow P_4

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2, P_4 \rbrace` have the same
series identifier, |S1|.

Objects :math:`O_{S_1}` form a damaged obsolescence chain though it can be
inferred that |P2| is ``obsoletedBy`` and |P4| ``obsoletes`` the same object
even though it can not be resolved, :math:`P_2.obsoletedBy` and
:math:`P_4.obsoletes` are be the same value.

In this case resolving |S1| will result in |P4|.



Case 18
~~~~~~~

.. math::
   :label: c16

   P_1\binom{S_1}{t_1} \leftrightarrows 
   P_2\binom{S_1}{t_2} & \rightarrow 
   ?? \; \square \; ?? \leftarrow
   P_5\binom{S_1}{t_5} \\
   t_1 < t_2 & < t_5\\
   resolve(S_1) &\Rrightarrow P_5


A set of objects :math:`O_{S_1} = \lbrace P_1, P_2, P_5 \rbrace` have the same
series identifier, |S1|.

The obsolescence chain :math:`O_{S_1}` is broken, with no way to traverse from
|P2| to :math:`P_5`.

The :attr:`~Types.SystemMetadata.dateUploaded` places :math:`P_5` as the newest
object with the series Id of |S1|.

Resolving |S1| results in :math:`P_5`.


Case 19
~~~~~~~

.. math::
   :label: c17

   P_1\binom{S_1}{t_1} \leftarrow 
   P_2\binom{S_1}{t_2} & \leftarrow 
   P_3\binom{S_1}{t_3} \\
   t_1 > t_2 & > t_3\\
   resolve(S_1) &\Rrightarrow P_3

A set of objects :math:`O_{S_1} = \lbrace P_1, P_2, P_3 \rbrace` have the same
series identifier, |S1|.

Objects :math:`O_{S_1}` form a damaged obsolescence chain since only
``obsoletes`` values are specified.

The :attr:`~Types.SystemMetadata.dateUploaded` of |P1| is newer than |P2|,
which in turn is newer than |P3|.

In this case resolving |S1| will result in |P3| even though |P1| is the most
recent object since the obsolescence chain overrides the times.



Referencing Content by Identifier
---------------------------------

The use of the PID or SID for either citation or analysis workflows is up to the
user and is context dependent. In general, DataONE anticipates ``DATA`` and
``RESOURCE_MAP``  objects will be referenced by PID, to ensure reproducibility;
and in general,  ``METADATA`` documents will be referenced by SID, to take
advantage of any data  curation / correction efforts that would not otherwise
affect scientific  reproducibility.  Additionally, clues for the content
submitter's preference can  be found in the format of the identifiers
themselves. For example, DOIs and EZIDs  take a recognizable format, and are
often encouraged in scientific communities for  citations, so an end-user might
take that into consideration when deciding which  identifier to choose.

.. TODO::

    guidance on RESOURCE_MAPS - initial thoughts: depends on references to DATA
    objects, whether they be SIDs or PIDs


Assigning Identifiers
---------------------

Depending on the Member Node used as the primary repository, content originators
may have some choice in assigning identifiers. For those that do, it is advised
that they assign PIDs and SIDs according to the typical usage pattern described
above.

Some Member Nodes may not preserve past versions of content, in which case the
PID is likely to be automatically generated, and the submitter only has to
determine the SID, and may not need to know the difference between the SID and
PID.  Other Member Nodes may still be at v1 of the DataONE APIs and only allow
assignment of the PID.


Limits on the Series
--------------------

The SID is used to conceptually represent an object that may vary modestly over
time, but remains conceptually the same. Content contributers should be careful
to apply reasonable limits on the scope of documents such that an entity does
not deviate too much from the original item.  In such cases, a new / different
series should be initiated.


Requirements on Member Node Implementations
-------------------------------------------

For Member Nodes that employ a mutable content storage model, the only
additional DataONE requirement is that the Member Node generate a SystemMetadata
document for the updated content, containing:

  1. unique PID in systemMetadata.identifier field

  2. new checksum

  3. the previous PID in the systemMetadata.obsoletes field

Ideally, the SystemMetadata of now unavailable versions will be maintained, and
the ``obsoletedBy`` field is populated with the PID of the version that replaced
it.

Some Member Nodes may opt to preserve recent back-versions to aid the complete
capture of versions by the DataONE network via synchronization.


Reassignment of AuthoritativeMemberNode field for unhosted versions
-------------------------------------------------------------------

to be determined


Replication of unhosted back-versions
-------------------------------------

DataONE will attempt to synchronize all versions it's made aware of through the
synchronization process, but may miss short-lived versions that are in existence
only between the Member Node's synchronization interval.    Please note, also,
that the synchronization schedule is not guaranteed.  Periods of DataONE
maintenance may suspend synchronization, or high CN load could prolong the
synchronization interval.

Member Nodes keen to make sure versions have the highest chance of
synchronization can choose to issue a :func:`CNCore.synchronize` command that
will put the item on the synchronization queue instead of waiting for the
harvest interval.

Conversely, if the Member Node expressly doesn't want DataONE to preserve
back-versions, they can set systemMetadata.replicationPolicy.numberReplicas
field to 0.


Synchronizing Content from Mutable Member Nodes
-----------------------------------------------

At its core, DataONE is in the business of preserving definite versions of
content through centrally coordinated per-to-peer replication.  That is, DataONE
Coordinating Nodes direct certain Member Nodes to replicate newly synchronized
objects from the originating Member Node to better preserve it.  New versions of
objects appear as first class immutable objects with unique PIDs, even if
originating from mutable Member Nodes.

From the DataONE perspective the only difference between objects from mutable
Member Nodes and immutable Member Nodes is the completeness of the series of
versions it is able to synchronize and replicate.



The Problem
-----------

Current DataONE replication processes and fixity checks depend on content
identified by a PID that does not change. If this were not enforced, mutable
content from a member node would not be differentiated from corrupt copies of
the object and our replication and recovery features would attempt to correct
the byte inconsistency. The immutability requirement helps to ensure
reproducible results of any use of an object. Any analysis on a data set
repeated sometime in the future should yield identical results (within the
limits of precision of the analytical tools) and this is one of the major
guiding principles in creating DataONE as a long term data repository
federation. By simply overwriting existing content using the same identifier,
nodes cannot be relied upon for repeatable retrieval of content.


Proposal
--------

The proposal for supporting "mutable" content is to allow a series identifier
(SID) to facilitate the semantics of citing an object at the conceptual level,
instead of the version level. As content changes over time, new identifiers
(PIDs) will still be used to mark each change, but the conceptual object can
continue to be referred to with an unchanging identifier (SID). The member node
will be responsible for creating each version and assigning a unique PID to it
and these objects will be synchronized and replicated to other DataONE member
nodes as they are today. So instead of allowing content to be directly modified,
we are allowing strongly-versioned chains to be referenced by an identifier; and
relaxing the requirement that all revisions be resolvable forever.


The Series Identifier
---------------------

The proposed solution is to model and implement a "series identifier" (SID)
along with modified services that would work with both SIDs and PIDs.  From a
DataONE perspective, the series identifiers would be assigned to all versions of
an object, be unique in DataONE (assigned to only one version chain), and would
be reserved just as PIDs - from the same namespace.  The series identifier, once
assigned to the version chain, would similarly be immutable, and could apply to
all new versions of the item.   It is also assumed that in order to coordinate
users to use one identifier for citations, that the cardinality for the citation
identifier would be 0..1.  The semantics for making API calls with a SID would,
in general, be to return responses as if the call were made with the most
current PID.

Member Nodes that only maintain the latest version of an item would be required
to use a new PID for any updated content, and modify the System Metadata
appropriately so that the new version can be synchronized with the network. The
same SID would typically be used for the updated object, although we would allow
the revision chain to shift to a new SID as desired by the client and/or member
node.

It cannot be assumed that a user with an identifier in hand knows whether it is
a SID or a PID, so DataONE expects the user to refer to the System Metadata once
it has the item to determine if the identifier used in the call matches the PID
or the SID.  Similarly, they could interrogate search results for the same
information.  For high-level interfaces, like D1Client.getD1Object(id), the PID
of the object returned may or may not match the passed in 'id'.  So, high-level
functions or applications that use resolve will have to make sure they handle
the new resolving semantics.

It is recommended that search indexes include a search field for the
series identifier that can also be returned in the results.


Semantics of "Current"
~~~~~~~~~~~~~~~~~~~~~~

A SID chain closes with two types of ends:

Type 1: An object on the SID chain doesn't have the "obsoletedBy" field.

Example::

  P1(S1) âŸº P2(S1)

``P2`` is a type 1 end.

Type 2: An object on the SID chain does have the "obsoletedBy" field, but the
PID in the "obsoletedBy" field has a different SID (including no SID value).

Examples::

  P1(S1) âŸº P2(S2)

  P1(S1) âŸº P2()

``P1`` is a type 2 end on both chains.

It is tricky to determine a type 2 end if the object in the "obsoletedBy" field
is missing. For example, ``P1(S1) âŸº P2(S1) âŸ¹ ??``. We don't have the
knowledge of the series id of the object "??". So we generally consider it a
type 2 end except we are sure it is not an end - there is another object in the
chain (has the same series id) that obsoletes the missing object.

In previous example [P1(S1) âŸº P2(S1) âŸ¹ ??], P2 is a type 2 end (case 12).

However,  P1(S1) âŸº P2(S1) âŸ¹  ??  âŸ¸ P4(S1), P2 is not an end (case 8) since "??" is in the obsoletes field of P4 that has the same series id - S1 (We are sure that the "??" has the series id S1 as well, so P2 is not an end).

In P1(S1) âŸº P2(S1) âŸ¹  ??  âŸ¸ P4(S2), P2 is a type 2 end even though "??" is in the  obsoletes field of P4. But P4 has a different series id - S2 (so we are not sure "??" has the S1 or S2).

Ideally, if there is one and only one end on a SID chain, this end will be the HEAD (current) version. This kind of chains are called ideal chains.

If there are more than one end on a SID chain because of the incompleteness of the system metadata, It is hard to determine which one is the real end. This kind of chain is not a ideal chain and we have to use this mechanism to determine the HEAD version:
    1. Choose the end with latest dateUploaded in the chain as the temporary HEAD version. This rule works if the uploaded time stamps of objects aren't messed up.
    2. If the time stamps are messed up, we need to test if any object obsoletes the temporary HEAD on the obsolete chain with the SAME SID. If nothing obsoletes the temporary HEAD, the temporary HEAD is the final HEAD; otherwise, the end of obsolete chain is the final HEAD.

Take this example P1[S1, t1] âŸ¸P2[S1, t2] âŸ¸P3(S1, t3) (case 19) (The t1, t2 and t3 are time stamps and t1 > t2 > t3. This means the time stamps are messing up - the newest version P3 was uploaded the earliest while the oldest version p1 was uploaded the latest)
    1. This chain has three type 1 ends - P1, P2 and P3. It is not an ideal chain.
    2. Choose P1 which has the latest date of uploaded as the temporary HEAD.
    3. P2 obsoletes P1 and also P3 obsoletes P2 on the obsolete chain P1 âŸ¸ P2 âŸ¸ P3. So we choose the end of the whole chain - P3 as the final HEAD.




Version Storage
^^^^^^^^^^^^^^^
Mutable content implies that back-versions of content may not be readily available
on the nodes that originally produce the content. For metadata and resource maps,
the coordinating nodes will store previous versions of objects during the synchronization process,
but any data updates will result in only the latest version being available at the originating node.
If the data objects were replicated (as is the hope), it is likely that previous versions of
the data can still be resolved from replica target nodes, though this is dependent on replication policies,
synchronization schedules and the availability of replica storage across the federation.

The current DataONE storage model, through the MN_Storage.update method, places
responsibility for storing versions squarely on the submitter. Each update to the object requires
a new unique identifier (PID) and must state which PID the new version is obsoleting.
We will continue to require that unique PIDs are provided for each
and every version of an object, but the member node will not be required to maintain a copy of previous revisions
if it chooses not to. An optional series identifier (SID) can be provided with object SystemMetadata
to group revisions together and to provide a convenient way to refer to the latest version of the object.


Version preservation
~~~~~~~~~~~~~~~~~~~~~
As is currently the case, the member node should maintain all versions of content using
unique identifiers (PID) and synchronization will harvest each new revision to the network.
While there will be no requirement that the Member node continue to make available the object identified
by the obsoleted PID, the hope is that they will persist the data history as best they can.
If the objects in the revision chain have a SID assigned, the new PID will be considered the latest
version of this series.

The member node can allow access to the current version of the object using MN_Read.get(sid) as a convenience and
any reference to the SID would resolve to the latest version of the object with a potentially different checksum and PID
from what was originally present when the citation was distributed.

The member node must [minimally] maintain system metadata for the current revision of the object.
Any updated object is still required to be identified by a new unique PID, but would include the same SID used
in the previous version. The obsoletes field should indicate that the new PID replaces the previous PID.
The coordinating node learns about the updated content during synchronization because there is:
	- a new PID
	- an updated dateSystemMetadataUpdated timestamp
	- an updated checksum (other fields may also be updated).

N.B. Multiple revisions between synchronization periods would not
result in multiple versions recorded in the federation - just the revision[s] that happened to be
synchronized would be persisted in DataONE. This leaves open the possibility of an end user retrieving a version from the MN that
will ultimately not be persisted in perpetuity.


Working drafts vs. Repository publishing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
DataONE essentially considers member nodes as the originators of *selected* versions of
content.  That is, not every intermediate revision on the way to a final
product should neccessarily be saved for future reference.  Organizations following the mutable
content model for storage may wish to limit the objects returned by listObjects() to those that
are considered in their publishable form. Certainly theses objects can later be updated as needed,
but minimizing draft-status objects will reduce the amount of [possibly irretrievable] draft content floating around
the federated network.



Types of Mutable Objects
^^^^^^^^^^^^^^^^^^^^^^^^
As illustrated in the optional use cases, the rate and regularity of change of
objects can be widely variable. The more frequent the change, the less likely
that all versions would need to be reproduced, and the utility of complete version
history diminishes.  One can imagine a member node serving up an unrecorded data
stream, such as a web-cam, delaying creating a version until a user calls MN.get()
on the item, by tee'ing the output stream to file while returning the object.

Additionally the need to keep past versions may be less important for metadata
objects (correcting typos that do not change the meaning or interpretation of the data)
than data objects or resource maps.

Accumulating datasets
~~~~~~~~~~~~~~~~~~~~~
The use case of mutable data objects that grow with new records appended to the
end of a table, for example, was given as a common practice for some groups, and
one that would produce progressively redundant information with each persisted
version.  The motivation for rolling up records accumulated over time instead
of new data files for each is the ease of use for end users. Using a SID to access
the data object will always give the latest snapshot of the data records where old revisions
may or may not also be accessible.


Mixed metadata-data objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Objects like NetCDF files that include both metadata and data in the same object
will be managed with the same PID and SID considerations. If only the metadata portion of
the file is modified, the SID may remain the same, but a new PID and checksum must be created and
made available for synchronization. The old revision may immediately become inaccessible using the PID
and that is allowable under the proposal.


Retrieval / Citation Support
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Implicit in the support for versioned content is support for retrieval of, or possibly
just resolution to, the current object bytes by the identifier assigned in the
originating system.  At a minimum CNs will be required to support calculating
which is the current version of series of versions and returning it or its
identifier. This will be accomplished using the series identifier (SID) associated with object[s]
in a revision chain. The "current" version of an object is defined as the non-obsoleted object with
a SID that matches the requested identifier. Objects that are marked as "archived" may be returned as the
most current version, but they should not be seen in default search interfaces.
Since DataONE identifiers have no special formating semantics, those following a citation
will not know by looking at the identifier whether it is referring to a specific
version (PID) or the latest version of the item (SID), so services may be provided to easily investigate
an entire version series. Existing services allow clients to deduce this information by inspecting the system
metadata for the identifier and following any obsolescence properties as needed.

Retrieval vs. Resolution
~~~~~~~~~~~~~~~~~~~~~~~~
Because the content of an object is retrieved in a separate call from its system
metadata, use of the SID for MN Read API calls is troublesome because the content may be updated
between the two calls.
It would be impossible to tell if the bytes retrieved were incorrect (bit rot) or correct (newer version)
when comparing checksums in this case. If data consistency is important to the caller, the PID should be used
to guarantee that only the expected bytes (or a NotFound exception) are returned by any MN.get calls.

Those making a citation may wish to cite a specific version, or the latest current
version.  Followers of citations may wish to, if given an identifier representing
a specific version (PID), find out what is the latest version (another, newer PID, or the SID).
Conversely, if given a series identifier that navigates to the latest version, they may wish to find
out what the content was at some previous point in time (e.g., the time of the citation) by following
the obsolescence chain backward.


Service development plans
~~~~~~~~~~~~~~~~~~~~~~~~~
DataONE will be providing CN services for navigating to the latest version of an object, since the only way
to do it currently is for the clients to serially retrieve the system metadata for
versions in the chain until they reach the head version, which is can be inefficient.
A new method to retrieve the entire version history is also under consideration.


MN API method changes
~~~~~~~~~~~~~~~~~~~~~~

MN.get(Identifier id):
	Identifier can be either a PID or SID, and if a SID, return the bytes of the HEAD PID in the series.

MN.getSystemMetadata(Identifier id):
	If PID, return SystemMetadata of PID.
	If SID, return HEAD PID SystemMetadata.

MN.describe(Identifier id):
	If PID, return header for PID.
	If SID, return header for HEAD PID.

MN.getChecksum(Identifier pid):
	Requires PID to effectively verify data integrity.

MN.create(Identifier pid, object, SystemMetadata):
	Identifier must be PID and included in accompanying systemMetadata.
	SID may be included in accompanying systemMetadata if known at time of creation. The SID mustn't exist in the system.

MN.update(Identifier id, Identifier newPid, SystemMetadata):
	Identifier id may be a PID or SID -- in the case of a SID, the method works with the HEAD PID of the chain.
	The new Identifier must be a PID and must match the accompanying SystemMetadata.
	The new SID can match the old SID in previous SystemMetadata (objects are in the same series),
	or it can be any unique SID that does not already exist in the system (newly assigning a SID or shifting the SID because of a "scientifically meaningful change").
	Moreover, the new system metadata may not have a SID no matter the previous version has a SID or not.

MN.getLogRecords(?idFilter):
	Filter can be PID or SID. The MN should resolve the SID to the HEAD PID, and return the log records for that PID.
	If a client wishes to retrieve log records for the entire family of objects referenced by a SID, then the client should retrieve a list of PIDs for the SID, the call getLogRecords for each PID to retrieve the entire set of log records.
	The Log.identifier field will only contain PID values, no SIDs.

MN.delete(Identifier id):
	Identifier can be PID or SID.
	If PID, delete that specific version;
	If a SID, delete the HEAD PID version.

MN.archive(Identifier id):
	Identifier can be PID or SID.
	If PID, archive that specific version.
	If a SID, archive the HEAD PID.

MN.isAuthorized(Identifier id):
	Can accept either PID or SID, but in the case of a SID parameter only reports on the the accessPolicy for HEAD PID.

MN.synchronizationFailed(Identifier pid):
	Inter-node communication should only use PIDs for identifying objects.

MN.replicate(Identifier id):
	No changes in behavior. SystemMetadata object has changed structure so there is a change in signature.
	Replication is based on the PID so that we can ensure content has not been corrupted.

MN.getReplica(Identifier id):
	Can only make requests for PIDs so that checksum integrity can be verified.

MN.systemMetadataChanged(Identifier, serialVersion, dateSysMetaModified):
	May be called on the MN if the CN infers an obsoletes relationship for a new PID based on a shared SID.
	Identifier can be either a PID or a SID.
	If a SID, the MN will fetch SystemMetadata from the CN using SID (which will return the HEAD PID SystemMetadata).

MN.listObjects(?identifier=XXX):
    Returns an ObjectList like normal, but can be filtered by identifer (SID or PID).
    If the Identifier is a PID, it returns just the single entry for that PID.
    If the Identifier is a SID, it returns the objects (PIDs) of all objects that have that SID.

MN.view(Identifier id):
    Can accept either PID or SID. If a PID, get the formated view for the specified version. If a SID, get the view for the HEAD PID.

MN.getPackage(Identifier id):
    Can accept either PID or SID. If a PID, get the package of the specified version. If a SID, get the package of the HEAD PID.

MN.updateSystemMetadata(Identifier id, SystemMetadata newSysmeta):
    Requires a PID. The SID can exist in the newSysmeta object. Since SID is immutable, the SID in the newSysmeta should match the current SID if it exists. If current system metadata doesn't have a SID, the new SID can be one of the following cases:
      1. The new SID is null (without a SID).
      2. The new SID is a unique identifier which doesn't exist in the system.
      3. The new SID matches the SID in the system metadata of the object in the "obsoletes" value.
      4. The new SID matches the SID in the system metadata of the object in the "obsoletedBy" value.


CN API method changes
~~~~~~~~~~~~~~~~~~~~~~

CN.get(Identifier id):
	Behaves the same as MN

CN.describe(Identifier id):
	Behaves the same as MN

CN.getSystemMetadata(Identifier id):
	Behaves the same as MN.
	N.B. This method can be used with a SID to locate the PID of the latest version which may be sufficient without implementing a
	getHead() method.

CN.getChecksum(Identifier id):
    Behaves the same as MN

CN.getLogRecords(?idFilter):
    Behaves the same as MN

CN.create(Identifier pid, object, SystemMetadata):
    Identifier must be PID and included in accompanying systemMetadata. SID may be included in accompanying systemMetadata if known at time of creation. The SID can be one of the following cases:
      1. The SID is a unique identifier which doesn't exist in the system.
      2. The SID matches the SID in the system metadata of the object in the "obsoletes" value.
      3. The SID matches the SID in the system metadata of the object in the "obsoletedBy" value.

CN.registerSystemMetadata(Identifier pid, SystemMetadata sysmeta):
    Requires a PID, allows a SID in the SystemMetadata. The SID can be one of the following cases:
      1. The SID is a unique identifier which doesn't exist in the system.
      2. The SID matches the SID in the system metadata of the object in the "obsoletes" value.
      3. The SID matches the SID in the system metadata of the object in the "obsoletedBy" value.

CN.updateSystemMetadata(Identifier id, SystemMetadata newSysmeta):
    Behaves the same as MN

CN.delete(Identifier id):
    Behaves the same as MN

CN.archive(Identifier id):
   Behaves the same as MN

CN.reserveIdentifier(Identifier id):
	Accepts PID or SID values and treats them exactly the same.

CN.hasReservation(Identifier id):
	Accepts PID or SID values and treats them exactly the same.

CN.resolve(Identifier):
	If PID, resolve it.
	If a SID, then resolve the HEAD PID.

CN.isAuthorized(Identifier id):
    Behaves the same as MN

CN.isNodeAuthorized(Identifier id):
    Only accept PID since it is a replication related method. No behavior change.

CN.updateReplicationMetadata(Identifier id):
    Only accept PID since it is a replication related method. No behavior change.

CN.deleteReplicationMetadata(Identifier id):
    Only accept PID since it is a replication related method. No behavior change.

CN.setReplicationStatus(Identifier id):
    Only accept PID since it is a replication related method. No behavior change.

CN.setReplicationPolicy():
	Only accept PID since it is a replication related method. No behavior change.

CN.setRightsHolder():
    Ownerships apply to particular revisions, not the entire chain.
    If a SID is passed in to a method that affects one of these policies, the change is applied to the HEAD PID for that series.

CN.setAccessPolicy():
	Policies apply to particular revisions, not the entire chain.
	If a SID is passed in to a method that affects one of these policies, the change is applied to the HEAD PID for that series.

CN.setObsoletedBy(Identifier id, Identifier obsoletedByPid):
	Only PIDs can be used when expressing obsolescence chain.

CN.view(Identifier id)
    Behaves the same as MN

CN.listObjects(?identifier=XXX):
	Behaves the same as MN



Use Cases
---------

The use cases below organize the identified requirements related to mutable
content, with the most relevant use cases listed first.

Prioritized goals
^^^^^^^^^^^^^^^^^

1. Data preservation
~~~~~~~~~~~~~~~~~~~~
Defined as activities that help ensure continued discoverability and usefulness
and usually in reference to metadata, not data.

- metadata adaptation / improvement
- metadata correction
- absent a "push" notification, users should be able to easily determine if they
  have the most current version of something, and easily and quickly get it.

2. Mutable Content Member Node support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For institutions following a mutable content model:

- Provide a path forward for integrating into DataONE network.
- Minimize the burden of adaptation to working with versioned content.
- Allow use of their identifiers in DataONE in the context they are familiar with
  (if their identifier always points to the latest, in DataONE it should too)
- Options for maintaining past versions
- Differentiating between incremental internal saves, vs. new revision.

3. Citation support
~~~~~~~~~~~~~~~~~~~~
- avoid unnecessary costs associated with obtaining resolvable (e.g., DOIs) for each version
- coordinating citation by a common identifier for citation tracking
- ensuring that the cited object is the same when accessed as when it was originally used
- ability to cite a version as well as the conceptual object

Optional
^^^^^^^^

4. Support for frequently changing / overwritten data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What is the best way to version mutable data that frequently changes but may or
not be used.  For example a "current time" object, replaced every minute, or
"current weather radar" that's replaced every 3 hours.

- preserving every version could be very expensive for very little value
- what mechanisms could be employed to minimize the overhead?

The underlying dynamic here is the the rate of mutation vs. the rate of synchronization

5. Support for accumulating datasets
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This means supporting data objects that add records over time, either:

- within pre-defined bounds  e.g. "2013 year-to-date"  (the metadata could stay
  the same, while data changes)
- without pre-defined bounds e.g. "JGoodall primate observation log"?


6. Support for mixed metadata/data objects
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some formats combine data with metadata, for example netCDF, so allowing the
metadata to change without impacting the consistency assessment of the data itself.

- changes in the file are treated like any other change; they will be versioned,
	but may be referenced using a seriesId


7. Supporting 'unrecorded' data streams
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Mutable content can theoretically include things that are live feeds from
sensors, but are otherwise not captured.

This proposal does not accommodate streams unless they have discrete snapshots that can be referenced as part of
a seriesId.

- Should we allow identifiers to resolve to a URL that returns an input stream?
- Can we prevent it?
- Can we mark it as the user's responsibility to do the mn.create?