Identifiers in DataONE
======================

Identifiers (PIDs, Persistent IDentifiers) are handles that uniquely identify
objects within the DataONE system.

* All data, metadata, and resource map objects in DataONE have a unique
  identifier.

* PIDs will always refer to the same set of bytes accessed through the DataONE
  API methods such as :func:`MNRead.get`.

* The location of content identified by a PID is determined by calling the
  :func:`CNCore.resolve` method.

* PIDs are persistent. Once content is registered with DataONE, the identifier
  for that content will remain in the DataONE system.

* PIDs are unique, and can not be reused once assigned.

* PIDs are generally controlled by Member Nodes, however their uniqueness and
  immutability is enforced primarily by the Coordinating Nodes.


Uniqueness
~~~~~~~~~~

Generation of identifiers in DataONE is largely under the control of the Member
Nodes (i.e. the data providers), with the requirement that an existing
identifier (i.e. one that is already registered in the DataONE system) can not
be reused. This rule is enforced for new content by checking the uniqueness of a
proposed identifier in the :func:`MNStorage.create` method, and for existing
content by ignoring content with identifiers that are already in use. The
:func:`CNCore.reserveIdentifier` method may be used to reserve an identifier, so
that a client may for example compose a composite object prior to committing the
new content to storage on the Member Node. Similarly, Tier 3 and above Member
Nodes may support the :func:`MNStorage.generateIdentifier` which will typically
delegate to a third party persistent identifier service such as EZID [1]_ to
return an identifier guaranteed to be unique within the DataONE system.


Authority
~~~~~~~~~

DataONE treats the original identifier (i.e. the first assignment of the
identifier to an object that becomes known to DataONE) as the authoritative
identifier for an object. Although generally not encouraged, multiple
identifiers may refer to a particular object and in such cases, DataONE will
attempt to utilize the original identifier for all communications about the
object.


Opacity
~~~~~~~

Identifiers utilized by Member Nodes can take many different forms from
automatically generated sequential or random character strings to strings that
conform to schemes such as the LSID [2]_ and DOI [3]_ specifications. DataONE
does not directly utilize implied functionality and services that might be
available for some of the identifier schemes. This is not to say that mechanisms
such as metadata retrieval for LSIDs is not used by any components of the
DataONE infrastructure, but rather that the DataONE infrastructure and services
have no functional dependency on such external services.

Identifiers are treated as opaque strings in the DataONE system, with no meaning
inferred from structure or pattern that may be present in identifiers. The rules
for identifier construction in DataONE are minimal and intended to ensure
practical utility of identifiers. There is a set of characters that can not be
used within an identifier string (non-printing and whitespace characters), and
the maximum number of characters that such a string may contain (800 characters,
#577). Leading and trailing white space is not allowed.


Immutability
~~~~~~~~~~~~

Once assigned and registered in the DataONE infrastructure, an identifier will
always refer to the same sequence of bytes. Generation of other representations
of objects may be supported by services (e.g. an image may be transformed from
TIFF to JPEG), but the identifier will always refer to the original form.


Resolvability
~~~~~~~~~~~~~

A fundamental goal of DataONE is to ensure that any identifier utilized in the
system is resolvable, that is, DataONE provides a mechanism that will enable the
location of the object to be determined. Resolution is handled by the
Coordinating Nodes through the :func:`CNCore.resolve` method, which returns a
list of nodes from which the object may be retrieved.

A guarantee of identifier resolvability is an important, core function of the
DataONE infrastructure upon which many other services may be constructed, both
within DataONE and by third party systems.


Granularity
~~~~~~~~~~~

Identifiers refer to managed objects in DataONE. Initially data, science metadata
documents, and resource maps have identifiers. The definition of "data" is
somewhat arbitrary though, and a single data object may be a single record
within some larger collection, or may refer to an entire set of records
contained within some package.


Structure
~~~~~~~~~

The characters that may appear in an identifier string acceptable to the
DataONE system is constrained by the XMLSchema definition
(:class:`Types.Identifier`), which is essentially a string of length greater
than zero but less than 800 characters with no whitespace (spaces, tabs,
non-printing characters, carriage returns, new lines). Identifiers may be
Unicode provided they conform to the fairly liberal restrictions imposed by
the XML specification [4]_. Examples of valid identifiers in DataONE are shown
in the section *Serializing* below.


Serializing
~~~~~~~~~~~

When identifiers appear in text, the full identifier should be presented
unmodified.

Identifiers appearing in URLs or other representations that have reserved
characters should be escaped according to the rules of the targeted
serialization format. For example, the identifiers::

  10.1000/182
  urn:lsid:ubio.org:namebank:11815
  http://example.com/data/mydata?row=24
  ldap://ldap1.example.net:6666/o=University%20of%20Michigan,c=US??sub?(cn=Babs%20Jensen)
  ฉันกินกระจกได้
  Is_féidir_liom_ithe_gloine

would be serialized in DataONE :func:`MNRead.get` URLs (or any other URL path)
according to RFC3986_ encoding guidelines for URI path segments::

  http://mn.example.com/mn/object/10.1000%2F182
  http://mn.example.com/mn/object/urn:lsid:ubio.org:namebank:11815
  http://mn.example.com/mn/object/http:%2F%2Fexample.com%2Fdata%2Fmydata%3Frow=24
  http://mn.example.com/mn/object/ldap:%2F%2Fldap1.example.net:6666%2Fo=University%2520of%2520Michigan,c=US%3F%3Fsub%3F(cn=Babs%2520Jensen)
  http://mn.example.com/mn/object/%E0%B8%89%E0%B8%B1%E0%B8%99%E0%B8%81%E0%B8%B4%E0%B8%99%E0%B8%81%E0%B8%A3%E0%B8%B0%E0%B8%88%E0%B8%81%E0%B9%84%E0%B8%94%E0%B9%89
  http://mn.example.com/mn/object/Is_f%C3%A9idir_liom_ithe_gloine


.. Note::

   The "+" (plus) character is a special case since it was once treated as a
   space character in URLs, and was changed in RFC3986 [5]_ such that the "+"
   would not be treated as a space. To minimize confusion when the plus
   character appears in an identifier, DataONE recommends that the character
   is percent escaped (``%2B``) when it appears in DataONE service URLs. All
   DataONE libraries and services operate in this manner.


The necessary encoding of URLs can be usually achieved through standard
libraries available in many languages, with the caveat that the encoding
follows the RFC3986 encoding rules. Many packages over-escape, keeping only
the unreserved character set unescaped. For its client libraries, DataONE is
taking a minimal escaping approach within the latitude RFC3986 allows.
Specifically, using [pchar] - ['+'] as the set of unescaped characters for
identifiers in path segments, and [pchar] - ['+', '&', '='] + ['/', '?'] for
identifiers in query segments, (segments in both cases meaning characters
between delimiters). For example::

  example-location-dependent-__/__?__&__=__
  example-common-unescaped-;:@$-_.!*()',~

will be encoded in paths to::

  example-location-dependent-__%2F__%3F__&__=__
  example-common-unescaped-;:@$-_.!*()',~

and encoded in the query section to::

  example-location-dependent-__/__?__%26__%3D__
  example-common-unescaped-;:@$-_.!*()',~

Note that RFC3986 [5]_ treats the query section of the URI as a blackbox, so '&'
and '=' are unescaped (to be used as sub-delimiters). For the purpose of
encoding content, we take the approach of encoding at the segment level, so
need to escape those characters. For those implementations using standard
encoding routines, it is important to know that package's treatment of this.

The following examples in Python and Java illustrate percent encoding of data
such as an identifier appropriate for appending to a URL. Each processes utf-8
encoded input through *stdin* and outputs percent encoded or decoded
responses. In java pseudo-code the general process is as follows.

.. code-block:: java

  // pseudo-code: this will not compile!

  CharacterSet PATH_SAFE = RFC3986_PCHAR and not ['+'];
  CharacterSet QUERY_SAFE = PATH_SAFE and not ['&','='] or ['?','/']; 

  String encodeUtf8_pathSegment(identifier) {
      String utf8ID = identifier.translate("UTF-8");
      return encodedID = percentEscape(utf8ID,PATH_SAFE);
  }

  String encodeUtf8_querySegment(identifier) {
      String utf8ID = identifier.translate("UTF-8");
      return encodedID = percentEscape(utf8ID,QUERY_SAFE);
  }

  String decodeString(string) {
      // older clients may encode spaces with '+'
      // so if we see them in the input, it is due to that
      // and we need to decode them, too.

      String correctedString = string.replace("+","%2B");
      return decodePercentEscaped(correctedString);
  }


.. code-block:: python

  import sys
  import codecs
  import urllib

  def pctEncode(data):
    '''Encode the unicode string data as utf-8 then percent encode that
    ready for appending as a path element to a URL.
    '''
    response = urllib.quote(data.encode("utf-8"), safe=":")
    return response


  def pctDecode(data):
    '''Decode a percent encoded string and return the unicode object.
    but first handle any mistaken '+' in the data string 
    '''
   data = data.replace("+","%2B")
    response = urllib.unquote(data)
    return response


  if __name__ == "__main__":
    '''
    Read utf-8 encoded input from stdin and percent encode or 
    decode (with command line argument -d). 

    e.g. given test_ids.txt, a UTF-8 encoded file with identifiers 
    appearing one per line:
      cat test_ids.txt | python PctEncode.py | python PctEncode.py -d

    should output equivalent to:
      cat test_ids.txt
    '''
    doEncode = True
    try:
      if sys.argv[1] == "-d":
        doEncode = False
    except:
      pass
    id = unicode(sys.stdin.readline(), "utf-8").strip()
    while len(id) > 0:
      if doEncode:
        print pctEncode(id)
      else:
        print pctDecode(id)
      id = unicode(sys.stdin.readline(), "utf-8").strip()
  

.. code-block:: java

  import java.io.*;
  import java.net.*;

  class PctEncode 
  {
    /**
    Simple example of URL path encoding of UTF-8 strings for including as
    path elements in URLs as per RFC3986.

    e.g. given test_ids.txt, a UTF-8 encoded file with identifiers 
    appearing one per line:
      cat test_ids.txt | java PctEncode | java PctEncode -d

    should output equivalent to:
      cat test_ids.txt
    */

    public static String pctDecode(String data) {
      /**
      Decode a percent encoded string, returning a Java Unicode string
      */
      String response = null;
      try {
        data = data.replace("+","%2B");
        response = URLDecoder.decode( data, "UTF-8");
      } catch (java.io.UnsupportedEncodingException e) {
        System.out.println("Error pctDecode : " + e.getMessage());
      } 
      return response;
    }


    public static String pctEncodePathSegment(String data) {
      /**
      Encode a Java string according to the path encoding rules in 
      RFC3986. Note that this does not encode properly for data that 
      is to be the root of the path, it is assumed that the data will 
      be appended to the end of a a URL path.
      */
      String response = null;
      try {
        response = URLEncoder.encode( data, "UTF-8" );
	// fix outdated space-to-+ convention
        response = response.replace("+","%20");
	// now un-escape for minimally escaped result
        response = response.replace("%3A",":").replace("%28","(");
        response = response.replace("%3B",";").replace("%29",")");
        response = response.replace("%40","@").replace("%27","'");
        response = response.replace("%24","$").replace("%2C",",");
        response = response.replace("%21","!").replace("%7E","~");

      } catch (java.io.UnsupportedEncodingException e) {
        System.out.println("Error  pctEncode: " + e.getMessage());
      } 
      return response;
    }


    public static void main( String[] args ) {
      try {
        boolean doEncode = true;
        try {
          if (args[0].equals( "-d" ))
            doEncode = false;
        } catch(ArrayIndexOutOfBoundsException e) {
        }

        PrintStream outs = new PrintStream( System.out, true, "UTF-8" );
        InputStreamReader isr = new InputStreamReader( System.in, "UTF-8" );
        BufferedReader reader = new BufferedReader( isr );
        String id = null;
        String data = null;
        while ( (id = reader.readLine()) != null ) {
          if (doEncode) {
            data = pctEncode( id );
          } else {
            data = pctDecode( id );
          }
          outs.println( data );
        }
      } catch(java.io.IOException e) {
        System.out.println("Error main: " + e.getMessage());
      }
    }
  }


Given this code and a utf-8 encoded source file *test_ids.txt* such as::

  ö
  10.1000/182
  urn:lsid:ubio.org:namebank:11815
  http://example.com/data/mydata?row=24
  ldap://ldap1.example.net:6666/o=University%20of%20Michigan,%20c=US??sub?(cn=Babs%20Jensen)",
  ฉันกินกระจกได้
  Is_féidir_liom_ithe_gloine

The following commands should output the same as ``cat test_ids.txt``::

  cat test_ids.txt | java PctEncode | python PctEncode.py -d
  cat test_ids.txt | python PctEncode.py | java PctEncode -d


.. _guid: http://en.wikipedia.org/wiki/Globally_unique_identifier#Algorithm


.. _OGC WKT: http://en.wikipedia.org/wiki/Well-known_text

.. [1] http://n2t.net/ezid/

.. [2] http://lsids.sourceforge.net/

.. [3] http://www.doi.org/

.. [4] http://www.w3.org/TR/xml11/#charsets

.. [5] http://tools.ietf.org/html/rfc3986


.. 
    OLD Notes follow, preserved here for now but likely to be removed 

    Suggested Strategy
    ------------------

    1. DataONE supports all identifier schemes where the PID can be represented as
       a Unicode string (this should be any identifier).

    2. The original identifier first assigned by a Member Node is the identifier
       promoted as the authoritative identifier for that content. Other identifiers
       that may be assigned by MNs that don't support the original scheme will be
       mapped to the original.

    3. If the original MN discontinues participation in DataONE, then the
       identifier originally used remains as the authoritative identifier.

    4. Any identifiers in use by the DataONE system can be resolved at any node
       (CN or MN). A caching system (e.g. memcached) should be used to improve
       resolution performance (can be primed with existing IDs).

    This strategy will enable the use of any identifier that is represented by a
    string, and will persist the original identifier for the object regardless of
    what happens to the originating Member Node.

    An obvious concern with this strategy is that a single object may have
    multiple identifiers associated with it. Since the original identifier is
    persisted, however, it will be the primary identifier by which that content
    will be referenced, regardless of which node the object is located on.


    .. 
       @startuml images/resolve.png
       title Resolve PID
       actor User
       participant "CRUD API" as m_crud << Member Node >>
       participant "Cache" as m_cache << Member Node >>
       participant "CRUD API" as cn_crud << Coordinating Node >>
       participant "Directory" as cn_dir << Coordinating Node >>
       User -> m_crud: resolve(token, "A5548D")
       m_crud -> m_cache: cache_lookup("A5548D")
       m_cache --> m_crud: FAIL
       m_crud -> cn_crud: resolve(token, "A5548D")
       cn_crud -> cn_dir: lookup("A5548D")
       cn_dir --> cn_crud: metadata
       cn_crud --> m_crud: metadata
       m_crud --> m_cache: addEntry("A5548D", metadata)
       m_crud --> User: metadata
       @enduml

    .. image:: images/resolve.png

    *Figure 1.* Resolving a PID. In this scenario a user is trying to determine
    what the ID "A5548D" refers to, and uses the resolution service of a Member
    Node to that effect.


    .. 
       @startuml images/resolve-detail.png
       title Resolve PID Detail
       actor User
       participant "CRUD API" as m_crud << Member Node >>
       participant "Cache" as m_cache << Member Node >>
       participant "CRUD API" as cn_crud << Coordinating Node >>
       participant "Directory" as cn_dir << Coordinating Node >>
       participant "CRUD API" as m_crud2 << Member Node 2 >>
       User -> m_crud: get(token, "A5548D")
       m_crud -> m_cache: lookup("A5548D")
       note right
         Local resolve failed, defer to CN
       endnote
       m_cache --> m_crud: FAIL
       m_crud -> cn_crud: resolve(token, "A5548D")
       cn_crud -> cn_dir: lookup("A5548D")
       cn_dir --> cn_crud: metadata
       cn_crud --> m_crud: metadata
       m_crud --> m_cache: addEntry(GUID, metadata)
       m_crud -> m_crud: parseMetadata(metadata)
       note right
         Found data URL = http://mn2.dataone.org/objects/A4448D
       endnote
       m_crud --> User: HTTP 302: http://mn2.dataone.org/objects/A4448D
       note right
         Return a redirect to the MN 2 get object 
         interface for the specified object.
       endnote
       User -> m_crud2: GET "http://mn2.dataone.org/objects/A4448D"
       m_crud2 --> User: bytes
       @enduml

    .. image:: images/resolve-detail.png

    *Figure 2.* Detail for object retrieval of an object identified by a PID. In
    this case, the User is requesting a data object from MN 1, though the data is
    actually located on MN 2.


    .. 
       @startuml images/resolve-conflict.png
       title Conflicting IDs
       participant "MN_A" as mn_a
       participant "MN_B" as mn_b
       participant "CN" as cn
       participant "CN OStore" as cn_os

       mn_a -> cn: registerID("435")
       cn -> cn_os: store("mn_a:435")
       cn_os <-- cn: ACK
       mn_a <-- cn: ACK

       mn_b -> cn: registerID("435")
       cn -> cn_os: store("mn_b:435")
       cn_os <-- cn: ACK
       mn_b <-- cn: ACK

       actor user
       user -> cn: resolve("435")
       user <-- cn: "mn_a:435", "mn_b:435"
       @enduml

    .. image:: images/resolve-conflict.png

    *Figure 3.* A scenario where two MNs happen to add different content to the
    system with the same identifier. Resolving the identifier without including
    the namespace results in two matches that must be interpreted by the client.
    The likelihood of such a scenario should be low, given that MNs should be
    utilizing identifier schemes that under ideal circumstances should not
    generate duplicate identifiers.


    Notes from the 20090602 Albuquerque Meeting
    -------------------------------------------

    These lightly edited notes were taken by Bruce Wilson of the group discussion
    about identifiers during the VDC-TWG 20090602 Albuquerque Meeting.

    Original notes are located in subversion at:

    /documents/Projects/VDC/docs/20090602_04_ABQ_Meeting


    Design Goals
    ~~~~~~~~~~~~

    From the DataONE perspective, an identifier is opaque. DataONE does not attach
    any meaning or resolution protocol based on the identifier.

    A call to return the object associated with a particular identifier should
    always return either identically the same object or n/a if that object is no
    longer available. This raises a number of implementation issues, noted below.
    Particular issues include how to handle data which is regularly updated and
    things like status changes.

    A Member Node may use its own internal identification scheme, but must be able
    to retrieve an object based on its DataONE globally unique identifier.

    Member Nodes may generate their own unique identifiers, such as DOIs_,
    Handles_, PURLs_, or UUIDs_. The only requirement is that the identifier is
    unique across the space of DataONE. This implies that CN's must have
    functionality to:

    .. _DOIs: http://www.doi.org/
    .. _Handles: http://www.handle.net/
    .. _PURLs: http://purl.org/docs/index.html
    .. _UUIDs: http://en.wikipedia.org/wiki/UUID


    (a) check that an identifier is unique and 

    (b) to "reserve" or stub-out an identifier while the MN goes through the
        process of assembling the package to submit the object into DataONE.

    When an object is replicated from one MN to another MN, the receiving MN must
    be able to accept and resolve the supplied DataONE identifier. That is, an
    object, no matter where it is within the DataONE network must be retrievable
    by its DataONE identifier, regardless of location. There was a lot of
    discussion on this point, and this is my interpretation of the conclusion. I
    believe we came out with the point that if a receiving Member Node assigns its
    own permanent identifier, then that creates more confusion, requires the MN to
    register that second ID with the CNs, and we can have confusion regarding the
    citation (for example) of the piece of data. It also makes tracking things
    like metrics, since the originating MN must then find out all other
    identifiers for the data and search for all of those. And while it can be
    argued that nobody "owns" the data, there is (currently) a culture and need
    for the original archive to feel like it still can receive credit for that
    investment.

    A system doesn't need to maintain every version, but it does need to be able
    to identify every version.

    Identifiers also apply to metadata as well as data.


    Questions for Further Consideration
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    If a MN uses a DOI for a data set identifier, is it appropriate to include
    doi: in the identifier. For example, 10.3334/ORNLDAAC/840 is the DOI for a
    particular data set at the ORNL DAAC. Both "doi:10.3334/ORNLDAAC/840" and
    "10.3334/ORNLDAAC/840" can be presumed to be unique identifiers. Which should
    be used? 

    BEW: My personal preference is to use the one with the resolution
    protocol included. That does, however, make the identifier more of a "smart"
    identifier, which is generally problematic.

    Where an identifier has a mechanism to resolve to multiple locations (such as
    is possible with an LSID and some DOI mechanisms) and that object is
    replicated from one MN to another MN, this would suggest that the originating
    MN needs to be notified of the additional location and has the option of
    registering the new location with the handle registration authority. This also
    means that if a replication is removed, the original MN should have the option
    of being notified, so that the resolution points are updated. Ideally, this
    should happen before the replica is removed (where possible), so that we
    eliminate (or at least minimize) the amount of time that an invalid resolution
    point is in someone else's system.

    Where an identifier (such as a Handle) has a URL resolution, what should that
    resolution be? ORNL DAAC DOI's resolve to a web page where a user (after
    logging in) can see and download the components of the data set. Our opinion
    is that the DOI resolving to a human interpretable description of the object
    is more important than a machine interpretable resolution point. Some thought
    and guidance on this point for the overall DataONE community of practice is
    desirable.

    Do we want/need a registry of name spaces? Where a MN uses a UUID (for
    example), there may not be a way to describe the name space for identifiers,
    unless the MN prefixes the UUID with some descriptor, which generally violates
    the general admonition about smart identifiers. It might, however, be helpful
    to have something like a set of regexps that describe the name space for a
    MN's identifiers, particularly if an automated way could be developed to look
    for potential collisions (non-null overlaps) between name spaces. BEW: My
    thought is that this is far from an initial feature, but the desirability of
    this as a possible future feature could have implications on the way we do
    things from the start.

    Can the metadata standards support multiple globally unique identifiers? For
    example, what happens in the case that a MN starts down the DOI path and then
    switches to LSID's because of economic costs, for example, and goes back and
    assigns an LSID to historical data sets. Those data sets now have both an LSID
    and a DOI. Where is this in the metadata? Is there a mechanism for indicating
    the preferred ID and the alternate ID's? Likewise, how should things be
    handled when a MN decides to register an object with e.g. GCMD and the
    namespace that GCMD allows for identifiers does not allow for the MN's
    preferred identifier. Can a MN update the metadata to show an alternate key
    with the GCMD identifier (data set is also known as)? What is the implication
    for the metadata identifier in such a case? This is an update operation to the
    metadata, which implies that the metadata identifier is changed. How would one
    update the old metadata record to indicate that it is:

    (a) deprecated and 

    (b) the id of the new metadata record?

    The above also relates to the issue of establishing predecessor-successor
    relationships between identifiers. How should this be done across the system?

    How do versions enter into the identifiers scheme? The general concept is that
    different versions of an object have different identifiers. What about having
    some type of an identifier that aggregates all versions of an object and which
    always points to the latest version of that object? How does D1 know that an
    object is a new version of an existing object? Update operation should take
    the old identifier and the new identifier. That would allow for the tracking
    of updates. A Member Node may track versions. Could create an interface
    specification for "latest version" where the CN calls the authoritative MN for
    the DS and asks for the identifier of the latest version of a particular
    identifier. Points back to the need for what amounts to meta-metadata - where
    the metadata object can be updated to indicate the status level of the data
    set (e.g. deprecated). Where is the identifier for something like World Ocean
    Data Base - this gets updated quarterly. They think of the fundamental unit as
    an observation point, which is either a location (e.g. buoy, possibly with
    different identifiers for different depths) or a leg of a trip, with multiple
    observations along a path.

    For identifiers, we may need to specify the character space. What happens when
    a MN stores unique identifiers in a database field that supports just ASCII,
    but a different MN does its unique identifiers in some other character set?
    PURL is a possible unique identifier, but we can get into cases now where URLs
    have characters from other language character sets (such as Arabic, Kanji, Ö)

    What happens when a request for a replicated version of a data set comes to
    the replicate MN and the data set has been updated and the originating MN has
    not supplied the information about the update (e.g. they did an insert for the
    new version)?

    How do we assign ID's for a continuous data stream or for a subset calculated
    on the fly? Does this mean that every request for a continuous data stream
    gets its own data set identifier, which then gets stored in the D1 system
    someplace? What is the value to the overall enterprise for storing the data
    set identifiers for each request, particularly in the context of something
    like a stream, where the on-the-fly processing is used to get a dynamic subset
    or dynamic reprojection? Examples of this sort of situation include the stream
    gauge data or the Atmospheric Radiation Measurement (ARM) archive. Ameriflux
    Flux tower data is a simpler case, in that they work on the basis of a
    site-year as a unit of data. The World Oceanic DataBase (WODB), however,
    operates on a location (and possibly depth) as a unit of data. Many of these
    are updated quarterly. Each unit of data has an identifier, unique within
    WODB, and WODB publishes a data stream that indicates what data packages were
    updated at what point in time. It is possible to determine whether a
    particular data package changed between two points in time. The differences
    are human interpretable, but it is not possible (in any generally automated
    fashion) to recreate the data stream for a particular data package at an
    arbitrary point in prior time.

    Do the CN's need a method to determine the object type for an identifier? Do
    identifiers need to be unique across all types of identified objects?