<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>Identifiers in DataONE &#8212; v2.1.0-beta</title>
    
    <link rel="stylesheet" href="../_static/dataone.css" type="text/css" />
    <link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
    
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.1.0-beta',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true,
        SOURCELINK_SUFFIX: '.txt'
      };
    </script>
    <script type="text/javascript" src="../_static/mathjax_pre.js"></script>
    <script type="text/javascript" src="../_static/jquery.js"></script>
    <script type="text/javascript" src="../_static/underscore.js"></script>
    <script type="text/javascript" src="../_static/doctools.js"></script>
    <script type="text/javascript" src="//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML"></script>
    <script type="text/javascript" src="../_static/sidebar.js"></script>
    <link rel="author" title="About these documents" href="../about.html" />
    <link rel="index" title="Index" href="../genindex.html" />
    <link rel="search" title="Search" href="../search.html" />
    <link rel="next" title="Identity Management and Authenticated Session Management" href="Authentication.html" />
    <link rel="prev" title="Immutability of Content in DataONE" href="ContentImmutability.html" />
   
  
  <link media="only screen and (max-device-width: 480px)" href="../_static/small_dataone.css" type= "text/css" rel="stylesheet" />

  </head>
  <body role="document">
  
    <div class="version_notice">
      <p>
      <span class='bold'>Warning:</span> These documents are under active 
      development and subject to change (version 2.1.0-beta).<br />
      The latest release documents are at:
      <a href="https://purl.dataone.org/architecture">https://purl.dataone.org/architecture</a>
      </p>
    </div>

    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="../genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="../py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="Authentication.html" title="Identity Management and Authenticated Session Management"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="ContentImmutability.html" title="Immutability of Content in DataONE"
             accesskey="P">previous</a> |</li>
        <li class="nav-item nav-item-0"><a href="../index.html"></a> &#187;</li>
          <li class="nav-item nav-item-1"><a href="index.html" accesskey="U">&lt;no title&gt;</a> &#187;</li> 
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body">
            
  <div class="section" id="identifiers-in-dataone">
<h1>Identifiers in DataONE<a class="headerlink" href="#identifiers-in-dataone" title="Permalink to this headline">¶</a></h1>
<p>Identifiers (PIDs, Persistent IDentifiers) are handles that uniquely identify
objects within the DataONE system.</p>
<ul class="simple">
<li>All data, metadata, and resource map objects in DataONE have a unique
identifier.</li>
<li>PIDs will always refer to the same set of bytes accessed through the DataONE
API methods such as <a class="reference internal" href="../apis/MN_APIs.html#MNRead.get" title="MNRead.get"><code class="xref py py-func docutils literal"><span class="pre">MNRead.get()</span></code></a>.</li>
<li>The location of content identified by a PID is determined by calling the
<code class="xref py py-func docutils literal"><span class="pre">CNCore.resolve()</span></code> method.</li>
<li>PIDs are persistent. Once content is registered with DataONE, the identifier
for that content will remain in the DataONE system.</li>
<li>PIDs are unique, and can not be reused once assigned.</li>
<li>PIDs are generally controlled by Member Nodes, however their uniqueness and
immutability is enforced primarily by the Coordinating Nodes.</li>
</ul>
<div class="section" id="uniqueness">
<h2>Uniqueness<a class="headerlink" href="#uniqueness" title="Permalink to this headline">¶</a></h2>
<p>Generation of identifiers in DataONE is largely under the control of the Member
Nodes (i.e. the data providers), with the requirement that an existing
identifier (i.e. one that is already registered in the DataONE system) can not
be reused. This rule is enforced for new content by checking the uniqueness of a
proposed identifier in the <a class="reference internal" href="../apis/MN_APIs.html#MNStorage.create" title="MNStorage.create"><code class="xref py py-func docutils literal"><span class="pre">MNStorage.create()</span></code></a> method, and for existing
content by ignoring content with identifiers that are already in use. The
<a class="reference internal" href="../apis/CN_APIs.html#CNCore.reserveIdentifier" title="CNCore.reserveIdentifier"><code class="xref py py-func docutils literal"><span class="pre">CNCore.reserveIdentifier()</span></code></a> method may be used to reserve an identifier, so
that a client may for example compose a composite object prior to committing the
new content to storage on the Member Node. Similarly, Tier 3 and above Member
Nodes may support the <a class="reference internal" href="../apis/MN_APIs.html#MNStorage.generateIdentifier" title="MNStorage.generateIdentifier"><code class="xref py py-func docutils literal"><span class="pre">MNStorage.generateIdentifier()</span></code></a> which will typically
delegate to a third party persistent identifier service such as EZID <a class="footnote-reference" href="#id7" id="id1">[1]</a> to
return an identifier guaranteed to be unique within the DataONE system.</p>
</div>
<div class="section" id="authority">
<h2>Authority<a class="headerlink" href="#authority" title="Permalink to this headline">¶</a></h2>
<p>DataONE treats the original identifier (i.e. the first assignment of the
identifier to an object that becomes known to DataONE) as the authoritative
identifier for an object. Although generally not encouraged, multiple
identifiers may refer to a particular object and in such cases, DataONE will
attempt to utilize the original identifier for all communications about the
object.</p>
</div>
<div class="section" id="opacity">
<h2>Opacity<a class="headerlink" href="#opacity" title="Permalink to this headline">¶</a></h2>
<p>Identifiers utilized by Member Nodes can take many different forms from
automatically generated sequential or random character strings to strings that
conform to schemes such as the LSID <a class="footnote-reference" href="#id8" id="id2">[2]</a> and DOI <a class="footnote-reference" href="#id9" id="id3">[3]</a> specifications. DataONE
does not directly utilize implied functionality and services that might be
available for some of the identifier schemes. This is not to say that mechanisms
such as metadata retrieval for LSIDs is not used by any components of the
DataONE infrastructure, but rather that the DataONE infrastructure and services
have no functional dependency on such external services.</p>
<p>Identifiers are treated as opaque strings in the DataONE system, with no meaning
inferred from structure or pattern that may be present in identifiers. The rules
for identifier construction in DataONE are minimal and intended to ensure
practical utility of identifiers. There is a set of characters that can not be
used within an identifier string (non-printing and whitespace characters), and
the maximum number of characters that such a string may contain (800 characters,
#577). Leading and trailing white space is not allowed.</p>
</div>
<div class="section" id="immutability">
<h2>Immutability<a class="headerlink" href="#immutability" title="Permalink to this headline">¶</a></h2>
<p>Once assigned and registered in the DataONE infrastructure, an identifier will
always refer to the same sequence of bytes. Generation of other representations
of objects may be supported by services (e.g. an image may be transformed from
TIFF to JPEG), but the identifier will always refer to the original form.</p>
</div>
<div class="section" id="resolvability">
<h2>Resolvability<a class="headerlink" href="#resolvability" title="Permalink to this headline">¶</a></h2>
<p>A fundamental goal of DataONE is to ensure that any identifier utilized in the
system is resolvable, that is, DataONE provides a mechanism that will enable the
location of the object to be determined. Resolution is handled by the
Coordinating Nodes through the <code class="xref py py-func docutils literal"><span class="pre">CNCore.resolve()</span></code> method, which returns a
list of nodes from which the object may be retrieved.</p>
<p>A guarantee of identifier resolvability is an important, core function of the
DataONE infrastructure upon which many other services may be constructed, both
within DataONE and by third party systems.</p>
</div>
<div class="section" id="granularity">
<h2>Granularity<a class="headerlink" href="#granularity" title="Permalink to this headline">¶</a></h2>
<p>Identifiers refer to managed objects in DataONE. Initially data, science metadata
documents, and resource maps have identifiers. The definition of &#8220;data&#8221; is
somewhat arbitrary though, and a single data object may be a single record
within some larger collection, or may refer to an entire set of records
contained within some package.</p>
</div>
<div class="section" id="structure">
<h2>Structure<a class="headerlink" href="#structure" title="Permalink to this headline">¶</a></h2>
<p>The characters that may appear in an identifier string acceptable to the
DataONE system is constrained by the XMLSchema definition
(<a class="reference internal" href="../apis/Types.html#Types.Identifier" title="Types.Identifier"><code class="xref py py-class docutils literal"><span class="pre">Types.Identifier</span></code></a>), which is essentially a string of length greater
than zero but less than 800 characters with no whitespace (spaces, tabs,
non-printing characters, carriage returns, new lines). Identifiers may be
Unicode provided they conform to the fairly liberal restrictions imposed by
the XML specification <a class="footnote-reference" href="#id10" id="id4">[4]</a>. Examples of valid identifiers in DataONE are shown
in the section <em>Serializing</em> below.</p>
</div>
<div class="section" id="serializing">
<h2>Serializing<a class="headerlink" href="#serializing" title="Permalink to this headline">¶</a></h2>
<p>When identifiers appear in text, the full identifier should be presented
unmodified.</p>
<p>Identifiers appearing in URLs or other representations that have reserved
characters should be escaped according to the rules of the targeted
serialization format. For example, the identifiers:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span>10.1000/182
urn:lsid:ubio.org:namebank:11815
http://example.com/data/mydata?row=24
ldap://ldap1.example.net:6666/o=University%20of%20Michigan,c=US??sub?(cn=Babs%20Jensen)
ฉันกินกระจกได้
Is_féidir_liom_ithe_gloine
</pre></div>
</div>
<p>would be serialized in DataONE <a class="reference internal" href="../apis/MN_APIs.html#MNRead.get" title="MNRead.get"><code class="xref py py-func docutils literal"><span class="pre">MNRead.get()</span></code></a> URLs (or any other URL path)
according to <a href="#id12"><span class="problematic" id="id13">RFC3986_</span></a> encoding guidelines for URI path segments:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">mn</span><span class="o">.</span><span class="n">example</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">mn</span><span class="o">/</span><span class="nb">object</span><span class="o">/</span><span class="mf">10.1000</span><span class="o">%</span><span class="mi">2</span><span class="n">F182</span>
<span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">mn</span><span class="o">.</span><span class="n">example</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">mn</span><span class="o">/</span><span class="nb">object</span><span class="o">/</span><span class="n">urn</span><span class="p">:</span><span class="n">lsid</span><span class="p">:</span><span class="n">ubio</span><span class="o">.</span><span class="n">org</span><span class="p">:</span><span class="n">namebank</span><span class="p">:</span><span class="mi">11815</span>
<span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">mn</span><span class="o">.</span><span class="n">example</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">mn</span><span class="o">/</span><span class="nb">object</span><span class="o">/</span><span class="n">http</span><span class="p">:</span><span class="o">%</span><span class="mi">2</span><span class="n">F</span><span class="o">%</span><span class="mi">2</span><span class="n">Fexample</span><span class="o">.</span><span class="n">com</span><span class="o">%</span><span class="mi">2</span><span class="n">Fdata</span><span class="o">%</span><span class="mi">2</span><span class="n">Fmydata</span><span class="o">%</span><span class="mi">3</span><span class="n">Frow</span><span class="o">=</span><span class="mi">24</span>
<span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">mn</span><span class="o">.</span><span class="n">example</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">mn</span><span class="o">/</span><span class="nb">object</span><span class="o">/</span><span class="n">ldap</span><span class="p">:</span><span class="o">%</span><span class="mi">2</span><span class="n">F</span><span class="o">%</span><span class="mi">2</span><span class="n">Fldap1</span><span class="o">.</span><span class="n">example</span><span class="o">.</span><span class="n">net</span><span class="p">:</span><span class="mi">6666</span><span class="o">%</span><span class="mi">2</span><span class="n">Fo</span><span class="o">=</span><span class="n">University</span><span class="o">%</span><span class="mi">2520</span><span class="n">of</span><span class="o">%</span><span class="mi">2520</span><span class="n">Michigan</span><span class="p">,</span><span class="n">c</span><span class="o">=</span><span class="n">US</span><span class="o">%</span><span class="mi">3</span><span class="n">F</span><span class="o">%</span><span class="mi">3</span><span class="n">Fsub</span><span class="o">%</span><span class="mi">3</span><span class="n">F</span><span class="p">(</span><span class="n">cn</span><span class="o">=</span><span class="n">Babs</span><span class="o">%</span><span class="mi">2520</span><span class="n">Jensen</span><span class="p">)</span>
<span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">mn</span><span class="o">.</span><span class="n">example</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">mn</span><span class="o">/</span><span class="nb">object</span><span class="o">/%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="mi">89</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="n">B1</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="mi">99</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="mi">81</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="n">B4</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="mi">99</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="mi">81</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="n">A3</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="n">B0</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="mi">88</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="mi">81</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B9</span><span class="o">%</span><span class="mi">84</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B8</span><span class="o">%</span><span class="mi">94</span><span class="o">%</span><span class="n">E0</span><span class="o">%</span><span class="n">B9</span><span class="o">%</span><span class="mi">89</span>
<span class="n">http</span><span class="p">:</span><span class="o">//</span><span class="n">mn</span><span class="o">.</span><span class="n">example</span><span class="o">.</span><span class="n">com</span><span class="o">/</span><span class="n">mn</span><span class="o">/</span><span class="nb">object</span><span class="o">/</span><span class="n">Is_f</span><span class="o">%</span><span class="n">C3</span><span class="o">%</span><span class="n">A9idir_liom_ithe_gloine</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">The &#8220;+&#8221; (plus) character is a special case since it was once treated as a
space character in URLs, and was changed in RFC3986 <a class="footnote-reference" href="#id11" id="id5">[5]</a> such that the &#8220;+&#8221;
would not be treated as a space. To minimize confusion when the plus
character appears in an identifier, DataONE recommends that the character
is percent escaped (<code class="docutils literal"><span class="pre">%2B</span></code>) when it appears in DataONE service URLs. All
DataONE libraries and services operate in this manner.</p>
</div>
<p>The necessary encoding of URLs can be usually achieved through standard
libraries available in many languages, with the caveat that the encoding
follows the RFC3986 encoding rules. Many packages over-escape, keeping only
the unreserved character set unescaped. For its client libraries, DataONE is
taking a minimal escaping approach within the latitude RFC3986 allows.
Specifically, using [pchar] - [&#8216;+&#8217;] as the set of unescaped characters for
identifiers in path segments, and [pchar] - [&#8216;+&#8217;, &#8216;&amp;&#8217;, &#8216;=&#8217;] + [&#8216;/&#8217;, &#8216;?&#8217;] for
identifiers in query segments, (segments in both cases meaning characters
between delimiters). For example:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span>example-location-dependent-__/__?__&amp;__=__
example-common-unescaped-;:@$-_.!*()&#39;,~
</pre></div>
</div>
<p>will be encoded in paths to:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span>example-location-dependent-__%2F__%3F__&amp;__=__
example-common-unescaped-;:@$-_.!*()&#39;,~
</pre></div>
</div>
<p>and encoded in the query section to:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span>example-location-dependent-__/__?__%26__%3D__
example-common-unescaped-;:@$-_.!*()&#39;,~
</pre></div>
</div>
<p>Note that RFC3986 <a class="footnote-reference" href="#id11" id="id6">[5]</a> treats the query section of the URI as a blackbox, so &#8216;&amp;&#8217;
and &#8216;=&#8217; are unescaped (to be used as sub-delimiters). For the purpose of
encoding content, we take the approach of encoding at the segment level, so
need to escape those characters. For those implementations using standard
encoding routines, it is important to know that package&#8217;s treatment of this.</p>
<p>The following examples in Python and Java illustrate percent encoding of data
such as an identifier appropriate for appending to a URL. Each processes utf-8
encoded input through <em>stdin</em> and outputs percent encoded or decoded
responses. In java pseudo-code the general process is as follows.</p>
<div class="highlight-java"><div class="highlight"><pre><span></span><span class="c1">// pseudo-code: this will not compile!</span>

<span class="n">CharacterSet</span> <span class="n">PATH_SAFE</span> <span class="o">=</span> <span class="n">RFC3986_PCHAR</span> <span class="n">and</span> <span class="n">not</span> <span class="o">[</span><span class="sc">&#39;+&#39;</span><span class="o">];</span>
<span class="n">CharacterSet</span> <span class="n">QUERY_SAFE</span> <span class="o">=</span> <span class="n">PATH_SAFE</span> <span class="n">and</span> <span class="n">not</span> <span class="o">[</span><span class="sc">&#39;&amp;&#39;</span><span class="o">,</span><span class="sc">&#39;=&#39;</span><span class="o">]</span> <span class="n">or</span> <span class="o">[</span><span class="sc">&#39;?&#39;</span><span class="o">,</span><span class="sc">&#39;/&#39;</span><span class="o">];</span>

<span class="n">String</span> <span class="nf">encodeUtf8_pathSegment</span><span class="o">(</span><span class="n">identifier</span><span class="o">)</span> <span class="o">{</span>
    <span class="n">String</span> <span class="n">utf8ID</span> <span class="o">=</span> <span class="n">identifier</span><span class="o">.</span><span class="na">translate</span><span class="o">(</span><span class="s">&quot;UTF-8&quot;</span><span class="o">);</span>
    <span class="k">return</span> <span class="n">encodedID</span> <span class="o">=</span> <span class="n">percentEscape</span><span class="o">(</span><span class="n">utf8ID</span><span class="o">,</span><span class="n">PATH_SAFE</span><span class="o">);</span>
<span class="o">}</span>

<span class="n">String</span> <span class="nf">encodeUtf8_querySegment</span><span class="o">(</span><span class="n">identifier</span><span class="o">)</span> <span class="o">{</span>
    <span class="n">String</span> <span class="n">utf8ID</span> <span class="o">=</span> <span class="n">identifier</span><span class="o">.</span><span class="na">translate</span><span class="o">(</span><span class="s">&quot;UTF-8&quot;</span><span class="o">);</span>
    <span class="k">return</span> <span class="n">encodedID</span> <span class="o">=</span> <span class="n">percentEscape</span><span class="o">(</span><span class="n">utf8ID</span><span class="o">,</span><span class="n">QUERY_SAFE</span><span class="o">);</span>
<span class="o">}</span>

<span class="n">String</span> <span class="nf">decodeString</span><span class="o">(</span><span class="n">string</span><span class="o">)</span> <span class="o">{</span>
    <span class="c1">// older clients may encode spaces with &#39;+&#39;</span>
    <span class="c1">// so if we see them in the input, it is due to that</span>
    <span class="c1">// and we need to decode them, too.</span>

    <span class="n">String</span> <span class="n">correctedString</span> <span class="o">=</span> <span class="n">string</span><span class="o">.</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;+&quot;</span><span class="o">,</span><span class="s">&quot;%2B&quot;</span><span class="o">);</span>
    <span class="k">return</span> <span class="n">decodePercentEscaped</span><span class="o">(</span><span class="n">correctedString</span><span class="o">);</span>
<span class="o">}</span>
</pre></div>
</div>
<div class="highlight-python"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">sys</span>
<span class="kn">import</span> <span class="nn">codecs</span>
<span class="kn">import</span> <span class="nn">urllib</span>

<span class="k">def</span> <span class="nf">pctEncode</span><span class="p">(</span><span class="n">data</span><span class="p">):</span>
  <span class="sd">&#39;&#39;&#39;Encode the unicode string data as utf-8 then percent encode that</span>
<span class="sd">  ready for appending as a path element to a URL.</span>
<span class="sd">  &#39;&#39;&#39;</span>
  <span class="n">response</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">quote</span><span class="p">(</span><span class="n">data</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s2">&quot;utf-8&quot;</span><span class="p">),</span> <span class="n">safe</span><span class="o">=</span><span class="s2">&quot;:&quot;</span><span class="p">)</span>
  <span class="k">return</span> <span class="n">response</span>


<span class="k">def</span> <span class="nf">pctDecode</span><span class="p">(</span><span class="n">data</span><span class="p">):</span>
  <span class="sd">&#39;&#39;&#39;Decode a percent encoded string and return the unicode object.</span>
<span class="sd">  but first handle any mistaken &#39;+&#39; in the data string</span>
<span class="sd">  &#39;&#39;&#39;</span>
 <span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s2">&quot;+&quot;</span><span class="p">,</span><span class="s2">&quot;%2B&quot;</span><span class="p">)</span>
  <span class="n">response</span> <span class="o">=</span> <span class="n">urllib</span><span class="o">.</span><span class="n">unquote</span><span class="p">(</span><span class="n">data</span><span class="p">)</span>
  <span class="k">return</span> <span class="n">response</span>


<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s2">&quot;__main__&quot;</span><span class="p">:</span>
  <span class="sd">&#39;&#39;&#39;</span>
<span class="sd">  Read utf-8 encoded input from stdin and percent encode or</span>
<span class="sd">  decode (with command line argument -d).</span>

<span class="sd">  e.g. given test_ids.txt, a UTF-8 encoded file with identifiers</span>
<span class="sd">  appearing one per line:</span>
<span class="sd">    cat test_ids.txt | python PctEncode.py | python PctEncode.py -d</span>

<span class="sd">  should output equivalent to:</span>
<span class="sd">    cat test_ids.txt</span>
<span class="sd">  &#39;&#39;&#39;</span>
  <span class="n">doEncode</span> <span class="o">=</span> <span class="bp">True</span>
  <span class="k">try</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">sys</span><span class="o">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="s2">&quot;-d&quot;</span><span class="p">:</span>
      <span class="n">doEncode</span> <span class="o">=</span> <span class="bp">False</span>
  <span class="k">except</span><span class="p">:</span>
    <span class="k">pass</span>
  <span class="nb">id</span> <span class="o">=</span> <span class="nb">unicode</span><span class="p">(</span><span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">readline</span><span class="p">(),</span> <span class="s2">&quot;utf-8&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
  <span class="k">while</span> <span class="nb">len</span><span class="p">(</span><span class="nb">id</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
    <span class="k">if</span> <span class="n">doEncode</span><span class="p">:</span>
      <span class="k">print</span> <span class="n">pctEncode</span><span class="p">(</span><span class="nb">id</span><span class="p">)</span>
    <span class="k">else</span><span class="p">:</span>
      <span class="k">print</span> <span class="n">pctDecode</span><span class="p">(</span><span class="nb">id</span><span class="p">)</span>
    <span class="nb">id</span> <span class="o">=</span> <span class="nb">unicode</span><span class="p">(</span><span class="n">sys</span><span class="o">.</span><span class="n">stdin</span><span class="o">.</span><span class="n">readline</span><span class="p">(),</span> <span class="s2">&quot;utf-8&quot;</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
</pre></div>
</div>
<div class="highlight-java"><div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">java.io.*</span><span class="o">;</span>
<span class="kn">import</span> <span class="nn">java.net.*</span><span class="o">;</span>

<span class="kd">class</span> <span class="nc">PctEncode</span>
<span class="o">{</span>
  <span class="cm">/**</span>
<span class="cm">  Simple example of URL path encoding of UTF-8 strings for including as</span>
<span class="cm">  path elements in URLs as per RFC3986.</span>

<span class="cm">  e.g. given test_ids.txt, a UTF-8 encoded file with identifiers</span>
<span class="cm">  appearing one per line:</span>
<span class="cm">    cat test_ids.txt | java PctEncode | java PctEncode -d</span>

<span class="cm">  should output equivalent to:</span>
<span class="cm">    cat test_ids.txt</span>
<span class="cm">  */</span>

  <span class="kd">public</span> <span class="kd">static</span> <span class="n">String</span> <span class="nf">pctDecode</span><span class="o">(</span><span class="n">String</span> <span class="n">data</span><span class="o">)</span> <span class="o">{</span>
    <span class="cm">/**</span>
<span class="cm">    Decode a percent encoded string, returning a Java Unicode string</span>
<span class="cm">    */</span>
    <span class="n">String</span> <span class="n">response</span> <span class="o">=</span> <span class="kc">null</span><span class="o">;</span>
    <span class="k">try</span> <span class="o">{</span>
      <span class="n">data</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;+&quot;</span><span class="o">,</span><span class="s">&quot;%2B&quot;</span><span class="o">);</span>
      <span class="n">response</span> <span class="o">=</span> <span class="n">URLDecoder</span><span class="o">.</span><span class="na">decode</span><span class="o">(</span> <span class="n">data</span><span class="o">,</span> <span class="s">&quot;UTF-8&quot;</span><span class="o">);</span>
    <span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="n">java</span><span class="o">.</span><span class="na">io</span><span class="o">.</span><span class="na">UnsupportedEncodingException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
      <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;Error pctDecode : &quot;</span> <span class="o">+</span> <span class="n">e</span><span class="o">.</span><span class="na">getMessage</span><span class="o">());</span>
    <span class="o">}</span>
    <span class="k">return</span> <span class="n">response</span><span class="o">;</span>
  <span class="o">}</span>


  <span class="kd">public</span> <span class="kd">static</span> <span class="n">String</span> <span class="nf">pctEncodePathSegment</span><span class="o">(</span><span class="n">String</span> <span class="n">data</span><span class="o">)</span> <span class="o">{</span>
    <span class="cm">/**</span>
<span class="cm">    Encode a Java string according to the path encoding rules in</span>
<span class="cm">    RFC3986. Note that this does not encode properly for data that</span>
<span class="cm">    is to be the root of the path, it is assumed that the data will</span>
<span class="cm">    be appended to the end of a a URL path.</span>
<span class="cm">    */</span>
    <span class="n">String</span> <span class="n">response</span> <span class="o">=</span> <span class="kc">null</span><span class="o">;</span>
    <span class="k">try</span> <span class="o">{</span>
      <span class="n">response</span> <span class="o">=</span> <span class="n">URLEncoder</span><span class="o">.</span><span class="na">encode</span><span class="o">(</span> <span class="n">data</span><span class="o">,</span> <span class="s">&quot;UTF-8&quot;</span> <span class="o">);</span>
      <span class="c1">// fix outdated space-to-+ convention</span>
      <span class="n">response</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;+&quot;</span><span class="o">,</span><span class="s">&quot;%20&quot;</span><span class="o">);</span>
      <span class="c1">// now un-escape for minimally escaped result</span>
      <span class="n">response</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;%3A&quot;</span><span class="o">,</span><span class="s">&quot;:&quot;</span><span class="o">).</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;%28&quot;</span><span class="o">,</span><span class="s">&quot;(&quot;</span><span class="o">);</span>
      <span class="n">response</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;%3B&quot;</span><span class="o">,</span><span class="s">&quot;;&quot;</span><span class="o">).</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;%29&quot;</span><span class="o">,</span><span class="s">&quot;)&quot;</span><span class="o">);</span>
      <span class="n">response</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;%40&quot;</span><span class="o">,</span><span class="s">&quot;@&quot;</span><span class="o">).</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;%27&quot;</span><span class="o">,</span><span class="s">&quot;&#39;&quot;</span><span class="o">);</span>
      <span class="n">response</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;%24&quot;</span><span class="o">,</span><span class="s">&quot;$&quot;</span><span class="o">).</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;%2C&quot;</span><span class="o">,</span><span class="s">&quot;,&quot;</span><span class="o">);</span>
      <span class="n">response</span> <span class="o">=</span> <span class="n">response</span><span class="o">.</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;%21&quot;</span><span class="o">,</span><span class="s">&quot;!&quot;</span><span class="o">).</span><span class="na">replace</span><span class="o">(</span><span class="s">&quot;%7E&quot;</span><span class="o">,</span><span class="s">&quot;~&quot;</span><span class="o">);</span>

    <span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="n">java</span><span class="o">.</span><span class="na">io</span><span class="o">.</span><span class="na">UnsupportedEncodingException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
      <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;Error  pctEncode: &quot;</span> <span class="o">+</span> <span class="n">e</span><span class="o">.</span><span class="na">getMessage</span><span class="o">());</span>
    <span class="o">}</span>
    <span class="k">return</span> <span class="n">response</span><span class="o">;</span>
  <span class="o">}</span>


  <span class="kd">public</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">main</span><span class="o">(</span> <span class="n">String</span><span class="o">[]</span> <span class="n">args</span> <span class="o">)</span> <span class="o">{</span>
    <span class="k">try</span> <span class="o">{</span>
      <span class="kt">boolean</span> <span class="n">doEncode</span> <span class="o">=</span> <span class="kc">true</span><span class="o">;</span>
      <span class="k">try</span> <span class="o">{</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">args</span><span class="o">[</span><span class="mi">0</span><span class="o">].</span><span class="na">equals</span><span class="o">(</span> <span class="s">&quot;-d&quot;</span> <span class="o">))</span>
          <span class="n">doEncode</span> <span class="o">=</span> <span class="kc">false</span><span class="o">;</span>
      <span class="o">}</span> <span class="k">catch</span><span class="o">(</span><span class="n">ArrayIndexOutOfBoundsException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
      <span class="o">}</span>

      <span class="n">PrintStream</span> <span class="n">outs</span> <span class="o">=</span> <span class="k">new</span> <span class="n">PrintStream</span><span class="o">(</span> <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">,</span> <span class="kc">true</span><span class="o">,</span> <span class="s">&quot;UTF-8&quot;</span> <span class="o">);</span>
      <span class="n">InputStreamReader</span> <span class="n">isr</span> <span class="o">=</span> <span class="k">new</span> <span class="n">InputStreamReader</span><span class="o">(</span> <span class="n">System</span><span class="o">.</span><span class="na">in</span><span class="o">,</span> <span class="s">&quot;UTF-8&quot;</span> <span class="o">);</span>
      <span class="n">BufferedReader</span> <span class="n">reader</span> <span class="o">=</span> <span class="k">new</span> <span class="n">BufferedReader</span><span class="o">(</span> <span class="n">isr</span> <span class="o">);</span>
      <span class="n">String</span> <span class="n">id</span> <span class="o">=</span> <span class="kc">null</span><span class="o">;</span>
      <span class="n">String</span> <span class="n">data</span> <span class="o">=</span> <span class="kc">null</span><span class="o">;</span>
      <span class="k">while</span> <span class="o">(</span> <span class="o">(</span><span class="n">id</span> <span class="o">=</span> <span class="n">reader</span><span class="o">.</span><span class="na">readLine</span><span class="o">())</span> <span class="o">!=</span> <span class="kc">null</span> <span class="o">)</span> <span class="o">{</span>
        <span class="k">if</span> <span class="o">(</span><span class="n">doEncode</span><span class="o">)</span> <span class="o">{</span>
          <span class="n">data</span> <span class="o">=</span> <span class="n">pctEncode</span><span class="o">(</span> <span class="n">id</span> <span class="o">);</span>
        <span class="o">}</span> <span class="k">else</span> <span class="o">{</span>
          <span class="n">data</span> <span class="o">=</span> <span class="n">pctDecode</span><span class="o">(</span> <span class="n">id</span> <span class="o">);</span>
        <span class="o">}</span>
        <span class="n">outs</span><span class="o">.</span><span class="na">println</span><span class="o">(</span> <span class="n">data</span> <span class="o">);</span>
      <span class="o">}</span>
    <span class="o">}</span> <span class="k">catch</span><span class="o">(</span><span class="n">java</span><span class="o">.</span><span class="na">io</span><span class="o">.</span><span class="na">IOException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
      <span class="n">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="s">&quot;Error main: &quot;</span> <span class="o">+</span> <span class="n">e</span><span class="o">.</span><span class="na">getMessage</span><span class="o">());</span>
    <span class="o">}</span>
  <span class="o">}</span>
<span class="o">}</span>
</pre></div>
</div>
<p>Given this code and a utf-8 encoded source file <em>test_ids.txt</em> such as:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span>ö
10.1000/182
urn:lsid:ubio.org:namebank:11815
http://example.com/data/mydata?row=24
ldap://ldap1.example.net:6666/o=University%20of%20Michigan,%20c=US??sub?(cn=Babs%20Jensen)&quot;,
ฉันกินกระจกได้
Is_féidir_liom_ithe_gloine
</pre></div>
</div>
<p>The following commands should output the same as <code class="docutils literal"><span class="pre">cat</span> <span class="pre">test_ids.txt</span></code>:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">cat</span> <span class="n">test_ids</span><span class="o">.</span><span class="n">txt</span> <span class="o">|</span> <span class="n">java</span> <span class="n">PctEncode</span> <span class="o">|</span> <span class="n">python</span> <span class="n">PctEncode</span><span class="o">.</span><span class="n">py</span> <span class="o">-</span><span class="n">d</span>
<span class="n">cat</span> <span class="n">test_ids</span><span class="o">.</span><span class="n">txt</span> <span class="o">|</span> <span class="n">python</span> <span class="n">PctEncode</span><span class="o">.</span><span class="n">py</span> <span class="o">|</span> <span class="n">java</span> <span class="n">PctEncode</span> <span class="o">-</span><span class="n">d</span>
</pre></div>
</div>
<table class="docutils footnote" frame="void" id="id7" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td><a class="reference external" href="http://n2t.net/ezid/">http://n2t.net/ezid/</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id8" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td><a class="reference external" href="http://lsids.sourceforge.net/">http://lsids.sourceforge.net/</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id9" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td><a class="reference external" href="http://www.doi.org/">http://www.doi.org/</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id10" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id4">[4]</a></td><td><a class="reference external" href="http://www.w3.org/TR/xml11/#charsets">http://www.w3.org/TR/xml11/#charsets</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id11" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label">[5]</td><td><em>(<a class="fn-backref" href="#id5">1</a>, <a class="fn-backref" href="#id6">2</a>)</em> <a class="reference external" href="http://tools.ietf.org/html/rfc3986">http://tools.ietf.org/html/rfc3986</a></td></tr>
</tbody>
</table>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
        <div class="sphinxsidebarwrapper">
    <p class="logo"><a href="http://dataone.org">
      <img class="logo" src="../_static/dataone_logo.png" alt="Logo"/>
    </a></p>
  <h3><a href="../index.html">Table Of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">Identifiers in DataONE</a><ul>
<li><a class="reference internal" href="#uniqueness">Uniqueness</a></li>
<li><a class="reference internal" href="#authority">Authority</a></li>
<li><a class="reference internal" href="#opacity">Opacity</a></li>
<li><a class="reference internal" href="#immutability">Immutability</a></li>
<li><a class="reference internal" href="#resolvability">Resolvability</a></li>
<li><a class="reference internal" href="#granularity">Granularity</a></li>
<li><a class="reference internal" href="#structure">Structure</a></li>
<li><a class="reference internal" href="#serializing">Serializing</a></li>
</ul>
</li>
</ul>
<h3>Related Topics</h3>
<ul>
  <li><a href="../index.html">Documentation Overview</a><ul>
  <li><a href="index.html">&lt;no title&gt;</a><ul>
      <li>Previous: <a href="ContentImmutability.html" title="previous chapter">Immutability of Content in DataONE</a></li>
      <li>Next: <a href="Authentication.html" title="next chapter">Identity Management and Authenticated Session Management</a></li>
  </ul></li>
  </ul></li>
</ul>
<div id="searchbox" style="display: none" role="search">
  <h3>Quick search</h3>
    <form class="search" action="../search.html" method="get">
      <div><input type="text" name="q" /></div>
      <div><input type="submit" value="Go" /></div>
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>

    <div class="footer">
      <div id="copyright">
      &copy; Copyright <a href="http://www.dataone.org">2009-2017, DataONE</a>.
        [ <a href="../_sources/design/PIDs.txt"
               rel="nofollow">Page Source</a> |
          <a href='https://redmine.dataone.org/projects/d1/repository/changes/documents/Projects/cicore/architecture/api-documentation/source/design/PIDs.txt'
            rel="nofollow">Revision History</a> ]&nbsp;&nbsp;
      </div>
      <div id="acknowledgement">
        <p>This material is based upon work supported by the National Science Foundation
          under Grant Numbers <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=0830944">083094</a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1430508">1430508</a>.</p>
        <p>Any opinions, findings, and conclusions or recommendations expressed in this
           material are those of the author(s) and do not necessarily reflect the views
           of the National Science Foundation.</p>
      </div>
    </div>
    <!--
    <hr />
     <div id="HCB_comment_box"><a href="http://www.htmlcommentbox.com">HTML Comment Box</a> is loading comments...</div>
     <link rel="stylesheet" type="text/css" href="_static/skin.css" />
     <script type="text/javascript" language="javascript" id="hcb">
     /*<! -*/
     (function()
     {s=document.createElement("script");
     s.setAttribute("type","text/javascript");
     s.setAttribute("src", "http://www.htmlcommentbox.com/jread?page="+escape((typeof hcb_user !== "undefined" && hcb_user.PAGE)||(""+window.location)).replace("+","%2B")+"&mod=%241%24wq1rdBcg%24Gg8J5iYSHJWwAJtlYu/yU."+"&opts=21407&num=10");
     if (typeof s!="undefined") document.getElementsByTagName("head")[0].appendChild(s);})();
      /* ->*/
     </script>
   -->
  </body>
</html>