<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>Very Large Data Packages &#8212; v2.1.0-beta</title>
    
    <link rel="stylesheet" href="../_static/dataone.css" type="text/css" />
    <link rel="stylesheet" href="../_static/pygments.css" type="text/css" />
    
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '../',
        VERSION:     '2.1.0-beta',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true,
        SOURCELINK_SUFFIX: '.txt'
      };
    </script>
    <script type="text/javascript" src="../_static/mathjax_pre.js"></script>
    <script type="text/javascript" src="../_static/jquery.js"></script>
    <script type="text/javascript" src="../_static/underscore.js"></script>
    <script type="text/javascript" src="../_static/doctools.js"></script>
    <script type="text/javascript" src="//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML"></script>
    <script type="text/javascript" src="../_static/sidebar.js"></script>
    <link rel="author" title="About these documents" href="../about.html" />
    <link rel="index" title="Index" href="../genindex.html" />
    <link rel="search" title="Search" href="../search.html" />
    <link rel="next" title="Spatial Search and Plotting Using Geohashes" href="geohash.html" />
    <link rel="prev" title="Supporting multiple API versions" href="Versions.html" />
   
  
  <link media="only screen and (max-device-width: 480px)" href="../_static/small_dataone.css" type= "text/css" rel="stylesheet" />

  </head>
  <body role="document">
  
    <div class="version_notice">
      <p>
      <span class='bold'>Warning:</span> These documents are under active 
      development and subject to change (version 2.1.0-beta).<br />
      The latest release documents are at:
      <a href="https://purl.dataone.org/architecture">https://purl.dataone.org/architecture</a>
      </p>
    </div>

    <div class="related" role="navigation" aria-label="related navigation">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="../genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="../py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="geohash.html" title="Spatial Search and Plotting Using Geohashes"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="Versions.html" title="Supporting multiple API versions"
             accesskey="P">previous</a> |</li>
        <li class="nav-item nav-item-0"><a href="../index.html"></a> &#187;</li>
          <li class="nav-item nav-item-1"><a href="index.html" accesskey="U">&lt;no title&gt;</a> &#187;</li> 
      </ul>
    </div>  

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body">
            
  <div class="section" id="very-large-data-packages">
<h1><a class="toc-backref" href="#id1">Very Large Data Packages</a><a class="headerlink" href="#very-large-data-packages" title="Permalink to this headline">¶</a></h1>
<table class="docutils field-list" frame="void" rules="none">
<col class="field-name" />
<col class="field-body" />
<tbody valign="top">
<tr class="field-odd field"><th class="field-name" colspan="2">Document Status:</th></tr>
<tr class="field-odd field"><td>&nbsp;</td><td class="field-body"><table border="1" class="first last docutils">
<colgroup>
<col width="11%" />
<col width="89%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Status</th>
<th class="head">Comment</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>DRAFT</td>
<td>(rnahf) committed minor modifications shortly (1hr) after email
to <a class="reference external" href="mailto:developers&#37;&#52;&#48;dataone&#46;org">developers<span>&#64;</span>dataone<span>&#46;</span>org</a></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<div class="contents topic" id="contents">
<p class="topic-title first">Contents</p>
<ul class="simple">
<li><a class="reference internal" href="#very-large-data-packages" id="id1">Very Large Data Packages</a><ul>
<li><a class="reference internal" href="#synopsis" id="id2">Synopsis</a></li>
<li><a class="reference internal" href="#identified-issues" id="id3">Identified Issues</a><ul>
<li><a class="reference internal" href="#resource-map-creation" id="id4">Resource map creation</a></li>
<li><a class="reference internal" href="#rdf-deserialization" id="id5">RDF Deserialization</a></li>
<li><a class="reference internal" href="#indexing" id="id6">Indexing</a></li>
<li><a class="reference internal" href="#whole-package-download" id="id7">Whole-Package Download</a></li>
</ul>
</li>
<li><a class="reference internal" href="#mitigations" id="id8">Mitigations</a><ul>
<li><a class="reference internal" href="#determining-member-count" id="id9">Determining Member Count</a></li>
<li><a class="reference internal" href="#determining-total-package-size-for-download" id="id10">Determining total package size for download</a></li>
<li><a class="reference internal" href="#determining-memory-requirements-for-deserialization" id="id11">Determining Memory Requirements for deserialization</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</div>
<div class="section" id="synopsis">
<h2><a class="toc-backref" href="#id2">Synopsis</a><a class="headerlink" href="#synopsis" title="Permalink to this headline">¶</a></h2>
<p>While many data packages are of modest size (&lt;100 objects), some large studies
generate upwards of 100,000 datasets that form a data package.   These very large
data packages challenge performance limits in the DataONE data ingest cycle and
can present usability issues in user interfaces not prepared for them.  Both memory
and processor time increase dramatically with increased number of data objects
and relationships expressed.</p>
<p>Potential submitters of packages containing large numbers of data objects must be
mindful that packages of such an large number of objects is likely to be unusable
for the majority of interested parties, and should consider consolidating and
compressing the individual objects into fewer objects to allow easier discovery/
inspection and download. This should be especially considered if the objects in
the package would not be usefully retrieved individually.</p>
<p>Creation of large resource maps is potentially the most time consuming activity,
depending on the tool used.  Deserialization is comparatively quick, but the memory
requirements are high, depending on the type of model used during parsing.  At the
stage of indexing, at issue is the time needed to process index record updates, as
well as the resulting number items in certain fields in the solr records.  Last,
high-level client methods would like to safely be able to do whole-package downloads,
but need to be able to detect large data packages which could overwhelm their
ability to handle such as large package.</p>
<p>Below are discussions and test results of the known issues related to very large
resource maps, presented in order of when encountered in the object lifecycle.</p>
</div>
<div class="section" id="identified-issues">
<h2><a class="toc-backref" href="#id3">Identified Issues</a><a class="headerlink" href="#identified-issues" title="Permalink to this headline">¶</a></h2>
<div class="section" id="resource-map-creation">
<h3><a class="toc-backref" href="#id4">Resource map creation</a><a class="headerlink" href="#resource-map-creation" title="Permalink to this headline">¶</a></h3>
<p>Use of the foresite library for building resource maps includes many checks to make
sure that the map validates.  First the identifiers of the data and metadata are
added to a graph held in memory, then the graph is serialized to RDF/XML format.
For small packages the overhead for building the graph and performing consistency
checks is minimal, but both memory and time to build seem to scale geometrically
with the number of objects in the package.</p>
<p>Test results on different size resource maps are summarized below.  In all cases
there is one metadata object that documents all of the objects.</p>
<table border="1" class="docutils">
<colgroup>
<col width="27%" />
<col width="31%" />
<col width="21%" />
<col width="21%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head"># of objects</th>
<th class="head">time to build</th>
<th class="head">memory</th>
<th class="head">file size</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>10</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>7 K</td>
</tr>
<tr class="row-odd"><td>33</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>20 K</td>
</tr>
<tr class="row-even"><td>100</td>
<td>2  seconds</td>
<td>45 MB</td>
<td>60 K</td>
</tr>
<tr class="row-odd"><td>330</td>
<td>&nbsp;</td>
<td>&nbsp;</td>
<td>192 K</td>
</tr>
<tr class="row-even"><td>1000</td>
<td>6  seconds</td>
<td>20 MB</td>
<td>600 K</td>
</tr>
<tr class="row-odd"><td>3300</td>
<td>24  seconds</td>
<td>23 MB</td>
<td>2 Mb</td>
</tr>
<tr class="row-even"><td>10000</td>
<td>4.5  minutes</td>
<td>30 MB</td>
<td>6 Mb</td>
</tr>
<tr class="row-odd"><td>33000</td>
<td>66  minutes</td>
<td>142 MB</td>
<td>20 Mb</td>
</tr>
</tbody>
</table>
<p>For creating very large resource maps, generation time using the java foresite
toolkit is an issue.  Directly creating a serialized resource map is much faster.
For example, using an existing resource map as a template, and a short perl script,
a 100000 member resource map was created in approximately 10 seconds with the only
memory cost that of holding an identifier array in memory and any output buffering.</p>
</div>
<div class="section" id="rdf-deserialization">
<h3><a class="toc-backref" href="#id5">RDF Deserialization</a><a class="headerlink" href="#rdf-deserialization" title="Permalink to this headline">¶</a></h3>
<p>Deserialization happens both on the client side when downloading resource maps,
and on coordinating nodes, both when validating the resource map, and also when
indexing the relationships into the solr index.  Performance metrics obtained
from JUnit tests monitored with Java Visual VM are summarized below.  Fully
expressed resource maps were deserialized using both the default simple model,
and again using an OWL model loaded with the ORE schema to be able to do semantic
reasoning.  The reasoning model adds an additional 268 triples from the ORE schema.</p>
<table border="1" class="docutils">
<colgroup>
<col width="18%" />
<col width="13%" />
<col width="14%" />
<col width="14%" />
<col width="13%" />
<col width="14%" />
<col width="14%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head"></th>
<th class="head" colspan="3">Default model</th>
<th class="head" colspan="3">Reasoning model</th>
</tr>
<tr class="row-even"><th class="head"># objects</th>
<th class="head">triples</th>
<th class="head">time</th>
<th class="head">memory</th>
<th class="head">triples</th>
<th class="head">time</th>
<th class="head">memory</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-odd"><td>10</td>
<td>61</td>
<td>1 sec.</td>
<td>9 Mb</td>
<td>329</td>
<td>2 sec.</td>
<td>13 Mb</td>
</tr>
<tr class="row-even"><td>33</td>
<td>176</td>
<td>1 sec.</td>
<td>10 Mb</td>
<td>444</td>
<td>2 sec.</td>
<td>13 Mb</td>
</tr>
<tr class="row-odd"><td>100</td>
<td>511</td>
<td>2 sec.</td>
<td>15 Mb</td>
<td>779</td>
<td>2 sec.</td>
<td>17 Mb</td>
</tr>
<tr class="row-even"><td>330</td>
<td>1661</td>
<td>2 sec.</td>
<td>20 Mb</td>
<td>1929</td>
<td>3 sec.</td>
<td>17 Mb</td>
</tr>
<tr class="row-odd"><td>1000</td>
<td>5011</td>
<td>2 sec.</td>
<td>17 Mb</td>
<td>5279</td>
<td>3 sec.</td>
<td>24 Mb</td>
</tr>
<tr class="row-even"><td>3300</td>
<td>16511</td>
<td>3 sec.</td>
<td>20 Mb</td>
<td>16779</td>
<td>4 sec.</td>
<td>40 Mb</td>
</tr>
<tr class="row-odd"><td>10000</td>
<td>50011</td>
<td>6 sec.</td>
<td>30 Mb</td>
<td>50279</td>
<td>8 sec.</td>
<td>90 Mb</td>
</tr>
<tr class="row-even"><td>33000</td>
<td>165011</td>
<td>7 sec.</td>
<td>51 Mb</td>
<td>165279</td>
<td>10 sec.</td>
<td>264 Mb</td>
</tr>
<tr class="row-odd"><td>100000</td>
<td>500011</td>
<td>15 sec.</td>
<td>138 Mb</td>
<td>500279</td>
<td>26 sec.</td>
<td>792 Mb</td>
</tr>
</tbody>
</table>
<p>The same information listed by model size shows that for small models, one can
see that memory requirements are not a simple function of number of triples, but
also a function of the model type.  The reasoning model uses more memory per
triple than the simple model.  Especially noticeable is that at very large sizes,
in terms of number of triples, the reasoning model uses significantly more memory.</p>
<table border="1" class="docutils">
<colgroup>
<col width="24%" />
<col width="21%" />
<col width="24%" />
<col width="30%" />
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">triples</th>
<th class="head">time</th>
<th class="head">memory</th>
<th class="head">model type</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>61</td>
<td>1 sec.</td>
<td>9 Mb</td>
<td>simple</td>
</tr>
<tr class="row-odd"><td>176</td>
<td>1 sec.</td>
<td>10 Mb</td>
<td>simple</td>
</tr>
<tr class="row-even"><td>329</td>
<td>2 sec.</td>
<td>13 Mb</td>
<td>reasoning</td>
</tr>
<tr class="row-odd"><td>444</td>
<td>2 sec.</td>
<td>13 Mb</td>
<td>reasoning</td>
</tr>
<tr class="row-even"><td>511</td>
<td>2 sec.</td>
<td>15 Mb</td>
<td>simple</td>
</tr>
<tr class="row-odd"><td>779</td>
<td>2 sec.</td>
<td>17 Mb</td>
<td>reasoning</td>
</tr>
<tr class="row-even"><td>1661</td>
<td>2 sec.</td>
<td>20 Mb</td>
<td>simple</td>
</tr>
<tr class="row-odd"><td>1929</td>
<td>3 sec.</td>
<td>17 Mb</td>
<td>reasoning</td>
</tr>
<tr class="row-even"><td>5011</td>
<td>2 sec.</td>
<td>17 Mb</td>
<td>simple</td>
</tr>
<tr class="row-odd"><td>5279</td>
<td>3 sec.</td>
<td>24 Mb</td>
<td>reasoning</td>
</tr>
<tr class="row-even"><td>16511</td>
<td>3 sec.</td>
<td>20 Mb</td>
<td>simple</td>
</tr>
<tr class="row-odd"><td>16779</td>
<td>4 sec.</td>
<td>40 Mb</td>
<td>reasoning</td>
</tr>
<tr class="row-even"><td>50011</td>
<td>6 sec.</td>
<td>30 Mb</td>
<td>simple</td>
</tr>
<tr class="row-odd"><td>50279</td>
<td>8 sec.</td>
<td>90 Mb</td>
<td>reasoning</td>
</tr>
<tr class="row-even"><td>165011</td>
<td>7 sec.</td>
<td>51 Mb</td>
<td>simple</td>
</tr>
<tr class="row-odd"><td>165279</td>
<td>10 sec.</td>
<td>264 Mb</td>
<td>reasoning</td>
</tr>
<tr class="row-even"><td>500011</td>
<td>15 sec.</td>
<td>138 Mb</td>
<td>simple</td>
</tr>
<tr class="row-odd"><td>500279</td>
<td>26 sec.</td>
<td>792 Mb</td>
<td>reasoning</td>
</tr>
</tbody>
</table>
<p>The impact of this is that especially automated applications that deserialize RDF
files (such as the index processor) will need to be able to detect when they are
dealing with a resource map that could exceed available system resources.</p>
<p>It also seems wise, given that memory issues weigh larger than RDF file size,
to specify that resource maps with more than 50,000 triples need to fully express
relationships, instead of relying on reasoning models to infer semantically-defined
inverse relationships. This implies that if DataONE allows resource maps to sparsely
populate their relationships, that there also be tools to tell whether an RDF
is fully expressing relationships, or will be relying on semantic reasoning.</p>
</div>
<div class="section" id="indexing">
<h3><a class="toc-backref" href="#id6">Indexing</a><a class="headerlink" href="#indexing" title="Permalink to this headline">¶</a></h3>
<p>When resource maps are synchronized, the map is read and - once all of the package
members are indexed - the relationships in the map are added to the index records
of the data members.  A 10000 member package will trigger the update of 10000
index records, adding the metadata object pid to the &#8216;isDocumentedBy&#8217; field.
Additionally, both the &#8216;contains&#8217; field in the resource map and the &#8216;documents&#8217;
field in the metadata records will be updated with the pids of the 10000 members.
Such many-membered fields are difficult to impossible to display, and are time-
consuming to search when queried.</p>
<p>Indexing is by necessity a single-threaded process, one that can update on the
order of 100 records/minute.  Therefore a package containing 100,000 members will
take about 1000 minutes, or about 17 hours.  During this time, no other updates
will be processed.</p>
<p>Workarounds for this issue requires a redesign of the index processor so that the
large resource map does not delay other items in the indexing queue.  Ultimately,
the solution would be to implement a different search engine for tracking package
relationships, and implementing another search endpoint using SPARQL
(<a class="reference external" href="http://en.wikipedia.org/wiki/SPARQL">http://en.wikipedia.org/wiki/SPARQL</a>), and probably hiding the search query details
behind new DataONE API methods to spare the end user from having to learn another
query language to interact with DataONE.</p>
</div>
<div class="section" id="whole-package-download">
<h3><a class="toc-backref" href="#id7">Whole-Package Download</a><a class="headerlink" href="#whole-package-download" title="Permalink to this headline">¶</a></h3>
<p>The high-level DataPackage.download(packageID) method in d1_libclient implementations
by default downloads the entire collection of data package objects for local usage.
For these very-large data packages, the total package size is likely to be gigabytes
of information. In order to better support such convenience features, there needs
to be ways for determining the number of members of a package prior to download.</p>
<p>This would also help in situations where the number of package members is small,
but the individual data objects are large.</p>
</div>
</div>
<div class="section" id="mitigations">
<h2><a class="toc-backref" href="#id8">Mitigations</a><a class="headerlink" href="#mitigations" title="Permalink to this headline">¶</a></h2>
<p>It is useful for applications to know when a given data package is too large for
it to work with, or will require special handling.  Ideally, this could be
determined before deserializing the xml, and even for some clients, prior to
download of the resource map itself.</p>
<p>Indexing performance is a function of member count, while deserialization performance
is a function of the number of triples. Download performance is a function of
total file size.</p>
<div class="section" id="determining-member-count">
<h3><a class="toc-backref" href="#id9">Determining Member Count</a><a class="headerlink" href="#determining-member-count" title="Permalink to this headline">¶</a></h3>
<p>For indexed resource maps, the easiest way to get the member count is with the
query:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span>cn/v1/query/solr/?q=resourceMap:{pid}&amp;rows=0
</pre></div>
</div>
<p>For unindexed resource maps, the count of the number of occurences of the term
&#8220;ore:isAggregatedBy&#8221; in the RDF file will suffice.</p>
</div>
<div class="section" id="determining-total-package-size-for-download">
<h3><a class="toc-backref" href="#id10">Determining total package size for download</a><a class="headerlink" href="#determining-total-package-size-for-download" title="Permalink to this headline">¶</a></h3>
<p>To get the total size of the package, the following solr queries can be used:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span># returns only sizes of package members
cn/v1/query/solr/?q=resourceMap:{pid}&amp;fl=id,size

# returns sizes for package members and the resource map itself
cn/v1/query/solr/?q=resourceMap:{pid} OR id:{pid}&amp;fl=id,size
</pre></div>
</div>
<p>from which the client could calculate the sum of the sizes returned.</p>
<p>To get the size of the resource map itself (useful for estimating memory requirements):</p>
<div class="highlight-default"><div class="highlight"><pre><span></span># returns size of only the resource map
cn/v1/query/solr/?q=id:{pid}&amp;fl=id,size
</pre></div>
</div>
</div>
<div class="section" id="determining-memory-requirements-for-deserialization">
<h3><a class="toc-backref" href="#id11">Determining Memory Requirements for deserialization</a><a class="headerlink" href="#determining-memory-requirements-for-deserialization" title="Permalink to this headline">¶</a></h3>
<p>It is the number of triples and type of model used, moreso than the number of
package members, that best determines the graph model&#8217;s memory requirement, and
so any additional triples expressed for each member would multiply the model size.
The use of ORE proxies, for example, or the inclusion of provenance information
are situations where this would be the case.  DataONE <em>is</em> planning for the
inclusion of provenance statements in the resource maps, so users and developers
alike should take this into consideration.</p>
<p>The number of triples in an RDF/XML file can be determined either by parsing the
XML, or by estimating off the resource map byte count.  By parsing the XML, one
would use an XML parser of choice to count all of the sub-elements of all of the
&#8220;rdf:Description&#8221; elements.  In psuedo-code:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">tripleCount</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">descriptionList</span> <span class="o">=</span> <span class="n">getRDFDescriptionElements</span><span class="p">();</span>
<span class="n">foreach</span> <span class="n">descriptionElement</span> <span class="ow">in</span> <span class="n">descriptionList</span> <span class="p">{</span>
   <span class="n">tripleCount</span> <span class="o">+=</span> <span class="n">descriptionElement</span><span class="o">.</span><span class="n">getElementList</span><span class="p">()</span><span class="o">.</span><span class="n">size</span><span class="p">;</span>
<span class="p">}</span>
</pre></div>
</div>
<p>To estimate from the file size, an upper limit of the number of triples can be deduced.
RDF/XML organizes triples as predicate-object sub-elements under an rdf:Description
element for each subject. If the ratio of subjects to triples is low, then the number
of bytes per triple is determined by the length of the predicate-object sub-element.
For a 30-character identifier, that sub-element is about 100 characters, and so:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">upper</span> <span class="n">limit</span> <span class="n">on</span> <span class="n">the</span> <span class="n">number</span> <span class="n">of</span> <span class="n">triples</span> <span class="o">=</span> <span class="n">file</span> <span class="n">size</span> <span class="p">(</span><span class="nb">bytes</span><span class="p">)</span> <span class="o">/</span> <span class="mi">100</span> <span class="nb">bytes</span><span class="o">-</span><span class="n">per</span><span class="o">-</span><span class="n">triple</span>
</pre></div>
</div>
<p>So for example, a 5Mb resource map has at most 50K triples, assuming an average
identifier size of 30 characters (URL encoded).</p>
<p>For a point of reference, a resource map for 1 metadata object documenting 1000
objects, expressing the &#8216;ore:aggregates&#8217;, &#8216;ore:isAggregatedBy&#8217;, &#8216;cito:documents&#8217;,
&#8216;cito:isDocumentedBy&#8217;, and &#8216;cito:identifier&#8217; predicates creates 5005 triples using
1003 subjects, and was tested to create 600K file.  Applying the upper limit
approximation, (600K / 100 = 6K) gives 6000 triples, an over-estimate matching
the number of subjects.</p>
<p>Also note that long identifiers and identifiers predominated by non-ascii characters
that would be percent encoded in the file (3bytes per character) can lead to
an even higher upper limit than expected, and similarly, short identifiers in
the resource map could lead to a less robust upper limit.</p>
<p>Determining the memory requirement from the number of triples can be done either
by interpolating from the tables above, or by equation.  Curve-fits of the
deserialization performance tests using polynomial equations gave the following:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">simple</span> <span class="n">model</span> <span class="n">memory</span><span class="p">(</span><span class="n">Mb</span><span class="p">)</span> <span class="o">~</span>  <span class="mf">2.6E-15</span> <span class="o">*</span> <span class="n">triples</span><span class="o">^</span><span class="mi">3</span> <span class="o">-</span> <span class="mf">1.7E-09</span> <span class="o">*</span> <span class="n">triples</span><span class="o">^</span><span class="mi">2</span> <span class="o">+</span> <span class="mf">0.00044</span> <span class="o">*</span> <span class="n">triples</span> <span class="o">+</span> <span class="mf">12.7</span>
<span class="p">(</span><span class="n">R2</span> <span class="o">=</span> <span class="mf">0.99466</span><span class="p">)</span>

<span class="n">reasoning</span> <span class="n">model</span> <span class="n">memory</span><span class="p">(</span><span class="n">Mb</span><span class="p">)</span> <span class="o">~</span>   <span class="mf">1.25E-10</span> <span class="o">*</span> <span class="n">triples</span><span class="o">^</span><span class="mi">2</span> <span class="o">+</span> <span class="mf">0.0015</span> <span class="o">*</span> <span class="n">triples</span> <span class="o">+</span> <span class="mf">14.3</span>
<span class="p">(</span><span class="n">R2</span> <span class="o">=</span> <span class="mf">0.99997</span><span class="p">)</span>
</pre></div>
</div>
<p>Note that the simple model required (rightly or wrongly) a third-order equation
to get a curve-fit with R2 &gt; 0.9, whereas the reasoning model data could be highly
corelated with a binomial equation.</p>
<p>Expressed as a function of file size (bytes):</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">simple</span> <span class="n">model</span> <span class="n">memory</span><span class="p">(</span><span class="n">Mb</span><span class="p">)</span> <span class="o">~</span>  <span class="mf">2.6E-21</span> <span class="o">*</span> <span class="n">size</span><span class="o">^</span><span class="mi">3</span> <span class="o">-</span> <span class="mf">1.7E-13</span> <span class="o">*</span> <span class="n">size</span><span class="o">^</span><span class="mi">2</span> <span class="o">+</span> <span class="mf">4.4E-06</span> <span class="o">*</span> <span class="n">size</span> <span class="o">+</span> <span class="mf">12.7</span>

<span class="n">reasoning</span> <span class="n">model</span> <span class="n">memory</span><span class="p">(</span><span class="n">Mb</span><span class="p">)</span> <span class="o">~</span>   <span class="mf">1.25E-14</span> <span class="o">*</span> <span class="n">size</span><span class="o">^</span><span class="mi">2</span> <span class="o">+</span> <span class="mf">1.5E-05</span> <span class="o">*</span> <span class="n">size</span> <span class="o">+</span> <span class="mf">14.3</span>
</pre></div>
</div>
</div>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="sphinxsidebar" role="navigation" aria-label="main navigation">
        <div class="sphinxsidebarwrapper">
    <p class="logo"><a href="http://dataone.org">
      <img class="logo" src="../_static/dataone_logo.png" alt="Logo"/>
    </a></p>
  <h3><a href="../index.html">Table Of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">Very Large Data Packages</a><ul>
<li><a class="reference internal" href="#synopsis">Synopsis</a></li>
<li><a class="reference internal" href="#identified-issues">Identified Issues</a><ul>
<li><a class="reference internal" href="#resource-map-creation">Resource map creation</a></li>
<li><a class="reference internal" href="#rdf-deserialization">RDF Deserialization</a></li>
<li><a class="reference internal" href="#indexing">Indexing</a></li>
<li><a class="reference internal" href="#whole-package-download">Whole-Package Download</a></li>
</ul>
</li>
<li><a class="reference internal" href="#mitigations">Mitigations</a><ul>
<li><a class="reference internal" href="#determining-member-count">Determining Member Count</a></li>
<li><a class="reference internal" href="#determining-total-package-size-for-download">Determining total package size for download</a></li>
<li><a class="reference internal" href="#determining-memory-requirements-for-deserialization">Determining Memory Requirements for deserialization</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<h3>Related Topics</h3>
<ul>
  <li><a href="../index.html">Documentation Overview</a><ul>
  <li><a href="index.html">&lt;no title&gt;</a><ul>
      <li>Previous: <a href="Versions.html" title="previous chapter">Supporting multiple API versions</a></li>
      <li>Next: <a href="geohash.html" title="next chapter">Spatial Search and Plotting Using Geohashes</a></li>
  </ul></li>
  </ul></li>
</ul>
<div id="searchbox" style="display: none" role="search">
  <h3>Quick search</h3>
    <form class="search" action="../search.html" method="get">
      <div><input type="text" name="q" /></div>
      <div><input type="submit" value="Go" /></div>
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>
      <div class="clearer"></div>
    </div>

    <div class="footer">
      <div id="copyright">
      &copy; Copyright <a href="http://www.dataone.org">2009-2017, DataONE</a>.
        [ <a href="../_sources/design/VeryLargeDataPackage.txt"
               rel="nofollow">Page Source</a> |
          <a href='https://redmine.dataone.org/projects/d1/repository/changes/documents/Projects/cicore/architecture/api-documentation/source/design/VeryLargeDataPackage.txt'
            rel="nofollow">Revision History</a> ]&nbsp;&nbsp;
      </div>
      <div id="acknowledgement">
        <p>This material is based upon work supported by the National Science Foundation
          under Grant Numbers <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=0830944">083094</a> and <a href="http://www.nsf.gov/awardsearch/showAward?AWD_ID=1430508">1430508</a>.</p>
        <p>Any opinions, findings, and conclusions or recommendations expressed in this
           material are those of the author(s) and do not necessarily reflect the views
           of the National Science Foundation.</p>
      </div>
    </div>
    <!--
    <hr />
     <div id="HCB_comment_box"><a href="http://www.htmlcommentbox.com">HTML Comment Box</a> is loading comments...</div>
     <link rel="stylesheet" type="text/css" href="_static/skin.css" />
     <script type="text/javascript" language="javascript" id="hcb">
     /*<! -*/
     (function()
     {s=document.createElement("script");
     s.setAttribute("type","text/javascript");
     s.setAttribute("src", "http://www.htmlcommentbox.com/jread?page="+escape((typeof hcb_user !== "undefined" && hcb_user.PAGE)||(""+window.location)).replace("+","%2B")+"&mod=%241%24wq1rdBcg%24Gg8J5iYSHJWwAJtlYu/yU."+"&opts=21407&num=10");
     if (typeof s!="undefined") document.getElementsByTagName("head")[0].appendChild(s);})();
      /* ->*/
     </script>
   -->
  </body>
</html>