Äcdocutils.nodes
document
q)Åq}q(U	nametypesq}qX���querying dataoneqNsUsubstitution_defsq}qUparse_messagesq	]q
Ucurrent_sourceqNU
decorationqNUautofootnote_startq
KUnameidsq}qhUquerying-dataoneqsUchildrenq]qcdocutils.nodes
section
q)Åq}q(U	rawsourceqU�UparentqhUsourceqXl���/var/lib/jenkins/jobs/API_Documentation_trunk/workspace/api-documentation/source/design/querying_content.txtqUtagnameqUsectionqU
attributesq}q(Udupnamesq]Uclassesq]Ubackrefsq ]Uidsq!]q"haUnamesq#]q$hauUlineq%KUdocumentq&hh]q'(cdocutils.nodes
title
q()Åq)}q*(hX���Querying DataONEq+hhhhhUtitleq,h}q-(h]h]h ]h!]h#]uh%Kh&hh]q.cdocutils.nodes
Text
q/X���Querying DataONEq0ÖÅq1}q2(hh+hh)ubaubcdocutils.nodes
target
q3)Åq4}q5(hU�hhhNhUtargetq6h}q7(h!]h ]h]h]h#]Urefidq8Uindex-0q9uh%Nh&hh]ubcsphinx.ext.todo
todo_node
q:)Åq;}q<(hX|���- Attribute mapping to the list prepared previously
- Attribute mapping to sysmeta docs
- SOLR examples, specific to Mercuryq=hhhhUexpect_referenced_by_nameq>}hU	todo_nodeq?h}q@(h]h]qAUadmonition-todoqBah ]h!]qCh9ah#]uh%Kh&hUexpect_referenced_by_idqD}qEh9h4sh]qF(h()ÅqG}qH(hX���TodoqIh}qJ(h]h]h ]h!]h#]uhh;h]qKh/X���TodoqLÖÅqM}qN(hU�hhGubahh,ubcdocutils.nodes
bullet_list
qO)ÅqP}qQ(hU�h}qR(UbulletqSX���-h!]h ]h]h]h#]uhh;h]qT(cdocutils.nodes
list_item
qU)ÅqV}qW(hX1���Attribute mapping to the list prepared previouslyqXh}qY(h]h]h ]h!]h#]uhhPh]qZcdocutils.nodes
paragraph
q[)Åq\}q](hhXhhVhhhU	paragraphq^h}q_(h]h]h ]h!]h#]uh%Kh]q`h/X1���Attribute mapping to the list prepared previouslyqaÖÅqb}qc(hhXhh\ubaubahU	list_itemqdubhU)Åqe}qf(hX!���Attribute mapping to sysmeta docsqgh}qh(h]h]h ]h!]h#]uhhPh]qih[)Åqj}qk(hhghhehhhh^h}ql(h]h]h ]h!]h#]uh%Kh]qmh/X!���Attribute mapping to sysmeta docsqnÖÅqo}qp(hhghhjubaubahhdubhU)Åqq}qr(hX"���SOLR examples, specific to Mercuryqsh}qt(h]h]h ]h!]h#]uhhPh]quh[)Åqv}qw(hhshhqhhhh^h}qx(h]h]h ]h!]h#]uh%Kh]qyh/X"���SOLR examples, specific to MercuryqzÖÅq{}q|(hhshhvubaubahhdubehUbullet_listq}ubeubh[)Åq~}q(hXC���This document has been DEPRECATED: Please see :doc:`SearchMetadata`qÄhhhhhh^h}qÅ(h]h]h ]h!]h#]uh%Kh&hh]qÇ(h/X.���This document has been DEPRECATED: Please see qÉÖÅqÑ}qÖ(hX.���This document has been DEPRECATED: Please see hh~ubcsphinx.addnodes
pending_xref
qÜ)Åqá}qà(hX���:doc:`SearchMetadata`qâhh~hhhUpending_xrefqäh}qã(UreftypeX���docqåUrefwarnqçàU	reftargetqéX���SearchMetadataU	refdomainU�h!]h ]Urefexplicitâh]h]h#]UrefdocqèX���design/querying_contentqêuh%Kh]qëcdocutils.nodes
inline
qí)Åqì}qî(hhâh}qï(h]h]qñ(Uxrefqóhåeh ]h!]h#]uhháh]qòh/X���SearchMetadataqôÖÅqö}qõ(hU�hhìubahUinlineqúubaubeubcdocutils.nodes
comment
qù)Åqû}qü(hXk"��Content here is preserved for notes until the search API is completed.
Synopsis
--------

This document provides an outline for approaches to querying content available
in DataONE through the ``/object/`` collection exposed by the CNs and MNs
(i.e. :func:`MN_replication.listObjects` and :func:`CN_query.search`
methods). The same approach can be applied to the ``/log/`` collection exposed
by the CNs and MNs (i.e. the :func:`CN_query.getLogRecords` and
:func:`MN_crud.getLogRecords` methods).

There are three types of query that can be readily supported by CNs
(name-value pairs, Metacat path query, and Mercury SOLR query), and at least
one by MNs (name-value pairs). There may also be additional query types
specified in the future (e.g. CQL, SPARQL).


Overview
--------

The basic model is that a query applied against a collection acts as a filter,
restricting the results to only those objects whose properties match the
supplied query expression. The default, or unfiltered view of the collection
shows all objects (that the user is authorized to access). The query does not
shape the result, i.e. it does not indicate which fields are returned or the
structure of the response.

There seems to be two basic types of query that need to be supported. One is
querying against fairly distinct and controlled object attributes that are for
the most part, defined by the DataONE system ("system queries"). The other is
for queries that apply to the content of objects that are contributed to
DataONE ("content queries"). In this case, the content, structure, and even
representation is essentially uncontrolled, and so may vary considerably
across the universe of objects that are managed by DataONE.

A longterm goal would be to support a query syntax that is expressive enough
to enable precise discovery of content but also simple enough that at least
common queries can be expressed in a URL.

There are three types of query expression that can be supported easily with
the initial version of the DataONE cyber-infrastructure:

1) Simple name-value pairs combined together with a single logical operator
(e.g. AND).

2) The Path Query syntax / structure that is used by Metacat. This is a
potentially very expressive query that is encoded in an XML structure, and so
can be unwieldy for passing in a URL (POST is typically used) or generation by
hand.

3) The SOLR / Lucene query syntax that is supported by Mercury. Fairly
sophisticated queries can be expressed, but there is no mechanism for querying
against structure (e.g. matching the value of a term that is a child of some
other element). SOLR queries are designed to be transmitted in URLs and are
reasonably simple to create by hand.

The different types of query are described in more detail below.

Since it is feasible that MNs and CNs could support multiple query types, it
is desirable that the client provide a hint about the type of query being
transmitted through a URL parameter such as "``qt``" (query type), with::

  qt=nvp    --> Name, value pairs
  qt=path   --> Metacat path query
  qt=solr   --> SOLR query syntax (used by Mercury)


Simple NV Pairs
---------------

The basic approach here is the use name/value pairs (NVPs) in the URL to
construct a query, with names typically mapping to an attribute + comparison
operator (with comparison operator indicated as a suffix to the attribute),
and values being the value to compare against entries in the database.
Multiple NVPs are combined together with either the logical AND operator or
the logical OR operator. The types of queries that can be expressed are quite
limited, though can be sufficient for restricting results to a portion of a
data set modeled as a flat table.

The primary goal of this query syntax is to enable simple implementation of
range restrictions for collections available on MNs.

An example of how a simple query might express "objects of type data that have
been modified since 6AM on the first of January, 2010 UTC"::

  ../object/?qt=nvp&oclass=data&lastModified_gt=20100101T060000+00

Suggestions for comparison operator suffixes:

======= ===========================
Suffix  Comparison Operator
======= ===========================
None    Equals (==) (default)
_eq     Equals (==)
_ne     Not equal (!=)
_lt     Less than (<)
_le     Less than or equals (<=)
_gt     Greater than (>)
_ge     Greater than or equals (>=)
======= ===========================

The presence of one or more wildcard characters in the value for an
equivalence operator would invoke the equivalent of a substring search. For
example::

  ../object/?qt=nvp&oclass=d*

could be mapped to the SQL WHERE clause::

  WHERE oclass LIKE 'd%'

The general grammar of the query can be expressed as:

.. productionlist::
   NVPQuery : { `nvpair` }
   nvpair   : `name` + "=" + `value`
   name     : string [+ `operator`]
   operator : "_eq" | "_ne" | "_lt" | "_le" | "_gt" | "_ge"
   value    : string



An alternative approach is to use enumerated triples, so for the same query as
above (with ``a`` referring to "attribute name", ``c`` to "comparison
operator", and ``v`` to "value")::

  ../object/?qt=nvp&a0=oclass&c0=eq&v0=data&
                    a1=lastModified&c1=gt&v1=20100101T060000+00

This approach has an advantage of specifying simple logical operators, e.g.::

  &lop0_1=AND

which would indicate that the logical operator between the first and second
query elements is "AND". This gets messy pretty quickly though when
considering precedence rules.


Metacat Path Query
------------------

.. TODO::
   - Rewrite this section to use the EarthGrid query syntax, which is more
     readable and expresses the same concepts as the pathquery

Metacat is an XML database, and so must support mechanisms for querying not
just the attribute name, but also its location relative to other elements of
the document (similar to XPath). The path query also indicates the elements
that will be returned in the response. An `example path query`_::

  <pathquery version="1.0">
    <meta_file_id>unspecified</meta_file_id>
    <querytitle>unspecified</querytitle>
    <returnfield>dataset/title</returnfield>
    <returnfield>keyword</returnfield>
    <returnfield>originator/individualName/surName</returnfield>
    <returndoctype>eml://ecoinformatics.org/eml-2.0.1</returndoctype>
    <returndoctype>eml://ecoinformatics.org/eml-2.0.0</returndoctype>
    <querygroup operator="UNION">
      <queryterm casesensitive="false" searchmode="contains">
        <value>Plant</value>
        <pathexpr>dataset/title</pathexpr>
      </queryterm>
      <queryterm casesensitive="false" searchmode="contains">
        <value>plant</value>
        <pathexpr>keyword</pathexpr>
      </queryterm>
    </querygroup>
  </pathquery>

This query states something like return the field values ``dataset/title``,
``keyword``, and ``originator/individualName/surName`` from documents where
the string "plant" appears in the ``keyword`` attribute or the string "Datos"
appears in the ``dataset/title`` attribute. The comparisons are performed
without consideration of case.

Since path queries are expressed as XML documents, they can get quite large
and so can be unwieldy when sending over a HTTP GET request. However, the
types of queries that can be created can be quite precise and expressive, so
these should be supported by the CN services, which shouldn't involve much
more than passing the query through to the Metacat instance operating as the
document store on the CN.

.. _example path query: https://code.ecoinformatics.org/code/metacat/trunk/docs/user/metacatquery.html


SOLR Query Syntax
-----------------

- http://wiki.apache.org/solr/SolrQuerySyntax

- http://lucene.apache.org/java/2_4_0/queryparsersyntax.html




Query Attributes
----------------

- Best if query attributes were consistent across all the query types

- Distinction between searches against system metadata and science metadata
  (though some overlap of attributes)

- Log searches can probably be pretty simple - just slicing by time

- MNs and CNs should support introspection that lists the supported query
  types and the supported query attributes



Misc Notes

Google visualization api query language: http://code.google.com/apis/visualization/documentation/querylanguage.html

SRU/SRW and CQL: http://www.loc.gov/standards/sru/

OpenSearch: http://www.opensearch.org/Home

XPath: http://www.w3.org/TR/xpath and XQuery: http://www.w3.org/TR/xquery/
(appropriate for querying against a general XML model)

SPARQL (assuming you can express content in an RDF model):
http://www.w3.org/TR/rdf-sparql-query/

TAPIR:
http://www.tdwg.org/dav/subgroups/tapir/1.0/docs/TAPIRSpecification_2008-02-07.html

MetaCat (EarthGRID):
https://code.ecoinformatics.org/code/metacat/trunk/docs/user/metacatquery.htmlhhhhhUcommentq†h}q°(U	xml:spaceq¢Upreserveq£h!]h ]h]h]h#]uh%Kˆh&hh]q§h/Xk"��Content here is preserved for notes until the search API is completed.
Synopsis
--------

This document provides an outline for approaches to querying content available
in DataONE through the ``/object/`` collection exposed by the CNs and MNs
(i.e. :func:`MN_replication.listObjects` and :func:`CN_query.search`
methods). The same approach can be applied to the ``/log/`` collection exposed
by the CNs and MNs (i.e. the :func:`CN_query.getLogRecords` and
:func:`MN_crud.getLogRecords` methods).

There are three types of query that can be readily supported by CNs
(name-value pairs, Metacat path query, and Mercury SOLR query), and at least
one by MNs (name-value pairs). There may also be additional query types
specified in the future (e.g. CQL, SPARQL).


Overview
--------

The basic model is that a query applied against a collection acts as a filter,
restricting the results to only those objects whose properties match the
supplied query expression. The default, or unfiltered view of the collection
shows all objects (that the user is authorized to access). The query does not
shape the result, i.e. it does not indicate which fields are returned or the
structure of the response.

There seems to be two basic types of query that need to be supported. One is
querying against fairly distinct and controlled object attributes that are for
the most part, defined by the DataONE system ("system queries"). The other is
for queries that apply to the content of objects that are contributed to
DataONE ("content queries"). In this case, the content, structure, and even
representation is essentially uncontrolled, and so may vary considerably
across the universe of objects that are managed by DataONE.

A longterm goal would be to support a query syntax that is expressive enough
to enable precise discovery of content but also simple enough that at least
common queries can be expressed in a URL.

There are three types of query expression that can be supported easily with
the initial version of the DataONE cyber-infrastructure:

1) Simple name-value pairs combined together with a single logical operator
(e.g. AND).

2) The Path Query syntax / structure that is used by Metacat. This is a
potentially very expressive query that is encoded in an XML structure, and so
can be unwieldy for passing in a URL (POST is typically used) or generation by
hand.

3) The SOLR / Lucene query syntax that is supported by Mercury. Fairly
sophisticated queries can be expressed, but there is no mechanism for querying
against structure (e.g. matching the value of a term that is a child of some
other element). SOLR queries are designed to be transmitted in URLs and are
reasonably simple to create by hand.

The different types of query are described in more detail below.

Since it is feasible that MNs and CNs could support multiple query types, it
is desirable that the client provide a hint about the type of query being
transmitted through a URL parameter such as "``qt``" (query type), with::

  qt=nvp    --> Name, value pairs
  qt=path   --> Metacat path query
  qt=solr   --> SOLR query syntax (used by Mercury)


Simple NV Pairs
---------------

The basic approach here is the use name/value pairs (NVPs) in the URL to
construct a query, with names typically mapping to an attribute + comparison
operator (with comparison operator indicated as a suffix to the attribute),
and values being the value to compare against entries in the database.
Multiple NVPs are combined together with either the logical AND operator or
the logical OR operator. The types of queries that can be expressed are quite
limited, though can be sufficient for restricting results to a portion of a
data set modeled as a flat table.

The primary goal of this query syntax is to enable simple implementation of
range restrictions for collections available on MNs.

An example of how a simple query might express "objects of type data that have
been modified since 6AM on the first of January, 2010 UTC"::

  ../object/?qt=nvp&oclass=data&lastModified_gt=20100101T060000+00

Suggestions for comparison operator suffixes:

======= ===========================
Suffix  Comparison Operator
======= ===========================
None    Equals (==) (default)
_eq     Equals (==)
_ne     Not equal (!=)
_lt     Less than (<)
_le     Less than or equals (<=)
_gt     Greater than (>)
_ge     Greater than or equals (>=)
======= ===========================

The presence of one or more wildcard characters in the value for an
equivalence operator would invoke the equivalent of a substring search. For
example::

  ../object/?qt=nvp&oclass=d*

could be mapped to the SQL WHERE clause::

  WHERE oclass LIKE 'd%'

The general grammar of the query can be expressed as:

.. productionlist::
   NVPQuery : { `nvpair` }
   nvpair   : `name` + "=" + `value`
   name     : string [+ `operator`]
   operator : "_eq" | "_ne" | "_lt" | "_le" | "_gt" | "_ge"
   value    : string



An alternative approach is to use enumerated triples, so for the same query as
above (with ``a`` referring to "attribute name", ``c`` to "comparison
operator", and ``v`` to "value")::

  ../object/?qt=nvp&a0=oclass&c0=eq&v0=data&
                    a1=lastModified&c1=gt&v1=20100101T060000+00

This approach has an advantage of specifying simple logical operators, e.g.::

  &lop0_1=AND

which would indicate that the logical operator between the first and second
query elements is "AND". This gets messy pretty quickly though when
considering precedence rules.


Metacat Path Query
------------------

.. TODO::
   - Rewrite this section to use the EarthGrid query syntax, which is more
     readable and expresses the same concepts as the pathquery

Metacat is an XML database, and so must support mechanisms for querying not
just the attribute name, but also its location relative to other elements of
the document (similar to XPath). The path query also indicates the elements
that will be returned in the response. An `example path query`_::

  <pathquery version="1.0">
    <meta_file_id>unspecified</meta_file_id>
    <querytitle>unspecified</querytitle>
    <returnfield>dataset/title</returnfield>
    <returnfield>keyword</returnfield>
    <returnfield>originator/individualName/surName</returnfield>
    <returndoctype>eml://ecoinformatics.org/eml-2.0.1</returndoctype>
    <returndoctype>eml://ecoinformatics.org/eml-2.0.0</returndoctype>
    <querygroup operator="UNION">
      <queryterm casesensitive="false" searchmode="contains">
        <value>Plant</value>
        <pathexpr>dataset/title</pathexpr>
      </queryterm>
      <queryterm casesensitive="false" searchmode="contains">
        <value>plant</value>
        <pathexpr>keyword</pathexpr>
      </queryterm>
    </querygroup>
  </pathquery>

This query states something like return the field values ``dataset/title``,
``keyword``, and ``originator/individualName/surName`` from documents where
the string "plant" appears in the ``keyword`` attribute or the string "Datos"
appears in the ``dataset/title`` attribute. The comparisons are performed
without consideration of case.

Since path queries are expressed as XML documents, they can get quite large
and so can be unwieldy when sending over a HTTP GET request. However, the
types of queries that can be created can be quite precise and expressive, so
these should be supported by the CN services, which shouldn't involve much
more than passing the query through to the Metacat instance operating as the
document store on the CN.

.. _example path query: https://code.ecoinformatics.org/code/metacat/trunk/docs/user/metacatquery.html


SOLR Query Syntax
-----------------

- http://wiki.apache.org/solr/SolrQuerySyntax

- http://lucene.apache.org/java/2_4_0/queryparsersyntax.html




Query Attributes
----------------

- Best if query attributes were consistent across all the query types

- Distinction between searches against system metadata and science metadata
  (though some overlap of attributes)

- Log searches can probably be pretty simple - just slicing by time

- MNs and CNs should support introspection that lists the supported query
  types and the supported query attributes



Misc Notes

Google visualization api query language: http://code.google.com/apis/visualization/documentation/querylanguage.html

SRU/SRW and CQL: http://www.loc.gov/standards/sru/

OpenSearch: http://www.opensearch.org/Home

XPath: http://www.w3.org/TR/xpath and XQuery: http://www.w3.org/TR/xquery/
(appropriate for querying against a general XML model)

SPARQL (assuming you can express content in an RDF model):
http://www.w3.org/TR/rdf-sparql-query/

TAPIR:
http://www.tdwg.org/dav/subgroups/tapir/1.0/docs/TAPIRSpecification_2008-02-07.html

MetaCat (EarthGRID):
https://code.ecoinformatics.org/code/metacat/trunk/docs/user/metacatquery.htmlq•ÖÅq¶}qß(hU�hhûubaubeubahU�Utransformerq®NU
footnote_refsq©}q™Urefnamesq´}q¨Usymbol_footnotesq≠]qÆUautofootnote_refsqØ]q∞Usymbol_footnote_refsq±]q≤U	citationsq≥]q¥h&hUcurrent_lineqµNUtransform_messagesq∂]q∑cdocutils.nodes
system_message
q∏)Åqπ}q∫(hU�h}qª(h]UlevelKh!]h ]Usourcehh]h#]UlineKUtypeUINFOqºuh]qΩh[)Åqæ}qø(hU�h}q¿(h]h]h ]h!]h#]uhhπh]q¡h/X-���Hyperlink target "index-0" is not referenced.q¬ÖÅq√}qƒ(hU�hhæubahh^ubahUsystem_messageq≈ubaUreporterq∆NUid_startq«KU
autofootnotesq»]q…U
citation_refsq }qÀUindirect_targetsqÃ]qÕUsettingsqŒ(cdocutils.frontend
Values
qœoq–}q—(Ufootnote_backlinksq“KUrecord_dependenciesq”NUrfc_base_urlq‘Uhttps://tools.ietf.org/html/q’U	tracebackq÷àUpep_referencesq◊NUstrip_commentsqÿNU
toc_backlinksqŸUentryq⁄U
language_codeq€Uenq‹U	datestampq›NUreport_levelqfiKU_destinationqflNU
halt_levelq‡KU
strip_classesq·Nh,NUerror_encoding_error_handlerq‚Ubackslashreplaceq„Udebugq‰NUembed_stylesheetqÂâUoutput_encoding_error_handlerqÊUstrictqÁU
sectnum_xformqËKUdump_transformsqÈNU
docinfo_xformqÍKUwarning_streamqÎNUpep_file_url_templateqÏUpep-%04dqÌUexit_status_levelqÓKUconfigqÔNUstrict_visitorqNUcloak_email_addressesqÒàUtrim_footnote_reference_spaceqÚâUenvqÛNUdump_pseudo_xmlqÙNUexpose_internalsqıNUsectsubtitle_xformqˆâUsource_linkq˜NUrfc_referencesq¯NUoutput_encodingq˘Uutf-8q˙U
source_urlq˚NUinput_encodingq¸U	utf-8-sigq˝U_disable_configq˛NU	id_prefixqˇU�U	tab_widthr���KUerror_encodingr��UUTF-8r��U_sourcer��hUgettext_compactr��àU	generatorr��NUdump_internalsr��NUsmart_quotesr��âUpep_base_urlr��U https://www.python.org/dev/peps/r	��Usyntax_highlightr
��Ulongr��Uinput_encoding_error_handlerr��hÁUauto_id_prefixr
��Uidr��Udoctitle_xformr��âUstrip_elements_with_classesr��NU
_config_filesr��]r��Ufile_insertion_enabledr��àUraw_enabledr��KU
dump_settingsr��NubUsymbol_footnote_startr��K�Uidsr��}r��(hhh9h;uUsubstitution_namesr��}r��hh&h}r��(h]h!]h ]Usourcehh]h#]uU	footnotesr��]r��Urefidsr��}r��h9]r ��h4asub.