.. _UC02:

Use Case 02 - List PIDs By Search
----------------------------------

.. index:: Use Case 02, List, Search, Query

.. contents:: Contents
   :local:

Goal
~~~~

Get list of PIDs from metadata search (anonymous and authenticated).

Summary
~~~~~~~

A user performs a search against the DataONE system and receives a list of
object identifiers (PIDs) that match the search criteria. The list of PIDs
is filtered such that only objects for which the user has read permission
will be returned.

Content discovery in DataONE is achieved primarily through the service
interfaces provided by the Coordinating Nodes. Other systems may index
content available in DataONE (:doc:`UC34<34_uc>`), though the
operation of those systems is out of scope for DataONE operations except 
that the exposed APIs enable such functionality.

Actors
~~~~~~

- Client performing search operation
- Coordinating Node

.. uml::

  @startuml images/02_uc.png

  actor User
  usecase "12. Authentication" as authen

   actor "Coordinating Node" as CN 
   usecase "13. Authorization" as author
   usecase "02. Search Metadata" as SEARCH
   User -- SEARCH
   CN -- SEARCH
   SEARCH ..> author: <<includes>>
   SEARCH ..> authen: <<includes>>
  @enduml

**Figure 1.** Actors and dependencies for Use Case 02.


Preconditions 
~~~~~~~~~~~~~

- Client has authenticated to at the desired level (e.g. client may not have
  authenticated, so access might be anonymous).

Triggers
~~~~~~~~

- A search is performed against the DataONE system


Post Conditions
~~~~~~~~~~~~~~~

- The client has a list of PIDs (:class:`Types.objectList`) for which they
  have permission to read and match the supplied query or an error condition.

- The log is updated with information about the request


Implementation
~~~~~~~~~~~~~~

.. uml::

  @startuml images/02_seq.png

   participant "Client" as app_client << Application >>
   participant "Query API" as c_query << Coordinating Node >>
   participant "Authentication API" as c_authenticate << Coordinating Node >>
   participant "Read API" as c_crud << Coordinating Node >>
   app_client -> c_query: search(session, query)
   activate c_query
   c_query -> c_query: search -> objectList
   note right of c_query
     The query response is a list of PIDs.
     Each ID needs to be checked for read 
     access by the authenticated user.
   end note
     loop for pid in objectList
       c_query -> c_authenticate: isAuthorized(session, pid, OP_GET)
       c_query <-- c_authenticate: T or F
     end 
   c_query --> c_crud: log
   app_client <-- c_query: objectList
   deactivate c_query
  @enduml

**Figure 2.** Interaction diagram for Use Case 02. The process for determining
READ access is for illustration purposes only. Actually implementation may
vary (e.g. by augmenting the query used for searching).


Examples
~~~~~~~~

Search is implemented by the Coordinating Nodes and optionally by Member Nodes.
Two discovery endpoints are provided by Coordinating Nodes: query and search.
The search endpoint provides a response that is more constrained than the search
endpoint, with only an ObjetList structure being returned. It is recommended
that general searches be performed against the query endpoint.

The following examples assume a Coordinating Node base URL is set in the ${NODE}
variable, for example:

.. code-block:: bash
  
   export NODE="https://cn.dataone.org/cn"

.. Note:: 
   For more example queries and detailed description of the various fields, 
   please visit :doc:`/design/SearchMetadata`

.. Note::

   The actual response XML may be more compressed than the examples below show.
   For easier viewing, pipe the response throug the xmlstarlet_ command line
   tool using the format ("fo") option. For example:

   .. code-block:: bash

      curl ${NODE}/v1/query | xml fo



Discover Available Query Engines
................................

To discover the query engines (search indexes) supported on the node:

.. code-block:: xml

   $ curl ${NODE}/v1/query

   <?xml version="1.0" encoding="UTF-8"?>
   <?xml-stylesheet type="text/xsl" href="/cn/xslt/dataone.types.v1.xsl"?>
   <d1_v1.1:queryEngineList 
    xmlns:d1="http://ns.dataone.org/service/types/v1" 
    xmlns:d1_v1.1="http://ns.dataone.org/service/types/v1.1">
     <queryEngine>solr</queryEngine>
     <queryEngine>logsolr</queryEngine>
   </d1_v1.1:queryEngineList>

The response show two query engines available ``solr`` and ``logsolr``. The
``solr`` query engine provides access to content (data, metadata, resource maps)
that have been indexed by the Coordinating Nodes. The ``logsolr`` endpoint
provides access to log records that have been aggregated by the Coordinating
Nodes.


List Search Fields Offered
..........................

To determine the search fields provided by a query engine, append the value of a
``queryEngine`` element to the url:

.. code-block:: xml

   $ curl ${NODE}/v1/query/solr

    <?xml version="1.0" encoding="UTF-8"?>
    <?xml-stylesheet type="text/xsl" href="/cn/xslt/dataone.types.v1.xsl"?>
    <d1_v1.1:queryEngineDescription 
      xmlns:d1="http://ns.dataone.org/service/types/v1" 
      xmlns:d1_v1.1="http://ns.dataone.org/service/types/v1.1">
      <queryEngineVersion>3.4.0.2011.09.20.17.19.53</queryEngineVersion>
      <querySchemaVersion>1.1</querySchemaVersion>
      <name>solr</name>
      <additionalInfo>https://releases.dataone.org/online/api-documentation-v1.2.0/
      </additionalInfo>
      <queryField>
        <name>abstract</name>
        <description>The full text of the abstract as provided in the science 
        metadata document.</description>
        <type>text</type>
        <searchable>true</searchable>
        <returnable>true</returnable>
        <sortable>true</sortable>
        <multivalued>false</multivalued>
      </queryField>
      <queryField>
        <name>attribute</name>
        <description>Multi-valued field containing the text from attributeName, 
        attributeLabel, attributeDescription, attributeUnit fields into a single 
        searchable text field.</description>
        <type>text</type>
        <searchable>true</searchable>
        <returnable>true</returnable>
        <sortable>true</sortable>
        <multivalued>true</multivalued>
      </queryField>
      <queryField>
        <name>attributeDescription</name>
        <description>Multi-valued field containing the attribute descriptive 
        text.</description>
        <type>text</type>
        <searchable>true</searchable>
        <returnable>true</returnable>
        <sortable>true</sortable>
        <multivalued>true</multivalued>
      </queryField>
      <queryField>
        <name>attributeLabel</name>
        <description>Multi-valued field containing secondary attribute name 
        information.</description>
        <type>string</type>
        <searchable>true</searchable>
        <returnable>true</returnable>
        <sortable>true</sortable>
        <multivalued>true</multivalued>
      </queryField>

      ...

    </d1_v1.1:queryEngineDescription>


Full Text Search
................

The solr endpoint supports standard solr_ query syntax and construct. To search 
all text for the string "water", the query "text:water" could be used. Expressed
as a command line request:

.. code-block:: xml

  $ curl "${NODE}/v1/query/solr/?q=text:water"

  <?xml version="1.0" encoding="UTF-8"?>
  <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">5</int>
      <lst name="params">
        <str name="q">text:water</str>
        <str name="rows">1</str>
      </lst>
    </lst>
    <result name="response" numFound="139455" start="0">
      <doc>
      ...
      </doc>
    </result>
  </response>

which indicates there were 139455 matches. The response is the standard solr XML
response (json may be returned by adding ``&wt=json`` to the url), with
``<doc>`` elements holding the actual response records. 


Limiting Returned Fields
........................

The default solr response returns all fields of the doc records which can be
quite verbose. To limit the response, against the standard solr syntax is used
with the ``fl`` parameter. For example, to return only the record identifier
(PID) and the date the system metadata was last modified:

.. code-block:: bash

   $ curl "${NODE}/v1/query/solr/?q=text:water&fl=id,dateModified"

.. code-block:: xml

  <?xml version="1.0" encoding="UTF-8"?>
  <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">4</int>
      <lst name="params">
        <str name="fl">id,dateModified</str>
        <str name="q">text:water</str>
        <str name="row">5</str>
      </lst>
    </lst>
    <result name="response" numFound="139455" start="0">
      <doc>
        <date name="dateModified">2015-03-20T23:18:10.507Z</date>
        <str name="id">https://pasta.lternet.edu/package/metadata/eml/knb-lter-gce/249/34</str>
      </doc>
      <doc>
        <date name="dateModified">2012-06-26T13:50:33.75Z</date>
        <str name="id">doi:10.6073/AA/knb-lter-gce.249.23</str>
      </doc>
      <doc>
        <date name="dateModified">2012-06-26T13:51:00.556Z</date>
        <str name="id">doi:10.6073/AA/knb-lter-gce.249.16</str>
      </doc>
      <doc>
        <date name="dateModified">2012-06-26T13:50:21.131Z</date>
        <str name="id">doi:10.6073/AA/knb-lter-gce.249.19</str>
      </doc>
      ...
    </result>
  </response>


Paging Response Records
.......................

The solr ``rows`` parameter limits the number of records that are returned in a
response, and the ``start`` parameter indicates the 0-based offset of the first
records from the start of the set of matching results. For example the second
page of records with five results per page would use ``start=5`` and 
``count=5``:

.. code-block:: bash

   $ curl "${NODE}/v1/query/solr/?q=text:water&fl=id,dateModified&start=5&rows=5"

.. code-block:: xml

  <?xml version="1.0" encoding="UTF-8"?>
  <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">3</int>
      <lst name="params">
        <str name="fl">id,dateModified</str>
        <str name="start">5</str>
        <str name="q">text:water</str>
        <str name="rows">5</str>
      </lst>
    </lst>
    <result name="response" numFound="139455" start="5">
      <doc>
        <date name="dateModified">2012-06-26T13:51:00.556Z</date>
        <str name="id">doi:10.6073/AA/knb-lter-gce.249.16</str>
      </doc>
      <doc>
        <date name="dateModified">2012-06-26T13:50:21.131Z</date>
        <str name="id">doi:10.6073/AA/knb-lter-gce.249.19</str>
      </doc>
      <doc>
        <date name="dateModified">2012-06-26T13:49:54.779Z</date>
        <str name="id">doi:10.6073/AA/knb-lter-gce.249.21</str>
      </doc>
      <doc>
        <date name="dateModified">2012-06-26T13:49:54.409Z</date>
        <str name="id">doi:10.6073/AA/knb-lter-gce.249.22</str>
      </doc>
      <doc>
        <date name="dateModified">2012-06-26T17:09:59.721Z</date>
        <str name="id">doi:10.6073/AA/knb-lter-gce.249.17</str>
      </doc>
    </result>
  </response>


.. _xmlstarlet: http://xmlstar.sourceforge.net/
.. _solr: https://lucene.apache.org/core/2_9_4/queryparsersyntax.html