Register an Object Format
=========================

.. contents:: Contents
   :local:
   :backlinks: entry

While DataONE recognizes many of the common object formats, it is entirely
expected that other ones will need to be registered in the future.  Object
formats are categorized into 3 types: *DATA*, *METADATA*, and *RESOURCE*,
representing data objects, metadata objects, and resource maps, respectively.
DataONE is responsible for maintaining the extent and categorization of all
individual object formats.

All format identifiers are registered in each Coordinating Node environment via
a manual process by CN operators.  

* RESOURCE format registration

  Currently, DataONE only reads one type of object format for recording data
  package relationships (http://www.openarchives.org/ore/terms). New formats
  also require development, testing and deployment of parsers before they can be
  considered fully registered.


Manually Adding Object Formats
------------------------------

DataONE is primarily concerned with the proper scoping and MIME type
associations of new object formats that represent data objects. Registration is
a straightforward process that requires little testing. Once formats are
registered into the object format list, additional work may have to occur for
further processing of metadata formats.

The DataONE Object Format list is maintained on the Coordinating Nodes for each
environment.  For a given environment, the object format list needs to be added
to a single CN during a fresh install of the CN, and the Metacat application on
each CN handles the replication of the list to the other CNs in the environment.
The production list is maintained in the dataone-cn-metacat buildout package and
is named `objectFormatList.xml`_. The insertOrUpdateObjectFormatList.sh_ script
is also maintained in the same directory, and provides a convenient way to
insert or update the document in Metacat.

.. _objectFormatList.xml: https://repository.dataone.org/software/cicore/trunk/cn-buildout/dataone-cn-metacat/usr/share/metacat/debian/objectFormatList.xml

.. _insertOrUpdateObjectFormatList.sh: https://repository.dataone.org/software/cicore/trunk/cn-buildout/dataone-cn-metacat/usr/share/metacat/debian/insertOrUpdateObjectFormatList.sh


First time inserts in a new CN environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When a Coordinating Node is first installed, the object format list needs to be inserted into the Metacat database.  To do so, on one of the CNs in the environment, issue the following commands::

  $ cd /usr/share/metacat/debian
  $ sudo chmod +x insertOrUpdateObjectFormatList.sh
  $ sudo ./insertOrUpdateObjectFormatList.sh objectFormatList.xml
    
When prompted for the password, enter the password for the `uid=dataone_cn_metacat,o=DATAONE,dc=ecoinformatics,dc=org` user, which is stored in the SystemPW.txt.gpg file in subversion.

Note:  We've changed the above DN in the production environment to `cn=dataone_cn_metacat,dc=dataone,dc=org`.  Because of this, before executing the script, change the script to have::

  username="cn=dataone_cn_metacat,dc=dataone,dc=org";
    
Use the password for this DN found in the ProductionPW.txt.gpg file in subversion.

Updating the object format list
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Before updating the list, consult the `Unfied Digital Format Registry`_ and search for the file format in that registry to help decide what the DataONE formatId should be for the format.  It's important to ensure that the format id is unique, as well as versioned in some manner in order to accomodate future iterations of the format.  Also look through the existing objectFormatList.xml to ensure the format doesn't already exist, perhaps even under a different formatId.

To update the list, do an svn checkout of the dataone-cn-metacat package::

  $ svn co https://repository.dataone.org/software/cicore/trunk/cn-buildout/dataone-cn-metacat

Modify the objectFormatList.xml file by adding new formats according to the `ObjectFormat Type`_.  Never modify an existing format, and never delete an existing format. Update the `total` and `count` attributes of the `ObjectFormatList`_ element.  It can be helpful to use xmlstarlet to count the total as a cross check::

  $ xmlstarlet sel -t -v "count(//objectFormat/formatId)"
    
Commit the changes::

  $ svn commit objectFormatList.xml
    
Copy the new list to the CN you are modifying, and replace the file in /usr/share/metacat/debian/objectFormatList.xml::

  $ scp objectFormatList.xml cn-dev-ucsb-1.test.dataone.org:
  $ ssh cn-dev-ucsb-1.test.dataone.org
  $ sudo cp objectFormatList.xml /usr/share/metacat/debian/objectFormatList.xml

Lastly, run the update script against the new format list document::

  $ cd /usr/share/metacat/debian
  $ sudo chmod +x insertOrUpdateObjectFormatList.sh
  $ sudo ./insertOrUpdateObjectFormatList.sh objectFormatList.xml

After being prompted for the password, the list should be updated in Metacat.

You can verify that each CN has the updated list by visiting the Cn's formats REST endpoint::

  https://cn-dev-ucsb-1.test.dataone.org/cn/v1/formats
  https://cn-dev-unm-1.test.dataone.org/cn/v1/formats
  https://cn-dev-orc-1.test.dataone.org/cn/v1/formats

Maintenance of all format lists
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When updating the object format list, it's best to do so in all environments at
once because the list only gets initially added when first installing the CN.
So, perform the above steps for DEV, SANDBOX, SANDBOX2, STAGE, STAGE2, and
PRODUCTION environments.

Note that the identifier for the object format list XML document may differ
across environments because some environments get wiped clean and re-installed.
For instance, in DEV it might be `OBJECT_FORMAT_LIST.1.1` whereas in PRODUCTION
it might be `OBJECT_FORMAT_LIST.1.8`.

.. _ObjectFormatList: https://releases.dataone.org/online/api-documentation-v1.2.0/apis/Types.html#Types.ObjectFormatList

.. _ObjectFormat Type: https://releases.dataone.org/online/api-documentation-v1.2.0/apis/Types.html#Types.ObjectFormat

.. _Unfied Digital Format Registry: http://udfr.org


METADATA format registration
----------------------------

.. todo:: 

   20150730 - this section needs revision

In addition to the concerns of data format registration, DataONE Coordinating
Nodes must parse metadata objects in order to add their information to the
search index, and so the format needs to be tested, and parsers built and
deployed before it can be considered fully registered.



Overview
~~~~~~~~

While DataONE's architecture is designed to accommodate any metadata format
Member Nodes make use of, each new metadata format requires a bit of development
to enable DataONE's discovery mechanisms for those metadata documents.  Both
Content Curator (usually a Member Node administrator) and DataONE developer
effort is required, and more significantly, a patch-level release of the CN
software stack needs to be performed so that content of the new format can be
synchronized, indexed, and  ultimately discovered.  The building, testing, and
deploying the necessary items to the CNs does necessitate a lag between when the
new format is published and when  content using it can be successfully created.
Accordingly, content curators  making use of a new format, or a new version of
an existing format, need to  account for that in their own planning.

The process of registering a new metadata format involves the creation and testing of 
the following items::

1. a **published schema or DTD** (done by Content Curator)
2. an **indexing parser** (a DataONE developer responsibility)
3. an **XSLT template** (built by either, depending on time and ability )   // TODO: verify who's responsible

Once all are available and tested, the format can be fully registered into DataONE 
as a new object format.  

When done as part of a new Member Node deployment, it is good to plan for this 
work to be done early on, as final testing of the node requires that all objects 
use a registered format. 


Metadata Format Registration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Irrespective of Member Node deployment, registering a metadata format follows the same steps:

Content Curators:

1. develop and test their schema or DTD.  The schema or DTD needs to pass standard schema 
   validation tests that can be found at numerous testing services online (search
   for "online XML schema validation").
   
2. publish the schema such that the namespace and schemaLocation of the metadata 
   documents point to an immutable copy of the schema, where it can continue to be 
   resolved consistently indefinitely.  
   
3. contact DataONE via support@dataone.org, attaching example metadata documents,
   or providing a link to a test instance of the Member Node that contains them. 

DataONE developers:

4. test the schema format via the examples, iterating with the content curator on any bug fixes.

5. write an indexing parser and / or XSLT template.

6. test the indexing parser and XSLT template (in the DEV environment). 
 
7. Review test results with the content curator (show search results, and metadata visualizations)

8. Deploy indexing parser and XSLT templates and new object format record to additional 
   environments (STAGE and/or production)
   (Currently XSLT template is handed off to ONEMercury maintainers)
   
9. Notify content curator when work is done.

Content Curator can then start submitting metadata objects using the new format.

// TODO: who names the object format (gives the identifier?)

As part of Member Node deployment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Deployment-phase testing of Member Nodes requires all metadata formats used by the 
prospective Member Node to be registered, so that the processes under test 
(synchronization, indexing, ONEMercury presentation) can be run.  Keeping in mind 
that DataONE will need to build, test, and deploy items to the Coordinating Nodes, 
format registration would ideally be started during the implementation / development
phase of the Member Node on-boarding process.  Specifically, the first item 
(the published schema) needs to be published and tested, and the object format 
registered to the target testing environment before the Member Node itself can 
be tested.  Absent these things, synchronization will fail, and the indexing and 
ONEMercury tests cannot be run.

Typically, the indexing parser and SXLT template are tested and deployed to the 
Coordinating Nodes of the DEV testing environment for testing by DataONE developers, 
and then if successful, deployed to the STAGE environment, in preparation for 
registration of the prospective Member Node in that environment.

Member Node implementers should work out specific timings and placements with 
their primary DataONE contact to optimize their development cycles. 



Notes:
~~~~~~

What information is pulled from metadata into the search index::

  http://mule1.dataone.org/ArchitectureDocs-current/design/SearchMetadata.html#values-extracted-from-science-metadata

current effort estimation:

- 2 days dev, 2 days testing (sandbox, staging), 1 for the release, 1 day ONEMercury upgrade.
- new versions of existing formats require less development and result in quicker testing
- what is process for registering a data format?

Remaining issue
~~~~~~~~~~~~~~~

Because of the difficulty re-synchronizing failed objects, the Member Node is 
dependent on DataONE to register the data format before it can start even entering
data onto their node.  This seems like a backwards dependency that puts DataONE
resources on the critical path of external projects.

Q. is there a more graceful way to handle this situation?