The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) was first developed in the late 1990’s as a standard for harvesting metadata from distributed metadata/data repositories. The current version of the OAI-PMH standard is 2.0 as of June 2002, with minor updates in December 2008.
The OAI-PMH standard uses the Hypertext Transport Protocol (HTTP) as a transport layer and specifies six query methods (called verbs) that must be supported by an OAI-PMH compliant data provider (also referred to as a repository). These methods are:
GetRecord
- retrieves zero or one complete metadata record from a repository;Identify
- retrieves information about a repository;ListIdentifiers
- retrieves zero or more metadata record “headers” (not the complete metadata record) from a repository;ListMetadataFormats
- retrieves a list of available metadata record formats supported by a repository;ListRecords
- retrieves zero or more complete metadata records from a repository; andListSets
- retrieves the set structure from a repository.The OAI-PMH compliant data provider must accept requests from both HTTP GET and HTTP POST request methods. Responses from the data provider must be returned as an XML-encoded (version 1.0) stream. Error handling must be supported by the data provider and return the correct error response code back to the harvester. Detailed specifications and examples of all six verbs may be viewed in Section 4 of the OAI-PMH standards document.
The OAI-PMH requires that unqualified Dublin Core metadata be supported as a
minimum. Although EML generally provides more fine-grained metadata than Dublin
Core, the two metadata standards do share many of the same (or similar) content
elements. Transformations from EML to Dublin Core performed by Metacat OAI-PMH
produce simple or unqualified Dublin Core, which is associated with the reserved
metadataPrefix symbol oai_dc
in the OAI-PMH.
The following table summarizes the element mappings of the EML to Dublin Core crosswalk performed by Metacat OAI-PMH, including notes specific to each element mapping.
EML Element | DC Element | Notes |
---|---|---|
Title | title | |
Creator | creator | Use only the creator’s name (givenName and surName elements); could be an organization name |
keyword | subject | One subject element per keyword element |
abstract | description | Must extract text formatting tags |
publisher | publisher | Use only the publisher’s name (givenName and surName elements); could be an organization name |
associatedParty | contributor | Use only the party’s name (givenName and surName); could be an organization name |
pubDate | date | One-to-one mapping |
dataset, citation, protocol, software | type | Type value is determined by the type of EML document rather than by a specific field value |
physical | format | Use a mime type as the Format value? For example, if EML has <textFormat> element within <physical>, then use ‘text/plain’ as the Format value? |
|
identifier | packageId can be used as the value of one identifier element; a second identifier element can hold a URL to the EML document |
dataSource | source | Use the document URL of the referenced data source? |
Citation | relation | Use the document URL of the referenced citation? |
geographicCoverage | coverage | Add separate coverage elements for geographic description and geographic bounding coordinates. For bounding coordinates, use minimal labeling, for example: 81.505000 W, 81.495000 W, 31.170000 N, 31.163000 N |
taxonomicCoverage | coverage | Use only genus/species binomials; place each binomial in a separate coverage element |
temporalCoverage | coverage | Include begin date and end date when available. For example: 1915-01-01 to 2004-12-31 |
intellectualRights | rights | Must extract text formatting tags |
Metacat OAI-PMH includes a set of XSLT stylesheets used for converting specific versions of EML to their Dublin Core equivalents.
Metacat includes support for two OAI-PMH service interfaces: a data provider (or repository) service interface and a harvester service interface.
The Metacat OAI-PMH Data Provider service interface supports all six OAI-PMH methods (GetRecord, Identify, ListIdentifiers, ListMetadataFormats, ListRecords, and ListSets) as defined in the OAI-PMH Version 2 Specification through a standard HTTP URL that accepts both HTTP GET and HTTP POST requests.
The Metacat OAI-PMH Data Provider service was implemented using the Online Computer Library Center (OCLC) OAICat Open Source Software as the basis for its implementation, with customizations added to facilitate integration with Metacat.
Users of the Metacat OAI-PMH Data Provider should be aware of the following issues:
xml_documents.date_updated
field in day granularity, it is the level
that is supported by the Metacat OAI-PMH Data Provider.The Metacat OAI-PMH Harvester service interface utilizes OAI-PMH methods to request metadata or related information from an OAI-PMH-compliant data provider using a standard HTTP URL in either an HTTP-GET or HTTP-POST request.
The Metacat OAI-PMH Harvester client was implemented using OCLC’s OAIHarvester2 open source code as its base implementation, with customizations as needed to support integration with Metacat.
Users of the Metacat OAI-PMH Harvester should be aware of the following issues:
xml_documents.last_updated
field in day granularity, it is also the
level that is supported by both the Metacat OAI-PMH Data Provider and the
Metacat OAI-PMH Harvester. This has implications when Metacat OAI-PMH
Harvester (MOH) interacts with data providers such as the Dryad repository,
which stores its documents with seconds granularity. For example, consider
the following sequence of events:To configure and enable the Data Provider servlet:
Stop Tomcat and edit the Metacat properties (metacat.properties
) file in
the Metacat context directory inside the Tomcat application directory.
The Metacat context directory is the name of the application (usually knb
):
<tomcat_app_dir>/<context_dir>/WEB-INF/metacat.properties
Change the following properties appropriately:
``oaipmh.repositoryIdentifier`` - A string that identifies this repository
``Identify.adminEmail`` - The email address of the repository administrator
Edit the deployment descriptor (web.xml
) file, also in the WEB-INF
directory. Uncomment the servlet-name and servlet-mapping entries for the
DataProvider servlet by removing the surroundin “<!–” and “–>” strings:
<servlet>
<servlet-name>DataProvider</servlet-name>
<description>Processes OAI verbs for Metacat OAI-PMH Data Provider (MODP)</description>
<servlet-class>edu.ucsb.nceas.metacat.oaipmh.provider.server.OAIHandler</servlet-class>
<load-on-startup>4</load-on-startup>
</servlet>
<servlet-mapping>
<servlet-name>DataProvider</servlet-name>
<url-pattern>/dataProvider</url-pattern>
</servlet-mapping>
Save the metacat.properties
and web.xml
files and start Tomcat.
The following table describes the complete set of metacat.properties
settings that are used by the DataProvider servlet.
Property Name | Sample Value | Description |
---|---|---|
oaipmh.maxListSize | 5 | Maximum number of records returned by each call to the ListIdentifiers and ListRecords verbs. |
oaipmh.repositoryIdentifier | metacat.lternet.edu | An identifier string for the respository. |
AbstractCatalog.oaiCatalogClassName | edu.ucsb.nceas.metacat.oaipmh.provider.server.catalog.MetacatCatalog | The Java class that implements the AbstractCatalog interface. This class determines which records exist in the repository and their datestamps. |
AbstractCatalog.recordFactoryClassName | edu.ucsb.nceas.metacat.oaipmh.provider.server.catalog.MetacatRecordFactory | The Java class that extends the RecordFactory class. This class creates OAI-PMH metadata records. |
AbstractCatalog.secondsToLive | 3600 | The lifetime, in seconds, of the resumptionToken. |
AbstractCatalog.granularity | YYYY-MM-DD or YYYY-MM-DDThh:mm:ssZ | Granularity of datestamps. Either “days granularity” or “seconds granularity” values can be used. |
Identify.repositoryName | Metacat OAI-PMH Data Provider | A name for the repository. |
Identify.earliestDatestamp | 2000-01-01T00:00:00Z | Earliest datestamp supported by this repository |
Identify.deletedRecord | yes or no | Use “yes” if the repository indicates the status of deleted records; use “no” if it doesn’t. |
Identify.adminEmail | mailto:tech_support@someplace.org | Email address of the repository administrator. |
Crosswalks.oai_dc | edu.ucsb.nceas.metacat.oaipmh.provider.server.crosswalk.Eml2oai_dc | Java class that controls the EML 2.x.y to oai_dc (Dublin Core) crosswalk. |
Crosswalks.eml2.0.0 | edu.ucsb.nceas.metacat.oaipmh.provider.server.crosswalk.Eml200 | Java class that furnishes EML 2.0.0 metadata. |
Crosswalks.eml2.0.1 | edu.ucsb.nceas.metacat.oaipmh.provider.server.crosswalk.Eml201 | Java class that furnishes EML 2.0.1 metadata. |
Crosswalks.eml2.1.0 | edu.ucsb.nceas.metacat.oaipmh.provider.server.crosswalk.Eml210 | Java class that furnishes EML 2.1.0 metadata. |
Sample URLs that demonstrate use of the Metacat OAI-PMH Data Provider follow:
OAI-PMH Verb | Description | URL |
---|---|---|
GetRecord | Get an EML 2.0.1 record using its LSID identifier | http://<your_context_url>/dataProvider?verb=GetRecord&metadataPrefix=eml-2.0.1&identifier=urn:lsid:knb.ecoinformatics.org:knb-ltergce:26 |
GetRecord | Get an oai_dc (Dublin Core) record using its LSID identifier | http://<your_context_url>/dataProvider?verb=GetRecord&metadataPrefix=oai_dc&identifier=urn:lsid:knb.ecoinformatics.org:knb-lter-gce:26 |
Identify | Identify this data provider | http://<your_context_url>/dataProvider?verb=Identify |
ListIdentifiers | List all EML 2.1.0 identifiers in the repository | http://<your_context_url>/dataProvider?verb=ListIdentifiers&metadataPrefix=eml-2.1.0 |
ListIdentifiers | List all oai_dc (Dublin Core) identifiers in the repository between a range of dates | http://<your_context_url>/dataProvider?verb=ListIdentifiers&metadataPrefix=oai_dc&from=2006-01-01&until=2010-01-01 |
ListMetadataFormats | List metadata formats supported by this repository | http://<your_context_url>/dataProvider?verb=ListMetadataFormats |
ListRecords | List all EML 2.0.0 records in the repository | http://<your_context_url>/dataProvider?verb=ListRecords&metadataPrefix=eml-2.0.0 |
ListRecords | List all oai_dc (Dublin Core) records in the repository | http://<your_context_url>/dataProvider?verb=ListRecords&metadataPrefix=oai_dc |
ListSets | List sets supported by this repository | http://<your_context_url>/dataProvider?verb=ListSets |
The Metacat OAI-PMH Harvester (MOH) is executed as a command-line program:
sh runHarvester.sh -dn <distinguishedName> \
-password <password> \
-metadataPrefix <prefix> \
[-from <fromDate>] \
[-until <untilDate>] \
[-setSpec <setName>] \
<baseURL>
The following example illustrates how the Metacat OAI-PMH Harvester is run from the command line:
Open a system command window or terminal window.
Set the METACAT_HOME environment variable to the value of the Metacat installation directory. Some examples follow:
export METACAT_HOME=/home/somePath/metacat
cd to the following directory:
cd $METACAT_HOME/lib/oaipmh
Run the appropriate Metacat OAI-PMH Harvester shell script, as determined by the operating system:
sh runHarvester.sh \
-dn uid=jdoe,o=myorg,dc=ecoinformatics,dc=org \
-password some_password \
-metadataPrefix oai_dc \
http://baseurl.repository.org/metacat/dataProvider
Command line options and parameters are described in the following table:
Command Option or Parameter | Example | Description |
---|---|---|
-dn | -dn uid=dryad,o=LTER,dc=ecoinformatics,dc=org |
Full distinguished name of the LDAP account used when harvesting documents into Metacat. (Required) |
-password | -password some_password |
Password of the LDAP account used when harvesting documents into Metacat. (Required) |
-metadataPrefix | -metadataPrefix oai_dc |
The type of documents being harvested from the remote repository. (Required) |
-from | -from 2000-01-01 |
The lower limit of the datestamp for harvested documents. (Optional) |
-until | -until 2010-12-31 |
The upper limit of the datestamp for harvested documents. (Optional) |
-setSpec | -setSpec someSet |
Harvest documents belonging to this set. (Optional) |
base_url | http://baseurl.repository.org/metacat/dataProvider |
Base URL of the remote repository |
Error Code | Description | Applicable Verbs |
badArgument | The request includes illegal arguments, is missing required arguments, includes a repeated argument, or values for arguments have an illegal syntax. | all verbs |
badResumptionToken | The value of the resumptionToken argument is invalid or expired. | ListIdentifiers ListRecords ListSets |
badVerb | Value of the verb argument is not a legal OAI-PMH verb, the verb argument is missing, or the verb argument is repeated. | N/A |
cannotDisseminateFormat | The metadata format identified by the value given for the metadataPrefix argument is not supported by the item or by the repository. | GetRecord ListIdentifiers ListRecords |
idDoesNotExist | The value of the identifier argument is unknown or illegal in this repository. | GetRecord ListMetadataFormats |
noRecordsMatch | The combination of the values of the from, until, set and metadataPrefix arguments results in an empty list. | ListIdentifiers ListRecords |
noMetadataFormats | There are no metadata formats available for the specified item. | ListMetadataFormats |
noSetHierarchy | The repository does not support sets. | ListSets ListIdentifiers ListRecords |