Ecological Metadata Language (EML) SpecificationPrefaceIntroduction
The Ecological Metadata Language (EML) is a metadata standard developed by
the ecology discipline and for the ecology discipline. It is based on
prior work done by the Ecological Society of America and associated
efforts (Michener et al., 1997, Ecological Applications). EML is
implemented as a series of XML document types that can by used in a
modular and extensible manner to document ecological data. Each EML
module is designed to describe one logical part of the total metadata
that should be included with any ecological dataset.
Purpose Statement
To provide the ecological community with an extensible, flexible,
metadata standard for use in data analysis and archiving that will
allow automated machine processing, searching and retrieval.
Features
The architecture of EML was designed to serve the needs of the
ecological community, and has benefitted from previous work in other
related metadata languages. EML has adopted the strengths of many of
these languages, but also addresses a number of short-comings that
have proved to inhibit the automated processing and integration of
dataset resources via their metadata.
The following list represents some of the features of EML:
Modularity: EML was designed as a collection of modules rather than
one large standard to facilitate future growth of the language in both
breadth and depth. By implementing EML with an extensible
architecture, groups may choose which of the core modules are
pertinent to describing their data, literature, and software
resources. Also, if EML falls short in a particular area, it may be
extended by creating a new module that describes the resource (e.g. a
detailed soils metadata profile that extends eml-dataset). The intent
is to provide a common set of core modules for information exchange,
but to allow for future customizations of the language without the
need of going through a lengthy 'approval' process.
Detailed Structure: EML strives to balance the tradeoff of too much
detail with enough detail to enable advanced services in terms of
processing data through the parsing of accompanied metadata.
Therefore, a driving question throughout the design was: 'Will this
particular piece of information be machine-processed, just human
readable, or both?' Information was then broken down into more highly
structured elements when the answer involved machine processing.
Compatibility: EML adopts much of it's syntax from the other metadata
standards that have evolved from the expertise of groups in other
disciplines. Whenever possible, EML adopted entire trees of
information in order to facilitate conversion of EML documents into
other metadata languages. EML was designed with the following
standards in mind: Dublin Core Metadata Initiative, the Content
Standard for Digital Geospatial Metadata (CSDGM from the US geological
Survey's Federal Geographic Data Committee (FGDC)), the Biological
Profile of the CSDGM (from the National Biological Information
Infrastructure), the International Standards Organization's Geographic
Information Standard (ISO 19115), the ISO 8601 Date and Time Standard,
the OpenGIS Consortiums's Geography Markup Language (GML), the
Scientific, Technical, and Medical Markup Language (STMML), and the
Extensible Scientific Interchange Language (XSIL).
Strong Typing: EML is implemented in an Extensible Markup Language
(XML) known as XML
Schema, which is a language that defines the rules
that govern the EML syntax. XML Schema is an internet recommendation
from the World Wide Web Consortium,
and so a
metadata document that is said to comply with the syntax of EML will
structurally meet the criteria defined in the XML Schema documents for
EML. Over and above the structure (what elements can be nested within
others, cardinality, etc.), XML Schema provides the ability to use
strong
data typing within elements. This allows for finer validation of the
contents of the element, not just it's structure. For instance, an
element may be of type 'date', and so the value that is inserted in
the field will be checked against XML Schema's definition of a date.
Traditionally, XML documents (including previous versions of EML)
have been validated against Document Type
Definitions (DTDs), which do not provide a means to employ strong
validation on field values through typing.
There is a distinction between the content model (i.e. the concepts
behind the structure of a document - which fields go where, cardinality,
etc.) and the syntactic implementation of that model (the technology
used to express the concepts defined in the content model).
The normative sections below define the content model and the
XML Schema documents distributed with EML define the syntactic
implementation. For the foreseeable future, XML Schema will be the
syntactic specification, although it may change later.
Overview of EML modules and their useModule Overview Foreword
The following section briefly describes each EML module and how they
are logically designed in order to document ecological resources.
Some of the modules are dependent on others, while others may be used
as stand-alone descriptions. This section describes the modules using
a "top down" approach, starting from the top-level eml wrapper
module, followed by modules of increasing detail. However, there are
modules that may be used at many levels, such as eml-access.
These modules are described when it is appropriate.
Root-level structure
Top-level resources
The following four modules are used to describe separate resources:
datasets, literature, software, and protocols. However, note that
the dataset module makes use of the other top-level modules by
importing them at different levels. For instance, a dataset may
have been produced using a particular protocol, and that protocol
may come from a protocol document in a library of protocols.
Likewise, citations are used throughout the top-level resource
modules by importing the literature module.
Supporting Modules - Adding detail to top-level resources
The following six modules are used to qualify the resources being
described in more detail. They are used to describe access control
rules, distribution of the metadata and data themselves, parties
associated with the resource, the geographic, temporal, and
taxonomic extents of the resource, the overall research context of
the resource, and detailed methodology used for creating the
resource. Some of these modules are imported directly into the
top-level resource modules, often in many locations in order to
limit the scope of the description. For instance, the eml-coverage
module may be used for a particular column of a dataset, rather
than the entire dataset as a whole.
Data organization - Modules describing dataset structures
The following three modules are used to document the logical layout
of a dataset. Many datasets are comprised of multiple entities
(e.g. a series of tabular data files, or a set of GIS features, or a
number of tables in a relational database). Each entity within a
dataset may contain one or more attributes (e.g. multiple columns in
a data file, multiple attributes of a GIS feature, or multiple
columns of a database table). Lastly, there may be both simple or
complex relationships among the entities within a dataset. The
relationships, or the constraints that are to be enforced in the
dataset, are described using the eml-constraint module. All
entities share a common set of information (described using
eml-entity), but some discipline specific entities have
characteristics that are unique to that entity type. Therefore, the
eml-entity module is extended for each of these types (dataTable,
spatialRaster, spatialVector, etc...) which are described
in the next section.
Entity types - Detailed information for discipline specific entities
The following six modules are used to describe a number of common
types of entities found in datasets. Each entity type uses the
eml-entity module elements as it's base set of elements, but then
extends the base with entity-specific elements. Note that the
eml-spatialReference module is not an entity type, but is rather a
common set of elements used to describe spatial reference systems
in both eml-spatialRaster and eml-spatialVector. It is described
here in relation to those two modules.
Utility modules - Metadata documentation enhancements
The following modules are used to highlight the information being
documented in each of the above modules where prose may be needed to
convey the critical metadata. The eml-text module provides a number
of text-based constructs to enhance a document (including sections,
paragraphs, lists, subscript, superscript, emphasis, etc.)
Dependency Chart
The multiple modules in EML all depend on each other in complex
ways. To easily see these dependencies see the
EML Dependency Chart.
Internationalization - Metadata in multiple languages
EML supports internationalization using the i18nNonEmptyStringType.
Fields defined as this type include:
TitleKeywordContact information (e.g. names, organizations, addresses)
TextType fields also support language translations. These fields include:
AbstractMethodsProtocolInternationalization techniques
Core metadata should be provided in English.
The core elements can be augmented with translations in a native language.
Detailed metadata can be provided in the native language as declared using the xml:lang attribute.
Authors can opt to include English translations of this detailed metadata as they see fit.
The following example metadata document is provided primarily in Portuguese but includes English translations
of core metadata fields.
The xml:lang="pt_BR" attribute at the root of the EML document indicates that, unless otherwise specified,
the content of the document is supplied in Portuguese (Brazil).
The xml:lang="en_US" attributes on child elements denote that the content of that element is provided in English.
Core metadata (i.e. title) is provided in English, supplemented with a Portuguese translation using the
value tag with an xml:lang attribute. Note that child elements can override the
root language declaration of the document as well as the language declaration of their containing elements.
The abstract element is primarily given in Portuguese (as inherited from the root language declaration),
with an English translation.
Many EML fields are repeatable (i.e. keyword) so that multiple values can be provided for the same concept.
Translations for these fields should be included as nested value tags to indicate that they are equivalent concepts
expressed in different languages rather than entirely different concepts.
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xml:lang="pt_BR"
xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
<dataset id="ds.1">
<!-- English title with Portuguese translation -->
<title xml:lang=""en_US">
Sample Dataset Description
<value xml:lang="pt_BR">Exemplo Descrição Dataset</value>
</title>
...
<!-- Portuguese abstract with English translation -->
<abstract>
<para>
Neste exemplo, a tradução em Inglês é secundário
<value xml:lang="en_US">In this example, the English translation is secondary</value>
<para>
</abstract>
...
<!-- two keywords, each with an equivalent translation -->
<keywordSet>
<keyword keywordType="theme">
árvore
<value xml:lang="en_US">tree</value>
<keyword>
<keyword keywordType="theme">
água
<value xml:lang="en_US">water</value>
<keyword>
</keywordSet>
...
</dataset>
</eml:eml>
Technical Architecture (Normative)Introduction
This section explains the rules of EML. There are some rules that cannot
be written directly into the XML Schemas nor enforced by an XML parser.
These are guidelines that every EML package must follow in order for
it to be considered EML compliant.
Module Structure
Each EML module, with the exception of "eml" itself, has a top level
choice between the structured content of that modules or a
"references" field. This enables the reuse of content
previously defined elsewhere in the document. Methods for defining
and referencing content are described in the
next section
Reusable Content
EML allows the reuse of previously defined structured content (DOM
sub-trees) through the use of key/keyRef type references. In order
for an EML package to remain cohesive and to allow for the cross
platform compatibility of packages, the following rules with respect
to packaging must be followed.
An ID is required on the eml root element.
IDs are optional on all other elements.
If an ID is not provided, that content must be interpreted as
representing a distinct object.
If an ID is provided for content then that content is distinct from
all other content except for that content that references its ID.
If a user wants to reuse content to indicate the repetition of an
object, a reference must be used. Two identical ids with the same system
attribute cannot exist in a single document.
"Document" scope is defined as identifiers unique only to a
single instance document (if a document does not have a system
attribute or if scope is set to 'document' then all IDs are defined
as distinct content).
"System" scope is defined as identifiers unique to an entire data
management system (if two documents share a system string, then
any IDs in those two documents that are identical refer to the
same object).
If an element references another element, it must not have an
ID itself. The system attribute must have the same value in both the
target and referencing elements or it must be absent in both.
All EML packages must have the 'eml' module as the root.
The system and scope attribute are always optional except for at the
'eml' module where the scope attribute is fixed as 'system'. The scope
attribute defaults to 'document' for all other modules.
EML Parser
Because some of these rules cannot be enforced in XML-Schema, we have
written a parser which checks the validity of the references and IDs
used in your document. This parser is included with the 2.1.0 release
of EML. To run the parser, you must have Java 1.3.1 or higher. To
execute it change into the lib directory of the release and run
the 'runEMLParser' script passing your EML instance file as a
parameter. There is also an online
version of this parser which is publicly accessible. The online
parser will both validate your XML document against the schema as
well as check the integrity of your references.
ID and Scope ExamplesExample DocumentsInvalid EML due to duplicate identifiers
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
<dataset id="ds.1">
<title>Sample Dataset Description</title>
<!-- the two creators have the same id. this should be an error-->
<creator id="23445" scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
<creator id="23445" scope="document">
<individualName>
<surName>Myer</surName>
</individualName>
</creator>
...
</dataset>
</eml:eml>
This instance document is invalid because both creator
elements have the same id. No two elements can have the
same string as an id.Invalid EML due to a non-existent reference
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
<dataset id="ds.1">
<title>Sample Dataset Description</title>
<creator id="23445" scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
<creator id="23446" scope="document">
<individualName>
<surName>Myer</surName>
</individualName>
</creator>
...
<contact>
<references>23447</references>
</contact>
</dataset>
</eml:eml>
This instance document is invalid because the contact
element references an id that does not exist. Any referenced
id must exist.Invalid EML due to a conflicting id attribute and a
<references> element
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
<dataset id="ds.1">
<title>Sample Dataset Description</title>
<creator id="23445" scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
<creator id="23446" scope="document">
<individualName>
<surName>Meyer</surName>
</individualName>
</creator>
...
<contact id="522">
<references>23445</references>
</contact>
</dataset>
</eml:eml>
This instance document is invalid because the contact
element both references another element and has an id itself.
If an element references another element, it may not have
an id. This prevents circular references.A valid EML document
<?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xmlns:eml="eml://ecoinformatics.org/eml-2.1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd">
<dataset id="ds.1">
<title>Sample Dataset Description</title>
<creator id="23445" scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
<creator id="23446" scope="document">
<individualName>
<surName>Smith</surName>
</individualName>
</creator>
...
<contact>
<references>23446</references>
</contact>
<contact>
<references>23445</references>
</contact>
</dataset>
</eml:eml>
This instance document is valid. Each contact is
referencing one of the creators above and all the ids are
unique.Module Descriptions (Normative).xsd
, IndexABCDEFGHIJkLMNOPQSTUVWXYZNormative technical docs for
./.html./.html#-