'$RCSfile: eml-physical.xsd,v $'
Copyright: 2000 Regents of the University of California and the
National Center for Ecological Analysis and Synthesis
For Details: http://knb.ecoinformatics.org/
'$Author: cjones $'
'$Date: 2001-12-14 20:26:08 $'
'$Revision: 1.8 $'
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
eml-physical
The eml-physical Module defines the structural
characteristics of data formats as delivered over the wire or
as found in a file system. One physical object (which can be a
bytestream or an object in a file system) might contain multiple
entities (for example, this would be typical in a MS Access file
that contained multiple tables of data). However, it is typically
used to describe a file or stream that is in some text-based
format such as ASCII or UTF-8, and includes the information needed
to parse the data stream to extract the entity and its attributes
from the stream.
Physical structure.
Physical structure of an entity or entities.
Physical structure of an entity or entities. This generally is a detailed
description of a text representation that shows how the columns and rows
of a table are represented, or simply the name of a well-known binary or
proprietary format (e.g., Microsoft Excel 2000).
The eml-physical was introduced into EML 1.4 as eml-file.
Unique identifier
The unique identifier of this metadata file or object.
The identifier field provides a unique identifier for this
metadata documentation. It will most likely be part of a
sequence of numbers or letters that are meaningful in a
larger context, such as a metadata catalog. That larger
system can be identified in the "system" attribute. Multiple
identifiers can be listed corresponding to different catalog
systems.
nceas.3.2]]>
The 'identifier' field is derived from the eml-dataset
meta_file_id filed in EML 1.4.
Catalog system
The catalog system in which this identifier is used.
This element gives the name of the catalog system in which
this identifier is used. It is useful to determine the
scope of the identifier, and to determine the semantics
of the various subparts of the identifier. Unresolved issue:
can or should this be a URI/URL pointing to the catalog
system, or just the name?
nceas.3.2]]>
New to EML 2.0.
File format
Contains the name of the format for this file.
This element contains the name of the file's format.
The file's format is typically ASCII, Unicode, or some
well-known binary format (e.g., Microsoft Excel 2000).
This could be a mime-type.
ASCII]]>
The format element was introduced into EML 1.4.
Character Encoding
Contains the name of the chracter encoding used for the data.
This element contains the name of the character encoding.
This is typically ASCII or UTF-8, or one of the other common encodings.
UTF-8]]>
Introduced in EML 2.0
Entity size
Describes the physical size of the entity.
This element contains information of the physical size
of the entity, typically in bytes.
13]]>
The entitySize was introduced into EML 1.4.
Unit of measurement
Unit of measurement for the entity size, typically bytes
This element gives the unit of measurement for the
size of the entity, and is typically bytes.
13]]>
The unit was introduced into EML 1.4.
Authentication method
A value, typically a checksum, used to authenticate that the bitstream
delivered to the user is identical to the original.
This element describes authentication procedures or
techniques, typically by giving a checksum method (e.g., MD5) and
checksum value for the bytestream.
f5b2177ea03aea73de12da81f896fe40
]]>
The authentication element was introduced into EML 1.4.
Authentication method
The method used to calculate an authentication checksum.
This element names the method used to calculate and
authentication checksum that can be used to validate a
bytestream. Typical checksum methods include MD5 and CRC.
f5b2177ea03aea73de12da81f896fe40
]]>
The authentication element was introduced into EML 1.4.
Entity's compression method
Name ofthe entity's compression method
This element describes any compression methods used to
compress the entity, such as zip, compress, etc.
The compressed element was introduced into EML 1.4.
Encoding Method
Method used for encoding the entity
This element describes the entity's encoded method, such as
MIME base64 encoding or binhex encoding.
The encoded element was introduced into EML 1.4.
Header lines
Header lines in the entity
Number of header lines or information that prepares data.
3]]>
The numHeaderLines element was introduced into EML 1.4.
Record delimiter character
Character used to delimit records.
This element specifies the record delimiter character
when the format is text. The record delimiter is usually a
newline (\n) on UNIX, a carriage return (\r) on MacOS, or
both (\r\n) on Windows/DOS. Multiline records are usually
delimited with two line ending characters, for example on UNIX
it would be two newline characters (\n\n).
\n\r]]>
The recordDelimiter element was introduced into EML 1.4.
Quote character
Character used to quote values for delimeter escaping
This element specifies a character to be used in the entity
for quoting values so that field delimeters can be used within
the value. This basically allows delimeter "escaping". The
quoteChacter is typically a " or '.
"]]>
The quoteCharacter element was taken from the NBII standard.
Literal character
Character used to escape other characters
This element specifies a character to be used for escaping
character values so that the following character is treated as its literal
value. This allows "escaping" for special characters like quotes, commas,
and spaces when they aren't intended as a delimiter value. The
literalChacter is typically a \.
\]]>
Introduced in EML 2.0.
Start column
The starting column number for a fixed format attribute.
FixedWidth fields have a set length, thus
the end of the field can always be determined
by adding the fieldWidth to the starting
column number.
any positive integer, see example in "delimeter" description
Introduced into EML 2.0.
Field width
FieldWidth specification for fixed field length.
FixedWidth fields have a set length, thus
the end of the field can always be determined
by adding the fieldWidth to the starting
column number.
any positive integer, see example in "delimeter"
description
The fieldWidth element was introduced into
EML 1.4. Semantics changed to work identically to
the NBII DTD.
Attribute delimiter
The end of the attribute (field) is delimited by a
special character called a field delimiter.
Variable width format fields (attributes) can vary in their
field length, thus the end of the field is
delimited by a special character called a
field delimiter (typically a comma or a space).
Data sets are generally classified as fixedWidth
format or variableWidth format, but we have
determined that this is actually a per-field
classification because one may encounter
fixedWidth fields mixed together in the same
data file with variableWidth fields.
In our encoding scheme, the start of each field
is assumed to be the column after the last column
of the previous field, or the first column
if this is the first field in the dataset, unless
the starting column is explicity enumerated using the
"fieldStartColumn" element.
The end column for each field is classified
using either a special character delimeter indicated
using the filedDelimiter element,
or a fixed field length indicated by using the "fieldWidth"
element. The delimiter for the last field in the data set can be omitted.
variableWidth fields can vary in their field length, and the end of
the field is delimited by a special character
called a field delimiter, usually a comma or
a tab character. fixedWidth fields have a set
length, and so the end of the field can always
be determined by adding the fieldWidth to the
starting column number. Here is an example:
Assume we have the following data in a data set:
May,100aaaa,1.2,
April,200aaaa,3.4,
June,300bbbb,4.6,
The metadata indicating the physical layout of the 4 fields would include the
following:
,
3
3
,
]]>
In a strictly fixed format file, the metadata would be slightly different:
May100aaaa1.2
Apr200aaaa3.4
Jun300bbbb4.6
3
3
4
3
]]>
or, one could explicitly describe the starting columns:
1
3
4
3
7
4
11
3
]]>
comma, tab, white space, etc.
The delimiter element was introduced into
EML 1.4. Semantics changed to work identically to
the NBII DTD, and then modified to fit more cases.