The eml-physical module - Physical file format

'$RCSfile: eml-physical.xsd,v $' Copyright: 1997-2002 Regents of the University of California, University of New Mexico, and Arizona State University Sponsors: National Center for Ecological Analysis and Synthesis and Partnership for Interdisciplinary Studies of Coastal Oceans, University of California Santa Barbara Long-Term Ecological Research Network Office, University of New Mexico Center for Environmental Studies, Arizona State University Other funding: National Science Foundation (see README for details) The David and Lucile Packard Foundation For Details: http://knb.ecoinformatics.org/ '$Author: jones $' '$Date: 2004-07-01 22:09:09 $' '$Revision: 1.68 $' This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA eml-physical

The eml-physical module - Physical file format The eml-physical module describes the external and internal physical characteristics of a data object as well as the information required for its distribution. Examples of the external physical characteristics of a data object would be the filename, size, compression, encoding methods, and authentication of a file or byte stream. Internal physical characteristics describe the format of the data object being described. Both named binary or otherwise proprietary formats can be cited (e.g., Microsoft Access 2000), or text formats can be precisely described (e.g., ASCII text delimited with commas). For these text formats, it also includes the information needed to parse the data object to extract the entity and its attributes from the data object. Distribution information describes how to retrieve the data object. The retrieval information can be either online (e.g., a URL or other connection information) or offline (e.g., a data object residing on an archival tape). The eml-physical module, like other modules, may be "referenced" via the <references> tag. This allows a physical document to be described once, and then used as a reference in other locations within the EML document via its ID.

Any data object that is being desribed by EML needs this information so the entities and attributes that reside with in the data object can be extracted. yes Physical structure Physical structure of an entity or entities. The content model for physical is a CHOICE between "references" and all of the elements that let you describe the internal/external characteristics and distribution of a data object (e.g., dataObject, dataFormat, distribution.) A physical element can contain a reference to an physical element defined elsewhere. Using a reference means that the referenced physical is identical, not just in name but identical in its complete description. Data object name The name of the data object. The name of the data object. This is possibly distinct from the entity name in that one physical object can contain multiple entities, even though that is not a recommended practice. The objectName often is the filename of a file in a filesytem or that is accessible on the network. rainfall-sev-2002-10.txt Data object size Describes the physical size of the data object. This element contains information of the physical size of the entity, by default represented in bytes unless the unit attribute is provided to change the units. 134 Unit of measurement Unit of measurement for the entity size, by default byte This element gives the unit of measurement for the size of the entity, and is by default a byte. byte Authentication value A value, typically a checksum, used to authenticate that the bitstream delivered to the user is identical to the original. This element describes authentication procedures or techniques, typically by giving a checksum value for the onject. The method used to compute the authentication value (e.g., MD5) is listed in the method attribute. f5b2177ea03aea73de12da81f896fe40 Authentication method The method used to calculate an authentication checksum. This element names the method used to calculate and authentication checksum that can be used to validate a bytestream. Typical checksum methods include MD5 and CRC. MD5 Compression Method Name of a compression method applied This element lists a compression method used to compress the object, such as zip, compress, etc. Compression and encoding methods must be listed in the order in which they were applied, so that decompression and deencoding should occur in the reverse order of the listing. For example, if a file is compressed using zip and then encoded using MIME base64, the compression method would be listed first and the encoding method second. zip gzip compress Encoding Method Name of a encoding method applied This element lists a encoding method used to encode the object, such as base64, binhex, etc. Compression and encoding methods must be listed in the order in which they were applied, so that decompression and deencoding should occur in the reverse order of the listing. For example, if a file is compressed using zip and then encoded using MIME base64, the compression method would be listed first and the encoding method second. base64 uuencode binhex Character Encoding Contains the name of the character encoding used for the data. This element contains the name of the character encoding. This is typically ASCII or UTF-8, or one of the other common encodings. UTF-8 Data format Describes the internal physical format of a data object. This element is the parent which is a CHOICE between four possible internal physical formats which describe the internal physical characteristics of the data object. Using this information the user should be able parse physical object to extract the entity and its attributes. Note that this is the format of the physical object itself. Text Format Description of a text formatted object Description of a text formatted object. The description includes detailed parsing instructions for extracting attributes from the bytestream for simple delimited file formats (e.g., CSV), fixed format files that use fixed columns for attribute locations, and mixtures of the two. It also supports records that span multiple lines. Number of header lines Number of header lines preceding data. Number of header lines preceding data. Lines are determined by the physicalLineDelimiter, or if it is absent, by the recordDelimiter. This value indicated the number of header lines that should be skipped before starting to parse the data. 4 Number of footer lines Number of footer lines following data. Number of footer lines following data. Lines are determined by the physicalLineDelimiter, or if it is absent, by the recordDelimiter. This value indicated the number of footer lines that should be skipped after parsing the data. If this value is omitted, parsers should assume the data continues to the end of the data stream. 4 Record delimiter character Character used to delimit records. This element specifies the record delimiter character when the format is text. The record delimiter is usually a linefeed (\n) on UNIX, a carriage return (\r) on MacOS, or both (\r\n) on Windows/DOS. Multiline records are usually delimited with two line ending characters, for example on UNIX it would be two linefeed characters (\n\n). As record delimeters are often non-printing characters, one can use either the special value "\n" to represent a linefeed (ASCII 0x0a) and "\r" to represent a carriage return (ASCII 0x0d). Alternatively, one can use the hex value to represent character values (e.g., 0x0a). \n\r Physical line delimiter character Character used to delimit physical lines. This element specifies the physical line delimiter character when the format is text. The line delimiter is usually a linefeed (\n) on UNIX, a carriage return (\r) on MacOS, or both (\r\n) on Windows/DOS. Multiline records are usually delimited with two line ending characters, for example on UNIX it would be two linefeed characters (\n\n). As line delimeters are often non-printing characters, one can use either the special value "\n" to represent a linefeed (ASCII 0x0a) and "\r" to represent a carriage return (ASCII 0x0d). Alternatively, one can use the hex value to represent character values (e.g., 0x0a). If this value is not provided, prcessors should assume that the physical line delimiter is the same as the record delimiter. \n\r Physical lines per record The number of physical lines in the file spanned by a single logical data record. A single logical data record may be written over several physical lines in a file, with no special marker to indicate the end of a record. In such cases, it is necessary to know the number of lines per record in order to correctly read them. If this value is not provided, processors should assume that records are wholly contained on one physical line. If the value is greater than 1, then processers should examine the lineNumber field for each attribute to determine which line of the record contains the information. 3 Maximum record length The maximum number fo characters in any record in the physical file. The maximum number of chanracters in any record in the physical file. For delimited files, the record length varies and this is not particularly useful. However, for fixed format files that do not contain record delimiters, this field is critical to tell processors when one record stops and another begins. 597 Orientation of attributes Orientation of attributes. Specifies whether the attributes described in the physical stream are found in columns or rows. The valid values are column or row. If set to 'column', then the attributes are in columns. If set to 'row', then the attributes are in rows. Row orientation is rare, but some systems such as Splus and R utilize it. For example, some data with column orientation: DATE PLOT SPECIES 2002-01-15 hfr5 acer rubrum 2002-01-15 hfr5 acer xxxx The same data in a rowMajor table: DATE 2002-01-15 PLOT hfr5 SPECIES acer rubrum acer xxxx column row Simple delimited format A simple delimited format. A simple delimited format that uses one of a series of delimiters to indicate the ends of fields in the data stream. More complex formats such as fixed format or mixed delimited and fixed formats can be described using the "complex" element. Field Delimiter character Character used to delimit the end of an attribute This element specifies a character to be used in the object for indicating the ending column for an attribute. The delimiter character itself is not part of the attribute value, but rather is present in the column following the last character of the value. Typical delimiter characters include commas, tabs, spaces, and semicolons. The only time the fieldDelimiter character is not interpreted as a delimiter is if it is contained in a quoted string (see quoteCharacter) or is immediately preceded by a literalCharacter. Non-printable quote characters can be provided as their hex values, and for tab characters by its ASCII string "\t". Processors should assume that the field starts in the column following the previous field if the previous field was fixed, or in the column following the delimiter from the previous field if the previous field was delimited. , \t 0x09 0x20 Treat consecutive delimiters as one Specification of how to handle consecutive delimiters while parsing The collapseDelimiters element specifies whether sequential delimiters should be treated as a single delimiter or multiple delimiters. An example is when a space delimiter is used; often there may be several repeated spaces that should be treated as a single delimiter, but not always. The valid values are yes or no. If it is set to yes, then consecutive delimiters will be collapsed to one. If set to no or absent, then consecutive delimiters will be treated as seperate delimiters. Default behaviour is no; hence, consecutive delimiters will be treated as seperate delimiters, by default. yes no Quote character Character used to quote values for delimiter escaping This element specifies a character to be used in the object for quoting values so that field delimeters can be used within the value. This basically allows delimeter "escaping". The quoteChacter is typically a " or '. When a processor encounters a quote character, it should not interpret any following characters as a delimiter until a matching quote character has been encountered (i.e., quotes come in pairs). It is an error to not provide a closing quote before the record ends. Non-printable quote characters can be provided as their hex values. " ' Literal character Character used to escape other special characters This element specifies a character to be used for escaping special character values so that they are treated as literal values. This allows "escaping" for special characters like quotes, commas, and spaces when they are intended to be used in an attribute value rather than being intended as a delimiter. The literalCharacter is typically a \. \ Complex text format A complex text format. A complex text format that can describe delimited fields, fixed width fields, and mixtures of the two. This supports multiline records (where one record is distributed across multiple physical lines). When using the complex format, the number of textFixed and textDelimited elements should exactly equal the number of attributes that have been described for the entity, and the order of the textFixed and textDelimited elements should correspond to the order of the attributes as described in the entity. Thus, for a delimited file with fourteen attributs, one should provide exactly fourteen textDelimited elements. Fixed format text Describes the physical format of data sequences that use a fixed number of characters in a specified position in the stream to locate attribute values. Describes the physical format of data sequences that use a fixed number of characters in a specified position in the stream to locate attribute values. This method is common in sensor-derived data and in legacy database systems. To parse it, one must know the number of characters for each attribute and the starting column and line to begin reading the value. Field width Field width in characters for fixed field length. Fixed width fields have a set length, thus the end of the field can always be determined by adding the fieldWidth to the starting column number. 7 Physical Line Number The line on which the data field is found, when the data record is written over more than one physical line in the file. A single logical data record may be written over several physical lines in a file, with no special marker to indicate the end of a record. In such cases, the relative location of a data field must be indicated by both relative row and column number. The lineNumber should never greater that the number of physical lines per record. 3 Start column The starting column number for a fixed format attribute. Fixed width fields have a set length, thus the end of the field can always be determined by adding the fieldWidth to the starting column number. If the starting column is not provided, processors should assume that the field starts in the column following the previous field if the previous field was fixed, or in the column following the delimiter from the previous field if the previous field was delimited. 58 Delimited format text Describes the physical format of data sequences that use delimiters in the stream to locate attribute values. Describes the physical format of data sequences that use delimiters in the stream to locate attribute values. This method is common in data exported from spreadsheets and database systems, To parse it, one must know the character that indicates the end of each attribute and the line to begin reading the value. Field Delimiter character Character used to delimit the end of a particular attribute This element specifies a character to be used in the object for indicating the ending column for an attribute. The delimiter character itself is not part of the attribute value, but rather is present in the column following the last character of the value. Typical delimiter characters include commas, tabs, spaces, and semicolons. The only time the fieldDelimiter character is not interpreted as a delimiter is if it is contained in a quoted string (see quoteCharacter) or is immediately preceded by a literalCharacter. Non-printable quote characters can be provided as their hex values, and for tab characters by its ASCII string "\t". Processors should assume that the field starts in the column following the previous field if the previous field was fixed, or in the column following the delimiter from the previous field if the previous field was delimited. , \t 0x09 0x20 Treat consecutive delimiters as single Specification of how to handle consecutive delimiters while parsing The collapseDelimiters element specifies whether sequential delimiters should be treated as a single delimiter or multiple delimiters. An example is when a space delimiter is used; often there may be several repeated spaces that should be treated as a single delimiter, but not always. The valid values are yes or no. If it is set to yes, then consecutive delimiters will be collapsed to one. If set to no or absent, then consecutive delimiters will be treated as seperate delimiters. Default behaviour is no; hence, consecutive delimiters will be treated as seperate delimiters, by default. yes no Physical Line Number The line on which the data field is found, when the data record is written over more than one physical line in the file. A single logical data record may be written over several physical lines in a file, with no special marker to indicate the end of a record. In such cases, the relative location of a data field must be indicated by both relative row and column number. The lineNumber should never be greater that the number of physical lines per record. When parsing the first field on a physical line as a delimited field, they should assume that the field data starts in the first column. Otherwise, follow the rules indicated under fieldDelimiter. 3 Quote character Character used to quote values for delimiter escaping This element specifies a character to be used in the object for quoting values so that field delimeters can be used within the value. This basically allows delimeter "escaping". The quoteChacter is typically a " or '. When a processor encounters a quote character, it should not interpret any following characters as a delimiter until a matching quote character has been encountered (i.e., quotes come in pairs). It is an error to not provide a closing quote before the record ends. Non-printable quote characters can be provided as their hex values. " ' Literal character Character used to escape other special characters This element specifies a character to be used for escaping special character values so that they are treated as literal values. This allows "escaping" for special characters like quotes, commas, and spaces when they are intended to be used in an attribute value rather than being intended as a delimiter. The literalCharacter is typically a \. \ Externally Defined Format Information about a non-text or proprietary formatted object. Information about a non-text or propriateary formatted object. The description names the format explicitly, but assumes a processor implicitly knows how to parse that format to extract the data. A format version can be included. This is mainly used for proprietary formats, including binary files like Microsoft Excel and text formats like ESRI's ArcInfo export format. This is not a recommended way to permenantly archive data because the software to parse the format is unlikely to be available over extended periods, but is included to allow for commonly used physical formats. Format Name Name of the format of the data object Name of the format of the data object Microsoft Excel Format Version Version of the format of the data object Version of the format of the data object 2000 (9.0.2720) Format citation Citation providing more details about the physical format. Citation providing more detail about the physical format, including parsing information or information about the software required for reading the object. Raster image format Contains binary raster data header parameters The binaryRasterInfo element is a container for various parameters used to described the contents of binary raster image files. In this case, it is based on a white paper on the ESRI site that describes the header information used for BIP and BIL files ("Extendable Image Formats for ArcView GIS 3.1 and 3.2"). Orientation for reading rows and columns Orientation for reading rows and columns. Specifies whether the data should be read across rows or down columns. The valid values are column or row. If set to 'column', then the data are read down columns. If set to 'row', then the data are read across rows. column row Multiple band image Multiple band image information. Information needed to properly interpret a multiband image. Number of Bands The number of spectral bands in the image. The number of spectral bands in the image. Must be greater than 1. 2 Layout The organization of the bands in the image file. The organization of the bands in the image file. Acceptable values are bil - Band interleaved by line. bip - Band interleaved by pixel. bsq - Band sequential. bil bip bsq Number of Bits The number of bits per pixel per band. The number of bits per pixel per band. Acceptable values are typically 1, 4, 8, 16, and 32. The default value is eight bits per pixel per band. For a true color image with three bands (R, G, B) stored using eight bits for each pixel in each band, nbits equals eight and nbands equals three, for a total of twenty-four bits per pixel. 8 Byte Order The byte order in which values are stored. The byte order in which values are stored. The byte order is important for sixteen-bit and higher images, that have two or more bytes per pixel. Acceptable values are little-endian (common on Intel systems like PCs) and big-endian (common on Motorola platforms). little-endian big-endian Skip Bytes The number of bytes of data in the image file to skip in order to reach the start of the image data. The number of bytes of data in the image file to skip in order to reach the start of the image data. This keyword allows you to bypass any existing image header information in the file. The default value is zero bytes. 0 Bytes per band per row The number of bytes per band per row. The number of bytes per band per row. This must be an integer. This keyword is used only with BIL files when there are extra bits at the end of each band within a row that must be skipped. 3 Total bytes of data per row The total number of bytes of data per row. The total number of bytes of data per row. Use totalrowbytes when there are extra trailing bits at the end of each row. 8 Bytes between bands The number of bytes between bands in a BSQ format image. The number of bytes between bands in a BSQ format image. The default is zero. 1 Distribution Information Information on how the resource is distributed online and offline This element provides information on how the resource is distributed online and offline. Connections to online systems can be described as URLs and as a list of relevant connection parameters. Online Distribution Information Distribution information for accessing the resource online. Distribution information for accessing the resource online, represented either as a URL or as a series of named parameters that are needed in order to connect. The URL field is provided for the simple cases where a file is available for download directly from a web server or other similar server and a complex connection protocol is not needed. The connection field provides an alternative where a complex protocol needs to be named and described, along with the necessary parameters needed for the connection. Download site URL A URL (Uniform Resource Locator) from which this resource can be downloaded or information can be obtained about downloading it. A URL (Uniform Resource Locator) from which this resource can be downloaded or additional information can be obtained. If accessing the URL would directly return the data stream, then the "function" attribute should be set to "download". If the URL provides further information about downloading the object but does not directly return the data stream, then the "function" attribute should be set to "information". If the "function" attribute is omitted, then "download" is implied for the URL function. In more complex cases where a non-standard connection must be established that complies with application specific procedures beyond what can be described in the simple URL, then the "connection" element should be used instead of the URL element. http://data.org/getdata?id=98332 Connection A description of the information needed to make an application connection to a data service. A description of the information needed to make an application connection to a data service. The connection starts with a connectionDefinition which lists all of the parameters needed for the connection and possible default values for each. It then includes a list of parameter values, one for each parameter, that override the defaults for this particular connection. One parameter element should exist for every parameterDefinition that is present in the connectionDefinition, except that parameters that were defined with a defaultValue in their parameterDefinition can be ommitted from the connection and the default will be used. All information about how to use the parameters to establish a session and extract data is present in the connectionDefinition, possibly implicitly by naming a connection schemeName that is well-known. Connection Definition Definition of the connection protocol to be used for this connection. Definition of the connection protocol to be used for this connection. The definition has a "scheme" which identifies the protocol by name, and a detailed description of the scheme and its required parameters. Parameter A parameter to be used to make this connection. A parameter to be used to make this connection. This value overrides any default value that may have been provided in the connection definition. Parameter Name Name of the parameter to be used to make this connection. The name of the parameter to be used to make this connection. hostname Parameter Value The value of the parameter to be used to make this connection. The value of the parameter to be used to make this connection. This value overrides any default value that may have been provided in the connection definition. nceas.ucsb.edu Offline Distribution Details about the offline medium on which this resource is distributed, either digitally or as hardcopy. Details about the offline medium on which this resource is distributed digitally, such as 3.5" floppy disk, or various tape media types, or 'hardcopy'. CD-ROM, 3.5 in. floppy disk, Zip disk Medium name Name of the medium that for this resource distribution Name of the medium on which this resource is distributed. Can be various digital media such as tapes and disks, or printed media which can collectively be termed 'hardcopy'. Tape, 3.5 inch Floppy Disk, hardcopy Density of the digital medium The density of the digital medium if this is relevant. The density of the digital medium if this is relevant. Used mainly for floppy disks or tape. High Density (HD), Double Density (DD) Units of a numerical density A numerical density's units If a density is given numerically, the units should be given here. B/cm Storage volume Total volume of the storage medium the total volume of the storage medium on which this resource is shipped. 650 MB Medium format Format of the medium on which the resource is shipped. The file system format of the medium on which the resource is shipped NTFS, FAT32, EXT2, QIK80 Note about the media Note about the media Any additional pertinent information about the media. Inline distribution Object data distributed inline in the metadata. Object data distributed inline in the metadata. Users have the option of including the data right inline in the metadata by providing it inside of the "inline" element. For many text formats, the data can be simply included directly in the element. However, certain character sequences are invalid in an XML document (e.g., <), so care will need to be taken to either 1) wrap the data in a CDATA section if needed, or 2) encode the data using a text encoding algorithm such as base64, and then include that in a CDATA section. The latter will be necessary for binary formats. The data should be de-encoded and de-compressed according to the encodingMethod and compressionMethod fields in eml-physical as if the data had been obtained out-of-band (e.g., from a URL).