Spatial Search and Plotting Using Geohashes
===========================================

Introduction
------------

A geohash is an encoded character string that is computed from geographic
coordinates. For example the approximate latitude and longitude of The National
Center For Ecological Analysis and Synthesis is 34.419279, -119.698472 from
which the geohash of *9q4gu1y4z* can be derived. The geohash algorithm is
bidirectional, so geographic coordinates can be encoded into geohashes and
geohashes can be decoded to obtain coordinates.
 
Geohashes have the property that characters can be incrementally removed from
the right side of the geohash to represent a geographic location less
precisely. A geohash is an approximation of a point, where each length of the
geohash corresponds to a rectangle (a geohash tile) that is an approximation of
the original encoded geographic coordinate.

This feature of geohashes can be useful for searching and plotting at different
resolutions.

Table 1 shows the relationship between geohash length and the size of the
rectangle represented by that geohash at the equator.

.. table:: **Table 1** Geohash Tile Sizes   

   ======= =====================
   Length  Tile Size
   ======= =====================
    1      5,009.4km x 4,992.6km
    2      1,252.3km x 624.1km
    3      156.5km x 156km
    4      39.1km x 19.5km
    5      4.9km x 4.9km
    6      1.2km x 609.4m
    7      152.9m  x 152.4m
    8      38.2m x 19m
    9      4.8m x 4.8m
   10      1.2m x 59.5cm
   11      14.9cm x 14.9cm
   12      3.7cm x 1.9cm 
   ======= =====================


Table 2 shows the relationship between a geohash and the resulting latitude and
longitude decoded from the different length geohashes. As characters are
removed from the original geohash '9q4gu1y4z' the bounding rectangle and the
accuracy of the decoded geohash becomes less precise. The decoded geohash
corresponds to the centroid of the bounding rectangle.

.. table:: **Table 2.** Geohash length vs Accuracy

  ========== =========================== =============================================
  Geohash    Tile Center lat, long       Tile minlat, minlong, maxlat, maxLong
  ========== =========================== =============================================
  9          22.5, -112.5                0, -135, 45, -90
  9q         36.5625, -118.125           33.75, -123.75, 39.375, -112.5
  9q4        34.45312, -120.23437        33.75, -120.9375, 35.15625, -119.53125
  9q4g       34.36523, -119.70703        34.27734, -119.88281, 34.45312, -119.53125
  9q4gu      34.43115, -119.68505        34.40917, -119.70703, 34.45312, -119.66308
  9q4gu1     34.41741, -119.70153        34.41467, -119.70703, 34.42016, -119.69604
  9q4gu1y    34.41947, -119.69810        34.41879, -119.69879, 34.42016, -119.69741
  9q4gu1y4   34.41922, -119.69861        34.41913, -119.69879, 34.41930, -119.69844
  9q4gu1y4z  34.41928, -119.69846        34.41926, -119.69849, 34.41930, -119.69844
  ========== =========================== =============================================
    
Geohashes comprise a nested spatial indexing system with each level of
geohashes tile containing 32 tiles of the next smaller tile size. The level one
geohashes (length=1) divide the earth into 32 tiles. Each of these 32 tiles is
then subdivided into 32 level 2 tiles and so on.

Geohashes also have the property that all smaller tiles within the enclosing
geohash tile begin with the same leading characters, therefor for the level 1
tile '9', all level 2 sub-tiles begin with '9': '90', '91', '93',..., '9z'.

For example the level 3 geohash tile that encloses much of Santa Barbara County
is '9q4'. Also contained in this bounding rectangle is the city center of Santa
Maria (geohash 9q4qg7j2hmdz), Goleta (geohash 9q4gckb5jxu7) and Santa Barbara
(geohash 9q4gu4n7y5b7) all of which begin with the characters '9q4' and fall
within the '9q4' geohash rectangle. This property is very useful for searching
and sorting datastores that contain geohashes.


DataONE Search Index and Geohashes
----------------------------------

The DataONE search index contains geohashes that have been computed for each
geographic coverage associated with a PID containing geographic coverage
information, which currently includes metadata objects in EML and FGDC formats.
The search index is described here: `<SearchMetadata.html>`_. Each PID in the
search index has a geohash computed at nine different resolutions,
corresponding to the geohash lengths shown *Table 3*. The field names are
appended with the geohash length, so for example the field *geohash_1* has a
string length of one and corresponds to the largest tile size in *Table 1*.
Geohashes are added to the search index at different lengths to allow for
searching and plotting at different resolutions.

For EML documents, the geohashes are computed by determining the centroid of
the XML elements *northBoundingCoordinate, southBoundingCoordinatem,
eastBoundingCoordinate, westBoundingCoordinate* which are child element of
*//dataset/coverage/geographicCoverage/boundingCoordinates*. Because any number
of coverages may be defined with the EML format, the geohashes for these
coverages are stored in a Solr multi-valued field. EML allows for the four
bounding coordinates to specify a single coordinate (i.e.
westBoudingCoordinate=eastBoundingCoordinate and
northBoundingCoordinate=southBoundingCoordinate), in which case this location
is used to compute a geohash.

For FGDC documents, the XML elements *northBoundingCoordinate,
southBoundingCoordinate, eastBoundingCoordinate, westBoundingCoordinate*
(parent element //metadata/idinfo/spdom/bounding) are used to compute the
geohash using the same method as for EML.


Using Geohashes for plotting
----------------------------

Geohashes can be used to efficiently plot the location of the geographic
coverages. The following examples show different search and plotting strategies
that are possible using geohashes. Several public domain Javascript libraries
are available for assisting in developing web clients that could use geohashes.
For example, the Javascript library *node-geohash* (available at
https://github.com/sunng87/node-geohash) contains routines to encode and decode
geohashs in addition to other spatial operators using geohashes. This library
will be used for the examples that follow.

**Example: Retrieve geohashes as facets in Solr search**

In this example the Solr search index is queried for a particular field of
interest, with the associated geohash counts being returned as a field facet.
Used in this way, the facet field of geohashes becomes a spatial bin, with the
size of the geographic area and the spatial resolution of the binning selected
by the geohash level.

For example, if we are interested in plotting the location of PIDs that have
some associated with kelp, we could query the search index with the Solr query:

..

  https://cn.dataone.org/cn/v1/query/solr?q=kelp&facet=true&facet.field=geohash_5&facet.mincount=1&rows=0

The portion of the response that we are interested in are the facet counts:

::

  <lst name="facet_counts">
    <lst name="facet_queries"/>
    <lst name="facet_fields">
      <lst name="geohash_5">
        <int name="9q4qx">5</int>
        <int name="9q4ce">4</int>
        <int name="9q4cf">4</int>
        <int name="9q4ey">4</int>
        <int name="9q4ge">4</int>
        <int name="9q4gx">4</int>
        <int name="9q4kj">4</int>
        <int name="9q4s4">4</int>
        <int name="9q4ez">3</int>
        <int name="9q4g8">3</int>
        <int name="9q4gb">1</int>
      </lst>
    </lst>
    <lst name="facet_dates"/>
    <lst name="facet_ranges"/>
  </lst>
  
To display these search results each geohash can be decoded to obtain the
latitude, longitude of the geohash. For example, we can obtain the coordinates
of the first geohash returned from the search as shown in the following code
fragment:

::

  // Use the node-geohash Javascript library
  var geohashLib= require('ngeohash');
  
  // Return [minlat, minlon, maxlat, maxlon] of geohash tile
  var coords = geohashLib.decode("9q4qx");

The variable *coords* now contains the latitude, longitude (coords.latitude,
coords.logitude) of the decoded geohash, which is center point of the geohash
tile, in this case for level 5 geohash tiles. We could then place a marker with
counts at these coordinates to indicate how many hits occured in this geohash
tile.

Care must be taken in selecting the right geohash tile level for the Solr
query, with the consideration of smaller geohash tiles providing more accurate
spatial results, but returning a greater number of facet results as each
greater resolution tile covers a smaller geographic area.


Using Geohashes for searching
-----------------------------

Geohashes in the search index are multi-valued, so that geohashes have been
computed for each geographic coverage for a PID. Since the geohashes are
indexed at different resolutions, you can search all coverages at different
spatial resolutions.

**Example: Search using a bounding box**

One appraach to using geohashes for search is to retrieve all PIDs with
geohashes that overlap a search box. First determine which geohashes overlap a
bounding rectangle, in this case the bounding rectangle that encompasses Santa
Cruz Island in the Santa Barbara Channel:

::

  // Use the node-geohash Javascript library
  var geohashLib= require('ngeohash');

  // Search for all geohashes within a geographic bounding box
  // which might be the current browser viewport or alternatively a
  // region of interest

  // Santa Cruz Island bounding coordinates (approximate)
  // lower left

  minlat = 33.959878;
  minlon = -119.914398;

  // upper right
  maxlat = 34.075341;
  maxlon = -119.520264;

  // Find all geohashes that overlap the specified bounding box.
  var geohashes = geohashLib.bboxes (minlat, minlon, maxlat, maxlon, precision=4);


The geohashes that overlap the search bounding box are returned. These
geohashes can then be used to find PIDs that have a coverage that is within
these geohash tiles:

::

  https://cn.dataone.org/cn/v1/query/solr?q=*:*&q.op=OR&fq=geohash_4:(9q4c 9q49 9q51)&fl=id
  
This Solr filter query will find all entries for which one of the level 4
geohashes matches any of the specified geohashes. Because the geohash_* fields
are indexed as Solr multivalued fields, all coverages for a PID are compared to
see if they match.


Geohash algorithm
-----------------

A description of the geohash algorith is outside the scope of this document,
but an excellent description of the can be found at
http://en.wikipedia.org/wiki/Geohash.
 
One caveat of the geohash algorith that may be of interest to end users
however, is that because of the tile ordering ("Z" ordering) where tile
geohashes are incremented in a "Z" pattern and not strictly by row, column, it
is not gauranteed that adjancent tiles have similar geohashes, for example, the
level 1 geohashes at the equator, starting from the International Date Line,
are named "8", "9", "d", "e", "s", "t" and so on.