ElasticSearch instance cannot be reached. This manual requires a running instance of ES to render examples.
Indexing sample docs...

The clustering plugin attempts to automatically group together similar "documents" and assign human-readable labels to these groups. The clusters can be thought-of as "dynamic facets" generated for each unique query and set of search result hits. Take a look at the Carrot2 demo page to glimpse at how this can be used in practice.

Each document passed for clustering is composed of several logical parts: the document's identifier, its origin URL, title, the main content and a language code. Only the identifier field is mandatory, everything else is optional but at least one of these fields will be required to make the clustering process reasonable.

Important!

Read this section first, it contains important information about clustering which will help understand what's going on behind the scenes.

Documents indexed in ElasticSearch do not have to follow any predefined schema so actual fields of a JSON document need to be mapped to the logical layout required by the clustering plugin. An example mapping can look as illustrated in the figure below:

Logical field mapping

Note that two document fields are mapped to the title. This is not an error, any number of fields can be mapped to either the title or the content—the content of those fields will be concatenated and used for clustering.

The logical fields can also be filled with generated content, for example by applying the highlighter to the document's fields. This can be useful to decrease the amount of text passed to the clustering algorithm (improves performance) or to make the clustered content more query-specific (this typically clusters better). REST API examples below demonstrate the API for field mapping in details.

The Java API for clustering search results is fully functional and works the magic behind all REST requests described in the following part of this document. For concrete code utilizing the API see the source code of the plugin at github, especially the unit and integration tests.

The HTTP REST API of the plugin contains several methods reflecting Java API's functionality. Each of these methods is described in detail below.

List available algorithms

  • /_algorithms (GET or POST)

This action lists all available clustering algorithms. The returned identifiers can be used as a parameter to the clustering request.

Request

A request to list the available algorithms is a simple GET or POST request to /_algorithms URL.

Response

The response is a JSON object with an algorithms property which is a non-empty array of algorithm identifiers. The following example shows the algorithms available on this plugin instance. The default algorithm is the first one on the list.

$.get("/_algorithms", function(response) {
    $("#list-of-algorithms").text(
      response.algorithms.join("\n"));
});
            

            

Search and cluster results

  • /_search_with_clusters (POST, GET)
  • /{index}/_search_with_clusters (POST, GET)
  • /{index}/{type}/_search_with_clusters (POST, GET)

This action performs a search query, fetches matching hits and clusters them on-the-fly.

The index and type URI segments implicitly bind the search request part of the message to a given index and document type, exactly as in the search request API.

A clustering request is a HTTP REST request, where the full set of parameters is supported via HTTP POST request with a JSON body. A limited subset of clustering functionality is also exposed via the HTTP GET method.

Request (HTTP POST)

A HTTP POST request should contain a JSON object with the following properties.

search_request

required The search request to fetch documents to be clustered. This section follows exactly what the search DSL specifies, including all optional bells and whistles such as sorting, filtering, query DSL, highlighter, etc.

query_hint

required This is a string attribute specifying query terms which were used to fetch the matching documents. This hint helps the clustering algorithm to avoid trivial clusters around the query terms. Typically the query terms hint will be identical to what the user typed in the search box. If possible, it should be pruned from any boolean or search-engine specific operators which could affect the clustering process. The query hint is obligatory but may be an empty string.

field_mapping

required Defines how to map actual fields of the documents matching the search_request to logical fields of the documents to be clustered. The value should be a hash where keys indicate logical document fields and values are arrays with field source specifications (content of fields defined by these specifications is concatenated). For example this is a valid field mapping specification:

{
  "url":      [_source.urlSource],
  "title":    [fields.subject],
  "content":  [_source.abstract, highlight.main],
  "language": [fields.lang]
}

Any of the following logical document field names are valid:

url

The URL of the document.

title

The title of the document.

content

The main body (content) of the document.

language

Optional language "tag" for the title and content of a document. The language tag is a two-letter ISO 639-1 code, with the exception of Simplified Chinese (zh_cn code). Whether or not the language is supported by a clustering engine depends on the algorithm used. Carrot2 algorithms support languages defined in the LanguageCode class.

A field source specification defines where the value is taken from: the search hit's fields, stored document's content, or from the highlighter's output. The syntax of field source specification is as follows:

fields.{fieldname}
Defines a search hit's field (stored field or field reparsed from source document but requested and returned in the search request).
highlight.{fieldname}
Defines a search hit's highlighted field. The highlighter output must also be configured properly in the search request (see field mapping example).
_source.{fieldname}
Defines a source document's field (top-level property of the json document). This will reparse the source document and fetch the appropriate value from there.
algorithm

optional Defines which clustering component (algorithm) should be used for clustering. Names of all built-in clustering algorithms are logged at startup and are also returned from the list algorithms request. If not present, the default algorithm is used.

include_hits

optional If set to false, the clustering response will not contain search hits, only cluster labels and document references. This option may be useful to decrease the size of clustering response in case only cluster labels are needed.

max_hits

optional If set to a non-negative number, the clustering response will be limited to contain only a maximum of the given search hits. The clustering will still run on a full window of results returned by the original search request. This option may be useful to decrease the size of clustering response in case cluster labels are used as facets (for refining the query, but without the immediate link to the search hits).

Note that clusters may still reference documents not present in the returned (trimmed) hits window.

attributes

optional A map of key-value attributes overriding the default algorithm settings per-query (runtime attributes in Carrot2 parlance). Typically the default settings are overridden using init-time XML configuration files.

Very important

Clustering requires at least a few dozen documents (hits) in order to make sense. The clustering plugin clusters search results only (it does not look in the index, it does not fetch additional documents). Make sure to specify the size of the fetch window to be at least 100 documents. If the response does not need so many hits (document references), the hits can be trimmed by using max_hits parameter on the clustering request.

Request (HTTP GET)

A HTTP GET clustering request supports a superset of HTTP URI parameters defined by ElasticSearch's URI search request. All additional parameters correspond to those typically defined in the body of a clustering request sent via HTTP POST. Namely, the following parameters are supported by HTTP GET:

field_mapping_*

required This is a wildcard (a family) of parameters, each of which defines a logical field mapping, similar to field_mapping map described in the HTTP POST request. A field_mapping_title will specify the logical title's mapping, wheareas field_mapping_url will specify the logical URL's mapping and so on.

The value of the mapping parameter is a comma-separated list of mapping specifications, as described in the description of the POST request.

algorithm

optional Identical semantics to algorithm attribute described in HTTP POST request.

query_hint

optional Identical semantics to query_hint attribute described in HTTP POST request. For GET requests the query hint is optional; if not present, the q attribute is used as the default.

Important

A HTTP GET request offers a subset of the functionality of a full HTTP POST JSON syntax. For example, it is not possible to specify a field mapping to highlighted field values, define custom algorithm attributes, etc. HTTP POST is recommended for production.

An example HTTP GET clustering request is shown below, with the resulting clusters shown on the right-hand side panel.

var getUrl = "/test/test/_search_with_clusters?"
  + "q=data+mining&"
  + "size=100&"
  + "field_mapping_title=_source.title&"
  + "field_mapping_content=_source.content";

// Run HTTP GET via jquery and render cluster labels.
$.get(getUrl,
  function(response) {
    $("#cluster-httpget-result").text(
      dumpClusters([], response.clusters).join("\n"));
});
            

            
Response

The response format is identical to a plain search request response, with extra properties presented in the schematic output below.

{
  /* Typical search response fields. */
  "hits": { /* ... */ },

  /* Clustering response fields. */
  "clusters": [
    /* Each cluster is defined by the following. */
    {
      "id":    /* identifier */,
      "score": /* numeric score */,
      "label": /* primary cluster label */,
      "other_topics": /* if present, and true, this cluster groups
                         unrelated documents (no related topics) */,
      "phrases": [
        /* cluster label array, will include primary. */
      ],
      "documents": [
        /* This cluster's document ID references.
           May be undefined if this cluster holds sub-clusters only. */
      ],
      "clusters": [
        /* This cluster's subclusters (recursive objects of the same
           structure). May be undefined if this cluster holds documents only. */
      ],
    },
    /* ...more clusters */
  ],
  "info": {
    /* Additional information about the clustering: execution times,
       the algorithm used, etc. */
  }
}

Given the following function that recursively dumps clusters:

window.dumpClusters = function(arr, clusters, indent) {
  indent = indent ? indent : "";
  clusters.forEach(function(cluster) {
    arr.push(
        indent + cluster.label
        + (cluster.documents ? " [" + cluster.documents.length + " documents]"   : "")
        + (cluster.clusters  ? " [" + cluster.clusters.length  + " subclusters]" : ""));
    if (cluster.clusters) {
      dumpClusters(arr, cluster.clusters, indent + "  ");
    }
  });
  return arr;
}

We can dump all cluster labels of a clustering request with the following snippet of javascript:

var request = {
  "search_request": {
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },

  "max_hits": 0,
  "query_hint": "data mining",
  "field_mapping": {
    "title": ["_source.title"],
    "content": ["_source.content"]
  }
};

$.post("/test/test/_search_with_clusters",
  JSON.stringify(request),
  function(response) {
    $("#cluster-list-result").text(
      dumpClusters([], response.clusters).join("\n"));
});
            

            

The output will vary depending on the choice of clustering algorithm (and particular documents that made it to the hit list if search is not deterministic). The following example shows a pseudo-clustering algorithm that uses the logical url field to produce clusters based on the components of each document's domain. We don't need every search hit here so we will omit them in the response.

var request = {
  "search_request": {
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },

  "max_hits": 0,
  "query_hint": "data mining",
  "field_mapping": {
    "url": ["_source.url"]
  },
  "algorithm": "byurl"
};

$.post("/test/test/_search_with_clusters",
  JSON.stringify(request), function(response) {
    $("#cluster-list-result2").text(
      dumpClusters([], response.clusters).join("\n"));
});
            

            

A full response for a clustering request can look as shown below (note the difference in field mapping in this example).

var request = {
  "search_request": {
    "fields": [ "title", "content" ],
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },

  "query_hint": "data mining",
  "field_mapping": {
    "title": ["fields.title"],
    "content": ["fields.content"]
  }
};

$.post("/test/test/_search_with_clusters",
  JSON.stringify(request),
  function(response) {
    $("#simple-request-result").text(
      JSON.stringify(response, false, "  "));
});
            

            

The field mapping section provides a connection between actual data and logical data to cluster on. The different field mapping sources (_source.*, highlight.* and fields.*) can be used to tune the amount of data returned in the request and the amount of text passed to the clustering engine (and in result the required processing cost).

  • The _source.* mapping takes data directly from the source document, if _source is available as part of the search hit. The content pointed to by this mapping is not returned as part of the request, it is only used internally for clustering.

    Warning! The _source may not be published by ElasticSearch's internal search infrastructure, in particular, when only selected fields are filtered, the source will not be available. This issue should be addressed in the future (ES API constraint).

  • The fields.* mapping must be accompanied by appropriate fields declaration in the search request. The content of those fields is returned back with the request and thus can be used for display purposes (for example to show each document's title).

  • The highlight.* mapping also must be accompanied by appropriate highlight declaration in the search request. The highlighting request specification can be used to tune the amount of content passed to the clustering engine (the number of fragments, their width, boundary, etc.). This is of particular importance when the documents are long (full content is stored): it is typical that clustering algorithms run perceptually "better" when focused on the context surrounding the query, rather than when presented with full content of all documents. Any highlighted content will also be returned as part of the request.

Compare the output for the following requests and note the differences outlined above.

var request = {
  "search_request": {
    "fields": ["url", "title", "content"],
    "query": {"match" : { "_all": "computer" }},
    "size": 100
  },

  "query_hint": "computer",
  "field_mapping": {
    "url":     ["fields.url"],
    "title":   ["fields.title"],
    "content": ["fields.content"]
  }
};

$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#fields-request").text(JSON.stringify(response, false, "  "));
});
            
var request = {
  "search_request": {
    "fields": ["url", "title"],
    "query": {"match" : { "_all": "computer" }},
    "size": 100,
    "highlight" : {
      "pre_tags" :  ["", ""],
      "post_tags" : ["", ""],
      "fields" : {
        "content" : { "fragment_size" : 100, "number_of_fragments" : 2 }
      }
    },
  },

  "query_hint": "computer",
  "field_mapping": {
    "url":     ["fields.url"],
    "title":   ["fields.title"],
    "content": ["highlight.content"]
  }
};

$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#highlight-request").text(JSON.stringify(response, false, "  "));
});
            


          


          

The clustering plugin comes with several open-source algorithms from the Carrot2 project and has a built-in support for the commercial Lingo3G clustering algorithm.

The question of which algorithm to choose depends on the amount of traffic (STC is faster than Lingo, but arguably produces less intuitive clusters, Lingo3G is the fastest algorithm but is not free or open source), expected result (Lingo3G provides hierarchical clusters, Lingo and STC provide flat clusters), and the input data (each algorithm will cluster the input slightly differently). There is no one answer which algorithm is "the best".

Compare the clusters dumped for the following identical search request.

var request = {
  "search_request": {
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },

  "query_hint": "data mining",
  "field_mapping": {
    "title":   ["_source.title"],
    "content": ["_source.content"]
  },
  "algorithm": "lingo"
};

$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#request-algorithm1").text(dumpClusters([], response.clusters).join("\n"));
});
            
var request = {
  "search_request": {
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },

  "query_hint": "data mining",
  "field_mapping": {
    "title":   ["_source.title"],
    "content": ["_source.content"]
  },
  "algorithm": "stc"
};

$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#request-algorithm2").text(dumpClusters([], response.clusters).join("\n"));
});
            


          


          

The default algorithm suite contains empty stubs for all initialization attributes of every included algorithm. These files follow a naming convention of: {algorithm-name}-attributes.xml and are resolved relative to the current setting of the resources configuration setting (see plugin configuration).

For example to override the default attributes for all requests to the lingo algorithm, one would tweak {plugin.zip}/config/algorithms/lingo-attributes.xml and place any overridden attributes in there, as in:

<attribute-sets default="overridden-attributes">
  <attribute-set id="overridden-attributes">
    <value-set>
      <label>overridden-attributes</label>

      <attribute key="LingoClusteringAlgorithm.desiredClusterCountBase">
        <value type="java.lang.Integer" value="5"/>
      </attribute>
    </value-set>
  </attribute-set>
</attribute-sets>

It is perhaps most convenient to export the configuration XMLs directly from the Carrot2 Workbench.

Every clustering algorithm comes with (tons) of attributes that modify its behavior (the Carrot2 Workbench can be used for tuning these). If desired, certain attributes can be modified per-request, as the following example shows by modifying the number of desired clusters randomly (execute the example a few times to see the difference).

var request = {
  "search_request": {
    "query": {"match" : { "_all": "data mining" }},
    "size": 100
  },

  "query_hint": "data mining",
  "field_mapping": {
    "title":   ["_source.title"],
    "content": ["_source.content"]
  },
  "algorithm": "lingo",
  "attributes": {
     "LingoClusteringAlgorithm.desiredClusterCountBase": Math.round(5 + Math.random() * 5)
  }
};

$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#request-attributes").text(dumpClusters([], response.clusters).join("\n"));
});
            

          

The field mapping specification can include a language element, which defines the ISO 639-1 code of the language in which the title and content of a document are written. This information can be stored in the index based on apriori knowledge of the documents' source or a language detection filter applied at indexing time.

The algorithms inside Carrot2 framework will accept ISO codes of languages defined in LanguageCode enum.

The language hint makes it easier for clustering algorithms to separate documents from different languages on input and to pick the right language resources for clustering. If you do have multi-lingual query results (or query results in a language different than English), it is strongly advised to map the language field appropriately.

The following example applies a clustering algorithm to all documents. Some documents are in German (and have a de language code), some are in English (and have an en language code). We additionally set the language aggregation strategy to FLATTEN_NONE so that top-level groups indicate the language of documents contained in sub-groups. Note the top-level group names in the output from the code sample below.

var request = {
  "search_request": {
    "query": {"match_all" : {}},
    "size": 100
  },

  "query_hint": "bundestag",
  "field_mapping": {
    "title":    ["_source.title"],
    "content":  ["_source.content"],
    "language": ["_source.lang"]
  },
  "attributes": {
    "MultilingualClustering.languageAggregationStrategy": "FLATTEN_NONE"
  }
};

$.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) {
  $("#language-fieldmapping").text(dumpClusters([], response.clusters).join("\n"));
});
            

          

The plugin comes with sensible defaults out of the box and should require no additional configuration. Customize only if really necessary.

The following configuration properties can be tweaked at the global ES configuration level.

{es.home}/config/elasticsearch.yml,
{es.home}/config/elasticsearch.json,
{es.home}/config/elasticsearch.properties

The main ES configuration file can be used to enable/ disable the plugin and to tweak the resources assigned to clustering requests.

carrot2.enabled
If set to false disables the plugin, even if it is installed.
threadpool.search.*
Clustering requests are executed on the search threadpool inside ES. It may be necessary to tune the settings of this threadpool to limit the number of concurrent clustering requests to the number of computational cores on the node (clustering is CPU-intense). See the relevant threadpool documentation section in ES.

The following configuration files and properties can be found inside the plugin's ZIP file (config/ folder) or under {es.home}/config/elasticsearch-carrot2/ after plugin installation).

carrot2.yml,
carrot2.json,
carrot2.properties

The master configuration file for the plugin.

suite

Path to the algorithm suite XML. The resource is looked up relative to the configuration folder. The algorithm suite XML is in Carrot2 format ant it contains the defaults for all open-source algorithms and Lingo3G.

resources

Resource lookup path for loading Carrot2 lexical resources, Lingo3G's lexical resources and algorithm descriptor files (including any initialization-time attributes).

Any resources not present in this location will be loaded from classpath (defaults).

controller.pool-size

Size of the internal pool of algorithm instances. This pool is sized automatically depending on the configuration of the search threadpool in ElasticSearch. If too many resources are consumed, the pool can be set to a fixed size using this option.

For Lingo3G, the license needs to be installed at any of the following locations.

{es.home}/config/license.xml,
{es.home}/config/.license.xml,
{plugin-zip}/config/license.xml,
{plugin-zip}/config/.license.xml

Note that if the license is installed inside the plugin, it should be copied to the ZIP file before the installation. Once installed, the plugin's configuration is copied to {es.home}/config/plugin-name/ and can be tweaked there.