The clustering plugin attempts to automatically group together similar "documents" and assign human-readable labels to these groups. The clusters can be thought-of as "dynamic facets" generated for each unique query and set of search result hits. Take a look at the Carrot2 demo page to glimpse at how this can be used in practice.
Each document passed for clustering is composed of several logical parts: the document's identifier, its origin URL, title, the main content and a language code. Only the identifier field is mandatory, everything else is optional but at least one of these fields will be required to make the clustering process reasonable.
Read this section first, it contains important information about clustering which will help understand what's going on behind the scenes.
Documents indexed in ElasticSearch do not have to follow any predefined schema so actual fields of a JSON document need to be mapped to the logical layout required by the clustering plugin. An example mapping can look as illustrated in the figure below:
Note that two document fields are mapped to the title. This is not an error, any number of fields can be mapped to either the title or the content—the content of those fields will be concatenated and used for clustering.
The logical fields can also be filled with generated content, for example by applying the highlighter to the document's fields. This can be useful to decrease the amount of text passed to the clustering algorithm (improves performance) or to make the clustered content more query-specific (this typically clusters better). REST API examples below demonstrate the API for field mapping in details.
The Java API for clustering search results is fully functional and works the magic behind all REST requests described in the following part of this document. For concrete code utilizing the API see the source code of the plugin at github, especially the unit and integration tests.
The HTTP REST API of the plugin contains several methods reflecting Java API's functionality. Each of these methods is described in detail below.
/_algorithms
(GET
or POST
)This action lists all available clustering algorithms. The returned identifiers can be used as a parameter to the clustering request.
A request to list the available algorithms is a simple GET
or POST
request to
/_algorithms
URL.
The response is a JSON object with an algorithms
property which is a non-empty
array of algorithm identifiers. The following example shows the algorithms available
on this plugin instance. The default algorithm is the first one on the list.
$.get("/_algorithms", function(response) { $("#list-of-algorithms").text( response.algorithms.join("\n")); });
/_search_with_clusters
(POST
, GET
)/{index}/_search_with_clusters
(POST
, GET
)/{index}/{type}/_search_with_clusters
(POST
, GET
)This action performs a search query, fetches matching hits and clusters them on-the-fly.
The index
and type
URI segments implicitly bind the search request part of the
message to a given index and document type, exactly as in the
search request API.
A clustering request is a HTTP REST request, where the full set of parameters is supported via HTTP POST request with a JSON body. A limited subset of clustering functionality is also exposed via the HTTP GET method.
A HTTP POST request should contain a JSON object with the following properties.
search_request
required The search request to fetch documents to be clustered. This section follows exactly what the search DSL specifies, including all optional bells and whistles such as sorting, filtering, query DSL, highlighter, etc.
query_hint
required This is a string attribute specifying query terms which were used to fetch the matching documents. This hint helps the clustering algorithm to avoid trivial clusters around the query terms. Typically the query terms hint will be identical to what the user typed in the search box. If possible, it should be pruned from any boolean or search-engine specific operators which could affect the clustering process. The query hint is obligatory but may be an empty string.
field_mapping
required
Defines how to map actual fields of the documents matching the search_request
to
logical fields of the documents to be clustered. The value should be a hash where keys indicate logical
document fields and values are arrays with field source specifications (content of fields
defined by these specifications is concatenated). For example this is a valid field mapping specification:
{ "url": [_source.urlSource], "title": [fields.subject], "content": [_source.abstract, highlight.main], "language": [fields.lang] }
Any of the following logical document field names are valid:
url
The URL of the document.
title
The title of the document.
content
The main body (content) of the document.
language
Optional language "tag" for the title and content of a document. The language tag is a two-letter
ISO 639-1 code, with the
exception of Simplified Chinese (zh_cn
code). Whether or not the language
is supported by a clustering engine depends on the algorithm used. Carrot2 algorithms
support languages defined in the
LanguageCode
class.
A field source specification defines where the value is taken from: the search hit's fields, stored document's content, or from the highlighter's output. The syntax of field source specification is as follows:
fields.{fieldname}
highlight.{fieldname}
_source.{fieldname}
algorithm
optional Defines which clustering component (algorithm) should be used for clustering. Names of all built-in clustering algorithms are logged at startup and are also returned from the list algorithms request. If not present, the default algorithm is used.
include_hits
optional
If set to true
, the clustering response will not contain search hits, only cluster labels
and document references. This option may be useful to decrease the size of clustering response in case
only cluster labels are needed.
max_hits
optional If set to a non-negative number, the clustering response will be limited to contain only a maximum of the given search hits. The clustering will still run on a full window of results returned by the original search request. This option may be useful to decrease the size of clustering response in case cluster labels are used as facets (for refining the query, but without the immediate link to the search hits).
Note that clusters may still reference documents not present in the returned (trimmed) hits window.
attributes
optional A map of key-value attributes overriding the default algorithm settings per-query (runtime attributes in Carrot2 parlance). Typically the default settings are overridden using init-time XML configuration files.
Very important
Clustering requires at least a few dozen documents (hits) in order to make
sense. The clustering plugin clusters search results only (it does not look in the index, it does not
fetch additional documents). Make sure to specify the size
of the
fetch window to be at least 100 documents. If the response does not need so many hits
(document references), the hits can be trimmed by using max_hits
parameter on
the clustering request.
A HTTP GET clustering request supports a superset of HTTP URI parameters defined by ElasticSearch's URI search request. All additional parameters correspond to those typically defined in the body of a clustering request sent via HTTP POST. Namely, the following parameters are supported by HTTP GET:
field_mapping_*
required
This is a wildcard (a family) of parameters, each of which defines a logical field mapping, similar
to field_mapping
map described in the HTTP POST request. A field_mapping_title
will specify the logical title's mapping, wheareas field_mapping_url
will specify
the logical URL's mapping and so on.
The value of the mapping parameter is a comma-separated list of mapping specifications, as described in the description of the POST request.
algorithm
optional
Identical semantics to algorithm
attribute described in HTTP POST request.
query_hint
optional
Identical semantics to query_hint
attribute described in HTTP POST request.
For GET requests the query hint is optional; if not present, the q
attribute is used
as the default.
Important
A HTTP GET request offers a subset of the functionality of a full HTTP POST JSON syntax. For example, it is not possible to specify a field mapping to highlighted field values, define custom algorithm attributes, etc. HTTP POST is recommended for production.
An example HTTP GET clustering request is shown below, with the resulting clusters shown on the right-hand side panel.
var getUrl = "/test/test/_search_with_clusters?" + "q=data+mining&" + "size=100&" + "field_mapping_title=_source.title&" + "field_mapping_content=_source.content"; // Run HTTP GET via jquery and render cluster labels. $.get(getUrl, function(response) { $("#cluster-httpget-result").text( dumpClusters([], response.clusters).join("\n")); });
The response format is identical to a plain search request response, with extra properties presented in the schematic output below.
{ /* Typical search response fields. */ "hits": { /* ... */ }, /* Clustering response fields. */ "clusters": [ /* Each cluster is defined by the following. */ { "id": /* identifier */, "score": /* numeric score */, "label": /* primary cluster label */, "other_topics": /* if present, and true, this cluster groups unrelated documents (no related topics) */, "phrases": [ /* cluster label array, will include primary. */ ], "documents": [ /* This cluster's document ID references. May be undefined if this cluster holds sub-clusters only. */ ], "clusters": [ /* This cluster's subclusters (recursive objects of the same structure). May be undefined if this cluster holds documents only. */ ], }, /* ...more clusters */ ], "info": { /* Additional information about the clustering: execution times, the algorithm used, etc. */ } }
Given the following function that recursively dumps clusters:
window.dumpClusters = function(arr, clusters, indent) { indent = indent ? indent : ""; clusters.forEach(function(cluster) { arr.push( indent + cluster.label + (cluster.documents ? " [" + cluster.documents.length + " documents]" : "") + (cluster.clusters ? " [" + cluster.clusters.length + " subclusters]" : "")); if (cluster.clusters) { dumpClusters(arr, cluster.clusters, indent + " "); } }); return arr; }
We can dump all cluster labels of a clustering request with the following snippet of javascript:
var request = { "search_request": { "query": {"match" : { "_all": "data mining" }}, "size": 100 }, "max_hits": 0, "query_hint": "data mining", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"] } }; $.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#cluster-list-result").text( dumpClusters([], response.clusters).join("\n")); });
The output will vary depending on the choice of clustering algorithm
(and particular documents that made it to the hit list if search is not
deterministic). The following example shows a pseudo-clustering algorithm
that uses the logical url
field to produce clusters based on the
components of each document's domain. We don't need every search hit here
so we will omit them in the response.
var request = { "search_request": { "query": {"match" : { "_all": "data mining" }}, "size": 100 }, "max_hits": 0, "query_hint": "data mining", "field_mapping": { "url": ["_source.url"] }, "algorithm": "byurl" }; $.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#cluster-list-result2").text( dumpClusters([], response.clusters).join("\n")); });
A full response for a clustering request can look as shown below (note the difference in field mapping in this example).
var request = { "search_request": { "fields": [ "title", "content" ], "query": {"match" : { "_all": "data mining" }}, "size": 100 }, "query_hint": "data mining", "field_mapping": { "title": ["fields.title"], "content": ["fields.content"] } }; $.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#simple-request-result").text( JSON.stringify(response, false, " ")); });
The field mapping section provides a connection between actual data and logical data
to cluster on. The different field mapping sources
(_source.*
,
highlight.*
and
fields.*
) can be used to tune the amount of data returned in the request and
the amount of text passed to the clustering engine (and in result the required processing cost).
The _source.*
mapping takes data directly from the source document, if _source
is available as part of the search hit. The content pointed to by this mapping is not returned as
part of the request, it is only used internally for clustering.
Warning!
The _source
may not
be published by ElasticSearch's internal search infrastructure, in particular, when only selected
fields
are filtered, the source will not be available. This issue should be addressed
in the future (ES API constraint).
The fields.*
mapping must be accompanied by appropriate
fields
declaration in the search request. The content of those fields is returned back with the
request and thus can be used for display purposes (for example to show each document's title).
The highlight.*
mapping also must be accompanied by appropriate
highlight
declaration in the search request. The highlighting request specification can be used to tune the
amount of content passed to the clustering engine (the number of fragments, their width, boundary, etc.). This
is of particular importance when the documents are long (full content is stored): it is typical that
clustering algorithms run perceptually "better" when focused on the context surrounding
the query, rather than when presented with full content of all documents.
Any highlighted content will also be returned as part of the request.
Compare the output for the following requests and note the differences outlined above.
var request = { "search_request": { "fields": ["url", "title", "content"], "query": {"match" : { "_all": "computer" }}, "size": 100 }, "query_hint": "computer", "field_mapping": { "url": ["fields.url"], "title": ["fields.title"], "content": ["fields.content"] } }; $.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#fields-request").text(JSON.stringify(response, false, " ")); });
var request = { "search_request": { "fields": ["url", "title"], "query": {"match" : { "_all": "computer" }}, "size": 100, "highlight" : { "pre_tags" : ["", ""], "post_tags" : ["", ""], "fields" : { "content" : { "fragment_size" : 100, "number_of_fragments" : 2 } } }, }, "query_hint": "computer", "field_mapping": { "url": ["fields.url"], "title": ["fields.title"], "content": ["highlight.content"] } }; $.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#highlight-request").text(JSON.stringify(response, false, " ")); });
The clustering plugin comes with several open-source algorithms from the Carrot2 project and has a built-in support for the commercial Lingo3G clustering algorithm.
The question of which algorithm to choose depends on the amount of traffic (STC is faster than Lingo, but arguably produces less intuitive clusters, Lingo3G is the fastest algorithm but is not free or open source), expected result (Lingo3G provides hierarchical clusters, Lingo and STC provide flat clusters), and the input data (each algorithm will cluster the input slightly differently). There is no one answer which algorithm is "the best".
Compare the clusters dumped for the following identical search request.
var request = { "search_request": { "query": {"match" : { "_all": "data mining" }}, "size": 100 }, "query_hint": "data mining", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"] }, "algorithm": "lingo" }; $.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#request-algorithm1").text(dumpClusters([], response.clusters).join("\n")); });
var request = { "search_request": { "query": {"match" : { "_all": "data mining" }}, "size": 100 }, "query_hint": "data mining", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"] }, "algorithm": "stc" }; $.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#request-algorithm2").text(dumpClusters([], response.clusters).join("\n")); });
The default algorithm suite contains empty stubs for all initialization attributes of every
included algorithm. These files follow a naming convention of:
{algorithm-name}-attributes.xml
and are resolved relative to the current setting of the resources
configuration
setting (see plugin configuration).
For example to override the default attributes
for all requests to the lingo
algorithm, one would tweak
{plugin.zip}/config/algorithms/lingo-attributes.xml
and place any overridden attributes in there, as in:
<attribute-sets default="overridden-attributes"> <attribute-set id="overridden-attributes"> <value-set> <label>overridden-attributes</label> <attribute key="LingoClusteringAlgorithm.desiredClusterCountBase"> <value type="java.lang.Integer" value="5"/> </attribute> </value-set> </attribute-set> </attribute-sets>
It is perhaps most convenient to export the configuration XMLs directly from the Carrot2 Workbench.
Every clustering algorithm comes with (tons) of attributes that modify its behavior (the Carrot2 Workbench can be used for tuning these). If desired, certain attributes can be modified per-request, as the following example shows by modifying the number of desired clusters randomly (execute the example a few times to see the difference).
var request = { "search_request": { "query": {"match" : { "_all": "data mining" }}, "size": 100 }, "query_hint": "data mining", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"] }, "algorithm": "lingo", "attributes": { "LingoClusteringAlgorithm.desiredClusterCountBase": Math.round(5 + Math.random() * 5) } }; $.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#request-attributes").text(dumpClusters([], response.clusters).join("\n")); });
The field mapping specification can include a language
element, which
defines the ISO 639-1
code of the language in which the title and content of a document are
written. This information can be stored in the index based on apriori knowledge of
the documents' source or a language detection filter applied at indexing time.
The algorithms inside Carrot2 framework will accept ISO codes
of languages defined in
LanguageCode
enum.
The language hint makes it easier for clustering algorithms to separate documents
from different languages on input and to pick the right language resources for clustering.
If you do have multi-lingual query results (or query results in a language different
than English), it is strongly advised to map the language
field appropriately.
The following example applies a clustering algorithm to all documents. Some
documents are in German (and have a de
language code), some are in English
(and have an en
language code). We additionally set the language aggregation
strategy to FLATTEN_NONE
so that top-level groups indicate the language
of documents contained in sub-groups. Note the top-level group names in the output
from the code sample below.
var request = { "search_request": { "query": {"match_all" : {}}, "size": 100 }, "query_hint": "bundestag", "field_mapping": { "title": ["_source.title"], "content": ["_source.content"], "language": ["_source.lang"] }, "attributes": { "MultilingualClustering.languageAggregationStrategy": "FLATTEN_NONE" } }; $.post("/test/test/_search_with_clusters", JSON.stringify(request), function(response) { $("#language-fieldmapping").text(dumpClusters([], response.clusters).join("\n")); });
The plugin comes with sensible defaults out of the box and should require no additional configuration. Customize only if really necessary.
The following configuration properties can be tweaked at the global ES configuration level.
{es.home}/config/elasticsearch.yml
,{es.home}/config/elasticsearch.json
,{es.home}/config/elasticsearch.properties
The main ES configuration file can be used to enable/ disable the plugin and to tweak the resources assigned to clustering requests.
carrot2.enabled
false
disables the plugin, even if it is installed.threadpool.search.*
The following configuration files and properties can be found inside the
plugin's ZIP file (config/
folder) or
under {es.home}/config/elasticsearch-carrot2/
after
plugin installation).
carrot2.yml
,carrot2.json
,carrot2.properties
The master configuration file for the plugin.
suite
Path to the algorithm suite XML. The resource is looked up relative to the configuration folder. The algorithm suite XML is in Carrot2 format ant it contains the defaults for all open-source algorithms and Lingo3G.
resources
Resource lookup path for loading Carrot2 lexical resources, Lingo3G's lexical resources and algorithm descriptor files (including any initialization-time attributes).
Any resources not present in this location will be loaded from classpath (defaults).
controller.pool-size
Size of the internal pool of algorithm instances. This pool is sized automatically depending on the configuration of the search threadpool in ElasticSearch. If too many resources are consumed, the pool can be set to a fixed size using this option.
For Lingo3G, the license needs to be installed at any of the following locations.
{es.home}/config/license.xml
,{es.home}/config/.license.xml
,{plugin-zip}/config/license.xml
,{plugin-zip}/config/.license.xml
Note that if the license is installed inside the plugin, it should be copied to the ZIP file
before the installation. Once installed, the plugin's configuration is
copied to {es.home}/config/plugin-name/
and can be tweaked there.