Wednesday, November 4, 2015

Elasticsearch and stemming

Main use of Elasticsearch is storing logs, and lot's of them. Searching throught the data suing Kibana frontend is awesome and one usually finds what one is looking for.

But lets have a bit fun by using the stemming in Elasticsearch.

Elasticsearch provides good support for stemming via numerous token filters, but lets focus on the Hunspell stemmer this time.

Els have the stemmer already (v1.7.3), but do not have the words. So first step would be to get the dictionaries and install them, not going into details here. Bounce the cluster (yep, all nodes) and be sure that the dictionries loads in nicely.

By default, newly created indices do not use stemming, thusly one have to set this when creating index.

put /skjetlein

{
  "settings": {
    "analysis": {
      "filter": {
        "en_US": {
          "type":     "hunspell",
          "language": "en_US" 
        }
      },
      "analyzer": {
        "en_US": {
          "tokenizer":  "standard",
          "filter":   [ "lowercase", "en_US" ]
        }
      }
    }

If the dictionaries are missing from one or several nodes, you will receive a failure notice.

Othervise:
{
  "acknowledged": true
}

Verify the settings

get /skjetlein/_settings

{
  "skjetlein": {
    "settings": {
      "index": {
        "uuid": "ny3n0uJMRKywpvy6OCRmLw",
        "number_of_replicas": "1",
        "analysis": {
          "filter": {
            "en_US": {
              "type": "hunspell",
              "language": "en_US"
            }
          },
          "analyzer": {
            "en_US": {
              "filter": [
                "lowercase",
                "en_US"
              ],
              "tokenizer": "standard"
...

Lets test the stemming,

get skjetlein/_analyze?analyzer=en_US -d "driving painting traveling"

...output should be something like this:
{
  "tokens": [
    {
      "token": "drive",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "painting",
      "start_offset": 8,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "paint",
      "start_offset": 8,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "traveling",
      "start_offset": 17,
      "end_offset": 26,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "travel",
      "start_offset": 17,
      "end_offset": 26,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

Ie. result is drive, paint and travel. Looks good.

So what can the usecase be in the context of elasticsearch, which usually is storing wast amount of logs (events) ? Well, lets say i search throught the logs for problems with filesystems. Elastic as-is would require search strings that include every possible word related to filesystem and since logs, in this context such as syslog do not provide the information is such a way that the search could be expressed in a constistant way.

Eg. filesystems can be stemmed to 'filesystem'
"custom_stem": {
          "type": "stemmer_override",
          "rules": [ 
            "ext2fs=>filesystem",
            "nfs=>filesystem",
            "btrfs=>filesystem"
...

or
           "postfix=>mail",
            "smtp=>mail",
            "qmail=>mail"



No comments:

VoWifi leaking IMSI

This is mostly a copy of the working group two blog I worked for when the research was done into the fields of imsi leakage when using voice...