How to preform Bucket Aggregations in ElasticSearch on fields where the data can be Missing

Home
Blog
How to preform Bucket Aggregations in ElasticSearch on fields where the data can be Missing

21.11.2022

1.0 Introduction
2.0 Problem
3.0 Solution
4.0 Aggregation
4.1 Grouping Experiences by their userId (Terms Aggregation)
4.2 Calculating the Experience
- 4.2.1 Sum Aggregation
- 4.2.2 Bucket Aggregation
- 4.2.3 Bucket Filtering
- 5.0 Runtime Fields
- 6.0 References
- We will create an aggregation that will sum every start date and every end date, after that we will subtract the end date with the start date for every experience that matches a given user and then we will filter the users by their aggregated experience. Also in the code example above you can see that in our experience property the end date is null, so we need to tell elastic in that case to use todays date (This part will come at the end of the article).
When you have nested properties like in the example above and you want to preform complex queries and aggregations, do yourself a favour and alongside the main index where you save the main data create a new index where you will save the data that is needed to preform these complex operations.

In this way you avoid getting the problems of how to access the data in the nested properties.

Here is a document from the experience index that I created alongside the main index, in here I save only the necessary data that is needed to preform the aggregations:
```
{
  {
    "id": 14214,
    "userId": 135151,
    "startDate": "2022-07-12T18:20:17.439Z",
    "endDate": null,
    "isEmployed": true
  }
}
```
0.2 - Document inside the Experience Index
Aggregation
In this part will deal with the aggregation and each step and aggregation property will be divided into its own code segment where we will analyze it deeper to understand what is going on and how you can tailor it to your needs.

Grouping Experiences by their userId
```
{
  "aggregations": {
    "group_by": {
      "terms": {
        "field": "userId"
      }
    }
  }
}
```
0.3 - Grouping the experience documents by their userId
As you can see in the code example above we declare the aggregation property and then define the terms by which we want to group_by, in our case we group our documents by the field userId which is stored in our index.
```
{
  "aggregations": {
    "group_by": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": 101,
          "doc_count": 1
        }
      ]
    }
  }
}
  
```
0.4 - Documents grouped by their id's and seperated into buckets
We now have the result of our first aggregation, the most important part here is to see the buckets property, which is an array of bucket objects that contain the key which is our userId and the doc_count property which is the number of documents that contain the same userId.

Calculating the Experience
This section will consist of two parts, the first part will describe the sum aggregation of the start and end date fields and the second part that will take those fields subtract them and get the number of months for a given user.

Sum Aggregation
```
{
  "aggregations": {
    "start": {
      "sum": {
        "field": "startDate"
      }
    },
    "end": {
      "sum": {
        "field": "endDate"
      }
    }
  }
}
```
0.5 - Suming up all startDates and endDates
As we saw in the previous example, we start the aggregation by declaring the property aggregation and now as we are doing the sum aggregation we want to declare a property that will be displayed in our bucket when the aggregated is completed.

The sum of our startDate fields will have the property name of start and for the sum of our endDate fields will have the property name of end.

Note that this aggregation will be nested under the first aggregation group_by property that we made in the example above, you can also follow with the full query which is at the end of this article.
```
{
  "buckets": [
    {
      "key": 101,
      "doc_count": 1,
      "start": {
        "value": 1.6451424e12,
        "value_as_string": "2022-02-18T00:00:00.000Z"
      },
      "end": {
        "value": 1.6660512e12,
        "value_as_string": "2022-10-18T00:00:00.000Z"
      }
    }
  ]
}
  
```
0.6 - Sum Aggregation result
Here is the result of our aggregation and now our bucket contains two new properties which are start&end which we defined in the code example 0.5 and they also contain properties for our dates. The value property will be particularly usefull in the next part...

Bucket Aggregation
```
{
  "duration": {
    "bucket_script": {
      "buckets_path": {
        "start": "start.value",
        "end": "end.value"
      },
      "script": {
        "params": {
          "month_in_milliseconds": 2628000000
        },
        "source": "Math.round((params.end - params.start) / params.month_in_milliseconds)"
      }
    }
  }
}
```
0.7 - Bucket Aggregation
Now that we summed up the start and the end dates we need to subtract the end date with the start date to get the number of months that the user has been employeed. The result of that operation will be in milliseconds, to convert the milliseconds into actual integers we will apply the Math.round() method in Java.

This process sounds more complicated then it is, but don't worry will take it step by step:
- - Fist step is to create a bucket script and name it accordingly, in our case we are trying to find the total employment duration for a given user, so will name our bucket duration.
- - Then on the second step will designate our field duration as a bucket_script which will tell Elastic to view it as a bucket aggregation.
- - Inside the bucket script we have the bucket_path property which enables us to define the properties that we want to use in our bucket aggregation but those properties need to be present in the bucket. To make sure that we are using the properties that are in our buckets will take a look at the code snippet 0.6 and there we see the start and end properties, which both have the value property that we will use in the next part of this aggregation. Inside the buckets_path property we define the start and end properties which will point at the sum aggregation bucket properties start.value and end.value.
- - Then we define the script property, this is the main part because all the magic happens here, in here we define the params that can contain anything we want and it is the preffered way to define static variables which will be used in the source property to preform operations.
- - And at the very end the source property. The duration filed in the bucket will contain the value that is calculated in this property in our case as described previously we subtract the end property with the start property and then divide it with the variable month_in_milliseconds then the result will be rounded using the Java Math.round() method.
- Note this aggregation will also be nested inside the second aggregation where we summed the start and end date fields (code snippet 0.5)
```
{
  "buckets": [
    {
      "key": 101,
      "doc_count": 1,
      "start": {
        "value": 1.6451424e12,
        "value_as_string": "2022-02-18T00:00:00.000Z"
      },
      "end": {
        "value": 1.6660512e12,
        "value_as_string": "2022-10-18T00:00:00.000Z"
      },
      "duration": {
        "value": 8.0
      }
    }
  ]
}
  
```
0.8 - Result of the bucket aggregation described in the code snippet 0.7
And as we can see the duration property contains the number 8 which is the difference between the end and start values, so our user has 8 months of experience which is correct.

To make sure that the aggregation result is OK you can look at the string representation in the value_as_string property for both start and end and calculate the difference between them by hand.

Bucket Filtering
At the end of this process we want to filter our duration bucket to return only those users who are matching the condition that is passed. This process is similar to the one above but here we just filter the bucket that has been aggregated and will do it with the following syntax which I will be explaining step by step.
```
{
  "duration_bucket_filter": {
    "bucket_selector": {
      "buckets_path": {
        "durationBucket": "duration"
      },
      "script": {
        "params": {
          "min_number_of_months": 8,
          "max_number_of_months": 8
        },
        "source": "params.durationBucket >= params.min_number_of_months && params.durationBucket <= params.max_number_of_months"
      }
    }
  }
}
  
```
0.9 - Filtering the duration bucket using the duration bucket filter
As you are looking at the code example above in the code snippet 0.9 you can see a pattern that emerges when it comes to bucket aggregations. We first define the name, in this case duration_bucket_filter then we define the type of a bucket aggregation, in the example 0.7 we had a bucket_script which is used for aggregating data, now instead of a bucket script we define bucket_selector that tells Elastic that we want to select a bucket.
To select a property inside a bucket as in code snippet 0.7 we add the buckets_path property where we define the durationBucket as an alias for the duration property that is aggregated in the previous step (code snippet 0.8).
Now that the property is defined we want to filter our buckets. By adding the scripts property we define the params which we want to use inside our source. Here we defined min and max number of months that a user needs to have so that his bucket gets returned after filtering.

And at the end of this code snippet is the source which contains the logic for filtering the buckets:
- params.durationBucket >= params.min_number_of_months - the duration of a given experience needs to be greater than the minimal number of months
- params.durationBucket <= params.max_number_of_months - that experience needs to be less or equal to the maximal number of months
Based on these conditions the buckets are filtered and returned:
```
{
  "buckets": [
    {
      "key": 101,
      "doc_count": 1,
      "start": {
        "value": 1.6451424e12,
        "value_as_string": "2022-02-18T00:00:00.000Z"
      },
      "end": {
        "value": 1.6660512e12,
        "value_as_string": "2022-10-18T00:00:00.000Z"
      },
      "duration": {
        "value": 8.0
      }
    }
  ]
}
```
1.0 - Resoult of bucket filtering
As you see we set the min and max values to match the user with the key 101 to get only the user in the result of our bucket filtering.

Runtime Fields
```
"runtime_mappings": {
  "updatedEndDate": {
    "type": "date",
    "script": {
      "source": "if (doc['isEmployed'].value.equals(true)) { emit(new Date().getTime()) } else { emit(doc['endDate'].value.millis) }"
    }
  }
}
```
1.1 - Populating the updatedEndDate field
Now we almost did it but there is one more thing, what if the user is still employed? How do we calculate the difference between start and end date if the end date is null. The answer is very simple runtime fields which are supported from the version code 7.11 so make sure you have the proper Elastic version.

The runtime fields enable us to populate fields at runtime. That means the value is generated when it is needed in our case when someone is filtering for the users that have some years of experience and are still employed the endDate field insteadd of null will be populated with the date at the moment of querying for the data.

To implement the runtime fields in our query will add the the runtime_mappings property at the top of the query. Then we will add a type to it which is date in our case and then we define the script property following with the source where the runtime_mappings preform their magic, will go over the source property now so you can understand it and tailor it to your needs:
- First we check if the user is employed if (doc['isEmployed'].value.equals(true)) through the field isEmployed which is contained in every document of our index
  - If he is employed that means that endDate field is null and that we need to populate it with the current date. To do this we emit the current date value into the new field called updatedEndDate with the part of the code displayed here { emit(new Date().getTime()) }
- If he isn't employed that means that the endDate field is actually populated and we need to emit it in the new field updatedEndDate
  - This process will be done with the following code { emit(doc['endDate'].value.millis) }, we emit the date in milliseconds.
Now we have assembled our query and delivered a solution to the problem that is described at the beginning of the article. You can checkout the full query down below in the code snippet 1.2
```
"runtime_mappings": {
  "updatedEndDate": {
    "type": "date",
    "script": {
      "source": "if (doc['isEmployed'].value.equals(true)) { emit(new Date().getTime()) } else { emit(doc['endDate'].value.millis) }"
    }
  }
},
"aggregations": {
  "group_by": {
    "terms": {
      "field": "userId"
    },
    "aggregations": {
      "start": {
        "sum": {
          "field": "startDate"
        }
      },
      "end": {
        "sum": {
          "field": "updatedEndDate"
        }
      },
      "duration": {
        "bucket_script": {
          "buckets_path": {
            "start": "start.value",
            "end": "end.value"
          },
          "script": {
            "params": {
              "month_in_milliseconds": 2628000000
            }
            "source": "Math.round((params.end - params.start) / params.month_in_milliseconds)"
          }
        }
      },
      "duration_bucket_filter": {
        "bucket_selector": {
          "buckets_path": {
            "durationBucket": "duration"
          },
          "script": {
            "params": {
              "min_number_of_months": 8,
              "max_number_of_months": 8
            },
            "source": "params.durationBucket >= params.min_number_of_months && params.durationBucket <= params.max_number_of_months"
          }
        }
      }
    }
  }
}
  
```
1.2 - Full Code snippet
References

Turn your Terminal into a Powerful JavaScript Editor. (Node.js REPL)

Verified member of the Pangea community

Latest Posts

11.09.2024

Working with JSON Objects in SQL Server
18.07.2024

Managing Resources with Using in C#
30.04.2024

Node.js PM2 Orchestration Explained
13.03.2024

How to write better documentation
25.01.2024

UX vs UI Design
23.01.2024

Meet Neo4j: Step-by-Step Guide to Graph Database
22.01.2024

CLEAN Architecture with ASP.NET Identity, IdentityServer and MediatR
04.01.2024

React Advanced Patterns
20.12.2023

Truthy and Falsy in JavaScript & How to Toggle Between?
27.10.2023

10 Angular Dos & Don'ts

Table of Contents

Introduction

Problem

Solution

Aggregation

Grouping Experiences by their userId

Calculating the Experience

Sum Aggregation

Bucket Aggregation

Bucket Filtering

Runtime Fields

References

Latest Posts