Aggregation¶

Aggregations can be created on the Search object or inside an existing Aggregation.

from elastipy import Search

s = Search()
agg = s.agg_terms("name_of_agg", field="field", size=100)

supported aggregations¶

bucket
- adjacency_matrix
- auto_date_histogram
- children
- composite
- date_histogram
- date_range
- diversified_sampler
- filter
- filters
- geo_distance
- geohash_grid
- geotile_grid
- global
- histogram
- ip_range
- missing
- nested
- range
- rare_terms
- sampler
- significant_terms
- terms
metric
- avg
- boxplot
- cardinality
- extended_stats
- geo_bounds
- geo_centroid
- matrix_stats
- max
- median_absolute_deviation
- min
- percentile_ranks
- percentiles
- rate
- scripted_metric
- stats
- string_stats
- sum
- t_test
- top_hits
- top_metrics
- value_count
- weighted_avg
pipeline

value access¶

Aggregation.keys(key_separator: Optional[str] = None, tuple_key: bool = False)

Iterates through all keys of this aggregation.

For example, a top-level terms aggregation would return all bucketed field values.

For a nested bucket aggregation each key is a tuple of all parent keys as well.

Parameters

key_separator – str Optional separator to concat multiple keys into one string
tuple_key – bool If True, the key is always a tuple If False, the key is a string if there is only one key

Returns

generator

Aggregation.values(default=None)

Iterates through all values of this aggregation.

Parameters: default – If not None any None-value will be replaced by this.
Returns: generator

Aggregation.items(key_separator: Optional[str] = None, tuple_key: bool = False, default=None) → Iterable[Tuple]

Iterates through all key, value tuples.

Parameters

key_separator – str Optional separator to concat multiple keys into one string.
tuple_key –
bool If True, the key is always a tuple.

If False, the key is a string if there is only one key.
default – If not None any None-value will be replaced by this.

Returns

generator

Aggregation.rows(header: bool = True, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, default=None) → Iterable[list]

Iterates through all result values from this aggregation branch.

Each row is a list. The first row contains the names if ‘header’ == True.

This will include all parent aggregations (up to the root) and all children aggregations (including metrics).

Parameters

header – bool If True, the first row contains the names of the columns
include – str or sequence of str Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed.
exclude – str or sequence of str Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed.
flat –
bool, str or sequence of str Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. If True, all bucket aggregations are flattened.

Only supported for bucket aggregations!

Note

Currently not supported for the root aggregation!
default – This value will be used wherever a value is undefined.

Returns

generator of list

Aggregation.dict_rows(include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False) → Iterable[dict]

Iterates through all result values from this aggregation branch.

This will include all parent aggregations (up to the root) and all children aggregations (including metrics and pipelines).

Parameters

include – str or sequence of str Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed.
exclude – str or sequence of str Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed.
flat –
bool, str or sequence of str Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. If True, all bucket aggregations are flattened.

Only supported for bucket aggregations!

Note

Currently not supported for the root aggregation!

Returns

generator of dict

Aggregation.to_dict(key_separator=None, default=None) → dict

Create a dictionary from all key/value pairs.

Parameters

key_separator – str, optional separator to concat multiple keys into one string
default – If not None any None-value will be replaced by this.

Returns

dict

Aggregation.to_pandas(index: Union[bool, str] = False, to_index: Union[bool, str] = False, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, dtype=None, default=None)

Converts the results of dict_rows() to a pandas DataFrame.

This will include all parent aggregations (up to the root) and all children aggregations (including metrics).

Any columns containing dates will be automatically converted to pandas.Timestamp.

This method has a synonym: df

Parameters

index –
bool or str Sets a specific column as the index of the DataFrame.
- If False no explicit index is set.
- If True the root aggregation’s keys will be the index.
- if str explicitly set a certain column as the DataFrame index.
Note

The column is kept in the DataFrame. If you wan’t to set a column as index and remove it from the columns, use to_index.
to_index –
bool or str Same as index but the column is removed from DataFrame.
- If False no explicit index is set.
- If True the root aggregation’s keys will be the index.
- if str explicitly set a certain column as the DataFrame index.
include – str or list of str Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed
exclude – str or list of str Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed
flat –
bool, str or sequence of str Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. If True, all bucket aggregations are flattened.

Only supported for bucket aggregations!

Note

Currently not supported for the root aggregation!
dtype – Numpy data type to force. Only a single dtype is allowed. If None, infer.
default – This value will be used wherever a value is undefined.

Returns

pandas DataFrame instance

Aggregation.to_matrix(sort: Optional[Union[bool, str, int, Sequence[Union[str, int]]]] = None, default: Optional[Any] = None, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None) → Tuple[List[str], List, List]

Generate an N-dimensional matrix from the values of this aggregation.

Each dimension corresponds to one of the parent bucket keys that lead to this aggregation.

The values are gathered through the Aggregation.items method. So the matrix values are either the doc_count of the bucket aggregation or the result of a metric or pipeline aggregation that is inside one of the bucket aggregations.

a = Search().agg_terms("color", field="color")
a = a.agg_terms("shape", field="shape")
...
names, keys, matrix = a.to_matrix()
names == ["color", "shape"]
keys == [["red", "green", "blue"], ["circle", "triangle"]]
matrix == [[23, 42], [84, 69], [4, 10]]

Parameters

sort –
Can sort one or several keys/axises.
- True sorts all keys ascending
- "-" sorts all keys descending
- The name of an aggregation sorts it’s keys ascending. A “-” prefix sorts descending.
- An integer defines the aggregation by index. Negative integers sort descending.
- A sequence of strings or integers can sort multiple keys
For example, agg.to_matrix(sort=(“color”, “-shape”, -4)) would sort the color keys ascending, the shape keys descending and the 4th aggregation -whatever that is- descending.
default – If not None any None-value will be replaced by this value
include – str | seq[str] One or more wildcard patterns that include matching keys. All other keys are removed from the output.
exclude – str | seq[str] One or more wildcard patterns that exclude matching keys.

Returns

A tuple of names, keys and matrix data, each as list.

The names are the names of each aggregation that generates keys.

The keys are a list of lists, each corresponding to all the keys of each parent aggregation.

Data is a list, with other nested lists for each further dimension, containing the values of this aggregation.

Returns three empty lists if no data is available.

aggregation interface¶

The Search class as well as created aggregations themselves support the following interface.

class elastipy.aggregation.Aggregation(search, name, type, params)[source]¶

Bases: elastipy.aggregation.converter.ConverterMixin, elastipy.aggregation.generated_interface.AggregationInterface

Aggregation definition and response parser.

Do not create instances yourself, use the Search.aggregation() and Aggregation.aggregation() variants.

Once the Search has been executed, the values of the aggregations can be accessed.

agg(*aggregation_name_type, **params)¶: Alias for aggregation()

agg_adjacency_matrix(*aggregation_name: Optional[str], filters: Mapping[str, Union[Mapping, QueryInterface]], separator: Optional[str] = None)¶

A bucket aggregation returning a form of adjacency matrix. The request provides a collection of named filter expressions, similar to the filters aggregation request. Each bucket in the response represents a non-empty cell in the matrix of intersecting filters.

The matrix is said to be symmetric so we only return half of it. To do this we sort the filter name strings and always use the lowest of a pair as the value to the left of the "&" separator.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
filters – Mapping[str, Union[Mapping, 'QueryInterface']]
separator – Optional[str] An alternative separator parameter can be passed in the request if clients wish to use a separator string other than the default of the ampersand.

Returns

'AggregationInterface' A new instance is created and returned

agg_auto_date_histogram(*aggregation_name: Optional[str], field: Optional[str] = None, buckets: int = 10, minimum_interval: Optional[str] = None, time_zone: Optional[str] = None, format: Optional[str] = None, keyed: bool = False, missing: Optional[Any] = None, script: Optional[dict] = None)¶

A multi-bucket aggregation similar to the Date histogram except instead of providing an interval to use as the width of each bucket, a target number of buckets is provided indicating the number of buckets needed and the interval of the buckets is automatically chosen to best achieve that target. The number of buckets returned will always be less than or equal to this target number.

The buckets field is optional, and will default to 10 buckets if not specified.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – Optional[str] If no field is specified it will default to the ‘timestamp_field’ of the Search class.
buckets – int The number of buckets that are to be returned.
minimum_interval –
Optional[str] The minimum_interval allows the caller to specify the minimum rounding interval that should be used. This can make the collection process more efficient, as the aggregation will not attempt to round at any interval lower than minimum_interval.

The accepted units for minimum_interval are: year, month, day, hour, minute, second
time_zone –
Optional[str] Date-times are stored in Elasticsearch in UTC. By default, all bucketing and rounding is also done in UTC. The time_zone parameter can be used to indicate that bucketing should use a different time zone.

Time zones may either be specified as an ISO 8601 UTC offset (e.g. +01:00 or -08:00) or as a timezone id, an identifier used in the TZ database like America/Los_Angeles.

Warning

When using time zones that follow DST (daylight savings time) changes, buckets close to the moment when those changes happen can have slightly different sizes than neighbouring buckets. For example, consider a DST start in the CET time zone: on 27 March 2016 at 2am, clocks were turned forward 1 hour to 3am local time. If the result of the aggregation was daily buckets, the bucket covering that day will only hold data for 23 hours instead of the usual 24 hours for other buckets. The same is true for shorter intervals like e.g. 12h. Here, we will have only a 11h bucket on the morning of 27 March when the DST shift happens.
format – Optional[str] Specifies the format of the ‘key_as_string’ response. See: mapping date format
keyed – bool Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.
missing – Optional[Any] The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.
script – Optional[dict] Generating the terms using a script

Returns

'AggregationInterface' A new instance is created and returned

agg_children(*aggregation_name: Optional[str], type: str)¶

A special single bucket aggregation that selects child documents that have the specified type, as defined in a join field.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
type – str The child type that should be selected.

Returns

'AggregationInterface' A new instance is created and returned

agg_composite(*aggregation_name: Optional[str], sources: Sequence[Mapping], size: int = 10, after: Optional[Union[str, int, float, datetime.datetime]] = None)¶

A multi-bucket aggregation that creates composite buckets from different sources.

Unlike the other multi-bucket aggregations, you can use the composite aggregation to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation, similar to what scroll does for documents.

The composite buckets are built from the combinations of the values extracted/created for each document and each combination is considered as a composite bucket.

For optimal performance the index sort should be set on the index so that it matches parts or fully the source order in the composite aggregation.

Sub-buckets: Like any multi-bucket aggregations the composite aggregation can hold sub-aggregations. These sub-aggregations can be used to compute other buckets or statistics on each composite bucket created by this parent aggregation.

Pipeline aggregations: The composite agg is not currently compatible with pipeline aggregations, nor does it make sense in most cases. E.g. due to the paging nature of composite aggs, a single logical partition (one day for example) might be spread over multiple pages. Since pipeline aggregations are purely post-processing on the final list of buckets, running something like a derivative on a composite page could lead to inaccurate results as it is only taking into account a “partial” result on that page.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
sources –
Sequence[Mapping] The sources parameter defines the source fields to use when building composite buckets. The order that the sources are defined controls the order that the keys are returned.

The sources parameter can be any of the following types:
- Terms
- Histogram
- Date histogram
- GeoTile grid
Note

You must use a unique name when defining sources.
size –
int The size parameter can be set to define how many composite buckets should be returned. Each composite bucket is considered as a single bucket, so setting a size of 10 will return the first 10 composite buckets created from the value sources. The response contains the values for each composite bucket in an array containing the values extracted from each value source.

Pagination: If the number of composite buckets is too high (or unknown) to be returned in a single response it is possible to split the retrieval in multiple requests. Since the composite buckets are flat by nature, the requested size is exactly the number of composite buckets that will be returned in the response (assuming that they are at least size composite buckets to return). If all composite buckets should be retrieved it is preferable to use a small size (100 or 1000 for instance) and then use the after parameter to retrieve the next results.
after –
Optional[Union[str, int, float, datetime]] To get the next set of buckets, resend the same aggregation with the after parameter set to the after_key value returned in the response.

Note

The after_key is usually the key to the last bucket returned in the response, but that isn’t guaranteed. Always use the returned after_key instead of derriving it from the buckets.

In order to optimize the early termination it is advised to set track_total_hits in the request to false. The number of total hits that match the request can be retrieved on the first request and it would be costly to compute this number on every page.

Returns

'AggregationInterface' A new instance is created and returned

agg_date_histogram(*aggregation_name: Optional[str], field: Optional[str] = None, calendar_interval: Optional[str] = None, fixed_interval: Optional[str] = None, min_doc_count: int = 1, offset: Optional[str] = None, time_zone: Optional[str] = None, format: Optional[str] = None, keyed: bool = False, missing: Optional[Any] = None, script: Optional[dict] = None)¶

This multi-bucket aggregation is similar to the normal histogram, but it can only be used with date or date range values. Because dates are represented internally in Elasticsearch as long values, it is possible, but not as accurate, to use the normal histogram on dates as well. The main difference in the two APIs is that here the interval can be specified using date/time expressions. Time-based data requires special support because time-based intervals are not always a fixed length.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – Optional[str] If no field is specified it will default to the ‘timestamp_field’ of the Search class.
calendar_interval – Optional[str] Calendar-aware intervals are configured with the calendar_interval parameter. You can specify calendar intervals using the unit name, such as month, or as a single unit quantity, such as 1M. For example, day and 1d are equivalent. Multiple quantities, such as 2d, are not supported.
fixed_interval –
Optional[str] In contrast to calendar-aware intervals, fixed intervals are a fixed number of SI units and never deviate, regardless of where they fall on the calendar. One second is always composed of 1000ms. This allows fixed intervals to be specified in any multiple of the supported units.

However, it means fixed intervals cannot express other units such as months, since the duration of a month is not a fixed quantity. Attempting to specify a calendar interval like month or quarter will throw an exception.

The accepted units for fixed intervals are:
- milliseconds (ms): A single millisecond. This is a very, very small interval.
- seconds (s): Defined as 1000 milliseconds each.
- minutes (m): Defined as 60 seconds each (60,000 milliseconds). All minutes begin at 00 seconds.
- hours (h): Defined as 60 minutes each (3,600,000 milliseconds). All hours begin at 00 minutes and 00 seconds.
- days (d): Defined as 24 hours (86,400,000 milliseconds). All days begin at the earliest possible time, which is usually 00:00:00 (midnight).
min_doc_count – int Minimum documents required for a bucket. Set to 0 to allow creating empty buckets.
offset –
Optional[str] Use the offset parameter to change the start value of each bucket by the specified positive (+) or negative offset (-) duration, such as 1h for an hour, or 1d for a day. See Time units for more possible time duration options.

For example, when using an interval of day, each bucket runs from midnight to midnight. Setting the offset parameter to +6h changes each bucket to run from 6am to 6am
time_zone –
Optional[str] Elasticsearch stores date-times in Coordinated Universal Time (UTC). By default, all bucketing and rounding is also done in UTC. Use the time_zone parameter to indicate that bucketing should use a different time zone.

For example, if the interval is a calendar day and the time zone is America/New_York then 2020-01-03T01:00:01Z is
- converted to 2020-01-02T18:00:01
- rounded down to 2020-01-02T00:00:00
- then converted back to UTC to produce 2020-01-02T05:00:00:00Z
- finally, when the bucket is turned into a string key it is printed in America/New_York so it’ll display as "2020-01-02T00:00:00"
It looks like:

bucket_key = localToUtc(Math.floor(utcToLocal(value) / interval) * interval))

You can specify time zones as an ISO 8601 UTC offset (e.g. +01:00 or -08:00) or as an IANA time zone ID, such as America/Los_Angeles.
format – Optional[str] Specifies the format of the ‘key_as_string’ response. See: mapping date format
keyed – bool Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.
missing – Optional[Any] The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.
script – Optional[dict] Generating the terms using a script

Returns

'AggregationInterface' A new instance is created and returned

agg_date_range(*aggregation_name: Optional[str], ranges: Sequence[Union[Mapping[str, str], str]], field: Optional[str] = None, format: Optional[str] = None, time_zone: Optional[str] = None, keyed: bool = False, missing: Optional[Any] = None, script: Optional[dict] = None)¶

A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the from and to response fields will be returned.

Note

Note that this aggregation includes the from value and excludes the to value for each range.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
ranges –
Sequence[Union[Mapping[str, str], str]] List of ranges to define the buckets

Example:
```
[
    {"to": "1970-01-01"},
    {"from": "1970-01-01", "to": "1980-01-01"},
    {"from": "1980-01-01"},
]
```
Instead of date values any Date Math expression can be used as well.

Alternatively this parameter can be a list of strings. The above example can be rewritten as: ["1970-01-01", "1980-01-01"]

Note

This aggregation includes the from value and excludes the to value for each range.
field –
Optional[str] The date field

If no field is specified it will default to the ‘timestamp_field’ of the Search class.
format – Optional[str] The format of the response bucket keys as available for the DateTimeFormatter
time_zone –
Optional[str] Dates can be converted from another time zone to UTC by specifying the time_zone parameter.

Time zones may either be specified as an ISO 8601 UTC offset (e.g. +01:00 or -08:00) or as one of the time zone ids from the TZ database.

The time_zone parameter is also applied to rounding in date math expressions.
keyed – bool Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.
missing – Optional[Any] The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.
script – Optional[dict] Generating the terms using a script

Returns

'AggregationInterface' A new instance is created and returned

agg_diversified_sampler(*aggregation_name: Optional[str], field: Optional[str] = None, script: Optional[Mapping] = None, shard_size: int = 100, max_docs_per_value: int = 1)¶

Like the sampler aggregation this is a filtering aggregation used to limit any sub aggregations’ processing to a sample of the top-scoring documents. The diversified_sampler aggregation adds the ability to limit the number of matches that share a common value such as an “author”.

Note

Any good market researcher will tell you that when working with samples of data it is important that the sample represents a healthy variety of opinions rather than being skewed by any single voice. The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography, a large spike in a timeline or an over-active forum spammer).

Example use cases:

Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches

Removing bias from analytics by ensuring fair representation of content from different sources

Reducing the running cost of aggregations that can produce useful results using only samples e.g. significant_terms

A choice of field or script setting is used to provide values used for de-duplication and the max_docs_per_value setting controls the maximum number of documents collected on any one shard which share a common value. The default setting for max_docs_per_value is 1.

Note

The aggregation will throw an error if the choice of field or script produces multiple values for a single document (de-duplication using multi-valued fields is not supported due to efficiency concerns).

Limitations:

Cannot be nested under breadth_first aggregations Being a quality-based filter the diversified_sampler aggregation needs access to the relevance score produced for each document. It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores. In this situation an error will be thrown.

Limited de-dup logic. The de-duplication logic applies only at a shard level so will not apply across shards.

No specialized syntax for geo/date fields Currently the syntax for defining the diversifying values is defined by a choice of field or script - there is no added syntactical sugar for expressing geo or date units such as "7d" (7 days). This support may be added in a later release and users will currently have to create these sorts of values using a script.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – Optional[str] The field to search on. Can alternatively be a script
script – Optional[Mapping] The script that specifies the aggregation. Can alternatively be a ‘field’
shard_size – int The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.
max_docs_per_value – int The max_docs_per_value is an optional parameter and limits how many documents are permitted per choice of de-duplicating value. The default setting is 1.

Returns

'AggregationInterface' A new instance is created and returned

agg_filter(*aggregation_name: Optional[str], filter: Union[Mapping, QueryInterface])¶

Defines a single bucket of all the documents in the current document set context that match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
filter – Union[Mapping, 'QueryInterface']

Returns

'AggregationInterface' A new instance is created and returned

agg_filters(*aggregation_name: Optional[str], filters: Mapping[str, Union[Mapping, QueryInterface]])¶

Defines a multi bucket aggregation where each bucket is associated with a filter. Each bucket will collect all documents that match its associated filter.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
filters – Mapping[str, Union[Mapping, 'QueryInterface']]

Returns

'AggregationInterface' A new instance is created and returned

agg_geo_distance(*aggregation_name: Optional[str], field: str, ranges: Sequence[Union[Mapping[str, float], float]], origin: Union[str, Mapping[str, float], Sequence[float]], unit: str = 'm', distance_type: str = 'arc', keyed: bool = False)¶

A multi-bucket aggregation that works on geo_point fields and conceptually works very similar to the range aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document value from the origin point and determines the buckets it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket).

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str The specified field must be of type geo_point (which can only be set explicitly in the mappings). And it can also hold an array of geo_point fields, in which case all will be taken into account during aggregation.
ranges –
Sequence[Union[Mapping[str, float], float]] A list of ranges that define the separate buckets, e.g:
```
[ { "to": 100000 }, { "from": 100000, "to": 300000 }, { "from":
300000 } ]
```
Alternatively this parameter can be a list of numbers. The above example can be rewritten as [100000, 300000]
origin –
Union[str, Mapping[str, float], Sequence[float]] The origin point can accept all formats supported by the geo_point type:
- Object format: { "lat" : 52.3760, "lon" : 4.894 } - this is the safest format as it is the most explicit about the lat & lon values
- String format: "52.3760, 4.894" - where the first number is the lat and the second is the lon
- Array format: [4.894, 52.3760] - which is based on the GeoJson standard and where the first number is the lon and the second one is the lat
unit – str By default, the distance unit is m (meters) but it can also accept: mi (miles), in (inches), yd (yards), km (kilometers), cm (centimeters), mm (millimeters).
distance_type – str There are two distance calculation modes: arc (the default), and plane. The arc calculation is the most accurate. The plane is the fastest but least accurate. Consider using plane when your search context is “narrow”, and spans smaller geographical areas (~5km). plane will return higher error margins for searches across very large areas (e.g. cross continent search).
keyed – bool Setting the keyed flag to true will associate a unique string key with each bucket and return the ranges as a hash rather than an array.

Returns

'AggregationInterface' A new instance is created and returned

agg_geohash_grid(*aggregation_name: Optional[str], field: str, precision: Union[int, str] = 5, bounds: Optional[Mapping] = None, size: int = 10000, shard_size: Optional[int] = None)¶

A multi-bucket aggregation that works on geo_point fields and groups points into buckets that represent cells in a grid. The resulting grid can be sparse and only contains cells that have matching data. Each cell is labeled using a geohash which is of user-definable precision.

High precision geohashes have a long string length and represent cells that cover only a small area.

Low precision geohashes have a short string length and represent cells that each cover a large area.

Geohashes used in this aggregation can have a choice of precision between 1 and 12.

The highest-precision geohash of length 12 produces cells that cover less than a square metre of land and so high-precision requests can be very costly in terms of RAM and result sizes.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field –
str The specified field must be of type geo_point or geo_shape (which can only be set explicitly in the mappings). And it can also hold an array of geo_point fields, in which case all will be taken into account during aggregation.

Aggregating on Geo-shape fields works just as it does for points, except that a single shape can be counted for in multiple tiles. A shape will contribute to the count of matching values if any part of its shape intersects with that tile.
precision –
Union[int, str] The required precision of the grid in the range [1, 12]. Higher means more precise.

Alternatively, the precision level can be approximated from a distance measure like "1km", "10m". The precision level is calculate such that cells will not exceed the specified size (diagonal) of the required precision. When this would lead to precision levels higher than the supported 12 levels, (e.g. for distances <5.6cm) the value is rejected.

Note

When requesting detailed buckets (typically for displaying a “zoomed in” map) a filter like geo_bounding_box should be applied to narrow the subject area otherwise potentially millions of buckets will be created and returned.
bounds – Optional[Mapping] The geohash_grid aggregation supports an optional bounds parameter that restricts the points considered to those that fall within the bounds provided. The bounds parameter accepts the bounding box in all the same accepted formats of the bounds specified in the Geo Bounding Box Query. This bounding box can be used with or without an additional geo_bounding_box query filtering the points prior to aggregating. It is an independent bounding box that can intersect with, be equal to, or be disjoint to any additional geo_bounding_box queries defined in the context of the aggregation.
size – int The maximum number of geohash buckets to return (defaults to 10,000). When results are trimmed, buckets are prioritised based on the volumes of documents they contain.
shard_size – Optional[int] To allow for more accurate counting of the top cells returned in the final result the aggregation defaults to returning max(10, (size x number-of-shards)) buckets from each shard. If this heuristic is undesirable, the number considered from each shard can be over-ridden using this parameter.

Returns

'AggregationInterface' A new instance is created and returned

agg_geotile_grid(*aggregation_name: Optional[str], field: str, precision: Union[int, str] = 7, bounds: Optional[Mapping] = None, size: int = 10000, shard_size: Optional[int] = None)¶

A multi-bucket aggregation that works on geo_point fields and groups points into buckets that represent cells in a grid. The resulting grid can be sparse and only contains cells that have matching data. Each cell corresponds to a map tile as used by many online map sites. Each cell is labeled using a “{zoom}/{x}/{y}” format, where zoom is equal to the user-specified precision.

High precision keys have a larger range for x and y, and represent tiles that cover only a small area.

Low precision keys have a smaller range for x and y, and represent tiles that each cover a large area.

Warning

The highest-precision geotile of length 29 produces cells that cover less than a 10cm by 10cm of land and so high-precision requests can be very costly in terms of RAM and result sizes. Please first filter the aggregation to a smaller geographic area before requesting high-levels of detail.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str The specified field must be of type geo_point (which can only be set explicitly in the mappings). And it can also hold an array of geo_point fields, in which case all will be taken into account during aggregation.
precision –
Union[int, str] The required precision of the grid in the range [1, 29]. Higher means more precise.

Note

When requesting detailed buckets (typically for displaying a “zoomed in” map) a filter like geo_bounding_box should be applied to narrow the subject area otherwise potentially millions of buckets will be created and returned.
bounds – Optional[Mapping] The geotile_grid aggregation supports an optional bounds parameter that restricts the points considered to those that fall within the bounds provided. The bounds parameter accepts the bounding box in all the same accepted formats of the bounds specified in the Geo Bounding Box Query. This bounding box can be used with or without an additional geo_bounding_box query filtering the points prior to aggregating. It is an independent bounding box that can intersect with, be equal to, or be disjoint to any additional geo_bounding_box queries defined in the context of the aggregation.
size – int The maximum number of geohash buckets to return (defaults to 10,000). When results are trimmed, buckets are prioritised based on the volumes of documents they contain.
shard_size – Optional[int] To allow for more accurate counting of the top cells returned in the final result the aggregation defaults to returning max(10, (size x number-of-shards)) buckets from each shard. If this heuristic is undesirable, the number considered from each shard can be over-ridden using this parameter.

Returns

'AggregationInterface' A new instance is created and returned

agg_global(*aggregation_name: Optional[str])¶

Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you’re searching on, but is not influenced by the search query itself.

Note

Global aggregators can only be placed as top level aggregators because it doesn’t make sense to embed a global aggregator within another bucket aggregator.

elasticsearch documentation

Parameters: aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
Returns: 'AggregationInterface' A new instance is created and returned

agg_histogram(*aggregation_name: Optional[str], field: str, interval: int, min_doc_count: int = 0, offset: Optional[int] = None, extended_bounds: Optional[Mapping[str, int]] = None, hard_bounds: Optional[Mapping[str, int]] = None, format: Optional[str] = None, order: Optional[Union[Mapping, str]] = None, keyed: bool = False, missing: Optional[Any] = None)¶

A multi-bucket values source based aggregation that can be applied on numeric values or numeric range values extracted from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval 5 (in case of price it may represent $5). When the aggregation executes, the price field of every document will be evaluated and will be rounded down to its closest bucket - for example, if the price is 32 and the bucket size is 5 then the rounding will yield 30 and thus the document will “fall” into the bucket that is associated with the key 30. To make this more formal, here is the rounding function that is used:

bucket_key = Math.floor((value - offset) / interval) * interval + offset

For range values, a document can fall into multiple buckets. The first bucket is computed from the lower bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same way from the upper bound of the range, and the range is counted in all buckets in between and including those two.

The interval must be a positive decimal, while the offset must be a decimal in [0, interval) (a decimal greater than or equal to 0 and less than interval)

Histogram fields: Running a histogram aggregation over histogram fields computes the total number of counts for each interval. See example

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str A numeric field to be indexed by the histogram.
interval – int A positive decimal defining the interval between buckets.
min_doc_count –
int By default the response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with a higher minimum count thanks to the min_doc_count setting

By default the histogram returns all the buckets within the range of the data itself, that is, the documents with the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when requesting empty buckets, this causes a confusion, specifically, when the data is also filtered.

To understand why, let’s look at an example:

Lets say the you’re filtering your request to get all docs with values between 0 and 500, in addition you’d like to slice the data per price using a histogram with an interval of 50. You also specify “min_doc_count” : 0 as you’d like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than 100, the first bucket you’ll get will be the one with 100 as its key. This is confusing, as many times, you’d also like to get those buckets between 0 - 100.
offset –
Optional[int] By default the bucket keys start with 0 and then continue in even spaced steps of interval, e.g. if the interval is 10, the first three buckets (assuming there is data inside them) will be [0, 10), [10, 20), [20, 30). The bucket boundaries can be shifted by using the offset option.

This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval 10 will result in two buckets with 5 documents each. If an additional offset 5 is used, there will be only one single bucket [5, 15) containing all the 10 documents.
extended_bounds –
Optional[Mapping[str, int]] With extended_bounds setting, you now can “force” the histogram aggregation to start building buckets on a specific min value and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if min_doc_count is greater than 0).

Note that (as the name suggest) extended_bounds is not filtering buckets. Meaning, if the extended_bounds.min is higher than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the same goes for the extended_bounds.max and the last bucket). For filtering buckets, one should nest the histogram aggregation under a range filter aggregation with the appropriate from/to settings.

When aggregating ranges, buckets are based on the values of the returned documents. This means the response may include buckets outside of a query’s range. For example, if your query looks for values greater than 100, and you have a range covering 50 to 150, and an interval of 50, that document will land in 3 buckets - 50, 100, and 150. In general, it’s best to think of the query and aggregation steps as independent - the query selects a set of documents, and then the aggregation buckets those documents without regard to how they were selected. See note on bucketing range fields for more information and an example.
hard_bounds – Optional[Mapping[str, int]] The hard_bounds is a counterpart of extended_bounds and can limit the range of buckets in the histogram. It is particularly useful in the case of open data ranges that can result in a very large number of buckets.
format – Optional[str] Specifies the format of the ‘key_as_string’ response. See: mapping date format
order – Optional[Union[Mapping, str]] By default the returned buckets are sorted by their key ascending, though the order behaviour can be controlled using the order setting. Supports the same order functionality as the Terms Aggregation.
keyed – bool Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.
missing – Optional[Any] The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

Returns

'AggregationInterface' A new instance is created and returned

agg_ip_range(*aggregation_name: Optional[str], field: str, ranges: Sequence[Union[Mapping[str, str], str]], keyed: bool = False)¶

Just like the dedicated date range aggregation, there is also a dedicated range aggregation for IP typed fields:

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str The IPv4 field
ranges –
Sequence[Union[Mapping[str, str], str]] List of ranges to define the buckets, either as straight IPv4 or as CIDR masks.

Example:
```
[
    {"to": "10.0.0.5"},
    {"from": "10.0.0.5", "to": "10.0.0.127"},
    {"from": "10.0.0.127"},
]
```
Alternatively this parameter can be a list of strings. The above example can be rewritten as: ["10.0.0.5", "10.0.0.127"]
keyed – bool Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.

Returns

'AggregationInterface' A new instance is created and returned

agg_missing(*aggregation_name: Optional[str], field: str)¶

A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set). This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str The field we wish to investigate for missing values

Returns

'AggregationInterface' A new instance is created and returned

agg_nested(*aggregation_name: Optional[str], path: str)¶

A special single bucket aggregation that enables aggregating nested documents.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
path – str The field of the nested document(s)

Returns

'AggregationInterface' A new instance is created and returned

agg_range(*aggregation_name: Optional[str], ranges: Sequence[Union[Mapping[str, Any], Any]], field: Optional[str] = None, keyed: bool = False, script: Optional[dict] = None)¶

A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and “bucket” the relevant/matching document.

Note

Note that this aggregation includes the from value and excludes the to value for each range.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
ranges –
Sequence[Union[Mapping[str, Any], Any]] List of ranges to define the buckets

Example:
```
[
    {"to": 10},
    {"from": 10, "to": 20},
    {"from": 20},
]
```
Alternatively this parameter can be a list of strings. The above example can be rewritten as: [10, 20]

Note

This aggregation includes the from value and excludes the to value for each range.
field – Optional[str] The field to index by the aggregation
keyed – bool Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.
script – Optional[dict] Generating the terms using a script

Returns

'AggregationInterface' A new instance is created and returned

agg_rare_terms(*aggregation_name: Optional[str], field: str, max_doc_count: int = 1, include: Optional[Union[str, Sequence[str], Mapping[str, int]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, missing: Optional[Any] = None)¶

A multi-bucket value source based aggregation which finds “rare” terms — terms that are at the long-tail of the distribution and are not frequent. Conceptually, this is like a terms aggregation that is sorted by _count ascending. As noted in the terms aggregation docs, actually ordering a terms agg by count ascending has unbounded error. Instead, you should use the rare_terms aggregation.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str The field we wish to find rare terms in
max_doc_count –
int The maximum number of documents a term should appear in.

The max_doc_count parameter is used to control the upper bound of document counts that a term can have. There is not a size limitation on the rare_terms agg like terms agg has. This means that terms which match the max_doc_count criteria will be returned. The aggregation functions in this manner to avoid the order-by-ascending issues that afflict the terms aggregation.

This does, however, mean that a large number of results can be returned if chosen incorrectly. To limit the danger of this setting, the maximum max_doc_count is 100.
include –
Optional[Union[str, Sequence[str], Mapping[str, int]]] A regexp pattern that filters the documents which will be aggregated.

Alternatively can be a list of strings.

Parition expressions are also possible.
exclude –
Optional[Union[str, Sequence[str]]] A regexp pattern that filters the documents which will be aggregated.

Alternatively can be a list of strings.
missing – Optional[Any] The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

Returns

'AggregationInterface' A new instance is created and returned

agg_sampler(*aggregation_name: Optional[str], shard_size: int = 100)¶

A filtering aggregation used to limit any sub aggregations’ processing to a sample of the top-scoring documents.

Example use cases:

Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches

Reducing the running cost of aggregations that can produce useful results using only samples e.g. significant_terms

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
shard_size – int The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.

Returns

'AggregationInterface' A new instance is created and returned

agg_significant_terms(*aggregation_name: Optional[str], field: str, size: int = 10, shard_size: Optional[int] = None, min_doc_count: int = 1, shard_min_doc_count: Optional[int] = None, execution_hint: str = 'global_ordinals', include: Optional[Union[str, Sequence[str], Mapping[str, int]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, script: Optional[dict] = None)¶

An aggregation that returns interesting or unusual occurrences of terms in a set.

Example use cases:

Suggesting “H5N1” when users search for “bird flu” in text

Identifying the merchant that is the “common point of compromise” from the transaction history of credit card owners reporting loss

Suggesting keywords relating to stock symbol $ATI for an automated news classifier

Spotting the fraudulent doctor who is diagnosing more than their fair share of whiplash injuries

Spotting the tire manufacturer who has a disproportionate number of blow-outs

In all these cases the terms being selected are not simply the most popular terms in a set. They are the terms that have undergone a significant change in popularity measured between a foreground and background set. If the term “H5N1” only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user’s search results that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.

Warning

Picking a free-text field as the subject of a significant terms analysis can be expensive! It will attempt to load every unique word into RAM. It is recommended to only use this on smaller indices.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
size – int The size parameter can be set to define how many term buckets should be returned out of the overall terms list. By default, the node coordinating the search process will request each shard to provide its own top size term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. This means that if the number of unique terms is greater than size, the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned).
shard_size –
Optional[int] The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data transfers between the nodes and the client).

The shard_size parameter can be used to minimize the extra work that comes with bigger requested size. When defined, it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the coordinating node will then reduce them to a final result which will be based on the size parameter - this way, one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to the client.
min_doc_count –
int It is possible to only return terms that match more than a configured number of hits using the min_doc_count option. Default value is 1.

Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The min_doc_count criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the shard_size parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
shard_min_doc_count –
Optional[int] The parameter shard_min_doc_count regulates the certainty a shard has if the term should actually be added to the candidate list or not with respect to the min_doc_count. Terms will only be considered if their local shard frequency within the set is higher than the shard_min_doc_count. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the shard_min_doc_count parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required min_doc_count even after merging the local counts. shard_min_doc_count is set to 0 per default and has no effect unless you explicitly set it.

Note

Setting min_doc_count=0 will also return buckets for terms that didn’t match any hit. However, some of the returned terms which have a document count of zero might only belong to deleted documents or documents from other types, so there is no warranty that a match_all query would find a positive document count for those terms.

Warning

When NOT sorting on doc_count descending, high values of min_doc_count may return a number of buckets which is less than size because not enough data was gathered from the shards. Missing buckets can be back by increasing shard_size. Setting shard_min_doc_count too high will cause terms to be filtered out on a shard level. This value should be set much lower than min_doc_count/#shards.
execution_hint –
str There are different mechanisms by which terms aggregations can be executed:
- by using field values directly in order to aggregate data per-bucket (map)
- by using global ordinals of the field and allocating one bucket per global ordinal (global_ordinals)
Elasticsearch tries to have sensible defaults so this is something that generally doesn’t need to be configured.

global_ordinals is the default option for keyword field, it uses global ordinals to allocates buckets dynamically so memory usage is linear to the number of values of the documents that are part of the aggregation scope.

map should only be considered when very few documents match a query. Otherwise the ordinals-based execution mode is significantly faster. By default, map is only used when running an aggregation on scripts, since they don’t have ordinals.
include –
Optional[Union[str, Sequence[str], Mapping[str, int]]] A regexp pattern that filters the documents which will be aggregated.

Alternatively can be a list of strings.

Parition expressions are also possible.
exclude –
Optional[Union[str, Sequence[str]]] A regexp pattern that filters the documents which will be aggregated.

Alternatively can be a list of strings.
script – Optional[dict] Generating the terms using a script

Returns

'AggregationInterface' A new instance is created and returned

agg_terms(*aggregation_name: Optional[str], field: str, size: int = 10, shard_size: Optional[int] = None, show_term_doc_count_error: Optional[bool] = None, order: Optional[Union[Mapping, str]] = None, min_doc_count: int = 1, shard_min_doc_count: Optional[int] = None, include: Optional[Union[str, Sequence[str], Mapping[str, int]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, missing: Optional[Any] = None, script: Optional[dict] = None)¶

A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
size – int The size parameter can be set to define how many term buckets should be returned out of the overall terms list. By default, the node coordinating the search process will request each shard to provide its own top size term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. This means that if the number of unique terms is greater than size, the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned).
shard_size –
Optional[int] The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data transfers between the nodes and the client).

The shard_size parameter can be used to minimize the extra work that comes with bigger requested size. When defined, it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the coordinating node will then reduce them to a final result which will be based on the size parameter - this way, one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to the client.
show_term_doc_count_error –
Optional[bool] This shows an error value for each term returned by the aggregation which represents the worst case error in the document count and can be useful when deciding on a value for the shard_size parameter. This is calculated by summing the document counts for the last term returned by all shards which did not return the term.

These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be determined and is given a value of -1 to indicate this.
order –
Optional[Union[Mapping, str]] The order of the buckets can be customized by setting the order parameter. By default, the buckets are ordered by their doc_count descending.

Warning

Sorting by ascending _count or by sub aggregation is discouraged as it increases the error on document counts. It is fine when a single shard is queried, or when the field that is being aggregated was used as a routing key at index time: in these cases results will be accurate since shards have disjoint values. However otherwise, errors are unbounded. One particular case that could still be useful is sorting by min or max aggregation: counts will not be accurate but at least the top buckets will be correctly picked.
min_doc_count –
int It is possible to only return terms that match more than a configured number of hits using the min_doc_count option. Default value is 1.

Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The min_doc_count criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the shard_size parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
shard_min_doc_count –
Optional[int] The parameter shard_min_doc_count regulates the certainty a shard has if the term should actually be added to the candidate list or not with respect to the min_doc_count. Terms will only be considered if their local shard frequency within the set is higher than the shard_min_doc_count. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the shard_min_doc_count parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required min_doc_count even after merging the local counts. shard_min_doc_count is set to 0 per default and has no effect unless you explicitly set it.

Note

Setting min_doc_count=0 will also return buckets for terms that didn’t match any hit. However, some of the returned terms which have a document count of zero might only belong to deleted documents or documents from other types, so there is no warranty that a match_all query would find a positive document count for those terms.

Warning

When NOT sorting on doc_count descending, high values of min_doc_count may return a number of buckets which is less than size because not enough data was gathered from the shards. Missing buckets can be back by increasing shard_size. Setting shard_min_doc_count too high will cause terms to be filtered out on a shard level. This value should be set much lower than min_doc_count/#shards.
include –
Optional[Union[str, Sequence[str], Mapping[str, int]]] A regexp pattern that filters the documents which will be aggregated.

Alternatively can be a list of strings.

Parition expressions are also possible.
exclude –
Optional[Union[str, Sequence[str]]] A regexp pattern that filters the documents which will be aggregated.

Alternatively can be a list of strings.
missing – Optional[Any] The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.
script – Optional[dict] Generating the terms using a script

Returns

'AggregationInterface' A new instance is created and returned

aggregation(*aggregation_name_type, **params) → elastipy.aggregation.aggregation.Aggregation[source]¶

Interface to create sub-aggregations.

This is the generic, undocumented version. Use the agg_*, metric_* and pipeline_* methods for convenience.

Parameters

aggregation_name_type – one or two strings, meaning either “type” or “name”, “type”
params – all parameters of the aggregation function

Returns

Aggregation instance

body_path() → str[source]¶

Return the dotted path of this aggregation in the request body

Returns: str

property buckets¶

Returns the buckets of the aggregation response

Only available for bucket root aggregations!

Returns: dict or list

df(index: Union[bool, str] = False, to_index: Union[bool, str] = False, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, dtype=None, default=None)¶

Converts the results of dict_rows() to a pandas DataFrame.

This will include all parent aggregations (up to the root) and all children aggregations (including metrics).

Any columns containing dates will be automatically converted to pandas.Timestamp.

This method has a synonym: df

Parameters

index –
bool or str Sets a specific column as the index of the DataFrame.
- If False no explicit index is set.
- If True the root aggregation’s keys will be the index.
- if str explicitly set a certain column as the DataFrame index.
Note

The column is kept in the DataFrame. If you wan’t to set a column as index and remove it from the columns, use to_index.
to_index –
bool or str Same as index but the column is removed from DataFrame.
- If False no explicit index is set.
- If True the root aggregation’s keys will be the index.
- if str explicitly set a certain column as the DataFrame index.
include – str or list of str Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed
exclude – str or list of str Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed
flat –
bool, str or sequence of str Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. If True, all bucket aggregations are flattened.

Only supported for bucket aggregations!

Note

Currently not supported for the root aggregation!
dtype – Numpy data type to force. Only a single dtype is allowed. If None, infer.
default – This value will be used wherever a value is undefined.

Returns

pandas DataFrame instance

df_matrix(sort: Optional[Union[bool, str, int, Sequence[Union[str, int]]]] = None, default: Optional[Any] = None, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None)¶

Returns a pandas DataFrame containing the matrix.

See to_matrix for details.

Only one- and two-dimensional matrices are supported.

Returns: pandas.DataFrame instance
Raises: ValueError – If dimensions is 0 or above 2

dict_rows(include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False) → Iterable[dict]¶

Iterates through all result values from this aggregation branch.

This will include all parent aggregations (up to the root) and all children aggregations (including metrics and pipelines).

Parameters

include – str or sequence of str Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed.
exclude – str or sequence of str Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed.
flat –
bool, str or sequence of str Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. If True, all bucket aggregations are flattened.

Only supported for bucket aggregations!

Note

Currently not supported for the root aggregation!

Returns

generator of dict

property dump¶

Access to printing interface

Returns: AggregationDump instance

execute()[source]¶

Executes the whole Search with all contained aggregations.

Returns: self

property group¶

Returns the name of the aggregation group.

Returns: str, either “bucket”, “metric” or “pipeline”

items(key_separator: Optional[str] = None, tuple_key: bool = False, default=None) → Iterable[Tuple]¶

Iterates through all key, value tuples.

Parameters

key_separator – str Optional separator to concat multiple keys into one string.
tuple_key –
bool If True, the key is always a tuple.

If False, the key is a string if there is only one key.
default – If not None any None-value will be replaced by this.

Returns

generator

key_name() → str[source]¶

Return default name of the bucket key field.

Metrics return their parent’s key

Returns: str

keys(key_separator: Optional[str] = None, tuple_key: bool = False)¶

Iterates through all keys of this aggregation.

For example, a top-level terms aggregation would return all bucketed field values.

For a nested bucket aggregation each key is a tuple of all parent keys as well.

Parameters

key_separator – str Optional separator to concat multiple keys into one string
tuple_key – bool If True, the key is always a tuple If False, the key is a string if there is only one key

Returns

generator

metric(*aggregation_name_type, **params)¶: Alias for aggregation()

metric_avg(*aggregation_name: Optional[str], field: str, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶

A single-value metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
missing – Optional[Any]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_boxplot(*aggregation_name: Optional[str], field: str, compression: int = 100, missing: Optional[Any] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
compression – int
missing – Optional[Any]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_cardinality(*aggregation_name: Optional[str], field: str, precision_threshold: int = 3000, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
precision_threshold – int
missing – Optional[Any]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_extended_stats(*aggregation_name: Optional[str], field: str, sigma: float = 3.0, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
sigma – float
missing – Optional[Any]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_geo_bounds(*aggregation_name: Optional[str], field: str, wrap_longitude: bool = True, return_self: bool = False)¶

A metric aggregation that computes the bounding box containing all geo values for a field.

The Geo Bounds Aggregation is also supported on geo_shape fields.

If wrap_longitude is set to true (the default), the bounding box can overlap the international date line and return a bounds where the top_left longitude is larger than the top_right longitude.

For example, the upper right longitude will typically be greater than the lower left longitude of a geographic bounding box. However, when the area crosses the 180° meridian, the value of the lower left longitude will be greater than the value of the upper right longitude. See Geographic bounding box on the Open Geospatial Consortium website for more information.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str The field defining the geo_point or geo_shape
wrap_longitude – bool An optional parameter which specifies whether the bounding box should be allowed to overlap the international date line. The default value is true.
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_geo_centroid(*aggregation_name: Optional[str], field: str, return_self: bool = False)¶

A metric aggregation that computes the weighted centroid from all coordinate values for geo fields.

The centroid metric for geo-shapes is more nuanced than for points. The centroid of a specific aggregation bucket containing shapes is the centroid of the highest-dimensionality shape type in the bucket. For example, if a bucket contains shapes comprising of polygons and lines, then the lines do not contribute to the centroid metric. Each type of shape’s centroid is calculated differently. Envelopes and circles ingested via the Circle are treated as polygons.

Warning

Using geo_centroid as a sub-aggregation of geohash_grid:

The geohash_grid aggregation places documents, not individual geo-points, into buckets. If a document’s geo_point field contains multiple values, the document could be assigned to multiple buckets, even if one or more of its geo-points are outside the bucket boundaries.

If a geocentroid sub-aggregation is also used, each centroid is calculated using all geo-points in a bucket, including those outside the bucket boundaries. This can result in centroids outside of bucket boundaries.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str The field defining the geo_point or geo_shape
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_matrix_stats(*aggregation_name: Optional[str], fields: list, mode: str = 'avg', missing: Optional[Any] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
fields – list
mode – str
missing – Optional[Any]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_max(*aggregation_name: Optional[str], field: str, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
missing – Optional[Any]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_median_absolute_deviation(*aggregation_name: Optional[str], field: str, compression: int = 1000, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
compression – int
missing – Optional[Any]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_min(*aggregation_name: Optional[str], field: str, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
missing – Optional[Any]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_percentile_ranks(*aggregation_name: Optional[str], field: str, values: list, keyed: bool = True, hdr__number_of_significant_value_digits: Optional[int] = None, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
values – list
keyed – bool
hdr__number_of_significant_value_digits – Optional[int]
missing – Optional[Any]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_percentiles(*aggregation_name: Optional[str], field: str, percents: list = '(1, 5, 25, 50, 75, 95, 99)', keyed: bool = True, tdigest__compression: int = 100, hdr__number_of_significant_value_digits: Optional[int] = None, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
percents – list
keyed – bool
tdigest__compression – int
hdr__number_of_significant_value_digits – Optional[int]
missing – Optional[Any]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_rate(*aggregation_name: Optional[str], unit: str, field: Optional[str] = None, script: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
unit – str
field – Optional[str]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_scripted_metric(*aggregation_name: Optional[str], map_script: str, combine_script: str, reduce_script: str, init_script: Optional[str] = None, params: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
map_script – str
combine_script – str
reduce_script – str
init_script – Optional[str]
params – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_stats(*aggregation_name: Optional[str], field: str, missing: Optional[Any] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
missing – Optional[Any]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_string_stats(*aggregation_name: Optional[str], field: str, show_distribution: bool = False, missing: Optional[Any] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
show_distribution – bool
missing – Optional[Any]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_sum(*aggregation_name: Optional[str], field: str, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – str
missing – Optional[Any]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_t_test(*aggregation_name: Optional[str], a__field: str, b__field: str, type: str, a__filter: Optional[dict] = None, b__filter: Optional[dict] = None, script: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
a__field – str
b__field – str
type – str
a__filter – Optional[dict]
b__filter – Optional[dict]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_top_hits(*aggregation_name: Optional[str], size: int, sort: Optional[dict] = None, _source: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
size – int
sort – Optional[dict]
_source – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_top_metrics(*aggregation_name: Optional[str], metrics: dict, sort: Optional[dict] = None, return_self: bool = False)¶

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
metrics – dict
sort – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_value_count(*aggregation_name: Optional[str], field: Optional[str] = None, script: Optional[dict] = None, return_self: bool = False)¶

A single-value metrics aggregation that counts the number of values that are extracted from the aggregated documents. These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically, this aggregator will be used in conjunction with other single-value aggregations. For example, when computing the avg one might be interested in the number of values the average is computed over.

value_count does not de-duplicate values, so even if a field has duplicates (or a script generates multiple identical values for a single document), each value will be counted individually.

Note

Because value_count is designed to work with any field it internally treats all values as simple bytes. Due to this implementation, if _value script variable is used to fetch a value instead of accessing the field directly (e.g. a “value script”), the field value will be returned as a string instead of it’s native format.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
field – Optional[str] The field who’s values should be counted
script – Optional[dict] Alternatively counting the values generated by a script
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metric_weighted_avg(*aggregation_name: Optional[str], value__field: str, weight__field: str, value__missing: Optional[Any] = None, weight__missing: Optional[Any] = None, format: Optional[str] = None, value_type: Optional[str] = None, script: Optional[dict] = None, return_self: bool = False)¶

A single-value metrics aggregation that computes the weighted average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.

When calculating a regular average, each datapoint has an equal “weight” … it contributes equally to the final value. Weighted averages, on the other hand, weight each datapoint differently. The amount that each datapoint contributes to the final value is extracted from the document, or provided by a script.

As a formula, a weighted average is the ∑(value * weight) / ∑(weight)

A regular average can be thought of as a weighted average where every value has an implicit weight of 1

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
value__field – str The field that values should be extracted from
weight__field – str The field that weights should be extracted from
value__missing – Optional[Any] A value to use if the field is missing entirely
weight__missing – Optional[Any] A weight to use if the field is missing entirely
format – Optional[str]
value_type – Optional[str]
script – Optional[dict]
return_self – bool If True, this call returns the created metric, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

metrics()[source]¶

Iterate through all contained metric aggregations

Returns: generator of Aggregation

pipeline(*aggregation_name_type, **params)¶: Alias for aggregation()

pipeline_avg_bucket(*aggregation_name: Optional[str], buckets_path: str, gap_policy: str = 'skip', format: Optional[str] = None, return_self: bool = False)¶

A sibling pipeline aggregation which calculates the (mean) average value of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
buckets_path –
str The path to the buckets we wish to find the average for.

See: bucket path syntax
gap_policy –
str The policy to apply when gaps are found in the data.

See: gap policy
format – Optional[str] Format to apply to the output value of this aggregation
return_self – bool If True, this call returns the created pipeline, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

pipeline_bucket_script(*aggregation_name: Optional[str], script: str, buckets_path: Mapping[str, str], gap_policy: str = 'skip', format: Optional[str] = None, return_self: bool = False)¶

A parent pipeline aggregation which executes a script which can perform per bucket computations on specified metrics in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a numeric value.

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
script – str The script to run for this aggregation. The script can be inline, file or indexed. (see Scripting for more details)
buckets_path – Mapping[str, str] A map of script variables and their associated path to the buckets we wish to use for the variable (see buckets_path Syntax for more details)
gap_policy – str The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)
format – Optional[str] Format to apply to the output value of this aggregation
return_self – bool If True, this call returns the created pipeline, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

pipeline_derivative(*aggregation_name: Optional[str], buckets_path: str, gap_policy: str = 'skip', format: Optional[str] = None, units: Optional[str] = None, return_self: bool = False)¶

A parent pipeline aggregation which calculates the derivative of a specified metric in a parent histogram (or date_histogram) aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count set to 0 (default for histogram aggregations).

elasticsearch documentation

Parameters

aggregation_name – Optional[str] Optional name of the aggregation. Otherwise it will be auto-generated.
buckets_path –
str The path to the buckets we wish to find the average for.

See: bucket path syntax
gap_policy –
str The policy to apply when gaps are found in the data.

See: gap policy
format – Optional[str] Format to apply to the output value of this aggregation
units – Optional[str] The derivative aggregation allows the units of the derivative values to be specified. This returns an extra field in the response normalized_value which reports the derivative value in the desired x-axis units.
return_self – bool If True, this call returns the created pipeline, otherwise the parent is returned.

Returns

'AggregationInterface' A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.

pipelines()[source]¶

Iterate through all contained pipeline aggregations

Returns: generator of Aggregation

property plot¶

Access to pandas plotting interface.

Returns: PandasPlotWrapper instance

property response¶

Returns the response object of the aggregation

Only available for root aggregations!

Returns: dict

rows(header: bool = True, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, default=None) → Iterable[list]¶

Iterates through all result values from this aggregation branch.

Each row is a list. The first row contains the names if ‘header’ == True.

This will include all parent aggregations (up to the root) and all children aggregations (including metrics).

Parameters

header – bool If True, the first row contains the names of the columns
include – str or sequence of str Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed.
exclude – str or sequence of str Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed.
flat –
bool, str or sequence of str Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. If True, all bucket aggregations are flattened.

Only supported for bucket aggregations!

Note

Currently not supported for the root aggregation!
default – This value will be used wherever a value is undefined.

Returns

generator of list

to_body()[source]¶

Returns the part of the elasticsearch request body

Returns: dict

to_dict(key_separator=None, default=None) → dict¶

Create a dictionary from all key/value pairs.

Parameters

key_separator – str, optional separator to concat multiple keys into one string
default – If not None any None-value will be replaced by this.

Returns

dict

to_matrix(sort: Optional[Union[bool, str, int, Sequence[Union[str, int]]]] = None, default: Optional[Any] = None, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None) → Tuple[List[str], List, List]¶

Generate an N-dimensional matrix from the values of this aggregation.

Each dimension corresponds to one of the parent bucket keys that lead to this aggregation.

The values are gathered through the Aggregation.items method. So the matrix values are either the doc_count of the bucket aggregation or the result of a metric or pipeline aggregation that is inside one of the bucket aggregations.

a = Search().agg_terms("color", field="color")
a = a.agg_terms("shape", field="shape")
...
names, keys, matrix = a.to_matrix()
names == ["color", "shape"]
keys == [["red", "green", "blue"], ["circle", "triangle"]]
matrix == [[23, 42], [84, 69], [4, 10]]

Parameters

sort –
Can sort one or several keys/axises.
- True sorts all keys ascending
- "-" sorts all keys descending
- The name of an aggregation sorts it’s keys ascending. A “-” prefix sorts descending.
- An integer defines the aggregation by index. Negative integers sort descending.
- A sequence of strings or integers can sort multiple keys
For example, agg.to_matrix(sort=(“color”, “-shape”, -4)) would sort the color keys ascending, the shape keys descending and the 4th aggregation -whatever that is- descending.
default – If not None any None-value will be replaced by this value
include – str | seq[str] One or more wildcard patterns that include matching keys. All other keys are removed from the output.
exclude – str | seq[str] One or more wildcard patterns that exclude matching keys.

Returns

A tuple of names, keys and matrix data, each as list.

The names are the names of each aggregation that generates keys.

The keys are a list of lists, each corresponding to all the keys of each parent aggregation.

Data is a list, with other nested lists for each further dimension, containing the values of this aggregation.

Returns three empty lists if no data is available.

to_pandas(index: Union[bool, str] = False, to_index: Union[bool, str] = False, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, dtype=None, default=None)¶

Converts the results of dict_rows() to a pandas DataFrame.

This will include all parent aggregations (up to the root) and all children aggregations (including metrics).

Any columns containing dates will be automatically converted to pandas.Timestamp.

This method has a synonym: df

Parameters

index –
bool or str Sets a specific column as the index of the DataFrame.
- If False no explicit index is set.
- If True the root aggregation’s keys will be the index.
- if str explicitly set a certain column as the DataFrame index.
Note

The column is kept in the DataFrame. If you wan’t to set a column as index and remove it from the columns, use to_index.
to_index –
bool or str Same as index but the column is removed from DataFrame.
- If False no explicit index is set.
- If True the root aggregation’s keys will be the index.
- if str explicitly set a certain column as the DataFrame index.
include – str or list of str Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed
exclude – str or list of str Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed
flat –
bool, str or sequence of str Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. If True, all bucket aggregations are flattened.

Only supported for bucket aggregations!

Note

Currently not supported for the root aggregation!
dtype – Numpy data type to force. Only a single dtype is allowed. If None, infer.
default – This value will be used wherever a value is undefined.

Returns

pandas DataFrame instance

values(default=None)¶

Iterates through all values of this aggregation.

Parameters: default – If not None any None-value will be replaced by this.
Returns: generator

printing utilities¶

class elastipy.aggregation.aggregation_dump.AggregationDump(agg: elastipy.aggregation.aggregation.Aggregation)[source]¶

Bases: object

dict(key_separator: str = '|', default: Optional[Any] = None, indent: int = 2, file: Optional[TextIO] = None)[source]¶

Print the result of Aggregation.to_dict to console.

Parameters

key_separator – str Separator to concat multiple keys into one string. Defaults to |
default – If not None any None-value will be replaced by this.
indent – The json indentation, defaults to 2.
file – Optional output stream.

hbar(width: Optional[int] = None, zero_based: bool = True, digits: int = 3, ascii: bool = False, colors: bool = True, file: Optional[TextIO] = None)[source]¶

Print a horizontal bar graphic based on Aggregation.keys() and values() to console.

Parameters

width – int Maximum width to use. Will be auto-detected if None.
zero_based – bool If True start at bars at tero, instead of global minimum
digits – int Optional number of digits for rounding.
colors – bool Enable console colors.
ascii – bool If True fall back to ascii characters.
file – Optional text stream to print to.

heatmap(sort: Optional[Union[bool, str, int, Sequence[Union[str, int]]]] = None, default: Optional[Any] = None, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, colors: bool = True, ascii: bool = False, **kwargs)[source]¶

Prints a heat-map from a two-dimensional matrix.

Parameters

sort –
Can sort one or several keys/axises.
- True sorts all keys ascending
- "-" sorts all keys descending
- The name of an aggregation sorts it’s keys ascending. A “-” prefix sorts descending.
- An integer defines the aggregation by index. Negative integers sort descending.
- A sequence of strings or integers can sort multiple keys
For example, agg.heatmap(sort=(“color”, “-shape”, -4)) would sort the color keys ascending, the shape keys descending and the 4th aggregation -whatever that is- descending.
default – If not None any None-value will be replaced by this value
include – str | seq[str] One or more wildcard patterns that include matching keys. All other keys are removed from the output.
exclude – str | seq[str] One or more wildcard patterns that exclude matching keys.
colors – bool Enable console colors.
ascii – bool If True fall back to ascii characters.
max_width – int Will limit the expansion of the table when bars are enabled. If left None, the terminal width is used.
file – Optional text stream to print to.
kwargs – TODO list all Heatmap parameters

matrix(indent: int = 2, file: Optional[TextIO] = None, **kwargs)[source]¶

Print a representation of Aggregation.to_matrix() to console.

Parameters

indent – The json indentation, defaults to 2.
file – Optional output stream.
kwargs – TODO: list additional to_matrix parameters

table(include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, sort: Optional[str] = None, digits: Optional[int] = None, header: bool = True, bars: bool = True, zero: Union[bool, float] = True, colors: bool = True, ascii: bool = False, max_width: Optional[int] = None, max_bar_width: int = 40, file: Optional[TextIO] = None)[source]¶

Print the result of the Aggregation.dict_rows() function as table to console.

Parameters

include – str or sequence of str Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed.
exclude – str or sequence of str Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed.
flat –
bool, str or sequence of str Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. If True, all bucket aggregations are flattened.

Only supported for bucket aggregations!

Note

Currently not supported for the root aggregation!
sort – str Optional sort column name which must match a ‘header’ key. Can be prefixed with - (minus) to reverse order
digits – int Optional number of digits for rounding.
header – bool if True, include the names in the first row.
bars – bool Enable display of horizontal bars in each number column. The table width will stretch out in size while limited to ‘max_width’ and ‘max_bar_width’
zero –
- If True: the bar axis starts at zero (or at a negative value if appropriate).
- If False: the bar starts at the minimum of all values in the column.
- If a number is provided, the bar starts there, regardless of the minimum of all values.
colors – bool Enable console colors.
ascii – bool If True fall back to ascii characters.
max_width – int Will limit the expansion of the table when bars are enabled. If left None, the terminal width is used.
max_bar_width – int The maximum size a bar should have
file – Optional text stream to print to.

plotting¶

class elastipy.plot.aggregation_plot_pd.PandasPlotWrapper(agg: elastipy.aggregation.aggregation.Aggregation)[source]¶

Bases: object

This is a short-hand accessor to the pandas.DataFrame.plot interface.

The documented parameters below will be passed to Aggregation.to_pandas. All other parameters are passed to the respective functions of the pandas interface.

s = Search()
s.agg_terms("idx", field="a").execute().plot(
    to_index="idx",
    kind="bar",
)

Parameters

index –
bool or str Sets a specific column as the index of the DataFrame.
- If False no explicit index is set.
- If True the root aggregation’s keys will be the index.
- if str explicitly set a certain column as the DataFrame index.
Note

The column is kept in the DataFrame. If you wan’t to set a column as index and remove it from the columns, use to_index.
to_index –
bool or str Same as index but the column is removed from DataFrame.
- If False no explicit index is set.
- If True the root aggregation’s keys will be the index.
- if str explicitly set a certain column as the DataFrame index.
include – str or list of str Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed
exclude – str or list of str Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed
flat –
bool, str or sequence of str Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. If True, all bucket aggregations are flattened.

Only supported for bucket aggregations!

Note

Currently not supported for the root aggregation!
dtype – Numpy data type to force. Only a single dtype is allowed. If None, infer.
default – This value will be used wherever a value is undefined.

Returns

matplotlib.axes.Axes or numpy.ndarray of them If the backend is not the default matplotlib one, the return value will be the object returned by the backend.

area(x=None, y=None, **kwargs)[source]¶

Draw a stacked area plot.

See pandas.DataFrame.plot.area

bar(x=None, y=None, **kwargs)[source]¶

Vertical bar plot.

See pandas.DataFrame.plot.bar

barh(x=None, y=None, **kwargs)[source]¶

Horizontal bar plot.

See pandas.DataFrame.plot.barh

box(by=None, **kwargs)[source]¶

Make a box plot of the DataFrame columns.

See pandas.DataFrame.plot.box

heatmap(sort: Optional[Union[bool, str, int, Sequence[Union[str, int]]]] = None, default: Optional[Any] = None, replace=None, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, transpose: bool = False, figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = None, **kwargs)[source]¶

Plots a heatmap using the data from Aggregation.df_matrix.

Pandas’ default plotting backend is matplotlib. In this case the seaborn.heatmap. is used and the seaborn package must be installed along with pandas and matplotlib.

The `ploty backend <>`__ is also supported in which case the plotly.express.imshow <https://plotly.com/python/imshow/> function is used.

In matplotlib-mode, the figsize parameter will create a new Axes before calling seaborn.heatmap. For plotly it’s ignored.

The documented parameters below are passed to Aggregation.df_matrix, generating a pandas.DataFrame. All other parameters are passed to the heatmap function.

In matplotlib-mode, the figsize parameter will create a new Axes before calling seaborn.heatmap. For plotly it’s ignored.

Labels can be defined in plotly with the labels parameter, e.g. labels={"x": "date", "y": "temperature", "color": "date.doc_count"}. If labels or any of the keys are not defined they will be set to the name of each aggregation. color will either be <bucket-agg-name>.doc_count or <metric-name> (or pipeline).

Parameters

sort –
Can sort one or several keys/axises.
- True sorts all keys ascending
- "-" sorts all keys descending
- The name of an aggregation sorts it’s keys ascending. A “-” prefix sorts descending.
- An integer defines the aggregation by index. Negative integers sort descending.
- A sequence of strings or integers can sort multiple keys
For example, agg.to_matrix(sort=(“color”, “-shape”, -4)) would sort the color keys ascending, the shape keys descending and the 4th aggregation -whatever that is- descending.
default – If not None any None-value will be replaced by this value
include – str | seq[str] One or more wildcard patterns that include matching keys. All other keys are removed from the output.
exclude – str | seq[str] One or more wildcard patterns that exclude matching keys.
replace –
str, regex, list, dict, Series, int, float, or None

If not None, the pandas.DataFrame.replace function will be called with this parameter as the to_replace parameter.

:param transpose bool: Transposes the matrix, e.g. exchanges X and Y axis.

Parameters

figsize – tuple of ints or floats Optional tuple to change the size of the plot when the plotting backend is matplotlib. int values will be passed to matplotlib.axes.Axes unchanged. A float value defines the size in terms of the number of keys per axis and is converted to int with int(len(keys) * value)
kwargs – Passed to seaborn.heatmap()

Returns