Aggregation¶
Aggregations can be created on the Search
object or inside
an existing Aggregation
.
from elastipy import Search
s = Search()
agg = s.agg_terms("name_of_agg", field="field", size=100)
supported aggregations¶
bucket
metric
pipeline
value access¶
-
Aggregation.
keys
(key_separator: Optional[str] = None, tuple_key: bool = False) Iterates through all keys of this aggregation.
For example, a top-level terms aggregation would return all bucketed field values.
For a nested bucket aggregation each key is a tuple of all parent keys as well.
- Parameters
key_separator –
str
Optional separator to concat multiple keys into one stringtuple_key –
bool
If True, the key is always a tuple If False, the key is a string if there is only one key
- Returns
generator
-
Aggregation.
values
(default=None) Iterates through all values of this aggregation.
- Parameters
default – If not None any None-value will be replaced by this.
- Returns
generator
-
Aggregation.
items
(key_separator: Optional[str] = None, tuple_key: bool = False, default=None) → Iterable[Tuple] Iterates through all key, value tuples.
- Parameters
key_separator –
str
Optional separator to concat multiple keys into one string.tuple_key –
bool
If True, the key is always a tuple.If False, the key is a string if there is only one key.
default – If not None any None-value will be replaced by this.
- Returns
generator
-
Aggregation.
rows
(header: bool = True, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, default=None) → Iterable[list] Iterates through all result values from this aggregation branch.
Each row is a list. The first row contains the names if ‘header’ == True.
This will include all parent aggregations (up to the root) and all children aggregations (including metrics).
- Parameters
header –
bool
If True, the first row contains the names of the columnsinclude –
str
orsequence of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed.exclude –
str
orsequence of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed.flat –
bool
,str
orsequence of str
Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. IfTrue
, all bucket aggregations are flattened.Only supported for bucket aggregations!
Note
Currently not supported for the root aggregation!
default – This value will be used wherever a value is undefined.
- Returns
generator of list
-
Aggregation.
dict_rows
(include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False) → Iterable[dict] Iterates through all result values from this aggregation branch.
This will include all parent aggregations (up to the root) and all children aggregations (including metrics and pipelines).
- Parameters
include –
str
orsequence of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed.exclude –
str
orsequence of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed.flat –
bool
,str
orsequence of str
Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. IfTrue
, all bucket aggregations are flattened.Only supported for bucket aggregations!
Note
Currently not supported for the root aggregation!
- Returns
generator of dict
-
Aggregation.
to_dict
(key_separator=None, default=None) → dict Create a dictionary from all key/value pairs.
- Parameters
key_separator – str, optional separator to concat multiple keys into one string
default – If not None any None-value will be replaced by this.
- Returns
dict
-
Aggregation.
to_pandas
(index: Union[bool, str] = False, to_index: Union[bool, str] = False, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, dtype=None, default=None) Converts the results of
dict_rows()
to a pandas DataFrame.This will include all parent aggregations (up to the root) and all children aggregations (including metrics).
Any columns containing dates will be automatically converted to pandas.Timestamp.
This method has a synonym:
df
- Parameters
index –
bool
orstr
Sets a specific column as the index of the DataFrame.If
False
no explicit index is set.If
True
the root aggregation’s keys will be the index.if
str
explicitly set a certain column as the DataFrame index.
Note
The column is kept in the DataFrame. If you wan’t to set a column as index and remove it from the columns, use
to_index
.to_index –
bool
orstr
Same asindex
but the column is removed from DataFrame.If
False
no explicit index is set.If
True
the root aggregation’s keys will be the index.if
str
explicitly set a certain column as the DataFrame index.
include –
str or list of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removedexclude –
str or list of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removedflat –
bool
,str
orsequence of str
Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. IfTrue
, all bucket aggregations are flattened.Only supported for bucket aggregations!
Note
Currently not supported for the root aggregation!
dtype – Numpy data type to force. Only a single dtype is allowed. If None, infer.
default – This value will be used wherever a value is undefined.
- Returns
pandas
DataFrame
instance
-
Aggregation.
to_matrix
(sort: Optional[Union[bool, str, int, Sequence[Union[str, int]]]] = None, default: Optional[Any] = None, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None) → Tuple[List[str], List, List] Generate an N-dimensional matrix from the values of this aggregation.
Each dimension corresponds to one of the parent bucket keys that lead to this aggregation.
The values are gathered through the
Aggregation.items
method. So the matrix values are either thedoc_count
of the bucket aggregation or the result of ametric
orpipeline
aggregation that is inside one of the bucket aggregations.a = Search().agg_terms("color", field="color") a = a.agg_terms("shape", field="shape") ... names, keys, matrix = a.to_matrix() names == ["color", "shape"] keys == [["red", "green", "blue"], ["circle", "triangle"]] matrix == [[23, 42], [84, 69], [4, 10]]
- Parameters
sort –
Can sort one or several keys/axises.
True
sorts all keys ascending"-"
sorts all keys descendingThe name of an aggregation sorts it’s keys ascending. A “-” prefix sorts descending.
An integer defines the aggregation by index. Negative integers sort descending.
A sequence of strings or integers can sort multiple keys
For example, agg.to_matrix(sort=(“color”, “-shape”, -4)) would sort the
color
keys ascending, theshape
keys descending and the 4th aggregation -whatever that is- descending.default – If not None any None-value will be replaced by this value
include –
str | seq[str]
One or more wildcard patterns that include matching keys. All other keys are removed from the output.exclude –
str | seq[str]
One or more wildcard patterns that exclude matching keys.
- Returns
A tuple of names, keys and matrix data, each as list.
The names are the names of each aggregation that generates keys.
The keys are a list of lists, each corresponding to all the keys of each parent aggregation.
Data is a list, with other nested lists for each further dimension, containing the values of this aggregation.
Returns three empty lists if no data is available.
aggregation interface¶
The Search
class as well as created aggregations themselves support the
following interface.
-
class
elastipy.aggregation.
Aggregation
(search, name, type, params)[source]¶ Bases:
elastipy.aggregation.converter.ConverterMixin
,elastipy.aggregation.generated_interface.AggregationInterface
Aggregation definition and response parser.
Do not create instances yourself, use the
Search.aggregation()
andAggregation.aggregation()
variants.Once the
Search
has beenexecuted
, the values of the aggregations can be accessed.-
agg
(*aggregation_name_type, **params)¶ Alias for aggregation()
-
agg_adjacency_matrix
(*aggregation_name: Optional[str], filters: Mapping[str, Union[Mapping, QueryInterface]], separator: Optional[str] = None)¶ A bucket aggregation returning a form of adjacency matrix. The request provides a collection of named filter expressions, similar to the filters aggregation request. Each bucket in the response represents a non-empty cell in the matrix of intersecting filters.
The matrix is said to be symmetric so we only return half of it. To do this we sort the filter name strings and always use the lowest of a pair as the value to the left of the
"&"
separator.- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.filters –
Mapping[str, Union[Mapping, 'QueryInterface']]
separator –
Optional[str]
An alternative separator parameter can be passed in the request if clients wish to use a separator string other than the default of the ampersand.
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_auto_date_histogram
(*aggregation_name: Optional[str], field: Optional[str] = None, buckets: int = 10, minimum_interval: Optional[str] = None, time_zone: Optional[str] = None, format: Optional[str] = None, keyed: bool = False, missing: Optional[Any] = None, script: Optional[dict] = None)¶ A multi-bucket aggregation similar to the Date histogram except instead of providing an interval to use as the width of each bucket, a target number of buckets is provided indicating the number of buckets needed and the interval of the buckets is automatically chosen to best achieve that target. The number of buckets returned will always be less than or equal to this target number.
The buckets field is optional, and will default to 10 buckets if not specified.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
Optional[str]
If no field is specified it will default to the ‘timestamp_field’ of the Search class.buckets –
int
The number of buckets that are to be returned.minimum_interval –
Optional[str]
The minimum_interval allows the caller to specify the minimum rounding interval that should be used. This can make the collection process more efficient, as the aggregation will not attempt to round at any interval lower than minimum_interval.The accepted units for minimum_interval are: year, month, day, hour, minute, second
time_zone –
Optional[str]
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and rounding is also done in UTC. Thetime_zone
parameter can be used to indicate that bucketing should use a different time zone.Time zones may either be specified as an ISO 8601 UTC offset (e.g.
+01:00
or-08:00
) or as a timezone id, an identifier used in the TZ database like America/Los_Angeles.Warning
When using time zones that follow DST (daylight savings time) changes, buckets close to the moment when those changes happen can have slightly different sizes than neighbouring buckets. For example, consider a DST start in the CET time zone: on 27 March 2016 at 2am, clocks were turned forward 1 hour to 3am local time. If the result of the aggregation was daily buckets, the bucket covering that day will only hold data for 23 hours instead of the usual 24 hours for other buckets. The same is true for shorter intervals like e.g. 12h. Here, we will have only a 11h bucket on the morning of 27 March when the DST shift happens.
format –
Optional[str]
Specifies the format of the ‘key_as_string’ response. See: mapping date formatkeyed –
bool
Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.missing –
Optional[Any]
The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.script –
Optional[dict]
Generating the terms using a script
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_children
(*aggregation_name: Optional[str], type: str)¶ A special single bucket aggregation that selects child documents that have the specified type, as defined in a join field.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.type –
str
The child type that should be selected.
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_composite
(*aggregation_name: Optional[str], sources: Sequence[Mapping], size: int = 10, after: Optional[Union[str, int, float, datetime.datetime]] = None)¶ A multi-bucket aggregation that creates composite buckets from different sources.
Unlike the other multi-bucket aggregations, you can use the composite aggregation to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation, similar to what scroll does for documents.
The composite buckets are built from the combinations of the values extracted/created for each document and each combination is considered as a composite bucket.
For optimal performance the index sort should be set on the index so that it matches parts or fully the source order in the composite aggregation.
Sub-buckets: Like any multi-bucket aggregations the composite aggregation can hold sub-aggregations. These sub-aggregations can be used to compute other buckets or statistics on each composite bucket created by this parent aggregation.
Pipeline aggregations: The composite agg is not currently compatible with pipeline aggregations, nor does it make sense in most cases. E.g. due to the paging nature of composite aggs, a single logical partition (one day for example) might be spread over multiple pages. Since pipeline aggregations are purely post-processing on the final list of buckets, running something like a derivative on a composite page could lead to inaccurate results as it is only taking into account a “partial” result on that page.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.sources –
Sequence[Mapping]
The sources parameter defines the source fields to use when building composite buckets. The order that the sources are defined controls the order that the keys are returned.The sources parameter can be any of the following types:
Terms
Histogram
Date histogram
GeoTile grid
Note
You must use a unique name when defining sources.
size –
int
The size parameter can be set to define how many composite buckets should be returned. Each composite bucket is considered as a single bucket, so setting a size of 10 will return the first 10 composite buckets created from the value sources. The response contains the values for each composite bucket in an array containing the values extracted from each value source.Pagination: If the number of composite buckets is too high (or unknown) to be returned in a single response it is possible to split the retrieval in multiple requests. Since the composite buckets are flat by nature, the requested size is exactly the number of composite buckets that will be returned in the response (assuming that they are at least size composite buckets to return). If all composite buckets should be retrieved it is preferable to use a small size (100 or 1000 for instance) and then use the after parameter to retrieve the next results.
after –
Optional[Union[str, int, float, datetime]]
To get the next set of buckets, resend the same aggregation with the after parameter set to theafter_key
value returned in the response.Note
The after_key is usually the key to the last bucket returned in the response, but that isn’t guaranteed. Always use the returned after_key instead of derriving it from the buckets.
In order to optimize the early termination it is advised to set
track_total_hits
in the request to false. The number of total hits that match the request can be retrieved on the first request and it would be costly to compute this number on every page.
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_date_histogram
(*aggregation_name: Optional[str], field: Optional[str] = None, calendar_interval: Optional[str] = None, fixed_interval: Optional[str] = None, min_doc_count: int = 1, offset: Optional[str] = None, time_zone: Optional[str] = None, format: Optional[str] = None, keyed: bool = False, missing: Optional[Any] = None, script: Optional[dict] = None)¶ This multi-bucket aggregation is similar to the normal histogram, but it can only be used with date or date range values. Because dates are represented internally in Elasticsearch as long values, it is possible, but not as accurate, to use the normal histogram on dates as well. The main difference in the two APIs is that here the interval can be specified using date/time expressions. Time-based data requires special support because time-based intervals are not always a fixed length.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
Optional[str]
If no field is specified it will default to the ‘timestamp_field’ of the Search class.calendar_interval –
Optional[str]
Calendar-aware intervals are configured with the calendar_interval parameter. You can specify calendar intervals using the unit name, such asmonth
, or as a single unit quantity, such as1M
. For example,day
and1d
are equivalent. Multiple quantities, such as2d
, are not supported.fixed_interval –
Optional[str]
In contrast to calendar-aware intervals, fixed intervals are a fixed number of SI units and never deviate, regardless of where they fall on the calendar. One second is always composed of 1000ms. This allows fixed intervals to be specified in any multiple of the supported units.However, it means fixed intervals cannot express other units such as months, since the duration of a month is not a fixed quantity. Attempting to specify a calendar interval like month or quarter will throw an exception.
The accepted units for fixed intervals are:
milliseconds (
ms
): A single millisecond. This is a very, very small interval.seconds (
s
): Defined as 1000 milliseconds each.minutes (
m
): Defined as 60 seconds each (60,000 milliseconds). All minutes begin at 00 seconds.hours (
h
): Defined as 60 minutes each (3,600,000 milliseconds). All hours begin at 00 minutes and 00 seconds.days (
d
): Defined as 24 hours (86,400,000 milliseconds). All days begin at the earliest possible time, which is usually 00:00:00 (midnight).
min_doc_count –
int
Minimum documents required for a bucket. Set to 0 to allow creating empty buckets.offset –
Optional[str]
Use the offset parameter to change the start value of each bucket by the specified positive (+) or negative offset (-) duration, such as1h
for an hour, or1d
for a day. See Time units for more possible time duration options.For example, when using an interval of day, each bucket runs from midnight to midnight. Setting the offset parameter to
+6h
changes each bucket to run from 6am to 6amtime_zone –
Optional[str]
Elasticsearch stores date-times in Coordinated Universal Time (UTC). By default, all bucketing and rounding is also done in UTC. Use the time_zone parameter to indicate that bucketing should use a different time zone.For example, if the interval is a calendar day and the time zone is
America/New_York
then2020-01-03T01:00:01Z
isconverted to
2020-01-02T18:00:01
rounded down to
2020-01-02T00:00:00
then converted back to UTC to produce
2020-01-02T05:00:00:00Z
finally, when the bucket is turned into a string key it is printed in
America/New_York
so it’ll display as"2020-01-02T00:00:00"
It looks like:
bucket_key = localToUtc(Math.floor(utcToLocal(value) / interval) * interval))
You can specify time zones as an ISO 8601 UTC offset (e.g.
+01:00
or-08:00
) or as an IANA time zone ID, such as America/Los_Angeles.format –
Optional[str]
Specifies the format of the ‘key_as_string’ response. See: mapping date formatkeyed –
bool
Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.missing –
Optional[Any]
The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.script –
Optional[dict]
Generating the terms using a script
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_date_range
(*aggregation_name: Optional[str], ranges: Sequence[Union[Mapping[str, str], str]], field: Optional[str] = None, format: Optional[str] = None, time_zone: Optional[str] = None, keyed: bool = False, missing: Optional[Any] = None, script: Optional[dict] = None)¶ A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the from and to response fields will be returned.
Note
Note that this aggregation includes the from value and excludes the to value for each range.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.ranges –
Sequence[Union[Mapping[str, str], str]]
List of ranges to define the bucketsExample:
[ {"to": "1970-01-01"}, {"from": "1970-01-01", "to": "1980-01-01"}, {"from": "1980-01-01"}, ]
Instead of date values any Date Math expression can be used as well.
Alternatively this parameter can be a list of strings. The above example can be rewritten as:
["1970-01-01", "1980-01-01"]
Note
This aggregation includes the from value and excludes the to value for each range.
field –
Optional[str]
The date fieldIf no field is specified it will default to the ‘timestamp_field’ of the Search class.
format –
Optional[str]
The format of the response bucket keys as available for the DateTimeFormattertime_zone –
Optional[str]
Dates can be converted from another time zone to UTC by specifying the time_zone parameter.Time zones may either be specified as an ISO 8601 UTC offset (e.g.
+01:00
or-08:00
) or as one of the time zone ids from the TZ database.The time_zone parameter is also applied to rounding in date math expressions.
keyed –
bool
Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.missing –
Optional[Any]
The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.script –
Optional[dict]
Generating the terms using a script
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_diversified_sampler
(*aggregation_name: Optional[str], field: Optional[str] = None, script: Optional[Mapping] = None, shard_size: int = 100, max_docs_per_value: int = 1)¶ Like the
sampler
aggregation this is a filtering aggregation used to limit any sub aggregations’ processing to a sample of the top-scoring documents. Thediversified_sampler
aggregation adds the ability to limit the number of matches that share a common value such as an “author”.Note
Any good market researcher will tell you that when working with samples of data it is important that the sample represents a healthy variety of opinions rather than being skewed by any single voice. The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography, a large spike in a timeline or an over-active forum spammer).
Example use cases:
Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
Removing bias from analytics by ensuring fair representation of content from different sources
Reducing the running cost of aggregations that can produce useful results using only samples e.g. significant_terms
A choice of field or script setting is used to provide values used for de-duplication and the
max_docs_per_value
setting controls the maximum number of documents collected on any one shard which share a common value. The default setting formax_docs_per_value
is 1.Note
The aggregation will throw an error if the choice of field or script produces multiple values for a single document (de-duplication using multi-valued fields is not supported due to efficiency concerns).
Cannot be nested under breadth_first aggregations Being a quality-based filter the diversified_sampler aggregation needs access to the relevance score produced for each document. It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores. In this situation an error will be thrown.
Limited de-dup logic. The de-duplication logic applies only at a shard level so will not apply across shards.
No specialized syntax for geo/date fields Currently the syntax for defining the diversifying values is defined by a choice of field or script - there is no added syntactical sugar for expressing geo or date units such as
"7d"
(7 days). This support may be added in a later release and users will currently have to create these sorts of values using a script.- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
Optional[str]
The field to search on. Can alternatively be a scriptscript –
Optional[Mapping]
The script that specifies the aggregation. Can alternatively be a ‘field’shard_size –
int
The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.max_docs_per_value –
int
The max_docs_per_value is an optional parameter and limits how many documents are permitted per choice of de-duplicating value. The default setting is 1.
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_filter
(*aggregation_name: Optional[str], filter: Union[Mapping, QueryInterface])¶ Defines a single bucket of all the documents in the current document set context that match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.filter –
Union[Mapping, 'QueryInterface']
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_filters
(*aggregation_name: Optional[str], filters: Mapping[str, Union[Mapping, QueryInterface]])¶ Defines a multi bucket aggregation where each bucket is associated with a filter. Each bucket will collect all documents that match its associated filter.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.filters –
Mapping[str, Union[Mapping, 'QueryInterface']]
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_geo_distance
(*aggregation_name: Optional[str], field: str, ranges: Sequence[Union[Mapping[str, float], float]], origin: Union[str, Mapping[str, float], Sequence[float]], unit: str = 'm', distance_type: str = 'arc', keyed: bool = False)¶ A multi-bucket aggregation that works on geo_point fields and conceptually works very similar to the range aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document value from the origin point and determines the buckets it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket).
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
The specified field must be of type geo_point (which can only be set explicitly in the mappings). And it can also hold an array of geo_point fields, in which case all will be taken into account during aggregation.ranges –
Sequence[Union[Mapping[str, float], float]]
A list of ranges that define the separate buckets, e.g:[ { "to": 100000 }, { "from": 100000, "to": 300000 }, { "from": 300000 } ]
Alternatively this parameter can be a list of numbers. The above example can be rewritten as
[100000, 300000]
origin –
Union[str, Mapping[str, float], Sequence[float]]
The origin point can accept all formats supported by the geo_point type:Object format:
{ "lat" : 52.3760, "lon" : 4.894 }
- this is the safest format as it is the most explicit about the lat & lon valuesString format:
"52.3760, 4.894"
- where the first number is the lat and the second is the lonArray format:
[4.894, 52.3760]
- which is based on the GeoJson standard and where the first number is the lon and the second one is the lat
unit –
str
By default, the distance unit ism
(meters) but it can also accept:mi
(miles),in
(inches),yd
(yards),km
(kilometers),cm
(centimeters),mm
(millimeters).distance_type –
str
There are two distance calculation modes:arc
(the default), andplane
. The arc calculation is the most accurate. The plane is the fastest but least accurate. Consider using plane when your search context is “narrow”, and spans smaller geographical areas (~5km).plane
will return higher error margins for searches across very large areas (e.g. cross continent search).keyed –
bool
Setting the keyed flag to true will associate a unique string key with each bucket and return the ranges as a hash rather than an array.
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_geohash_grid
(*aggregation_name: Optional[str], field: str, precision: Union[int, str] = 5, bounds: Optional[Mapping] = None, size: int = 10000, shard_size: Optional[int] = None)¶ A multi-bucket aggregation that works on geo_point fields and groups points into buckets that represent cells in a grid. The resulting grid can be sparse and only contains cells that have matching data. Each cell is labeled using a geohash which is of user-definable precision.
High precision geohashes have a long string length and represent cells that cover only a small area.
Low precision geohashes have a short string length and represent cells that each cover a large area.
Geohashes used in this aggregation can have a choice of precision between 1 and 12.
The highest-precision geohash of length 12 produces cells that cover less than a square metre of land and so high-precision requests can be very costly in terms of RAM and result sizes.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
The specified field must be of typegeo_point
orgeo_shape
(which can only be set explicitly in the mappings). And it can also hold an array of geo_point fields, in which case all will be taken into account during aggregation.Aggregating on Geo-shape fields works just as it does for points, except that a single shape can be counted for in multiple tiles. A shape will contribute to the count of matching values if any part of its shape intersects with that tile.
precision –
Union[int, str]
The required precision of the grid in the range [1, 12]. Higher means more precise.Alternatively, the precision level can be approximated from a distance measure like
"1km"
,"10m"
. The precision level is calculate such that cells will not exceed the specified size (diagonal) of the required precision. When this would lead to precision levels higher than the supported 12 levels, (e.g. for distances <5.6cm) the value is rejected.Note
When requesting detailed buckets (typically for displaying a “zoomed in” map) a filter like geo_bounding_box should be applied to narrow the subject area otherwise potentially millions of buckets will be created and returned.
bounds –
Optional[Mapping]
The geohash_grid aggregation supports an optional bounds parameter that restricts the points considered to those that fall within the bounds provided. The bounds parameter accepts the bounding box in all the same accepted formats of the bounds specified in the Geo Bounding Box Query. This bounding box can be used with or without an additional geo_bounding_box query filtering the points prior to aggregating. It is an independent bounding box that can intersect with, be equal to, or be disjoint to any additional geo_bounding_box queries defined in the context of the aggregation.size –
int
The maximum number of geohash buckets to return (defaults to 10,000). When results are trimmed, buckets are prioritised based on the volumes of documents they contain.shard_size –
Optional[int]
To allow for more accurate counting of the top cells returned in the final result the aggregation defaults to returningmax(10, (size x number-of-shards))
buckets from each shard. If this heuristic is undesirable, the number considered from each shard can be over-ridden using this parameter.
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_geotile_grid
(*aggregation_name: Optional[str], field: str, precision: Union[int, str] = 7, bounds: Optional[Mapping] = None, size: int = 10000, shard_size: Optional[int] = None)¶ A multi-bucket aggregation that works on geo_point fields and groups points into buckets that represent cells in a grid. The resulting grid can be sparse and only contains cells that have matching data. Each cell corresponds to a map tile as used by many online map sites. Each cell is labeled using a “{zoom}/{x}/{y}” format, where zoom is equal to the user-specified precision.
High precision keys have a larger range for x and y, and represent tiles that cover only a small area.
Low precision keys have a smaller range for x and y, and represent tiles that each cover a large area.
Warning
The highest-precision geotile of length 29 produces cells that cover less than a 10cm by 10cm of land and so high-precision requests can be very costly in terms of RAM and result sizes. Please first filter the aggregation to a smaller geographic area before requesting high-levels of detail.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
The specified field must be of type geo_point (which can only be set explicitly in the mappings). And it can also hold an array of geo_point fields, in which case all will be taken into account during aggregation.precision –
Union[int, str]
The required precision of the grid in the range [1, 29]. Higher means more precise.Note
When requesting detailed buckets (typically for displaying a “zoomed in” map) a filter like geo_bounding_box should be applied to narrow the subject area otherwise potentially millions of buckets will be created and returned.
bounds –
Optional[Mapping]
The geotile_grid aggregation supports an optional bounds parameter that restricts the points considered to those that fall within the bounds provided. The bounds parameter accepts the bounding box in all the same accepted formats of the bounds specified in the Geo Bounding Box Query. This bounding box can be used with or without an additionalgeo_bounding_box
query filtering the points prior to aggregating. It is an independent bounding box that can intersect with, be equal to, or be disjoint to any additional geo_bounding_box queries defined in the context of the aggregation.size –
int
The maximum number of geohash buckets to return (defaults to 10,000). When results are trimmed, buckets are prioritised based on the volumes of documents they contain.shard_size –
Optional[int]
To allow for more accurate counting of the top cells returned in the final result the aggregation defaults to returningmax(10, (size x number-of-shards))
buckets from each shard. If this heuristic is undesirable, the number considered from each shard can be over-ridden using this parameter.
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_global
(*aggregation_name: Optional[str])¶ Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you’re searching on, but is not influenced by the search query itself.
Note
Global aggregators can only be placed as top level aggregators because it doesn’t make sense to embed a global aggregator within another bucket aggregator.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_histogram
(*aggregation_name: Optional[str], field: str, interval: int, min_doc_count: int = 0, offset: Optional[int] = None, extended_bounds: Optional[Mapping[str, int]] = None, hard_bounds: Optional[Mapping[str, int]] = None, format: Optional[str] = None, order: Optional[Union[Mapping, str]] = None, keyed: bool = False, missing: Optional[Any] = None)¶ A multi-bucket values source based aggregation that can be applied on numeric values or numeric range values extracted from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval 5 (in case of price it may represent $5). When the aggregation executes, the price field of every document will be evaluated and will be rounded down to its closest bucket - for example, if the price is 32 and the bucket size is 5 then the rounding will yield 30 and thus the document will “fall” into the bucket that is associated with the key 30. To make this more formal, here is the rounding function that is used:
bucket_key = Math.floor((value - offset) / interval) * interval + offset
For range values, a document can fall into multiple buckets. The first bucket is computed from the lower bound of the range in the same way as a bucket for a single value is computed. The final bucket is computed in the same way from the upper bound of the range, and the range is counted in all buckets in between and including those two.
The interval must be a positive decimal, while the offset must be a decimal in [0, interval) (a decimal greater than or equal to 0 and less than interval)
Histogram fields: Running a histogram aggregation over histogram fields computes the total number of counts for each interval. See example
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
A numeric field to be indexed by the histogram.interval –
int
A positive decimal defining the interval between buckets.min_doc_count –
int
By default the response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with a higher minimum count thanks to the min_doc_count settingBy default the histogram returns all the buckets within the range of the data itself, that is, the documents with the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when requesting empty buckets, this causes a confusion, specifically, when the data is also filtered.
To understand why, let’s look at an example:
Lets say the you’re filtering your request to get all docs with values between 0 and 500, in addition you’d like to slice the data per price using a histogram with an interval of 50. You also specify “min_doc_count” : 0 as you’d like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than 100, the first bucket you’ll get will be the one with 100 as its key. This is confusing, as many times, you’d also like to get those buckets between 0 - 100.
offset –
Optional[int]
By default the bucket keys start with 0 and then continue in even spaced steps of interval, e.g. if the interval is 10, the first three buckets (assuming there is data inside them) will be [0, 10), [10, 20), [20, 30). The bucket boundaries can be shifted by using the offset option.This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval 10 will result in two buckets with 5 documents each. If an additional offset 5 is used, there will be only one single bucket [5, 15) containing all the 10 documents.
extended_bounds –
Optional[Mapping[str, int]]
With extended_bounds setting, you now can “force” the histogram aggregation to start building buckets on a specific min value and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense whenmin_doc_count
is 0 (the empty buckets will never be returned if min_doc_count is greater than 0).Note that (as the name suggest) extended_bounds is not filtering buckets. Meaning, if the extended_bounds.min is higher than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the same goes for the extended_bounds.max and the last bucket). For filtering buckets, one should nest the histogram aggregation under a range filter aggregation with the appropriate from/to settings.
When aggregating ranges, buckets are based on the values of the returned documents. This means the response may include buckets outside of a query’s range. For example, if your query looks for values greater than 100, and you have a range covering 50 to 150, and an interval of 50, that document will land in 3 buckets - 50, 100, and 150. In general, it’s best to think of the query and aggregation steps as independent - the query selects a set of documents, and then the aggregation buckets those documents without regard to how they were selected. See note on bucketing range fields for more information and an example.
hard_bounds –
Optional[Mapping[str, int]]
The hard_bounds is a counterpart of extended_bounds and can limit the range of buckets in the histogram. It is particularly useful in the case of open data ranges that can result in a very large number of buckets.format –
Optional[str]
Specifies the format of the ‘key_as_string’ response. See: mapping date formatorder –
Optional[Union[Mapping, str]]
By default the returned buckets are sorted by their key ascending, though the order behaviour can be controlled using the order setting. Supports the same order functionality as the Terms Aggregation.keyed –
bool
Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.missing –
Optional[Any]
The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_ip_range
(*aggregation_name: Optional[str], field: str, ranges: Sequence[Union[Mapping[str, str], str]], keyed: bool = False)¶ Just like the dedicated date range aggregation, there is also a dedicated range aggregation for IP typed fields:
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
The IPv4 fieldranges –
Sequence[Union[Mapping[str, str], str]]
List of ranges to define the buckets, either as straight IPv4 or as CIDR masks.Example:
[ {"to": "10.0.0.5"}, {"from": "10.0.0.5", "to": "10.0.0.127"}, {"from": "10.0.0.127"}, ]
Alternatively this parameter can be a list of strings. The above example can be rewritten as:
["10.0.0.5", "10.0.0.127"]
keyed –
bool
Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_missing
(*aggregation_name: Optional[str], field: str)¶ A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set). This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
The field we wish to investigate for missing values
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_nested
(*aggregation_name: Optional[str], path: str)¶ A special single bucket aggregation that enables aggregating nested documents.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.path –
str
The field of the nested document(s)
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_range
(*aggregation_name: Optional[str], ranges: Sequence[Union[Mapping[str, Any], Any]], field: Optional[str] = None, keyed: bool = False, script: Optional[dict] = None)¶ A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and “bucket” the relevant/matching document.
Note
Note that this aggregation includes the from value and excludes the to value for each range.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.ranges –
Sequence[Union[Mapping[str, Any], Any]]
List of ranges to define the bucketsExample:
[ {"to": 10}, {"from": 10, "to": 20}, {"from": 20}, ]
Alternatively this parameter can be a list of strings. The above example can be rewritten as:
[10, 20]
Note
This aggregation includes the from value and excludes the to value for each range.
field –
Optional[str]
The field to index by the aggregationkeyed –
bool
Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array.script –
Optional[dict]
Generating the terms using a script
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_rare_terms
(*aggregation_name: Optional[str], field: str, max_doc_count: int = 1, include: Optional[Union[str, Sequence[str], Mapping[str, int]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, missing: Optional[Any] = None)¶ A multi-bucket value source based aggregation which finds “rare” terms — terms that are at the long-tail of the distribution and are not frequent. Conceptually, this is like a terms aggregation that is sorted by
_count
ascending. As noted in the terms aggregation docs, actually ordering a terms agg by count ascending has unbounded error. Instead, you should use the rare_terms aggregation.- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
The field we wish to find rare terms inmax_doc_count –
int
The maximum number of documents a term should appear in.The max_doc_count parameter is used to control the upper bound of document counts that a term can have. There is not a size limitation on the rare_terms agg like terms agg has. This means that terms which match the max_doc_count criteria will be returned. The aggregation functions in this manner to avoid the order-by-ascending issues that afflict the terms aggregation.
This does, however, mean that a large number of results can be returned if chosen incorrectly. To limit the danger of this setting, the maximum max_doc_count is 100.
include –
Optional[Union[str, Sequence[str], Mapping[str, int]]]
A regexp pattern that filters the documents which will be aggregated.Alternatively can be a list of strings.
Parition expressions are also possible.
exclude –
Optional[Union[str, Sequence[str]]]
A regexp pattern that filters the documents which will be aggregated.Alternatively can be a list of strings.
missing –
Optional[Any]
The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_sampler
(*aggregation_name: Optional[str], shard_size: int = 100)¶ A filtering aggregation used to limit any sub aggregations’ processing to a sample of the top-scoring documents.
Example use cases:
Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
Reducing the running cost of aggregations that can produce useful results using only samples e.g. significant_terms
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.shard_size –
int
The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_significant_terms
(*aggregation_name: Optional[str], field: str, size: int = 10, shard_size: Optional[int] = None, min_doc_count: int = 1, shard_min_doc_count: Optional[int] = None, execution_hint: str = 'global_ordinals', include: Optional[Union[str, Sequence[str], Mapping[str, int]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, script: Optional[dict] = None)¶ An aggregation that returns interesting or unusual occurrences of terms in a set.
Example use cases:
Suggesting “H5N1” when users search for “bird flu” in text
Identifying the merchant that is the “common point of compromise” from the transaction history of credit card owners reporting loss
Suggesting keywords relating to stock symbol $ATI for an automated news classifier
Spotting the fraudulent doctor who is diagnosing more than their fair share of whiplash injuries
Spotting the tire manufacturer who has a disproportionate number of blow-outs
In all these cases the terms being selected are not simply the most popular terms in a set. They are the terms that have undergone a significant change in popularity measured between a foreground and background set. If the term “H5N1” only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user’s search results that is significant and probably very relevant to their search.
5/10,000,000
vs4/100
is a big swing in frequency.Warning
Picking a free-text field as the subject of a significant terms analysis can be expensive! It will attempt to load every unique word into RAM. It is recommended to only use this on smaller indices.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
size –
int
The size parameter can be set to define how many term buckets should be returned out of the overall terms list. By default, the node coordinating the search process will request each shard to provide its own top size term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. This means that if the number of unique terms is greater than size, the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned).shard_size –
Optional[int]
The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data transfers between the nodes and the client).The shard_size parameter can be used to minimize the extra work that comes with bigger requested size. When defined, it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the coordinating node will then reduce them to a final result which will be based on the size parameter - this way, one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to the client.
min_doc_count –
int
It is possible to only return terms that match more than a configured number of hits using the min_doc_count option. Default value is 1.Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The min_doc_count criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the shard_size parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
shard_min_doc_count –
Optional[int]
The parameter shard_min_doc_count regulates the certainty a shard has if the term should actually be added to the candidate list or not with respect to the min_doc_count. Terms will only be considered if their local shard frequency within the set is higher than the shard_min_doc_count. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the shard_min_doc_count parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required min_doc_count even after merging the local counts. shard_min_doc_count is set to 0 per default and has no effect unless you explicitly set it.Note
Setting min_doc_count=0 will also return buckets for terms that didn’t match any hit. However, some of the returned terms which have a document count of zero might only belong to deleted documents or documents from other types, so there is no warranty that a match_all query would find a positive document count for those terms.
Warning
When NOT sorting on doc_count descending, high values of min_doc_count may return a number of buckets which is less than size because not enough data was gathered from the shards. Missing buckets can be back by increasing shard_size. Setting shard_min_doc_count too high will cause terms to be filtered out on a shard level. This value should be set much lower than min_doc_count/#shards.
execution_hint –
str
There are different mechanisms by which terms aggregations can be executed:by using field values directly in order to aggregate data per-bucket (
map
)by using global ordinals of the field and allocating one bucket per global ordinal (
global_ordinals
)
Elasticsearch tries to have sensible defaults so this is something that generally doesn’t need to be configured.
global_ordinals
is the default option for keyword field, it uses global ordinals to allocates buckets dynamically so memory usage is linear to the number of values of the documents that are part of the aggregation scope.map
should only be considered when very few documents match a query. Otherwise the ordinals-based execution mode is significantly faster. By default,map
is only used when running an aggregation on scripts, since they don’t have ordinals.include –
Optional[Union[str, Sequence[str], Mapping[str, int]]]
A regexp pattern that filters the documents which will be aggregated.Alternatively can be a list of strings.
Parition expressions are also possible.
exclude –
Optional[Union[str, Sequence[str]]]
A regexp pattern that filters the documents which will be aggregated.Alternatively can be a list of strings.
script –
Optional[dict]
Generating the terms using a script
- Returns
'AggregationInterface'
A new instance is created and returned
-
agg_terms
(*aggregation_name: Optional[str], field: str, size: int = 10, shard_size: Optional[int] = None, show_term_doc_count_error: Optional[bool] = None, order: Optional[Union[Mapping, str]] = None, min_doc_count: int = 1, shard_min_doc_count: Optional[int] = None, include: Optional[Union[str, Sequence[str], Mapping[str, int]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, missing: Optional[Any] = None, script: Optional[dict] = None)¶ A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
size –
int
The size parameter can be set to define how many term buckets should be returned out of the overall terms list. By default, the node coordinating the search process will request each shard to provide its own top size term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. This means that if the number of unique terms is greater than size, the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned).shard_size –
Optional[int]
The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data transfers between the nodes and the client).The shard_size parameter can be used to minimize the extra work that comes with bigger requested size. When defined, it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the coordinating node will then reduce them to a final result which will be based on the size parameter - this way, one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to the client.
show_term_doc_count_error –
Optional[bool]
This shows an error value for each term returned by the aggregation which represents the worst case error in the document count and can be useful when deciding on a value for the shard_size parameter. This is calculated by summing the document counts for the last term returned by all shards which did not return the term.These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be determined and is given a value of -1 to indicate this.
order –
Optional[Union[Mapping, str]]
The order of the buckets can be customized by setting the order parameter. By default, the buckets are ordered by their doc_count descending.Warning
Sorting by ascending _count or by sub aggregation is discouraged as it increases the error on document counts. It is fine when a single shard is queried, or when the field that is being aggregated was used as a routing key at index time: in these cases results will be accurate since shards have disjoint values. However otherwise, errors are unbounded. One particular case that could still be useful is sorting by min or max aggregation: counts will not be accurate but at least the top buckets will be correctly picked.
min_doc_count –
int
It is possible to only return terms that match more than a configured number of hits using the min_doc_count option. Default value is 1.Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The min_doc_count criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the shard_size parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
shard_min_doc_count –
Optional[int]
The parameter shard_min_doc_count regulates the certainty a shard has if the term should actually be added to the candidate list or not with respect to the min_doc_count. Terms will only be considered if their local shard frequency within the set is higher than the shard_min_doc_count. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the shard_min_doc_count parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required min_doc_count even after merging the local counts. shard_min_doc_count is set to 0 per default and has no effect unless you explicitly set it.Note
Setting min_doc_count=0 will also return buckets for terms that didn’t match any hit. However, some of the returned terms which have a document count of zero might only belong to deleted documents or documents from other types, so there is no warranty that a match_all query would find a positive document count for those terms.
Warning
When NOT sorting on doc_count descending, high values of min_doc_count may return a number of buckets which is less than size because not enough data was gathered from the shards. Missing buckets can be back by increasing shard_size. Setting shard_min_doc_count too high will cause terms to be filtered out on a shard level. This value should be set much lower than min_doc_count/#shards.
include –
Optional[Union[str, Sequence[str], Mapping[str, int]]]
A regexp pattern that filters the documents which will be aggregated.Alternatively can be a list of strings.
Parition expressions are also possible.
exclude –
Optional[Union[str, Sequence[str]]]
A regexp pattern that filters the documents which will be aggregated.Alternatively can be a list of strings.
missing –
Optional[Any]
The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.script –
Optional[dict]
Generating the terms using a script
- Returns
'AggregationInterface'
A new instance is created and returned
-
aggregation
(*aggregation_name_type, **params) → elastipy.aggregation.aggregation.Aggregation[source]¶ Interface to create sub-aggregations.
This is the generic, undocumented version. Use the agg_*, metric_* and pipeline_* methods for convenience.
- Parameters
aggregation_name_type – one or two strings, meaning either “type” or “name”, “type”
params – all parameters of the aggregation function
- Returns
Aggregation
instance
-
body_path
() → str[source]¶ Return the dotted path of this aggregation in the request body
- Returns
str
-
property
buckets
¶ Returns the buckets of the aggregation response
Only available for bucket root aggregations!
- Returns
dict or list
-
df
(index: Union[bool, str] = False, to_index: Union[bool, str] = False, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, dtype=None, default=None)¶ Converts the results of
dict_rows()
to a pandas DataFrame.This will include all parent aggregations (up to the root) and all children aggregations (including metrics).
Any columns containing dates will be automatically converted to pandas.Timestamp.
This method has a synonym:
df
- Parameters
index –
bool
orstr
Sets a specific column as the index of the DataFrame.If
False
no explicit index is set.If
True
the root aggregation’s keys will be the index.if
str
explicitly set a certain column as the DataFrame index.
Note
The column is kept in the DataFrame. If you wan’t to set a column as index and remove it from the columns, use
to_index
.to_index –
bool
orstr
Same asindex
but the column is removed from DataFrame.If
False
no explicit index is set.If
True
the root aggregation’s keys will be the index.if
str
explicitly set a certain column as the DataFrame index.
include –
str or list of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removedexclude –
str or list of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removedflat –
bool
,str
orsequence of str
Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. IfTrue
, all bucket aggregations are flattened.Only supported for bucket aggregations!
Note
Currently not supported for the root aggregation!
dtype – Numpy data type to force. Only a single dtype is allowed. If None, infer.
default – This value will be used wherever a value is undefined.
- Returns
pandas
DataFrame
instance
-
df_matrix
(sort: Optional[Union[bool, str, int, Sequence[Union[str, int]]]] = None, default: Optional[Any] = None, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None)¶ Returns a pandas DataFrame containing the matrix.
See to_matrix for details.
Only one- and two-dimensional matrices are supported.
- Returns
pandas.DataFrame instance
- Raises
ValueError – If dimensions is 0 or above 2
-
dict_rows
(include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False) → Iterable[dict]¶ Iterates through all result values from this aggregation branch.
This will include all parent aggregations (up to the root) and all children aggregations (including metrics and pipelines).
- Parameters
include –
str
orsequence of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed.exclude –
str
orsequence of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed.flat –
bool
,str
orsequence of str
Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. IfTrue
, all bucket aggregations are flattened.Only supported for bucket aggregations!
Note
Currently not supported for the root aggregation!
- Returns
generator of dict
-
property
dump
¶ Access to
printing
interface- Returns
AggregationDump
instance
-
property
group
¶ Returns the name of the aggregation group.
- Returns
str, either “bucket”, “metric” or “pipeline”
-
items
(key_separator: Optional[str] = None, tuple_key: bool = False, default=None) → Iterable[Tuple]¶ Iterates through all key, value tuples.
- Parameters
key_separator –
str
Optional separator to concat multiple keys into one string.tuple_key –
bool
If True, the key is always a tuple.If False, the key is a string if there is only one key.
default – If not None any None-value will be replaced by this.
- Returns
generator
-
key_name
() → str[source]¶ Return default name of the bucket key field.
Metrics return their parent’s key
- Returns
str
-
keys
(key_separator: Optional[str] = None, tuple_key: bool = False)¶ Iterates through all keys of this aggregation.
For example, a top-level terms aggregation would return all bucketed field values.
For a nested bucket aggregation each key is a tuple of all parent keys as well.
- Parameters
key_separator –
str
Optional separator to concat multiple keys into one stringtuple_key –
bool
If True, the key is always a tuple If False, the key is a string if there is only one key
- Returns
generator
-
metric
(*aggregation_name_type, **params)¶ Alias for aggregation()
-
metric_avg
(*aggregation_name: Optional[str], field: str, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶ A single-value metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
missing –
Optional[Any]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_boxplot
(*aggregation_name: Optional[str], field: str, compression: int = 100, missing: Optional[Any] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
compression –
int
missing –
Optional[Any]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_cardinality
(*aggregation_name: Optional[str], field: str, precision_threshold: int = 3000, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
precision_threshold –
int
missing –
Optional[Any]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_extended_stats
(*aggregation_name: Optional[str], field: str, sigma: float = 3.0, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
sigma –
float
missing –
Optional[Any]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_geo_bounds
(*aggregation_name: Optional[str], field: str, wrap_longitude: bool = True, return_self: bool = False)¶ A metric aggregation that computes the bounding box containing all geo values for a field.
The Geo Bounds Aggregation is also supported on geo_shape fields.
If wrap_longitude is set to true (the default), the bounding box can overlap the international date line and return a bounds where the top_left longitude is larger than the top_right longitude.
For example, the upper right longitude will typically be greater than the lower left longitude of a geographic bounding box. However, when the area crosses the 180° meridian, the value of the lower left longitude will be greater than the value of the upper right longitude. See Geographic bounding box on the Open Geospatial Consortium website for more information.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
The field defining the geo_point or geo_shapewrap_longitude –
bool
An optional parameter which specifies whether the bounding box should be allowed to overlap the international date line. The default value is true.return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_geo_centroid
(*aggregation_name: Optional[str], field: str, return_self: bool = False)¶ A metric aggregation that computes the weighted centroid from all coordinate values for geo fields.
The centroid metric for geo-shapes is more nuanced than for points. The centroid of a specific aggregation bucket containing shapes is the centroid of the highest-dimensionality shape type in the bucket. For example, if a bucket contains shapes comprising of polygons and lines, then the lines do not contribute to the centroid metric. Each type of shape’s centroid is calculated differently. Envelopes and circles ingested via the Circle are treated as polygons.
Warning
Using geo_centroid as a sub-aggregation of
geohash_grid
:The geohash_grid aggregation places documents, not individual geo-points, into buckets. If a document’s geo_point field contains multiple values, the document could be assigned to multiple buckets, even if one or more of its geo-points are outside the bucket boundaries.
If a geocentroid sub-aggregation is also used, each centroid is calculated using all geo-points in a bucket, including those outside the bucket boundaries. This can result in centroids outside of bucket boundaries.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
The field defining the geo_point or geo_shapereturn_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_matrix_stats
(*aggregation_name: Optional[str], fields: list, mode: str = 'avg', missing: Optional[Any] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.fields –
list
mode –
str
missing –
Optional[Any]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_max
(*aggregation_name: Optional[str], field: str, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
missing –
Optional[Any]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_median_absolute_deviation
(*aggregation_name: Optional[str], field: str, compression: int = 1000, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
compression –
int
missing –
Optional[Any]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_min
(*aggregation_name: Optional[str], field: str, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
missing –
Optional[Any]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_percentile_ranks
(*aggregation_name: Optional[str], field: str, values: list, keyed: bool = True, hdr__number_of_significant_value_digits: Optional[int] = None, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
values –
list
keyed –
bool
hdr__number_of_significant_value_digits –
Optional[int]
missing –
Optional[Any]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_percentiles
(*aggregation_name: Optional[str], field: str, percents: list = '(1, 5, 25, 50, 75, 95, 99)', keyed: bool = True, tdigest__compression: int = 100, hdr__number_of_significant_value_digits: Optional[int] = None, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
percents –
list
keyed –
bool
tdigest__compression –
int
hdr__number_of_significant_value_digits –
Optional[int]
missing –
Optional[Any]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_rate
(*aggregation_name: Optional[str], unit: str, field: Optional[str] = None, script: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.unit –
str
field –
Optional[str]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_scripted_metric
(*aggregation_name: Optional[str], map_script: str, combine_script: str, reduce_script: str, init_script: Optional[str] = None, params: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.map_script –
str
combine_script –
str
reduce_script –
str
init_script –
Optional[str]
params –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_stats
(*aggregation_name: Optional[str], field: str, missing: Optional[Any] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
missing –
Optional[Any]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_string_stats
(*aggregation_name: Optional[str], field: str, show_distribution: bool = False, missing: Optional[Any] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
show_distribution –
bool
missing –
Optional[Any]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_sum
(*aggregation_name: Optional[str], field: str, missing: Optional[Any] = None, script: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
str
missing –
Optional[Any]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_t_test
(*aggregation_name: Optional[str], a__field: str, b__field: str, type: str, a__filter: Optional[dict] = None, b__filter: Optional[dict] = None, script: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.a__field –
str
b__field –
str
type –
str
a__filter –
Optional[dict]
b__filter –
Optional[dict]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_top_hits
(*aggregation_name: Optional[str], size: int, sort: Optional[dict] = None, _source: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.size –
int
sort –
Optional[dict]
_source –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_top_metrics
(*aggregation_name: Optional[str], metrics: dict, sort: Optional[dict] = None, return_self: bool = False)¶ -
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.metrics –
dict
sort –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_value_count
(*aggregation_name: Optional[str], field: Optional[str] = None, script: Optional[dict] = None, return_self: bool = False)¶ A single-value metrics aggregation that counts the number of values that are extracted from the aggregated documents. These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically, this aggregator will be used in conjunction with other single-value aggregations. For example, when computing the avg one might be interested in the number of values the average is computed over.
value_count does not de-duplicate values, so even if a field has duplicates (or a script generates multiple identical values for a single document), each value will be counted individually.
Note
Because value_count is designed to work with any field it internally treats all values as simple bytes. Due to this implementation, if _value script variable is used to fetch a value instead of accessing the field directly (e.g. a “value script”), the field value will be returned as a string instead of it’s native format.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.field –
Optional[str]
The field who’s values should be countedscript –
Optional[dict]
Alternatively counting the values generated by a scriptreturn_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metric_weighted_avg
(*aggregation_name: Optional[str], value__field: str, weight__field: str, value__missing: Optional[Any] = None, weight__missing: Optional[Any] = None, format: Optional[str] = None, value_type: Optional[str] = None, script: Optional[dict] = None, return_self: bool = False)¶ A single-value metrics aggregation that computes the weighted average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.
When calculating a regular average, each datapoint has an equal “weight” … it contributes equally to the final value. Weighted averages, on the other hand, weight each datapoint differently. The amount that each datapoint contributes to the final value is extracted from the document, or provided by a script.
As a formula, a weighted average is the
∑(value * weight) / ∑(weight)
A regular average can be thought of as a weighted average where every value has an implicit weight of 1
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.value__field –
str
The field that values should be extracted fromweight__field –
str
The field that weights should be extracted fromvalue__missing –
Optional[Any]
A value to use if the field is missing entirelyweight__missing –
Optional[Any]
A weight to use if the field is missing entirelyformat –
Optional[str]
value_type –
Optional[str]
script –
Optional[dict]
return_self –
bool
If True, this call returns the created metric, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
metrics
()[source]¶ Iterate through all contained metric aggregations
- Returns
generator of Aggregation
-
pipeline
(*aggregation_name_type, **params)¶ Alias for aggregation()
-
pipeline_avg_bucket
(*aggregation_name: Optional[str], buckets_path: str, gap_policy: str = 'skip', format: Optional[str] = None, return_self: bool = False)¶ A sibling pipeline aggregation which calculates the (mean) average value of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.buckets_path –
str
The path to the buckets we wish to find the average for.See: bucket path syntax
gap_policy –
str
The policy to apply when gaps are found in the data.See: gap policy
format –
Optional[str]
Format to apply to the output value of this aggregationreturn_self –
bool
If True, this call returns the created pipeline, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
pipeline_bucket_script
(*aggregation_name: Optional[str], script: str, buckets_path: Mapping[str, str], gap_policy: str = 'skip', format: Optional[str] = None, return_self: bool = False)¶ A parent pipeline aggregation which executes a script which can perform per bucket computations on specified metrics in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a numeric value.
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.script –
str
The script to run for this aggregation. The script can be inline, file or indexed. (see Scripting for more details)buckets_path –
Mapping[str, str]
A map of script variables and their associated path to the buckets we wish to use for the variable (see buckets_path Syntax for more details)gap_policy –
str
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)format –
Optional[str]
Format to apply to the output value of this aggregationreturn_self –
bool
If True, this call returns the created pipeline, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
pipeline_derivative
(*aggregation_name: Optional[str], buckets_path: str, gap_policy: str = 'skip', format: Optional[str] = None, units: Optional[str] = None, return_self: bool = False)¶ A parent pipeline aggregation which calculates the derivative of a specified metric in a parent histogram (or date_histogram) aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count set to 0 (default for histogram aggregations).
- Parameters
aggregation_name –
Optional[str]
Optional name of the aggregation. Otherwise it will be auto-generated.buckets_path –
str
The path to the buckets we wish to find the average for.See: bucket path syntax
gap_policy –
str
The policy to apply when gaps are found in the data.See: gap policy
format –
Optional[str]
Format to apply to the output value of this aggregationunits –
Optional[str]
The derivative aggregation allows the units of the derivative values to be specified. This returns an extra field in the response normalized_value which reports the derivative value in the desired x-axis units.return_self –
bool
If True, this call returns the created pipeline, otherwise the parent is returned.
- Returns
'AggregationInterface'
A new instance is created and attached to the parent and the parent is returned, unless ‘return_self’ is True, in which case the new instance is returned.
-
pipelines
()[source]¶ Iterate through all contained pipeline aggregations
- Returns
generator of Aggregation
-
property
plot
¶ Access to
pandas plotting interface
.- Returns
PandasPlotWrapper
instance
-
property
response
¶ Returns the response object of the aggregation
Only available for root aggregations!
- Returns
dict
-
rows
(header: bool = True, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, default=None) → Iterable[list]¶ Iterates through all result values from this aggregation branch.
Each row is a list. The first row contains the names if ‘header’ == True.
This will include all parent aggregations (up to the root) and all children aggregations (including metrics).
- Parameters
header –
bool
If True, the first row contains the names of the columnsinclude –
str
orsequence of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed.exclude –
str
orsequence of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed.flat –
bool
,str
orsequence of str
Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. IfTrue
, all bucket aggregations are flattened.Only supported for bucket aggregations!
Note
Currently not supported for the root aggregation!
default – This value will be used wherever a value is undefined.
- Returns
generator of list
-
to_dict
(key_separator=None, default=None) → dict¶ Create a dictionary from all key/value pairs.
- Parameters
key_separator – str, optional separator to concat multiple keys into one string
default – If not None any None-value will be replaced by this.
- Returns
dict
-
to_matrix
(sort: Optional[Union[bool, str, int, Sequence[Union[str, int]]]] = None, default: Optional[Any] = None, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None) → Tuple[List[str], List, List]¶ Generate an N-dimensional matrix from the values of this aggregation.
Each dimension corresponds to one of the parent bucket keys that lead to this aggregation.
The values are gathered through the
Aggregation.items
method. So the matrix values are either thedoc_count
of the bucket aggregation or the result of ametric
orpipeline
aggregation that is inside one of the bucket aggregations.a = Search().agg_terms("color", field="color") a = a.agg_terms("shape", field="shape") ... names, keys, matrix = a.to_matrix() names == ["color", "shape"] keys == [["red", "green", "blue"], ["circle", "triangle"]] matrix == [[23, 42], [84, 69], [4, 10]]
- Parameters
sort –
Can sort one or several keys/axises.
True
sorts all keys ascending"-"
sorts all keys descendingThe name of an aggregation sorts it’s keys ascending. A “-” prefix sorts descending.
An integer defines the aggregation by index. Negative integers sort descending.
A sequence of strings or integers can sort multiple keys
For example, agg.to_matrix(sort=(“color”, “-shape”, -4)) would sort the
color
keys ascending, theshape
keys descending and the 4th aggregation -whatever that is- descending.default – If not None any None-value will be replaced by this value
include –
str | seq[str]
One or more wildcard patterns that include matching keys. All other keys are removed from the output.exclude –
str | seq[str]
One or more wildcard patterns that exclude matching keys.
- Returns
A tuple of names, keys and matrix data, each as list.
The names are the names of each aggregation that generates keys.
The keys are a list of lists, each corresponding to all the keys of each parent aggregation.
Data is a list, with other nested lists for each further dimension, containing the values of this aggregation.
Returns three empty lists if no data is available.
-
to_pandas
(index: Union[bool, str] = False, to_index: Union[bool, str] = False, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, dtype=None, default=None)¶ Converts the results of
dict_rows()
to a pandas DataFrame.This will include all parent aggregations (up to the root) and all children aggregations (including metrics).
Any columns containing dates will be automatically converted to pandas.Timestamp.
This method has a synonym:
df
- Parameters
index –
bool
orstr
Sets a specific column as the index of the DataFrame.If
False
no explicit index is set.If
True
the root aggregation’s keys will be the index.if
str
explicitly set a certain column as the DataFrame index.
Note
The column is kept in the DataFrame. If you wan’t to set a column as index and remove it from the columns, use
to_index
.to_index –
bool
orstr
Same asindex
but the column is removed from DataFrame.If
False
no explicit index is set.If
True
the root aggregation’s keys will be the index.if
str
explicitly set a certain column as the DataFrame index.
include –
str or list of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removedexclude –
str or list of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removedflat –
bool
,str
orsequence of str
Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. IfTrue
, all bucket aggregations are flattened.Only supported for bucket aggregations!
Note
Currently not supported for the root aggregation!
dtype – Numpy data type to force. Only a single dtype is allowed. If None, infer.
default – This value will be used wherever a value is undefined.
- Returns
pandas
DataFrame
instance
-
values
(default=None)¶ Iterates through all values of this aggregation.
- Parameters
default – If not None any None-value will be replaced by this.
- Returns
generator
-
printing utilities¶
-
class
elastipy.aggregation.aggregation_dump.
AggregationDump
(agg: elastipy.aggregation.aggregation.Aggregation)[source]¶ Bases:
object
-
dict
(key_separator: str = '|', default: Optional[Any] = None, indent: int = 2, file: Optional[TextIO] = None)[source]¶ Print the result of
Aggregation.to_dict
to console.- Parameters
key_separator –
str
Separator to concat multiple keys into one string. Defaults to|
default – If not None any None-value will be replaced by this.
indent – The json indentation, defaults to 2.
file – Optional output stream.
-
hbar
(width: Optional[int] = None, zero_based: bool = True, digits: int = 3, ascii: bool = False, colors: bool = True, file: Optional[TextIO] = None)[source]¶ Print a horizontal bar graphic based on
Aggregation.keys()
andvalues()
to console.- Parameters
width –
int
Maximum width to use. Will be auto-detected ifNone
.zero_based –
bool
IfTrue
start at bars at tero, instead of global minimumdigits –
int
Optional number of digits for rounding.colors –
bool
Enable console colors.ascii –
bool
IfTrue
fall back to ascii characters.file – Optional text stream to print to.
-
heatmap
(sort: Optional[Union[bool, str, int, Sequence[Union[str, int]]]] = None, default: Optional[Any] = None, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, colors: bool = True, ascii: bool = False, **kwargs)[source]¶ Prints a heat-map from a two-dimensional matrix.
- Parameters
sort –
Can sort one or several keys/axises.
True
sorts all keys ascending"-"
sorts all keys descendingThe name of an aggregation sorts it’s keys ascending. A “-” prefix sorts descending.
An integer defines the aggregation by index. Negative integers sort descending.
A sequence of strings or integers can sort multiple keys
For example, agg.heatmap(sort=(“color”, “-shape”, -4)) would sort the
color
keys ascending, theshape
keys descending and the 4th aggregation -whatever that is- descending.default – If not None any None-value will be replaced by this value
include –
str | seq[str]
One or more wildcard patterns that include matching keys. All other keys are removed from the output.exclude –
str | seq[str]
One or more wildcard patterns that exclude matching keys.colors –
bool
Enable console colors.ascii –
bool
IfTrue
fall back to ascii characters.max_width –
int
Will limit the expansion of the table when bars are enabled. If left None, the terminal width is used.file – Optional text stream to print to.
kwargs – TODO list all Heatmap parameters
-
matrix
(indent: int = 2, file: Optional[TextIO] = None, **kwargs)[source]¶ Print a representation of
Aggregation.to_matrix()
to console.- Parameters
indent – The json indentation, defaults to 2.
file – Optional output stream.
kwargs – TODO: list additional to_matrix parameters
-
table
(include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, flat: Union[bool, str, Sequence[str]] = False, sort: Optional[str] = None, digits: Optional[int] = None, header: bool = True, bars: bool = True, zero: Union[bool, float] = True, colors: bool = True, ascii: bool = False, max_width: Optional[int] = None, max_bar_width: int = 40, file: Optional[TextIO] = None)[source]¶ Print the result of the
Aggregation.dict_rows()
function as table to console.- Parameters
include –
str
orsequence of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removed.exclude –
str
orsequence of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removed.flat –
bool
,str
orsequence of str
Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. IfTrue
, all bucket aggregations are flattened.Only supported for bucket aggregations!
Note
Currently not supported for the root aggregation!
sort –
str
Optional sort column name which must match a ‘header’ key. Can be prefixed with-
(minus) to reverse orderdigits –
int
Optional number of digits for rounding.header –
bool
if True, include the names in the first row.bars –
bool
Enable display of horizontal bars in each number column. The table width will stretch out in size while limited to ‘max_width’ and ‘max_bar_width’zero –
If
True
: the bar axis starts at zero (or at a negative value if appropriate).If
False
: the bar starts at the minimum of all values in the column.If a number is provided, the bar starts there, regardless of the minimum of all values.
colors –
bool
Enable console colors.ascii –
bool
IfTrue
fall back to ascii characters.max_width –
int
Will limit the expansion of the table when bars are enabled. If left None, the terminal width is used.max_bar_width –
int
The maximum size a bar should havefile – Optional text stream to print to.
-
plotting¶
-
class
elastipy.plot.aggregation_plot_pd.
PandasPlotWrapper
(agg: elastipy.aggregation.aggregation.Aggregation)[source]¶ Bases:
object
This is a short-hand accessor to the pandas.DataFrame.plot interface.
The documented parameters below will be passed to
Aggregation.to_pandas
. All other parameters are passed to the respective functions of the pandas interface.s = Search() s.agg_terms("idx", field="a").execute().plot( to_index="idx", kind="bar", )
- Parameters
index –
bool
orstr
Sets a specific column as the index of the DataFrame.If
False
no explicit index is set.If
True
the root aggregation’s keys will be the index.if
str
explicitly set a certain column as the DataFrame index.
Note
The column is kept in the DataFrame. If you wan’t to set a column as index and remove it from the columns, use
to_index
.to_index –
bool
orstr
Same asindex
but the column is removed from DataFrame.If
False
no explicit index is set.If
True
the root aggregation’s keys will be the index.if
str
explicitly set a certain column as the DataFrame index.
include –
str or list of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that does not fit a pattern is removedexclude –
str or list of str
Can be one or more (OR-combined) wildcard patterns. If used, any column that fits a pattern is removedflat –
bool
,str
orsequence of str
Can be one or more aggregation names that should be flattened out, meaning that each key of the aggregation creates a new column instead of a new row. IfTrue
, all bucket aggregations are flattened.Only supported for bucket aggregations!
Note
Currently not supported for the root aggregation!
dtype – Numpy data type to force. Only a single dtype is allowed. If None, infer.
default – This value will be used wherever a value is undefined.
- Returns
matplotlib.axes.Axes
or numpy.ndarray of them If the backend is not the default matplotlib one, the return value will be the object returned by the backend.
-
heatmap
(sort: Optional[Union[bool, str, int, Sequence[Union[str, int]]]] = None, default: Optional[Any] = None, replace=None, include: Optional[Union[str, Sequence[str]]] = None, exclude: Optional[Union[str, Sequence[str]]] = None, transpose: bool = False, figsize: Optional[Tuple[Union[int, float], Union[int, float]]] = None, **kwargs)[source]¶ Plots a heatmap using the data from
Aggregation.df_matrix
.Pandas’ default plotting backend is matplotlib. In this case the seaborn.heatmap. is used and the
seaborn
package must be installed along withpandas
andmatplotlib
.The `ploty backend <>`__ is also supported in which case the plotly.express.imshow <https://plotly.com/python/imshow/> function is used.
In matplotlib-mode, the
figsize
parameter will create a new Axes before calling seaborn.heatmap. For plotly it’s ignored.The documented parameters below are passed to
Aggregation.df_matrix
, generating a pandas.DataFrame. All other parameters are passed to the heatmap function.In matplotlib-mode, the
figsize
parameter will create a new Axes before calling seaborn.heatmap. For plotly it’s ignored.Labels can be defined in plotly with the
labels
parameter, e.g.labels={"x": "date", "y": "temperature", "color": "date.doc_count"}
. Iflabels
or any of the keys are not defined they will be set to the name of each aggregation.color
will either be<bucket-agg-name>.doc_count
or<metric-name>
(or pipeline).- Parameters
sort –
Can sort one or several keys/axises.
True
sorts all keys ascending"-"
sorts all keys descendingThe name of an aggregation sorts it’s keys ascending. A “-” prefix sorts descending.
An integer defines the aggregation by index. Negative integers sort descending.
A sequence of strings or integers can sort multiple keys
For example, agg.to_matrix(sort=(“color”, “-shape”, -4)) would sort the
color
keys ascending, theshape
keys descending and the 4th aggregation -whatever that is- descending.default – If not None any None-value will be replaced by this value
include –
str | seq[str]
One or more wildcard patterns that include matching keys. All other keys are removed from the output.exclude –
str | seq[str]
One or more wildcard patterns that exclude matching keys.replace –
str, regex, list, dict, Series, int, float, or None
If not None, the
pandas.DataFrame.replace
function will be called with this parameter as theto_replace
parameter.
- :param transpose
bool
Transposes the matrix, e.g. exchanges X and Y axis.
- Parameters
figsize –
tuple of ints or floats
Optional tuple to change the size of the plot when the plotting backend ismatplotlib
.int
values will be passed tomatplotlib.axes.Axes
unchanged. Afloat
value defines the size in terms of the number of keys per axis and is converted to int withint(len(keys) * value)
kwargs – Passed to
seaborn.heatmap()
- Returns
matplotlib.axes.Axes
Axis object with the heatmap.
-
hexbin
(x, y, C=None, reduce_C_function=None, gridsize=None, **kwargs)[source]¶ Generate a hexagonal binning plot.
-
kde
(bw_method=None, ind=None, **kwargs)[source]¶ Generate Kernel Density Estimate plot using Gaussian kernels.