Running metastore queriesin a distributed manner using query engine workers:
- Build a distributed metastore plan from GetIndexes and execute it via the v2 scheduler/worker pipeline
- Split the request into per-index PointersScan tasks, then fan-in and CollectSections to produce final section descriptors
- Add physical plan + protobuf support for Merge/PointersScan, improve tracing/cleanup, and add coverage for planner/workflow/proto roundtrips
The new worker package connects to an instance of a scheduler (#19570)
for task assignment and execution. A worker spawns a fixed number of
threads, each of which execute one task at a time.
Signed-off-by: Robert Fratto <robertfratto@gmail.com>
### Summary
The v1 engine has a mechanism to rename labels in case they have the same name but different origin, such as labels, structured metadata, or parsed fields.
1. In case a log line has a structured metadata key with the same name as the label name of the stream, than the metadata key is suffixed with `_extracted`, such as `service_extracted`, if `service` exists in both `labels` and `metadata`.
2. In case a parser creates a parsed field with the same as the label name of the stream, then the parsed key is suffixed with `_extracted` in the same way as case 1. However, if the field name also collides with a structured metadata key, then the extracted structured metadata is replaced with the extracted parsed field.
This PR only implements the first case. As a follow up PR, the second case needs to be implemented as well. Additionally, the newly introduced "compatibility node" should also be made optional with a feature flag and/or per-request.
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
In the new engine, we need fully qualified column names, since columns from different sources can have the same name.
Right now, the distinction between columns with the same name is implemented using the `Metadata` field on the `arrow.Field`. However, it is quite cumbersome to parse the column type and data type from this generic map.
This PR introduces package with naming conventions for columns, defined by name, data type, and column type. So, this information can be encoded into the `Name` field of the `arrow.Field`. The convention is defined as
```
[DATA_TYPE].[COLUMN_TYPE].[COLUMN_NAME]
```
#### Examples:
* `utf8.label.service_name`
* `timestamp_ns.builtin.timestamp`
The column type can easily be converted into a `Scope`, which is defined by an origin and type.
The mapping is as follows:
```
ColumnTypeBuiltin -> Scope{Record, Attribute}
ColumnTypeMetadata -> Scope{Record, Builtin}
ColumnTypeLabel -> Scope{Resource, Attribute}
ColumnTypeParsed -> Scope{Generated, Attribute}
ColumnTypeGenerated -> Scope{Generated, Builtin}
ColumnTypeAmbiguous -> Scope{Unscoped, Attribute}
```
---
Signed-off-by: Christian Haudum <christian.haudum@gmail.com>
This moves packages around to reduce the surface area of the public engine API:
* `pkg/engine/planner` moves to `pkg/engine/internal/planner`
* `pkg/engine/executor` moves to `pkg/engine/internal/executor`
These packages were only used from `pkg/engine` and did not need to be public.
We may make them public again in the future if we want to expose subcomponents
of the engine.
This move means that `pkg/engine/planner/internal/tree` became
`pkg/engine/internal/planner/internal/tree`. To reduce the import path, I also
moved that package to `pkg/engine/internal/util/tree`.
Other than moving files and updating import paths, no code changes are made.
Signed-off-by: Robert Fratto <robertfratto@gmail.com>