Self monitoring metric
Last updated on Mon Jun 24 08:16:10 2024 by stone1100
Each component of LinDB provides self-monitoring metrics to help users understand running status.
By default, LinDB regularly stores latest self-monitoring metric data into the _internal database.
There are several types of metrics as below
- General: General metrics, such as CPU, Mem, network, etc., applicable to Root, Broker, Storage;
- Broker: Broker internal monitoring metrics;
- Storage: Storage internal monitoring metrics;
All metrics are labeled with global tags as follows:
- node: component's node;
TIP
Since LinDB supports multiple storage clusters (Storage) under a compute cluster (Broker), in order to better distinguish storage clusters, 'namespace' has been added to the metric under Storage to distinguish the cluster.
General
Go Runtime
Metric Name | Tags | Fields | Description |
---|---|---|---|
lindb.runtime | - | go_goroutines | the number of goroutines |
go_threads | the number of records in the thread creation profile | ||
lindb.runtime.mem | - | alloc | bytes of allocated heap objects |
total_alloc | cumulative bytes allocated for heap objects | ||
sys | the total bytes of memory obtained from the OS | ||
lookups | the number of pointer lookups performed by the runtime | ||
mallocs | the cumulative count of heap objects allocated | ||
frees | the cumulative count of heap objects freed | ||
heap_alloc | bytes of allocated heap objects | ||
heap_sys | bytes of heap memory obtained from the OS | ||
heap_idle | bytes in idle (unused) spans | ||
heap_inuse | bytes in in-use spans | ||
heap_released | bytes of physical memory returned to the OS | ||
heap_objects | the number of allocated heap objects | ||
stack_inuse | bytes in stack spans | ||
stack_sys | bytes of stack memory obtained from the OS | ||
mspan_inuse | bytes of allocated mspan structures | ||
mspan_sys | bytes of memory obtained from the OS for mspan | ||
mcache_inuse | bytes of allocated mcache structures | ||
mcache_sys | bytes of memory obtained from the OS for mcache structures | ||
buck_hash_sys | bytes of memory in profiling bucket hash tables | ||
gc_sys | bytes of memory in garbage collection metadata | ||
other_sys | bytes of memory in miscellaneous off-heap | ||
next_gc | the target heap size of the next GC cycle | ||
last_gc | the time the last garbage collection finished | ||
gc_cpu_fraction | the fraction of this program's available CPU time used by the GC since the program started |
System
Metric Name | Tags | Fields | Description |
---|---|---|---|
lindb.monitor.system.cpu_stat | - | idle | CPU time that's not actively being used |
nice | CPU time used by processes that have a positive niceness | ||
system | CPU time used by the kernel | ||
user | CPU time used by user space processes | ||
irq | Interrupt Requests | ||
steal | The percentage of time a virtual CPU waits for a real CPU | ||
softirq | The kernel is servicing interrupt requests (IRQs) | ||
iowait | It marks time spent waiting for input or output operations | ||
lindb.monitor.system.mem_stat | - | total | Total amount of RAM on this system |
used | RAM used by programs | ||
free | Free RAM | ||
usage | Percentage of RAM used by programs | ||
lindb.monitor.system.disk_usage_stats | - | total | Total amount of disk |
used | Disk used by programs | ||
free | Free disk | ||
usage | Percentage of disk used by programs | ||
lindb.monitor.system.disk_inodes_stats | - | total | Total amount of inode |
used | INode used by programs | ||
free | Free inode | ||
usage | Percentage of inode used by programs | ||
lindb.monitor.system.net_stat | interface | bytes_sent | number of bytes sent |
bytes_recv | number of bytes received | ||
packets_sent | number of packets sent | ||
packets_recv | number of packets received | ||
errin | total number of errors while receiving | ||
errout | total number of errors while sending | ||
dropin | total number of incoming packets which were dropped | ||
dropout | total number of outgoing packets which were dropped (always 0 on OSX and BSD) |
Network
Metric Name | Tags | Fields | Description |
---|---|---|---|
lindb.traffic.tcp | addr | accept_conns | accept total count |
accept_failures | accept failure | ||
active_conns | current active connections | ||
reads | read total count | ||
read_bytes | read byte size | ||
read_failures | read failure | ||
writes | write total count | ||
write_bytes | write byte size | ||
write_failures | write failure | ||
close_conns | close total count | ||
close_failures | close failure | ||
lindb.traffic.grpc_client.unary | grpc_service grpc_method | failures | grpc unary client handle msg failure |
lindb.traffic.grpc_client.unary.duration | grpc_service grpc_method | histogram | grpc unary client handle msg duration |
lindb.traffic.grpc_server.unary | grpc_service grpc_method | failures | grpc unary server handle msg failure |
lindb.traffic.grpc_server.unary.duration | grpc_service grpc_method | histogram | grpc unary server handle msg duration |
lindb.traffic.grpc_client.stream | grpc_service grpc_service grpc_method | msg_received_failures | grpc cliet receive msg failure |
msg_sent_failures | grpc cliet send msg failure | ||
lindb.traffic.grpc_client.stream.received_duration | grpc_service grpc_service grpc_method | histogram | grpc client receive msg duration, include receive total count/handle duration |
lindb.traffic.grpc_client.stream.sent_duration | grpc_service grpc_service grpc_method | histogram | grpc client send msg duration, include send total count |
lindb.traffic.grpc_server.stream | grpc_service grpc_service grpc_method | msg_received_failures | grpc server receive msg failure |
msg_sent_failures | grpc server send msg failure | ||
lindb.traffic.grpc_server.stream.received_duration | grpc_service grpc_service grpc_method | histogram | grpc server receive msg duration, include receive total count/handle duration |
lindb.traffic.grpc_server.stream.sent_duration | grpc_service grpc_service grpc_method | histogram | grpc server send msg duration, include send total count |
lindb.traffic.grpc_server | - | panics | panic when grpc server handle request |
Concurrent
Metric Name | Tags | Fields | Description |
---|---|---|---|
lindb.concurrent.pool | pool_name | workers_alive | current workers count in use |
workers_created | workers created count since start | ||
workers_killed | workers killed count since start | ||
tasks_consumed | workers consumed count | ||
tasks_rejected | workers rejected count | ||
tasks_panic | workers execute panic count | ||
lindb.concurrent.pool.tasks_waiting_duration | pool_name | histogram | task waiting time |
lindb.concurrent.pool.tasks_executing_duration | pool_name | histogram | task executing time with waiting period |
lindb.concurrent.limit | type | throttle_requests | number of reaches the max-concurrency |
timeout_requests | number pending and then timeout | ||
processed | number of processed requests |
Coordinator
Metric Name | Tags | Fields | Description |
---|---|---|---|
lindb.coordinator.state_manager | type,coordinator | handle_events | handle coordinator event success count |
handle_event_failures | handle coordinator event failure count | ||
panics | panic count whne handle coordinator event |
Query
Applicable to Root, Broker.
Metric Name | Tags | Fields | Description |
---|---|---|---|
lindb.query | - | created_tasks | create query tasks |
alive_tasks | current executing tasks(alive) | ||
expire_tasks | task expire, long-term no response | ||
emitted_responses | emit response to parent node | ||
omitted_responses | omit response because task evicted | ||
lindb.task.transport | - | sent_requests | send request successfully |
sent_requests_failures | send request failure | ||
sent_responses | send response successfully | ||
sent_responses_failures | send response successfully |
Broker
Metric Name | Tags | Fields | Description |
---|---|---|---|
lindb.master.shard.leader | - | elections | shard leader elect successfully |
elect_failures | shard leader elect failure | ||
lindb.master.controller | - | failovers | master fail over successfully |
failover_failures | master fail over failure | ||
reassigns | master reassign successfully | ||
reassign_failures | master reassign failure | ||
lindb.http.ingest_duration | path | histogram | ingest duration(include count) |
lindb.ingestion.proto | - | data_corrupted | corrupted when parse |
ingested_metrics | ingested metrics | ||
read_bytes | read data bytes | ||
dropped_metrics | drop metrics when append | ||
lindb.ingestion.flat | - | data_corrupted | corrupted when parse |
ingested_metrics | ingested metrics | ||
read_bytes | read data bytes | ||
dropped_metrics | drop metrics when append | ||
size | block | read data block size | |
lindb.ingestion.influx | - | data_corrupted | corrupted when parse |
ingested_metrics | ingested metrics | ||
ingested_fields | ingested fields | ||
read_bytes | read data bytes | ||
dropped_metrics | drop metrics when append | ||
dropped_fields | drop fields when append | ||
lindb.broker.database.write | db | out_of_time_range | timestamp of metrics out of acceptable write time range |
shard_not_found | shard not found count | ||
lindb.broker.family.write | db | active_families | number of current active replica family channel |
batch_metrics | batch into memory chunk success count | ||
batch_metrics_failures | batch into memory chunk failure count | ||
pending_send | number of pending send message | ||
send_success | send message success count | ||
send_failures | send message failure count | ||
send_size | bytes of send message | ||
retry | retry count | ||
retry_drop | number of drop message after too many retry | ||
create_stream | create replica stream success count | ||
create_stream_failures | create replica stream failure count | ||
close_stream | close replica stream success count | ||
close_stream_failures | close replica stream failure count | ||
leader_changed | shard leader changed |
Storage
Metric Name | Tags | Fields | Description |
---|---|---|---|
lindb.storage.wal | db shard | receive_write_bytes | receive write request bytes(broker->leader) |
write_wal | write wal successfully(broker->leader) | ||
write_wal_failures | write wal failure(broker->leader) | ||
receive_replica_bytes | receive replica request bytes(storage leader->follower | ||
replica_wal | replica wal successfully(storage leader->follower) | ||
replica_wal_failures | replica wal failure(storage leader->follower) | ||
lindb.storage.replicator.runner | type db shard | active_replicators | number of current active local replicators |
replica_panics | replica panic count | ||
consume_msg | get message successfully count | ||
consume_msg_failures | get message failure count | ||
replica_lag | replica lag message count | ||
replica_bytes | bytes of replica data | ||
replicas | replica success count | ||
lindb.storage.replica.local | db shard | decompress_failures | decompress message failure count |
replica_failures | replica failure count | ||
replica_rows | row number of replica | ||
ack_sequence | ack persist sequence count | ||
invalid_sequence | invalid replica sequence count | ||
lindb.storage.replica.remote | db shard | not_ready | remote replicator channel not ready |
follower_offline | remote follower node offline | ||
need_close_last_stream | need close last stream, when do re-connection | ||
close_last_stream_failures | close last stream failure | ||
create_replica_cli | create replica client successfully | ||
create_replica_cli_failures | create replica client failure | ||
create_replica_stream | create replica stream successfully | ||
create_replica_stream_failures | create replica stream failure | ||
get_last_ack_failures | get last ack sequence from remote follower failure | ||
reset_follower_append_idx | reset follower append index successfully | ||
reset_follower_append_idx_failures | reset follower append index failure | ||
reset_append_idx | reset current leader local append index | ||
reset_replica_idx | reset current leader replica index successfully | ||
reset_replica_failures | reset current leader replica index failure | ||
send_msg | send replica msg successfully | ||
send_msg_failures | send replica msg failure | ||
receive_msg | receive replica resp successfully | ||
receive_msg_failures | receive replica resp failure | ||
ack_sequence | ack replica successfully sequence count | ||
invalid_ack_sequence | get wrong replica ack sequence from follower | ||
lindb.tsdb.indexdb | db | build_inverted_index | build inverted index count |
lindb.tsdb.memdb | db | allocated_pages | allocate temp memory page successfully |
allocate_page_failures | allocate temp memory page failure | ||
lindb.tsdb.database | db | metadb_flush_failures | flush metadata database failure |
lindb.tsdb.database.metadb_flush_duration | db | histogram | flush metadata database duration(include count) |
lindb.tsdb.metadb | db | gen_metric_ids | generate metric id successfully |
gen_metric_id_failures | generate metric id failure | ||
gen_tag_key_ids | generate tag key id successfully | ||
gen_tag_key_id_failures | generate tag key id failure | ||
gen_field_ids | generate field id successfully | ||
gen_field_id_failures | generate field id failure | ||
gen_tag_value_ids | generate tag value id successfully | ||
gen_tag_value_id_failures | generate tag value id failure | ||
lindb.tsdb.shard | db shard | active_families | number of current active families |
write_batches | write batch count | ||
write_metrics | write metric success count | ||
write_fields | write field data point success count | ||
write_metrics_failures | write metric failures | ||
memdb_total_size | total memory size of memory database | ||
active_memdbs | number of current active memory database | ||
memdb_flush_failures | flush memory database failure | ||
lookup_metric_meta_failures | lookup meta of metric failure | ||
indexdb_flush_failures | flush index database failure | ||
lindb.tsdb.shard.memdb_flush_duration | db shard | histogram | flush memory database duration(include count) |
lindb.tsdb.shard.indexdb_flush_duration | db shard | indexdb_flush_duration | flush index database duration(include count) |
lindb.kv.table.cache | - | evicts | evict reader from cache |
cache_hits | get reader hit cache | ||
cache_misses | get reader miss cache | ||
closes | close reader successfully | ||
close_failures | close reader failure | ||
active_readers | number of active reader in cache | ||
lindb.kv.table.read | - | gets | get data by key successfully |
get_failures | get data by key failures | ||
read_bytes | bytes of read data | ||
mmaps | map file successfully | ||
mmap_failures | map file failure | ||
unmmaps | unmam file successfully | ||
unmmap_failures | unmam file failure | ||
lindb.kv.table.write | - | bad_keys | add bad key count |
add_keys | add key successfully | ||
write_bytes | bytes of write data | ||
lindb.kv.compaction | type | compacting | number of compacting jobs |
failure | compact failure | ||
lindb.kv.compaction.duration | type | histogram | compact duration(include count) |
lindb.kv.flush | - | flushing | number of flushing jobs |
failure | flush job failure | ||
lindb.kv.flush.duration | - | histogram | flush duration(include count) |
lindb.storage.query | - | metric_queries | execute metric query successfully(just plan it) |
metric_query_failures | execute metric query failure | ||
meta_queries | metadata query successfully | ||
meta_query_failures | metadata query failure | ||
omitted_requests | omit request(task no belong to current node, wrong stream etc.) |
Previous
Data ModelNext
Configuration