Self monitoring

Each component of LinDB provides self-monitoring metrics to help users understand running status.

By default, LinDB regularly stores latest self-monitoring metric data into the '_internal' database.

There are several types of metrics as below

  • General: General metrics, such as CPU, Mem, network, etc., applicable to Broker, Storage;
  • Broker: Broker internal monitoring metrics;
  • Storage: Storage internal monitoring metrics;

All metrics are labeled with global tags as follows:

  • node: component's node;

TIP

Since LinDB supports multiple storage clusters (Storage) under a compute cluster (Broker), in order to better distinguish storage clusters, 'namespace' has been added to the metric under Storage to distinguish the cluster.

General

Go Runtime

Metric NameTagsFieldsDescription
lindb.runtime-go_goroutinesthe number of goroutines
go_threadsthe number of records in the thread creation profile
lindb.runtime.mem-allocbytes of allocated heap objects
total_alloccumulative bytes allocated for heap objects
systhe total bytes of memory obtained from the OS
lookupsthe number of pointer lookups performed by the runtime
mallocsthe cumulative count of heap objects allocated
freesthe cumulative count of heap objects freed
heap_allocbytes of allocated heap objects
heap_sysbytes of heap memory obtained from the OS
heap_idlebytes in idle (unused) spans
heap_inusebytes in in-use spans
heap_releasedbytes of physical memory returned to the OS
heap_objectsthe number of allocated heap objects
stack_inusebytes in stack spans
stack_sysbytes of stack memory obtained from the OS
mspan_inusebytes of allocated mspan structures
mspan_sysbytes of memory obtained from the OS for mspan
mcache_inusebytes of allocated mcache structures
mcache_sysbytes of memory obtained from the OS for mcache structures
buck_hash_sysbytes of memory in profiling bucket hash tables
gc_sysbytes of memory in garbage collection metadata
other_sysbytes of memory in miscellaneous off-heap
next_gcthe target heap size of the next GC cycle
last_gcthe time the last garbage collection finished
gc_cpu_fractionthe fraction of this program's available CPU time used by the GC since the program started

System

Metric NameTagsFieldsDescription
lindb.monitor.system.cpu_stat-idleCPU time that's not actively being used
niceCPU time used by processes that have a positive niceness
systemCPU time used by the kernel
userCPU time used by user space processes
irqInterrupt Requests
stealThe percentage of time a virtual CPU waits for a real CPU
softirqThe kernel is servicing interrupt requests (IRQs)
iowaitIt marks time spent waiting for input or output operations
lindb.monitor.system.mem_stat-totalTotal amount of RAM on this system
usedRAM used by programs
freeFree RAM
usagePercentage of RAM used by programs
lindb.monitor.system.disk_usage_stats-totalTotal amount of disk
usedDisk used by programs
freeFree disk
usagePercentage of disk used by programs
lindb.monitor.system.disk_inodes_stats-totalTotal amount of inode
usedINode used by programs
freeFree inode
usagePercentage of inode used by programs
lindb.monitor.system.net_statinterfacebytes_sentnumber of bytes sent
bytes_recvnumber of bytes received
packets_sentnumber of packets sent
packets_recvnumber of packets received
errintotal number of errors while receiving
errouttotal number of errors while sending
dropintotal number of incoming packets which were dropped
dropouttotal number of outgoing packets which were dropped (always 0 on OSX and BSD)

Network

Metric NameTagsFieldsDescription
lindb.traffic.tcpaddraccept_connsaccept total count
accept_failuresaccept failure
active_connscurrent active connections
readsread total count
read_bytesread byte size
read_failuresread failure
writeswrite total count
write_byteswrite byte size
write_failureswrite failure
close_connsclose total count
close_failuresclose failure
lindb.traffic.grpc_client.unarygrpc_service
grpc_method
failuresgrpc unary client handle msg failure
lindb.traffic.grpc_client.unary.durationgrpc_service
grpc_method
histogramgrpc unary client handle msg duration
lindb.traffic.grpc_server.unarygrpc_service
grpc_method
failuresgrpc unary server handle msg failure
lindb.traffic.grpc_server.unary.durationgrpc_service
grpc_method
histogramgrpc unary server handle msg duration
lindb.traffic.grpc_client.streamgrpc_service
grpc_service
grpc_method
msg_received_failuresgrpc cliet receive msg failure
msg_sent_failuresgrpc cliet send msg failure
lindb.traffic.grpc_client.stream.received_durationgrpc_service
grpc_service
grpc_method
histogramgrpc client receive msg duration, include receive total count/handle duration
lindb.traffic.grpc_client.stream.sent_durationgrpc_service
grpc_service
grpc_method
histogramgrpc client send msg duration, include send total count
lindb.traffic.grpc_server.streamgrpc_service
grpc_service
grpc_method
msg_received_failuresgrpc server receive msg failure
msg_sent_failuresgrpc server send msg failure
lindb.traffic.grpc_server.stream.received_durationgrpc_service
grpc_service
grpc_method
histogramgrpc server receive msg duration, include receive total count/handle duration
lindb.traffic.grpc_server.stream.sent_durationgrpc_service
grpc_service
grpc_method
histogramgrpc server send msg duration, include send total count
lindb.traffic.grpc_server-panicspanic when grpc server handle request

Concurrent

Metric NameTagsFieldsDescription
lindb.concurrent.poolpool_nameworkers_alivecurrent workers count in use
workers_createdworkers created count since start
workers_killedworkers killed count since start
tasks_consumedworkers consumed count
tasks_rejectedworkers rejected count
tasks_panicworkers execute panic count
lindb.concurrent.pool.tasks_waiting_durationpool_namehistogramtask waiting time
lindb.concurrent.pool.tasks_executing_durationpool_namehistogramtask executing time with waiting period
lindb.concurrent.limittypethrottle_requestsnumber of reaches the max-concurrency
timeout_requestsnumber pending and then timeout
processednumber of processed requests

Broker

Metric NameTagsFieldsDescription
lindb.broker.state_managertypehandle_eventshandle coordinator event success count
handle_event_failureshandle coordinator event failure count
panicspanic count whne handle coordinator event
lindb.master.shard.leader-electionsshard leader elect successfully
elect_failuresshard leader elect failure
lindb.master.controller-failoversmaster fail over successfully
failover_failuresmaster fail over failure
reassignsmaster reassign successfully
reassign_failuresmaster reassign failure
lindb.http.ingest_durationpathhistogramingest duration(include count)
lindb.ingestion.proto-data_corruptedcorrupted when parse
ingested_metricsingested metrics
read_bytesread data bytes
dropped_metricsdrop metrics when append
lindb.ingestion.flat-data_corruptedcorrupted when parse
ingested_metricsingested metrics
read_bytesread data bytes
dropped_metricsdrop metrics when append
sizeblockread data block size
lindb.ingestion.influx-data_corruptedcorrupted when parse
ingested_metricsingested metrics
ingested_fieldsingested fields
read_bytesread data bytes
dropped_metricsdrop metrics when append
dropped_fieldsdrop fields when append
lindb.broker.database.writedbout_of_time_rangetimestamp of metrics out of acceptable write time range
shard_not_foundshard not found count
lindb.broker.family.writedbactive_familiesnumber of current active replica family channel
batch_metricsbatch into memory chunk success count
batch_metrics_failuresbatch into memory chunk failure count
pending_sendnumber of pending send message
send_successsend message success count
send_failuressend message failure count
send_sizebytes of send message
retryretry count
retry_dropnumber of drop message after too many retry
create_streamcreate replica stream success count
create_stream_failurescreate replica stream failure count
close_streamclose replica stream success count
close_stream_failuresclose replica stream failure count
leader_changedshard leader changed
lindb.broker.query-created_taskscreate query tasks
alive_taskscurrent executing tasks(alive)
expire_taskstask expire, long-term no response
emitted_responsesemit response to parent node
omitted_responsesomit response because task evicted
sent_requestssend request successfully
sent_requests_failuressend request failure
sent_responsessend response successfully
sent_responses_failuressend response successfully

Storage

Metric NameTagsFieldsDescription
lindb.storage.state_managertypehandle_eventshandle coordinator event success count
handle_event_failureshandle coordinator event failure count
panicspanic count whne handle coordinator event
lindb.storage.waldb
shard
receive_write_bytesreceive write request bytes(broker->leader)
write_walwrite wal successfully(broker->leader)
write_wal_failureswrite wal failure(broker->leader)
receive_replica_bytesreceive replica request bytes(storage leader->follower
replica_walreplica wal successfully(storage leader->follower)
replica_wal_failuresreplica wal failure(storage leader->follower)
lindb.storage.replicator.runnertype
db
shard
active_replicatorsnumber of current active local replicators
replica_panicsreplica panic count
consume_msgget message successfully count
consume_msg_failuresget message failure count
replica_lagreplica lag message count
replica_bytesbytes of replica data
replicasreplica success count
lindb.storage.replica.localdb
shard
decompress_failuresdecompress message failure count
replica_failuresreplica failure count
replica_rowsrow number of replica
ack_sequenceack persist sequence count
invalid_sequenceinvalid replica sequence count
lindb.storage.replica.remotedb
shard
not_readyremote replicator channel not ready
follower_offlineremote follower node offline
need_close_last_streamneed close last stream, when do re-connection
close_last_stream_failuresclose last stream failure
create_replica_clicreate replica client successfully
create_replica_cli_failurescreate replica client failure
create_replica_streamcreate replica stream successfully
create_replica_stream_failurescreate replica stream failure
get_last_ack_failuresget last ack sequence from remote follower failure
reset_follower_append_idxreset follower append index successfully
reset_follower_append_idx_failuresreset follower append index failure
reset_append_idxreset current leader local append index
reset_replica_idxreset current leader replica index successfully
reset_replica_failuresreset current leader replica index failure
send_msgsend replica msg successfully
send_msg_failuressend replica msg failure
receive_msgreceive replica resp successfully
receive_msg_failuresreceive replica resp failure
ack_sequenceack replica successfully sequence count
invalid_ack_sequenceget wrong replica ack sequence from follower
lindb.tsdb.indexdbdbbuild_inverted_indexbuild inverted index count
lindb.tsdb.memdbdballocated_pagesallocate temp memory page successfully
allocate_page_failuresallocate temp memory page failure
lindb.tsdb.databasedbmetadb_flush_failuresflush metadata database failure
lindb.tsdb.database.metadb_flush_durationdbhistogramflush metadata database duration(include count)
lindb.tsdb.metadbdbgen_metric_idsgenerate metric id successfully
gen_metric_id_failuresgenerate metric id failure
gen_tag_key_idsgenerate tag key id successfully
gen_tag_key_id_failuresgenerate tag key id failure
gen_field_idsgenerate field id successfully
gen_field_id_failuresgenerate field id failure
gen_tag_value_idsgenerate tag value id successfully
gen_tag_value_id_failuresgenerate tag value id failure
lindb.tsdb.sharddb
shard
active_familiesnumber of current active families
write_batcheswrite batch count
write_metricswrite metric success count
write_fieldswrite field data point success count
write_metrics_failureswrite metric failures
memdb_total_sizetotal memory size of memory database
active_memdbsnumber of current active memory database
memdb_flush_failuresflush memory database failure
lookup_metric_meta_failureslookup meta of metric failure
indexdb_flush_failuresflush index database failure
lindb.tsdb.shard.memdb_flush_durationdb
shard
histogramflush memory database duration(include count)
lindb.tsdb.shard.indexdb_flush_durationdb
shard
indexdb_flush_durationflush index database duration(include count)
lindb.kv.table.cache-evictsevict reader from cache
cache_hitsget reader hit cache
cache_missesget reader miss cache
closesclose reader successfully
close_failuresclose reader failure
active_readersnumber of active reader in cache
lindb.kv.table.read-getsget data by key successfully
get_failuresget data by key failures
read_bytesbytes of read data
mmapsmap file successfully
mmap_failuresmap file failure
unmmapsunmam file successfully
unmmap_failuresunmam file failure
lindb.kv.table.write-bad_keysadd bad key count
add_keysadd key successfully
write_bytesbytes of write data
lindb.kv.compactiontypecompactingnumber of compacting jobs
failurecompact failure
lindb.kv.compaction.durationtypehistogramcompact duration(include count)
lindb.kv.flush-flushingnumber of flushing jobs
failureflush job failure
lindb.kv.flush.duration-histogramflush duration(include count)
lindb.storage.query-metric_queriesexecute metric query successfully(just plan it)
metric_query_failuresexecute metric query failure
meta_queriesmetadata query successfully
meta_query_failuresmetadata query failure
omitted_requestsomit request(task no belong to current node, wrong stream etc.)