在接收指标时,我们的 Graphite 服务器有一个奇怪的行为。我们使用 collectd 每 10 秒发送一次指标,但看不到所有指标。storage-scheme 中配置的保留是:
[collectd]
pattern=^collectd\.tera.*
retentions=10s:1d,60s:7d,5m:60d,30m:180d
[default_1min_for_1day]
pattern = .*
retentions = 1m:1d,10m:7d,6h:90d
我们每 10 秒看到一次 Grafana 上的指标,但过了一段时间(约 1 分钟),指标消失了,只剩下 1 个点。 图表示例
我们正在使用 1 个中继和 2 个缓存。
一个指标的例子是:
collectd.terado.cpu.cpu-user 1 1643015258
未配置aggregation-rules.conf。
这是我们的 carbon.conf:*由于 stackoverflow 字符限制而删除了注释
[cache]
DATABASE = whisper
ENABLE_LOGROTATION = True
USER =
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 100000
MAX_CREATES_PER_MINUTE = 50000
MIN_TIMESTAMP_RESOLUTION = 1
ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 0.0.0.0
#UDP_RECEIVER_PORT = 2003
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
#CACHE_QUERY_PORT = 7002
USE_FLOW_CONTROL = True
#METRIC_CLIENT_IDLE_TIMEOUT = None
LOG_UPDATES = True
LOG_CREATES = True
LOG_CACHE_HITS = False
LOG_CACHE_QUEUE_SORTS = False
CACHE_WRITE_STRATEGY = sorted
# On some systems it is desirable for whisper to write synchronously.
# Set this option to True if you'd like to try this. Basically it will
# shift the onus of buffering writes from the kernel into carbon's cache.
WHISPER_AUTOFLUSH = True
# By default new Whisper files are created pre-allocated with the data region
# filled with zeros to prevent fragmentation and speed up contiguous reads and
# writes (which are common). Enabling this option will cause Whisper to create
# the file sparsely instead. Enabling this option may allow a large increase of
# MAX_CREATES_PER_MINUTE but may have longer term performance implications
# depending on the underlying storage configuration.
# WHISPER_SPARSE_CREATE = False
# Only beneficial on linux filesystems that support the fallocate system call.
# It maintains the benefits of contiguous reads/writes, but with a potentially
# much faster creation speed, by allowing the kernel to handle the block
# allocation and zero-ing. Enabling this option may allow a large increase of
# MAX_CREATES_PER_MINUTE. If enabled on an OS or filesystem that is unsupported
# this option will gracefully fallback to standard POSIX file access methods.
# If enabled, disables WHISPER_SPARSE_CREATE regardless of the value.
WHISPER_FALLOCATE_CREATE = True
# Enabling this option will cause Whisper to lock each Whisper file it writes
# to with an exclusive lock (LOCK_EX, see: man 2 flock). This is useful when
# multiple carbon-cache daemons are writing to the same files.
# WHISPER_LOCK_WRITES = False
# On systems which has a large number of metrics, an amount of Whisper write(2)'s
# pageback sometimes cause disk thrashing due to memory shortage, so that abnormal
# disk reads occur. Enabling this option makes it possible to decrease useless
# page cache memory by posix_fadvise(2) with POSIX_FADVISE_RANDOM option.
# WHISPER_FADVISE_RANDOM = False
# By default all nodes stored in Ceres are cached in memory to improve the
# throughput of reads and writes to underlying slices. Turning this off will
# greatly reduce memory consumption for databases with millions of metrics, at
# the cost of a steep increase in disk i/o, approximately an extra two os.stat
# calls for every read and write. Reasons to do this are if the underlying
# storage can handle stat() with practically zero cost (SSD, NVMe, zRAM).
# Valid values are:
# all - all nodes are cached
# none - node caching is disabled
# CERES_NODE_CACHING_BEHAVIOR = all
# Ceres nodes can have many slices and caching the right ones can improve
# performance dramatically. Note that there are many trade-offs to tinkering
# with this, and unless you are a ceres developer you *really* should not
# mess with this. Valid values are:
# latest - only the most recent slice is cached
# all - all slices are cached
# none - slice caching is disabled
# CERES_SLICE_CACHING_BEHAVIOR = latest
# If a Ceres node accumulates too many slices, performance can suffer.
# This can be caused by intermittently reported data. To mitigate
# slice fragmentation there is a tolerance for how much space can be
# wasted within a slice file to avoid creating a new one. That tolerance
# level is determined by MAX_SLICE_GAP, which is the number of consecutive
# null datapoints allowed in a slice file.
# If you set this very low, you will waste less of the *tiny* bit disk space
# that this feature wastes, and you will be prone to performance problems
# caused by slice fragmentation, which can be pretty severe.
# If you set this really high, you will waste a bit more disk space (each
# null datapoint wastes 8 bytes, but keep in mind your filesystem's block
# size). If you suffer slice fragmentation issues, you should increase this or
# run the ceres-maintenance defrag plugin more often. However you should not
# set it to be huge because then if a large but allowed gap occurs it has to
# get filled in, which means instead of a simple 8-byte write to a new file we
# could end up doing an (8 * MAX_SLICE_GAP)-byte write to the latest slice.
# CERES_MAX_SLICE_GAP = 80
# Enabling this option will cause Ceres to lock each Ceres file it writes to
# to with an exclusive lock (LOCK_EX, see: man 2 flock). This is useful when
# multiple carbon-cache daemons are writing to the same files.
# CERES_LOCK_WRITES = False
# Set this to True to enable whitelisting and blacklisting of metrics in
# CONF_DIR/whitelist.conf and CONF_DIR/blacklist.conf. If the whitelist is
# missing or empty, all metrics will pass through
# USE_WHITELIST = False
# By default, carbon itself will log statistics (such as a count,
# metricsReceived) with the top level prefix of 'carbon' at an interval of 60
# seconds. Set CARBON_METRIC_INTERVAL to 0 to disable instrumentation
# CARBON_METRIC_PREFIX = carbon
CARBON_METRIC_INTERVAL = 10
# Enable AMQP if you want to receve metrics using an amqp broker
# ENABLE_AMQP = False
# Verbose means a line will be logged for every metric received
# useful for testing
# AMQP_VERBOSE = False
# AMQP_HOST = localhost
# AMQP_PORT = 5672
# AMQP_VHOST = /
# AMQP_USER = guest
# AMQP_PASSWORD = guest
# AMQP_EXCHANGE = graphite
# AMQP_METRIC_NAME_IN_BODY = False
# The manhole interface allows you to SSH into the carbon daemon
# and get a python interpreter. BE CAREFUL WITH THIS! If you do
# something like time.sleep() in the interpreter, the whole process
# will sleep! This is *extremely* helpful in debugging, assuming
# you are familiar with the code. If you are not, please don't
# mess with this, you are asking for trouble :)
#
# ENABLE_MANHOLE = False
# MANHOLE_INTERFACE = 127.0.0.1
# MANHOLE_PORT = 7222
# MANHOLE_USER = admin
# MANHOLE_PUBLIC_KEY = ssh-rsa AAAAB3NzaC1yc2EAAAABiwAaAIEAoxN0sv/e4eZCPpi3N3KYvyzRaBaMeS2RsOQ/cDuKv11dlNzVeiyc3RFmCv5Rjwn/lQ79y0zyHxw67qLyhQ/kDzINc4cY41ivuQXm2tPmgvexdrBv5nsfEpjs3gLZfJnyvlcVyWK/lId8WUvEWSWHTzsbtmXAF2raJMdgLTbQ8wE=
# Patterns for all of the metrics this machine will store. Read more at
# http://en.wikipedia.org/wiki/Advanced_Message_Queuing_Protocol#Bindings
#
# Example: store all sales, linux servers, and utilization metrics
# BIND_PATTERNS = sales.#, servers.linux.#, #.utilization
#
# Example: store everything
# BIND_PATTERNS = #
# URL of graphite-web instance, this is used to add incoming series to the tag database
GRAPHITE_URL = http://127.0.0.1:8080
# Tag support, when enabled carbon will make HTTP calls to graphite-web to update the tag index
# ENABLE_TAGS = True
# Tag update interval, this specifies how frequently updates to existing series will trigger
# an update to the tag index, the default setting is once every 100 updates
# TAG_UPDATE_INTERVAL = 100
# Tag hash filenames, this specifies whether tagged metric filenames should use the hash of the metric name
# or a human-readable name, using hashed names avoids issues with path length when using a large number of tags
# TAG_HASH_FILENAMES = True
# Tag batch size, this specifies the maximum number of series to be sent to graphite-web in a single batch
# TAG_BATCH_SIZE = 100
# Tag queue size, this specifies the maximum number of series to be queued for sending to graphite-web
# There are separate queues for new series and for updates to existing series
# TAG_QUEUE_SIZE = 10000
# Set to enable Sentry.io exception monitoring.
# RAVEN_DSN='YOUR_DSN_HERE'.
# To configure special settings for the carbon-cache instance 'b', uncomment this:
#[cache:b]
#LINE_RECEIVER_PORT = 2103
#PICKLE_RECEIVER_PORT = 2104
#CACHE_QUERY_PORT = 7102
# and any other settings you want to customize, defaults are inherited
# from the [cache] section.
# You can then specify the --instance=b option to manage this instance
#
# In order to turn off logging of successful connections for the line
# receiver, set this to False
# LOG_LISTENER_CONN_SUCCESS = True
[cache:a]
LINE_RECEIVER_PORT = 2103
PICKLE_RECEIVER_PORT = 2104
CACHE_QUERY_PORT = 7002
[cache:b]
LINE_RECEIVER_PORT = 2203
PICKLE_RECEIVER_PORT = 2204
CACHE_QUERY_PORT = 7102
#[cache:c]
#LINE_RECEIVER_PORT = 2303
#PICKLE_RECEIVER_PORT = 2304
#CACHE_QUERY_PORT = 7202
[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
# Carbon-relay has several options for metric routing controlled by RELAY_METHOD
#
# Use relay-rules.conf to route metrics to destinations based on pattern rules
#RELAY_METHOD = rules
#
# Use consistent-hashing for even distribution of metrics between destinations
RELAY_METHOD = consistent-hashing
#
# Use consistent-hashing but take into account an aggregation-rules.conf shared
# by downstream carbon-aggregator daemons. This will ensure that all metrics
# that map to a given aggregation rule are sent to the same carbon-aggregator
# instance.
# Enable this for carbon-relays that send to a group of carbon-aggregators
#RELAY_METHOD = aggregated-consistent-hashing
#
# You can also use fast-hashing and fast-aggregated-hashing which are in O(1)
# and will always redirect the metrics to the same destination but do not try
# to minimize rebalancing when the list of destinations is changing.
#RELAY_METHOD = rules
# If you use consistent-hashing you can add redundancy by replicating every
# datapoint to more than one machine.
REPLICATION_FACTOR = 1
# For REPLICATION_FACTOR >=2, set DIVERSE_REPLICAS to True to guarantee replicas
# across distributed hosts. With this setting disabled, it's possible that replicas
# may be sent to different caches on the same host. This has been the default
# behavior since introduction of 'consistent-hashing' relay method.
# Note that enabling this on an existing pre-0.9.14 cluster will require rebalancing
# your metrics across the cluster nodes using a tool like Carbonate.
#DIVERSE_REPLICAS = True
# This is a list of carbon daemons we will send any relayed or
# generated metrics to. The default provided would send to a single
# carbon-cache instance on the default port. However if you
# use multiple carbon-cache instances then it would look like this:
DESTINATIONS = 127.0.0.1:2104:a, 127.0.0.1:2204:b
#, 127.0.0.1:2304:c
#
# The general form is IP:PORT:INSTANCE where the :INSTANCE part is
# optional and refers to the "None" instance if omitted.
#
# Note that if the destinations are all carbon-caches then this should
# exactly match the webapp's CARBONLINK_HOSTS setting in terms of
# instances listed (order matters!).
#
# If using RELAY_METHOD = rules, all destinations used in relay-rules.conf
# must be defined in this list
#DESTINATIONS = 127.0.0.1:2004
# This define the protocol to use to contact the destination. It can be
# set to one of "line", "pickle", "udp" and "protobuf". This list can be
# extended with CarbonClientFactory plugins and defaults to "pickle".
DESTINATION_PROTOCOL = pickle
# This defines the wire transport, either none or ssl.
# If SSL is used any TCP connection will be upgraded to TLS1. The system's
# trust authority will be used unless DESTINATION_SSL_CA is specified in
# which case an alternative certificate authority chain will be used for
# verifying the remote certificate.
# To use SSL you'll need the cryptography, service_identity, and twisted >= 14
# DESTINATION_TRANSPORT = none
# DESTINATION_SSL_CA=/path/to/private-ca.crt
# This allows to have multiple connections per destinations, this will
# pool all the replicas of a single host in the same queue and distribute
# points accross these replicas instead of replicating them.
# The following example will balance the load between :0 and :1.
## DESTINATIONS = foo:2001:0, foo:2001:1
## RELAY_METHOD = rules
# Note: this is currently incompatible with USE_RATIO_RESET which gets
# disabled if this option is enabled.
# DESTINATIONS_POOL_REPLICAS = False
# When using consistent hashing it sometime makes sense to make
# the ring dynamic when you don't want to loose points when a
# single destination is down. Replication is an answer to that
# but it can be quite expensive.
# DYNAMIC_ROUTER = False
# Controls the number of connection attempts before marking a
# destination as down. We usually do one connection attempt per
# second.
# DYNAMIC_ROUTER_MAX_RETRIES = 5
# This is the maximum number of datapoints that can be queued up
# for a single destination. Once this limit is hit, we will
# stop accepting new data if USE_FLOW_CONTROL is True, otherwise
# we will drop any subsequently received datapoints.
MAX_QUEUE_SIZE = 500000
#previous was 100000
# This defines the maximum "message size" between carbon daemons. If
# your queue is large, setting this to a lower number will cause the
# relay to forward smaller discrete chunks of stats, which may prevent
# overloading on the receiving side after a disconnect.
MAX_DATAPOINTS_PER_MESSAGE = 50500
# Limit the number of open connections the receiver can handle as any time.
# Default is no limit. Setting up a limit for sites handling high volume
# traffic may be recommended to avoid running out of TCP memory or having
# thousands of TCP connections reduce the throughput of the service.
MAX_RECEIVER_CONNECTIONS = inf
# Specify the user to drop privileges to
# If this is blank carbon-relay runs as the user that invokes it
# USER =
# This is the percentage that the queue must be empty before it will accept
# more messages. For a larger site, if the queue is very large it makes sense
# to tune this to allow for incoming stats. So if you have an average
# flow of 100k stats/minute, and a MAX_QUEUE_SIZE of 3,000,000, it makes sense
# to allow stats to start flowing when you've cleared the queue to 95% since
# you should have space to accommodate the next minute's worth of stats
# even before the relay incrementally clears more of the queue
QUEUE_LOW_WATERMARK_PCT = 0.8
# To allow for batch efficiency from the pickle protocol and to benefit from
# other batching advantages, all writes are deferred by putting them into a queue,
# and then the queue is flushed and sent a small fraction of a second later.
TIME_TO_DEFER_SENDING = 0.0001
# Set this to False to drop datapoints when any send queue (sending datapoints
# to a downstream carbon daemon) hits MAX_QUEUE_SIZE. If this is True (the
# default) then sockets over which metrics are received will temporarily stop accepting
# data until the send queues fall below QUEUE_LOW_WATERMARK_PCT * MAX_QUEUE_SIZE.
USE_FLOW_CONTROL = True
# If enabled this setting is used to timeout metric client connection if no
# metrics have been sent in specified time in seconds
#METRIC_CLIENT_IDLE_TIMEOUT = None
# Set this to True to enable whitelisting and blacklisting of metrics in
# CONF_DIR/whitelist.conf and CONF_DIR/blacklist.conf. If the whitelist is
# missing or empty, all metrics will pass through
# USE_WHITELIST = False
# By default, carbon itself will log statistics (such as a count,
# metricsReceived) with the top level prefix of 'carbon' at an interval of 60
# seconds. Set CARBON_METRIC_INTERVAL to 0 to disable instrumentation
# CARBON_METRIC_PREFIX = carbon
CARBON_METRIC_INTERVAL = 10
#
# In order to turn off logging of successful connections for the line
# receiver, set this to False
# LOG_LISTENER_CONN_SUCCESS = True
# If you're connecting from the relay to a destination that's over the
# internet or similarly iffy connection, a backlog can develop because
# of internet weather conditions, e.g. acks getting lost or similar issues.
# To deal with that, you can enable USE_RATIO_RESET which will let you
# re-set the connection to an individual destination. Defaults to being off.
USE_RATIO_RESET=False
# When there is a small number of stats flowing, it's not desirable to
# perform any actions based on percentages - it's just too "twitchy".
MIN_RESET_STAT_FLOW=1000
# When the ratio of stats being sent in a reporting interval is far
# enough from 1.0, we will disconnect the socket and reconnecto to
# clear out queued stats. The default ratio of 0.9 indicates that 10%
# of stats aren't being delivered within one CARBON_METRIC_INTERVAL
# (default of 60 seconds), which can lead to a queue backup. Under
# some circumstances re-setting the connection can fix this, so
# set this according to your tolerance, and look in the logs for
# "resetConnectionForQualityReasons" to observe whether this is kicking
# in when your sent queue is building up.
MIN_RESET_RATIO=0.9
# The minimum time between resets. When a connection is re-set, we
# need to wait before another reset is performed.
# (2*CARBON_METRIC_INTERVAL) + 1 second is the minimum time needed
# before stats for the new connection will be available. Setting this
# below (2*CARBON_METRIC_INTERVAL) + 1 second will result in a lot of
# reset connections for no good reason.
MIN_RESET_INTERVAL=121
# Enable TCP Keep Alive (http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html).
# Default settings will send a probe every 30s. Default is False.
# TCP_KEEPALIVE=True
# The interval between the last data packet sent (simple ACKs are not
# considered data) and the first keepalive probe; after the connection is marked
# to need keepalive, this counter is not used any further.
# TCP_KEEPIDLE=10
# The interval between subsequential keepalive probes, regardless of what
# the connection has exchanged in the meantime.
# TCP_KEEPINTVL=30
# The number of unacknowledged probes to send before considering the connection
# dead and notifying the application layer.
# TCP_KEEPCNT=2
[aggregator]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
# If set true, metric received will be forwarded to DESTINATIONS in addition to
# the output of the aggregation rules. If set false the carbon-aggregator will
# only ever send the output of aggregation.
FORWARD_ALL = True
# Filenames of the configuration files to use for this instance of aggregator.
# Filenames are relative to CONF_DIR.
#
# AGGREGATION_RULES = aggregation-rules.conf
# REWRITE_RULES = rewrite-rules.conf
# This is a list of carbon daemons we will send any relayed or
# generated metrics to. The default provided would send to a single
# carbon-cache instance on the default port. However if you
# use multiple carbon-cache instances then it would look like this:
#
#DESTINATIONS = 127.0.0.1:2004
#
# The format is comma-delimited IP:PORT:INSTANCE where the :INSTANCE part is
# optional and refers to the "None" instance if omitted.
#
# Note that if the destinations are all carbon-caches then this should
# exactly match the webapp's CARBONLINK_HOSTS setting in terms of
# instances listed (order matters!).
DESTINATIONS = 127.0.0.1:2104:a, 127.0.0.1:2204:b
# If you want to add redundancy to your data by replicating every
# datapoint to more than one machine, increase this.
REPLICATION_FACTOR = 1
# This is the maximum number of datapoints that can be queued up
# for a single destination. Once this limit is hit, we will
# stop accepting new data if USE_FLOW_CONTROL is True, otherwise
# we will drop any subsequently received datapoints.
MAX_QUEUE_SIZE = 100000
# Set this to False to drop datapoints when any send queue (sending datapoints
# to a downstream carbon daemon) hits MAX_QUEUE_SIZE. If this is True (the
# default) then sockets over which metrics are received will temporarily stop accepting
# data until the send queues fall below 80% MAX_QUEUE_SIZE.
USE_FLOW_CONTROL = True
# If enabled this setting is used to timeout metric client connection if no
# metrics have been sent in specified time in seconds
#METRIC_CLIENT_IDLE_TIMEOUT = None
# This defines the maximum "message size" between carbon daemons.
# You shouldn't need to tune this unless you really know what you're doing.
MAX_DATAPOINTS_PER_MESSAGE = 20500
#previous was 5500
# This defines how many datapoints the aggregator remembers for
# each metric. Aggregation only happens for datapoints that fall in
# the past MAX_AGGREGATION_INTERVALS * intervalSize seconds.
MAX_AGGREGATION_INTERVALS = 5
#previous 5
# Limit the number of open connections the receiver can handle as any time.
# Default is no limit. Setting up a limit for sites handling high volume
# traffic may be recommended to avoid running out of TCP memory or having
# thousands of TCP connections reduce the throughput of the service.
MAX_RECEIVER_CONNECTIONS = inf
# By default (WRITE_BACK_FREQUENCY = 0), carbon-aggregator will write back
# aggregated data points once every rule.frequency seconds, on a per-rule basis.
# Set this (WRITE_BACK_FREQUENCY = N) to write back all aggregated data points
# every N seconds, independent of rule frequency. This is useful, for example,
# to be able to query partially aggregated metrics from carbon-cache without
# having to first wait rule.frequency seconds.
# WRITE_BACK_FREQUENCY = 0
# Set this to True to enable whitelisting and blacklisting of metrics in
# CONF_DIR/whitelist.conf and CONF_DIR/blacklist.conf. If the whitelist is
# missing or empty, all metrics will pass through
# USE_WHITELIST = False
# By default, carbon itself will log statistics (such as a count,
# metricsReceived) with the top level prefix of 'carbon' at an interval of 60
# seconds. Set CARBON_METRIC_INTERVAL to 0 to disable instrumentation
# CARBON_METRIC_PREFIX = carbon
CARBON_METRIC_INTERVAL = 10
# In order to turn off logging of successful connections for the line
# receiver, set this to False
# LOG_LISTENER_CONN_SUCCESS = True
# In order to turn off logging of metrics with no corresponding
# aggregation rules receiver, set this to False
# LOG_AGGREGATOR_MISSES = False
# Specify the user to drop privileges to
# If this is blank carbon-aggregator runs as the user that invokes it
# USER =
# Part of the code, and particularly aggregator rules, need
# to cache metric names. To avoid leaking too much memory you
# can tweak the size of this cache. The default allow for 1M
# different metrics per rule (~200MiB).
# CACHE_METRIC_NAMES_MAX=1000000
# You can optionally set a ttl to this cache.
# CACHE_METRIC_NAMES_TTL=600
您能帮助我们了解如何每 10 秒获取一次所有指标吗?谢谢。