2

我的 Opencensus 实现有问题,登录 Python 和 FastAPI。我想在 Azure 中记录对 Application Insights 的传入请求,因此我按照Microsoft 文档这个 Github 帖子在我的代码中添加了一个 FastAPI 中间件:

propagator = TraceContextPropagator()

@app.middleware('http')
async def middleware_opencensus(request: Request, call_next):
    tracer = Tracer(
        span_context=propagator.from_headers(request.headers),
        exporter=AzureExporter(connection_string=os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']),
        sampler=AlwaysOnSampler(),
        propagator=propagator)

    with tracer.span('main') as span:
        span.span_kind = SpanKind.SERVER
        tracer.add_attribute_to_current_span(HTTP_HOST, request.url.hostname)
        tracer.add_attribute_to_current_span(HTTP_METHOD, request.method)
        tracer.add_attribute_to_current_span(HTTP_PATH, request.url.path)
        tracer.add_attribute_to_current_span(HTTP_ROUTE, request.url.path)
        tracer.add_attribute_to_current_span(HTTP_URL, str(request.url))

        response = await call_next(request)
        tracer.add_attribute_to_current_span(HTTP_STATUS_CODE, response.status_code)

    return response

这在本地运行时非常有用,并且所有对 api 的传入请求都记录到 Application Insights。然而,由于实施了 Opencensus,当部署在 Azure 上的容器实例中时,几天(大约 3 天)后出现了一个问题,看起来似乎发生了一些递归日志记录问题(每秒 +30.000 个日志!),ia 声明Queue is full. Dropping telemetry,在最后经过几个小时的疯狂记录后崩溃:

在此处输入图像描述

我们logger.py定义日志处理程序的文件如下:

import logging.config
import os
import tqdm
from pathlib import Path
from opencensus.ext.azure.log_exporter import AzureLogHandler


class TqdmLoggingHandler(logging.Handler):
    """
        Class for enabling logging during a process with a tqdm progress bar.
        Using this handler logs will be put above the progress bar, pushing the
        process bar down instead of replacing it.
    """
    def __init__(self, level=logging.NOTSET):
        super().__init__(level)
        self.formatter = logging.Formatter(fmt='%(asctime)s <%(name)s> %(levelname)s: %(message)s',
                                           datefmt='%d-%m-%Y %H:%M:%S')

    def emit(self, record):
        try:
            msg = self.format(record)
            tqdm.tqdm.write(msg)
            self.flush()
        except (KeyboardInterrupt, SystemExit):
            raise
        except:
            self.handleError(record)


logging_conf_path = Path(__file__).parent
logging.config.fileConfig(logging_conf_path / 'logging.conf')

logger = logging.getLogger(__name__)
logger.addHandler(TqdmLoggingHandler(logging.DEBUG))  # Add tqdm handler to root logger to replace the stream handler
if os.getenv('APPLICATION_INSIGHTS_CONNECTION_STRING'):
    logger.addHandler(AzureLogHandler(connection_string=os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']))

warning_level_loggers = ['urllib3', 'requests']
for lgr in warning_level_loggers:
    logging.getLogger(lgr).setLevel(logging.WARNING)

有没有人知道这个问题的原因可能是什么,或者人们遇到过类似的问题吗?我不知道“第一个”错误日志是什么,因为日志记录量很快。

如果需要其他信息,请告诉我。

提前致谢!

4

1 回答 1

2

我们决定重新审视这个问题,并发现两个有用的线程描述了我们所看到的相似但不完全相同的行为:

如第二个线程中所述,Opencensus 似乎尝试向 AI 发送跟踪信息,如果失败,失败的日志将在 15 秒内被批处理并再次发送(默认)。这将无限期地持续到成功,可能会导致大量且看似递归的失败日志垃圾邮件。

Izchen 在此评论中介绍和提出的解决方案是enable_local_storage=False为此问题设置。

另一种解决方案是迁移到不应包含此潜在问题并且是我们当前正在运行的解决方案的OpenTelemetry 。请记住,Opencensus 仍然是 Microsoft 官方支持的应用程序监控解决方案,而且 OpenTelemetry 还很年轻。然而,OpenTelemetry 似乎确实得到了很多支持,并且越来越受到关注。

至于 OpenTelemetry 的实现,我们执行了以下操作来跟踪我们的请求:

if os.getenv('APPLICATION_INSIGHTS_CONNECTION_STRING'):
    from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter
    from opentelemetry import trace
    from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
    from opentelemetry.propagate import extract
    from opentelemetry.sdk.resources import SERVICE_NAME, SERVICE_NAMESPACE, SERVICE_INSTANCE_ID, Resource
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.sdk.trace.export import BatchSpanProcessor

    provider = TracerProvider()

    processor = BatchSpanProcessor(AzureMonitorTraceExporter.from_connection_string(
        os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']))
    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)

    FastAPIInstrumentor.instrument_app(app)

OpenTelemetry 支持许多可用于创建跨度的自定义 Instrumentor,例如 Requests PyMongo、Elastic、Redis 等 => https://opentelemetry.io/registry/

如果您想像上面的 OpenCensus 示例那样编写自定义跟踪器/跨度,您可以尝试这样的事情:

# These come still from Opencensus for convenience
HTTP_HOST = COMMON_ATTRIBUTES['HTTP_HOST']
HTTP_METHOD = COMMON_ATTRIBUTES['HTTP_METHOD']
HTTP_PATH = COMMON_ATTRIBUTES['HTTP_PATH']
HTTP_ROUTE = COMMON_ATTRIBUTES['HTTP_ROUTE']
HTTP_URL = COMMON_ATTRIBUTES['HTTP_URL']
HTTP_STATUS_CODE = COMMON_ATTRIBUTES['HTTP_STATUS_CODE']

provider = TracerProvider()

processor = BatchSpanProcessor(AzureMonitorTraceExporter.from_connection_string(
        os.environ['APPLICATION_INSIGHTS_CONNECTION_STRING']))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

@app.middleware('http')
async def middleware_opentelemetry(request: Request, call_next):
    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span('main',
                                      context=extract(request.headers),
                                      kind=trace.SpanKind.SERVER) as span:
        span.set_attributes({
            HTTP_HOST: request.url.hostname,
            HTTP_METHOD: request.method,
            HTTP_PATH: request.url.path,
            HTTP_ROUTE: request.url.path,
            HTTP_URL: str(request.url)
        })

        response = await call_next(request)
        span.set_attribute(HTTP_STATUS_CODE, response.status_code)

    return response

此解决方案不再需要AzureLogHandler我们配置中的,因此已将其删除。logger.py

其他一些可能有用的来源:

于 2021-10-20T13:36:28.547 回答