0

我想问你们中是否有人遇到过类似的错误。

我在一家使用气流的公司工作,部署在 Azure kubernetes 上。

我们有一个 Dag 负责提取有关不同文档的一些信息。在我们从文档中提取的许多内容中,我们使用 tika 来提取 xml。

流程将是:

  • 我们上传 10 个文件。
  • 创建了 10 个不同的 DAG 来从文档中提取信息。
  • 当它使用 tika 提取 xml 时,一些 DAGS 开始失败,因为 tika 服务器无法自行初始化。

关于使用 tika-server 的任务的一些事实:

  • 我们已将重试次数设置为 3
  • 我们将此任务的同时执行限制为 3 个,因此它永远不会失败。

这是我们在 Airflow 中的任务:

 text_extraction = KubernetesPodOperator(
        task_id="text_extraction",
        name="text_extraction",
        namespace=DEFAULT_NAMESPACE,
        image_pull_secrets=[k8s.V1LocalObjectReference('acr-pull')],
        image=image_text_tools,
        arguments=[
            "tika-text-extract",
            "--input-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.input_file_name}",
            "--xml-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.xml_file_name}",
            "--metadata-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.metadata_file_name}",
            "--ocr"
        ],
        get_logs=True,
        is_delete_operator_pod=True,
        startup_timeout_seconds=300,
        volumes=[VOLUME.volume],
        volume_mounts=[VOLUME.volume_mount1],
        max_active_tis_per_dag=3,
        retries=3,
        retry_delay=timedelta(minutes=1),
    )

我将错误留在这里,尽管我认为它没有太大帮助:

[2022-03-02, 09:27:33 UTC] {pod_manager.py:203} INFO - [cli.py: - parse_document() ] Extracting text with OCR enabled from: /opt/airflow/data/61d45f641b57d80819f9448f/6218edbbe40ccbfe96c6bdcd/20220225-145515_file/file
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:34 UTC [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:39 UTC [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:44 UTC [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread  ] [ERROR]  Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread  ] [ERROR]  Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - checkTikaServer() ] Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - Traceback (most recent call last):
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/text-tools/cli.py", line 128, in <module>
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     app()
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return get_command(self)(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return self.main(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     rv = self.invoke(ctx)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return _process_result(sub_ctx.command.invoke(sub_ctx))
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return ctx.invoke(self.callback, **ctx.params)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return __callback(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     return callback(**use_params)  # type: ignore
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/text-tools/cli.py", line 99, in tika_text_extract
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     parse_document(
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/text-tools/cli.py", line 28, in parse_document
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     parsed_pdf = parser.from_file(ip, xmlContent=True, requestOptions={"headers": headers, "timeout": timeout})
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/tika/parser.py", line 42, in from_file
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     output = parse1(service, filename, serverEndpoint, services={'meta': '/meta', 'text': '/tika', 'all': '/rmeta/xml'},
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 336, in parse1
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     status, response = callServer('put', serverEndpoint, service, f,
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 531, in callServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     serverEndpoint = checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath, config_path)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -   File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 601, in checkTikaServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO -     raise RuntimeError("Unable to start Tika server.")
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - RuntimeError: Unable to start Tika server.
4

0 回答 0