我想问你们中是否有人遇到过类似的错误。
我在一家使用气流的公司工作,部署在 Azure kubernetes 上。
我们有一个 Dag 负责提取有关不同文档的一些信息。在我们从文档中提取的许多内容中,我们使用 tika 来提取 xml。
流程将是:
- 我们上传 10 个文件。
- 创建了 10 个不同的 DAG 来从文档中提取信息。
- 当它使用 tika 提取 xml 时,一些 DAGS 开始失败,因为 tika 服务器无法自行初始化。
关于使用 tika-server 的任务的一些事实:
- 我们已将重试次数设置为 3
- 我们将此任务的同时执行限制为 3 个,因此它永远不会失败。
这是我们在 Airflow 中的任务:
text_extraction = KubernetesPodOperator(
task_id="text_extraction",
name="text_extraction",
namespace=DEFAULT_NAMESPACE,
image_pull_secrets=[k8s.V1LocalObjectReference('acr-pull')],
image=image_text_tools,
arguments=[
"tika-text-extract",
"--input-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.input_file_name}",
"--xml-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.xml_file_name}",
"--metadata-path", f"{xcom_pull_folder}/{BASIC_CONFIG_FACTORY.metadata_file_name}",
"--ocr"
],
get_logs=True,
is_delete_operator_pod=True,
startup_timeout_seconds=300,
volumes=[VOLUME.volume],
volume_mounts=[VOLUME.volume_mount1],
max_active_tis_per_dag=3,
retries=3,
retry_delay=timedelta(minutes=1),
)
我将错误留在这里,尽管我认为它没有太大帮助:
[2022-03-02, 09:27:33 UTC] {pod_manager.py:203} INFO - [cli.py: - parse_document() ] Extracting text with OCR enabled from: /opt/airflow/data/61d45f641b57d80819f9448f/6218edbbe40ccbfe96c6bdcd/20220225-145515_file/file
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:34 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:34 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:39 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:39 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:44 UTC [MainThread ] [WARNI] Failed to see startup log message; retrying...
[2022-03-02, 09:27:44 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Failed to see startup log message; retrying...
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread ] [ERROR] Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - startServer() ] Tika startup log message not received after 3 tries.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - 2022-03-02, 09:27:49 UTC [MainThread ] [ERROR] Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - [tika.py: - checkTikaServer() ] Failed to receive startup confirmation from startServer.
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - Traceback (most recent call last):
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 128, in <module>
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - app()
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return get_command(self)(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return self.main(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - rv = self.invoke(ctx)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return _process_result(sub_ctx.command.invoke(sub_ctx))
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return ctx.invoke(self.callback, **ctx.params)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return __callback(*args, **kwargs)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - return callback(**use_params) # type: ignore
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 99, in tika_text_extract
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - parse_document(
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/text-tools/cli.py", line 28, in parse_document
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - parsed_pdf = parser.from_file(ip, xmlContent=True, requestOptions={"headers": headers, "timeout": timeout})
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/parser.py", line 42, in from_file
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - output = parse1(service, filename, serverEndpoint, services={'meta': '/meta', 'text': '/tika', 'all': '/rmeta/xml'},
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 336, in parse1
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - status, response = callServer('put', serverEndpoint, service, f,
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 531, in callServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - serverEndpoint = checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath, config_path)
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - File "/opt/venv/lib/python3.9/site-packages/tika/tika.py", line 601, in checkTikaServer
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - raise RuntimeError("Unable to start Tika server.")
[2022-03-02, 09:27:49 UTC] {pod_manager.py:203} INFO - RuntimeError: Unable to start Tika server.