我正在使用带有标签的自定义模型(使用示例标签工具创建),并使用此页面底部的“Python Form Recognizer Async Analyze”V2 SDK 代码获取结果。它基本上可以工作,但单页 PDF 文件需要 20 多秒才能获得结果(使用 6 个标签,S0 定价模型)。150 个单页 pdf 文件花了一个多小时。我们还使用表单识别器的 V1 SDK 预览版(无标签)进行了测试,该版本比 V2快得多。
我知道 V2 现在是异步的,但是有什么可以加快表单识别的方法吗?下面是我基本上使用的代码:
########### Python Form Recognizer Async Analyze #############
import json
import time
from requests import get, post
# Endpoint URL
endpoint = r"<endpoint>"
apim_key = "<subsription key>"
model_id = "<model_id>"
post_url = endpoint + "/formrecognizer/v2.0-preview/custom/models/%s/analyze" % model_id
source = r"<file path>"
params = {
"includeTextDetails": True
}
headers = {
# Request headers
'Content-Type': '<file type>',
'Ocp-Apim-Subscription-Key': apim_key,
}
with open(source, "rb") as f:
data_bytes = f.read()
try:
resp = post(url = post_url, data = data_bytes, headers = headers, params = params)
if resp.status_code != 202:
print("POST analyze failed:\n%s" % json.dumps(resp.json()))
quit()
print("POST analyze succeeded:\n%s" % resp.headers)
get_url = resp.headers["operation-location"]
except Exception as e:
print("POST analyze failed:\n%s" % str(e))
quit()
n_tries = 15
n_try = 0
wait_sec = 5
max_wait_sec = 60
while n_try < n_tries:
try:
resp = get(url = get_url, headers = {"Ocp-Apim-Subscription-Key": apim_key})
resp_json = resp.json()
if resp.status_code != 200:
print("GET analyze results failed:\n%s" % json.dumps(resp_json))
quit()
status = resp_json["status"]
if status == "succeeded":
print("Analysis succeeded:\n%s" % json.dumps(resp_json))
quit()
if status == "failed":
print("Analysis failed:\n%s" % json.dumps(resp_json))
quit()
# Analysis still running. Wait and retry.
time.sleep(wait_sec)
n_try += 1
wait_sec = min(2*wait_sec, max_wait_sec)
except Exception as e:
msg = "GET analyze results failed:\n%s" % str(e)
print(msg)
quit()
print("Analyze operation did not complete within the allocated time.")