您对已经运行的进程的怀疑确实是正确的。在后台tika
运行意味着当您的脚本启动时意味着它不会使用新标志重新启动 java 进程,这意味着没有堆增加。
至于解决这个问题,我们可以借助以下方法在 Windows 上完全用 Python 完成psutil
:
from typing import Optional
import psutil
from tika import tika as tika_server
from tika import parser
def get_tika_process() -> Optional[psutil.Process]:
for process in psutil.process_iter(["name", "cmdline"]):
if "java" in process.name():
for part in process.cmdline():
if "tika" in part:
return process
if existing_tika_process := get_tika_process():
print("Found tika process:", existing_tika_process)
print("Existing process args:", existing_tika_process.cmdline())
existing_tika_process.terminate()
terminate_result = existing_tika_process.wait(10)
print(f"Terminated tika; exit code {terminate_result}")
else:
print("No existing tika process found")
tika_server.TikaJavaArgs += "-Xmx1G" # See note {1}
parsed = parser.from_file("spam.txt")
print("Tika server started")
new_tika_process = get_tika_process()
if new_tika_process:
print("New process args:", new_tika_process.cmdline())
print(parsed["metadata"])
print(parsed["content"])
{1} 我直接追加到tika_server.TikaJavaArgs
,因为环境变量在tika_server
导入时会被解析。如果您延迟导入(如问题中的第一次尝试),您可以替换为设置环境变量。
结果:
(venv) PS E:\DevProjects\stack-exchange-answers\69637621> python .\main.py
No existing tika process found
2021-10-22 22:50:04,476 [MainThread ] [WARNI] Failed to see startup log message; retrying...
Tika server started
New process args: ['java', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
{'Content-Encoding': 'windows-1252', 'Content-Type': 'text/plain; charset=windows-1252', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '54', 'resourceName': "b'spam.txt'"}
<blank lines removed>
Spam
Spam
More Spam!
(venv) PS E:\DevProjects\stack-exchange-answers\69637621> python .\main.py
Found tika process: psutil.Process(pid=11244, name='java.exe', status='running', started='22:50:04')
Existing process args: ['java', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
Terminated tika; exit code 15
2021-10-22 22:54:40,016 [MainThread ] [WARNI] Failed to see startup log message; retrying...
Tika server started
New process args: ['java', '-Xmx1G', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
{'Content-Encoding': 'windows-1252', 'Content-Type': 'text/plain; charset=windows-1252', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '55', 'resourceName': "b'spam.txt'"}
<blank lines removed>
Spam
Spam
More Spam!
(venv) PS E:\DevProjects\stack-exchange-answers\69637621>
您绝对可以改进这一点(例如,检查您的 args 是否相同,如果它们相同则跳过终止),但这至少应该让您重新开始。
此外,您应该考虑tika.tika.killServer()
在脚本末尾添加一个调用,以在完成后停止服务器。