2

我通过以下方式从我的 python 代码中调用一个 java 程序:

subprocess.check_output(["java", "-classpath", "/Users/feralvam/Programas/semanticvectors-3.4/semanticvectors-3.4.jar:/Users/feralvam/Programas/lucene-3.5.0/lucene-core-3.5.0.jar:/Users/feralvam/Programas/lucene-3.5.0/contrib/demo/lucene-demo-3.5.0.jar:", "pitt.search.semanticvectors.CompareTerms", "-queryvectorfile","/Users/feralvam/termvectors.bin",term1,term2])

“term1”和“term2”是从 UTF-8 编码的文本文件中读取的字符串。

当我从 PyDev(Eclipse 3.7.2 中的 2.5 版)运行此命令时,我得到以下输出:(此处,“term1”=“Eles”和“term2”=“é”)

Jun 26, 2012 11:20:55 AM pitt.search.semanticvectors.CompareTerms main
INFO: Opened query vector store from file: /Users/feralvam/termvectors.bin
Jun 26, 2012 11:20:55 AM pitt.search.semanticvectors.CompareTerms main
INFO: Couldn't open Lucene index at 
Jun 26, 2012 11:20:55 AM pitt.search.semanticvectors.CompareTerms main
INFO: No Lucene index for query term weighting, so all query terms will have same weight.
Didn't find vector for 'Eles'
No vector for 'Eles'
Didn't find vector for '??'
No vector for '??'
Jun 26, 2012 11:20:55 AM pitt.search.semanticvectors.CompareTerms main
INFO: Outputting similarity of "Eles" with "??" ...

但是如果我从终端运行相同的命令,我会得到:

Jun 26, 2012 11:30:26 AM pitt.search.semanticvectors.CompareTerms main
INFO: Opened query vector store from file: /Users/feralvam/termvectors.bin
Jun 26, 2012 11:30:26 AM pitt.search.semanticvectors.CompareTerms main
INFO: Couldn't open Lucene index at 
Jun 26, 2012 11:30:26 AM pitt.search.semanticvectors.CompareTerms main
INFO: No Lucene index for query term weighting, so all query terms will have same weight.
Didn't find vector for 'Eles'
No vector for 'Eles'
Found vector for 'é'
Jun 26, 2012 11:30:26 AM pitt.search.semanticvectors.CompareTerms main
INFO: Outputting similarity of "Eles" with "é" ...

撇开 SemanticVector 的工作原理不谈,问题是在第二种情况下“term2”以正确的编码传递,但在第一种情况下不会发生这种情况。

现在,使用这个命令:

print locale.getpreferredencoding(), sys.getdefaultencoding()

我得到以下信息:US-ASCII utf-8(在 PyDev 中)和 UTF-8 ascii(在终端中)

所以我认为正在发生的是它使用 US-ASCII 编码来传递参数,因此结果是错误的,因为单词没有正确的编码。顺便说一句,我使用的是 python 2.7。

有什么办法可以改变这个吗?

在此先感谢您提供的任何帮助。

4

1 回答 1

2

启动进程时,您可以在 LANG 环境变量中传递语言环境名称。做类似的事情:

env = os.environ.copy()
env['LANG'] = 'en_US.UTF-8'
subprocess.check_output( ..., env = env)
于 2012-06-26T10:53:22.347 回答