python - 无法处理此正则表达式

Question

我有以下“greekSymbols.txt”

Α α alpha
Β β beta
Γ γ gamma
Δ δ delta
Ε ε epsilon
Ζ ζ zeta
Η η eta
Θ θ theta
Ι ι iota
Κ κ kappa
Λ λ lambda
Μ μ mu  
Ν ν nu
Ξ ξ xi
Ο ο omicron
Π π pi
Ρ ρ rho
Σ σ sigma
Τ τ tau
Υ υ upsilon
Φ φ phi
Χ χ chi
Ψ ψ psi
Ω ω omega

我试图将其转换为带有标签作为分隔符的 Anki 纯文本文件。我将每一行转换为两张卡片，其中前面是符号（大写或小写），后面是名称。我有以下。

#!/usr/local/bin/python

import re

pattern = re.compile(r"(.)\s+(.)\s+(.+)", re.UNICODE)

input = open("./greekSymbols.txt", "r")

output = open("./greekSymbolsFormated.txt", "w+")

line = input.readline()

while line:

    string = line.rstrip()

    m = pattern.match(string)

    if m:
        output.write(m.group(1) + "\t" + m.group(3) + "\n")
        output.write(m.group(2) + "\t" + m.group(3) + "\n")
    else:
        print("I was unable to process line '" + string + "' [" +  str(m) + "]")

    line = input.readline()

input.close();
output.close();

不幸的是，我目前每行都收到“我无法处理...”消息，str(m) 的值为 None。我究竟做错了什么？

> localhost:Anki stephen$ python ./convertGreekSymbols.py 
I was unable to process line 'Α α   alpha' [None]
I was unable to process line 'Β β   beta' [None]
...

score 5 · Accepted Answer

你真的不需要正则表达式：

with (open("./greekSymbols.txt") as infile, 
      open("./greekSymbolsFormated.txt", "w+") as outfile):
    for line in infile:
        up, low, name = line.split()
        outfile.write("{0}\t{1}".format(up,name))
        outfile.write("{0}\t{1}".format(low,name))

如果您想坚持使用正则表达式，请尝试以下正则表达式而不是您的（这应该适用于 IMO，但可能不够明确）：

pattern = re.compile(r"(\S+)\s+(\S+)\s+(.+)", re.UNICODE)

score 2 · Accepted Answer

在我看来，这是错误的空白解析。不应该是(.)\s(.)\s(.+)，而不是\t？您的输入中似乎没有标签。

score 2 · Accepted Answer

你有一个没有标签的\t，应该是\s：

>>> matcher = re.compile(r"(.)\s(.)\t(.+)", re.UNICODE) 
>>> phi = "Φ φ phi" 
>>> matcher.match(phi)
>>> matcher = re.compile(r"(.)\s(.)\s+(.+)", re.UNICODE)
>>> matcher.match(phi)
<_sre.SRE_Match object at 0x1018d8290>
>>>

score 0 · Accepted Answer

这是最终使事情正常运行的代码。看来我的原始文件是 utf-8，这引起了问题。这是允许我为 Anki 创建 /t 单独导入文件的工作解决方案。

#!/usr/local/bin/python

import re
import codecs

pattern = re.compile(r"(\S+)\s+(\S+)\s+(.+)", re.UNICODE)

input = codecs.open("./greekSymbols.txt", "r", encoding="utf-8")

output = codecs.open("./greekSymbolsFormated.txt", "w+", encoding="utf-8")

line = input.readline()

while line:

    string = line.rstrip()

    m = pattern.match(string)

    if m:
        output.write(unicode(m.group(1) + "\t" + m.group(3) + "\n"))
        output.write(unicode(m.group(2) + "\t" + m.group(3) + "\n"))
    else:
        print("I was unable to process line '" + string + "' [" +  str(m) + "]")

    line = input.readline()

input.close();
output.close();

python - 无法处理此正则表达式

4 回答 4

Related

Reference