python - 如何在 python 3 中使用正则表达式忽略 unicode 字符？

Question

我正在尝试解析一个包含很多 unicode 字符的词汇表。我不想抓住这些角色，如果可能的话，我想像普通角色一样处理它们。我的数据：

en  antropologi     /ɑntrupulu¹giː/     antropologien, antropologier, antropologiene    an anthropology     01A
en  arkitektur  /ɑrkitek¹tʉːr/  arkitekturen, arkitekturerarkitekturene     an architecture     01A
ei  avis    /ɑ¹viːs/    avisa, aviser, avisene  a newspaper     01P
    Barcelona   /bɑʃe¹luːnɑ/        proper name     01M
    bare    /²bɑːre/        just, only  01M
    bare bra!   /bɑre ¹brɑː/        just fine!  01M
en  bensinstasjon   /ben¹siːnstɑˌʃuːn/  bensinstasjonen, bensinstasjoner, bensinstasjonene  a petrol station    01P

我想要两组正则表达式：组（1）：Includes all vocabulary without the last "capter-ID" 组（2）：Only "capter-ID"

示例：组（1）：en antropologi /ɑntrupulu¹giː/ antropologien, antropologier, antropologiene an anthropology
组（2）：01A

我尝试了以下搜索算法，这些算法在我用于调试的https://regex101.com/上运行良好："(.+)(01\S)\n" 与 "(\D+)(01\ S)\n"

这是我的代码和我得到的错误：

import re

def readTemplate(filepath): #reading a file
    try:
        with open(filepath, "r") as template:
            data = template.read()
        return data
    except:
        return False

def parseData(data): #parse file data
    voc = []
    cap = []

    regexMatch = re.compile("(.+)(01\S)\n").finditer(data)
    for matches in regexMatch:
        voc.append(str(matches.group(1)))
        cap.append(str(matches.group(2)))

    return voc, cap

#-----------------------------Main Prog.-----------------------------

data = readTemplate('Vocubulary.txt') #open file
voc, cap = parseData(data) #parse Data

Traceback (most recent call last):
  File "C:/User...Vocabulary.py", line 25, in <module>
    voc, cap = parseData(data) #parse Data
  File "C:/Users...Vocabulary.py", line 15, in parseData
    regexMatch = re.compile("(.+)(01\S)\n").finditer(data)
TypeError: expected string or bytes-like object

Process finished with exit code 1

score -1 · Accepted Answer

虽然代码在我的机器（linux）上没有错误，但您可以尝试使用raw strings。

regexMatch = re.compile(r"(.+)(01\S)\n").finditer(data)

python - 如何在 python 3 中使用正则表达式忽略 unicode 字符？

1 回答 1

Related

Reference