我正在尝试解析一个包含很多 unicode 字符的词汇表。我不想抓住这些角色,如果可能的话,我想像普通角色一样处理它们。我的数据:
en antropologi /ɑntrupulu¹giː/ antropologien, antropologier, antropologiene an anthropology 01A
en arkitektur /ɑrkitek¹tʉːr/ arkitekturen, arkitekturerarkitekturene an architecture 01A
ei avis /ɑ¹viːs/ avisa, aviser, avisene a newspaper 01P
Barcelona /bɑʃe¹luːnɑ/ proper name 01M
bare /²bɑːre/ just, only 01M
bare bra! /bɑre ¹brɑː/ just fine! 01M
en bensinstasjon /ben¹siːnstɑˌʃuːn/ bensinstasjonen, bensinstasjoner, bensinstasjonene a petrol station 01P
我想要两组正则表达式:组(1):Includes all vocabulary without the last "capter-ID"
组(2):Only "capter-ID"
示例:组(1):en antropologi /ɑntrupulu¹giː/ antropologien, antropologier, antropologiene an anthropology
组(2):01A
我尝试了以下搜索算法,这些算法在我用于调试的https://regex101.com/上运行良好:"(.+)(01\S)\n" 与 "(\D+)(01\ S)\n"
这是我的代码和我得到的错误:
import re
def readTemplate(filepath): #reading a file
try:
with open(filepath, "r") as template:
data = template.read()
return data
except:
return False
def parseData(data): #parse file data
voc = []
cap = []
regexMatch = re.compile("(.+)(01\S)\n").finditer(data)
for matches in regexMatch:
voc.append(str(matches.group(1)))
cap.append(str(matches.group(2)))
return voc, cap
#-----------------------------Main Prog.-----------------------------
data = readTemplate('Vocubulary.txt') #open file
voc, cap = parseData(data) #parse Data
Traceback (most recent call last):
File "C:/User...Vocabulary.py", line 25, in <module>
voc, cap = parseData(data) #parse Data
File "C:/Users...Vocabulary.py", line 15, in parseData
regexMatch = re.compile("(.+)(01\S)\n").finditer(data)
TypeError: expected string or bytes-like object
Process finished with exit code 1