0

我正在尝试解析一个包含很多 unicode 字符的词汇表。我不想抓住这些角色,如果可能的话,我想像普通角色一样处理它们。我的数据:

en  antropologi     /ɑntrupulu¹giː/     antropologien, antropologier, antropologiene    an anthropology     01A
en  arkitektur  /ɑrkitek¹tʉːr/  arkitekturen, arkitekturerarkitekturene     an architecture     01A
ei  avis    /ɑ¹viːs/    avisa, aviser, avisene  a newspaper     01P
    Barcelona   /bɑʃe¹luːnɑ/        proper name     01M
    bare    /²bɑːre/        just, only  01M
    bare bra!   /bɑre ¹brɑː/        just fine!  01M
en  bensinstasjon   /ben¹siːnstɑˌʃuːn/  bensinstasjonen, bensinstasjoner, bensinstasjonene  a petrol station    01P

我想要两组正则表达式:组(1):Includes all vocabulary without the last "capter-ID" 组(2):Only "capter-ID"

示例:组(1):en antropologi /ɑntrupulu¹giː/ antropologien, antropologier, antropologiene an anthropology
组(2):01A

我尝试了以下搜索算法,这些算法在我用于调试的https://regex101.com/上运行良好:"(.+)(01\S)\n" 与 "(\D+)(01\ S)\n"

这是我的代码和我得到的错误:

import re

def readTemplate(filepath): #reading a file
    try:
        with open(filepath, "r") as template:
            data = template.read()
        return data
    except:
        return False

def parseData(data): #parse file data
    voc = []
    cap = []

    regexMatch = re.compile("(.+)(01\S)\n").finditer(data)
    for matches in regexMatch:
        voc.append(str(matches.group(1)))
        cap.append(str(matches.group(2)))

    return voc, cap

#-----------------------------Main Prog.-----------------------------

data = readTemplate('Vocubulary.txt') #open file
voc, cap = parseData(data) #parse Data
Traceback (most recent call last):
  File "C:/User...Vocabulary.py", line 25, in <module>
    voc, cap = parseData(data) #parse Data
  File "C:/Users...Vocabulary.py", line 15, in parseData
    regexMatch = re.compile("(.+)(01\S)\n").finditer(data)
TypeError: expected string or bytes-like object

Process finished with exit code 1
4

1 回答 1

-1

虽然代码在我的机器(linux)上没有错误,但您可以尝试使用raw strings

regexMatch = re.compile(r"(.+)(01\S)\n").finditer(data)
于 2019-09-08T11:06:45.730 回答