1

我有以下 python 脚本,用于预处理具有 4 列的 csv 文件中的文本。

bad_tags=['code','a','img']
stops=['the','i',"i'd",'cannot','like','if','an','a','is','or','no','that',"i'm",'and','you','which','there','way','to','if','from','certain','quite',
        'help','me','how','should','why','can','what','in','on','where','thanks','thank','want','need','so','could','would','when','do','using',
        'another',"i've",'gives','still','while','this','for','but','actually','that','into','these','something','some','want','not','please','me',
        'know','it','have','stuff','with','each','able','wondering','such','finding','matter','question','as','make','use','my','any','be','more','than',
        'was','of','etc','find','answer','myself','since','work','without','kinds','very','then','think','thinking','thought','although','however','which',
        'anyway','anyways','more','at','every','everyone','never',"can't","won't","shouldn't","couldn't","there's",'sure','no','already','works','problem',
        'most','mostly','turned','am','create',"that's",'whole','putting','getting','good','bad','great','worst','best','worse','only','better',
        'now','often','happen','happens','happening','out','in','all','appreciate','basically','given','gives','gave','somewhere','try','tried','takes','taking',
        'e.g','question','trouble','based','guess','after','enough','has','them','ie','eg','having','weird','those','trying','wants','said','its','giving','whats','later',
        'used',"isn't",'gonna','will','explain','once','take','after','unfortunately','fortunately','receive','they','suppose','being','hence','did','wanna','usual',
        'questions','before','by','are',"aren't",'almost','wanted','does','someone','containing','because','within','just','own','easier','much','appreciated']
with open(r"input_file.csv") as r, open(r"output_file.csv", "w") as w:
    reader=csv.reader(r)
    next(r)
    for row in reader:
        soup=BS(row[2],'html')
        for tag in soup.findAll(True):
            if tag.name in bad_tags:
                tag.extract()
        new_string=soup.renderContents()
        final0=re.sub(r'<[^>]+>', '', new_string)

        parsing=re.findall(r"[\w+]+(?:[-'/.][\w+]+)*|'|[-.(]+|\S[\w+]*",re.sub(r'<[^>]+>', '', row[1]))
        final=' '.join(w.lower() for w in parsing if w not in string.punctuation)
        parsing2=[b for b in final.split(' ') if not b in stops]
        final2=' '.join(parsing2)

        parsing3=re.findall(r"[\w+]+(?:[-'/.][\w+]+)*|'|[-.(]+|\S[\w+]*",final0)
        final3=' '.join(w.lower() for w in parsing3 if w not in string.punctuation)
        parsing4=[b for b in final3.split(' ') if not b in stops]
        final4=' '.join(parsing4)


        w.write("{},{},{},{}\n".format(row[0],final2,final4,row[3]))

对于大多数示例,它都可以正常工作,除了偶尔它将输入文件的多行连接到输出文件中的单行,我不知道为什么会发生这种情况。任何人都可以解决这个问题吗?

4

0 回答 0