5

我有一个包含多个条目的文件。每个条目的格式如下:

"field1","field2","field3","field4","field5"

所有字段都保证不包含任何引号,但是它们可以包含,. 问题是field4可以分成多行。因此,示例文件可能如下所示:

"john","male US","done","Some sample text
across multiple lines. There
can be many lines of this","foo bar baz"
"jane","female UK","done","fields can have , in them","abc xyz"

我想使用 Python 提取字段。如果该字段不会被拆分为多行,这将很简单:从引号之间提取字符串。但在多行字段存在的情况下,我似乎找不到一种简单的方法来做到这一点。

编辑:实际上有五个字段。抱歉,如果有任何混淆。该问题已被编辑以反映这一点。

4

4 回答 4

6

I think that the csv module can solve this problem. It splits correctly with newlines:

import csv 

f = open('infile', newline='')
reader = csv.reader(f)
for row in reader:
    for field in row:
        print('-- {}'.format(field))

It yields:

-- john
-- male US
-- done
-- Some sample text
across multiple lines. There
can be many lines of this
-- foo bar baz
-- jane
-- female UK
-- done
-- fields can have , in them
-- abc xyz
于 2013-08-31T22:46:02.200 回答
1

您链接的问题的答案对我有用:

import re
f = open("test.txt")
text = f.read()

string_list = re.findall('"([^"]*"', text)

At this point, string_list contains your strings. Now, these strings can have line breaks in them, but you can use

new_string = string_list.replace("\n", " ")

to clean that up.

于 2013-08-31T22:45:07.460 回答
0

尝试 :

awk '{FS=','} /pattern if needed/{print $0}' fname
于 2013-08-31T22:38:35.227 回答
0

If you control the input to this file, you need to sanitize it beforehand by replacing \n with something ([\n]?) before putting the values into a comma-separated list.

Or, instead of saving strings -- save them as r-strings.

Then, use the csv module to parse it quickly with predefined separators, encoding and quotechar

于 2013-08-31T22:50:06.313 回答