python - 在Python中跨多行的引号之间提取字符串

Question

我有一个包含多个条目的文件。每个条目的格式如下：

"field1","field2","field3","field4","field5"

所有字段都保证不包含任何引号，但是它们可以包含,. 问题是field4可以分成多行。因此，示例文件可能如下所示：

"john","male US","done","Some sample text
across multiple lines. There
can be many lines of this","foo bar baz"
"jane","female UK","done","fields can have , in them","abc xyz"

我想使用 Python 提取字段。如果该字段不会被拆分为多行，这将很简单：从引号之间提取字符串。但在多行字段存在的情况下，我似乎找不到一种简单的方法来做到这一点。

编辑：实际上有五个字段。抱歉，如果有任何混淆。该问题已被编辑以反映这一点。

score 6 · Accepted Answer

I think that the csv module can solve this problem. It splits correctly with newlines:

import csv 

f = open('infile', newline='')
reader = csv.reader(f)
for row in reader:
    for field in row:
        print('-- {}'.format(field))

It yields:

-- john
-- male US
-- done
-- Some sample text
across multiple lines. There
can be many lines of this
-- foo bar baz
-- jane
-- female UK
-- done
-- fields can have , in them
-- abc xyz

score 1 · Accepted Answer

您链接的问题的答案对我有用：

import re
f = open("test.txt")
text = f.read()

string_list = re.findall('"([^"]*"', text)

At this point, string_list contains your strings. Now, these strings can have line breaks in them, but you can use

new_string = string_list.replace("\n", " ")

to clean that up.

score 0 · Accepted Answer

0

尝试：

awk '{FS=','} /pattern if needed/{print $0}' fname

于 2013-08-31T22:38:35.227 回答

score 0 · Accepted Answer

If you control the input to this file, you need to sanitize it beforehand by replacing \n with something ([\n]?) before putting the values into a comma-separated list.

Or, instead of saving strings -- save them as r-strings.

Then, use the csv module to parse it quickly with predefined separators, encoding and quotechar

python - 在Python中跨多行的引号之间提取字符串

4 回答 4

Related

Reference