0

我正在研究一个解析文本文件的脚本,试图对其进行标准化,以便能够将其插入数据库。数据代表由 1 个或多个作者撰写的文章。我遇到的问题是,因为没有固定数量的作者,我在输出文本文件中得到了可变数量的列。例如。

author1, author2, author3, this is the title of the article
author1, author2, this is the title of the article
author1, author2, author3, author4, this is the title of the article

这些结果给我的最大列数为 5。因此,对于前 2 篇文章,我需要添加空白列,以便输出具有偶数列。最好的方法是什么?我的输入文本是制表符分隔的,我可以通过在制表符上拆分来轻松地遍历它们。

4

2 回答 2

2

假设您已经拥有最大列数并且已经将它们分成列表(我将假设您将其放入自己的列表中),您应该能够只使用 list.insert(-1,item)添加空列:

def columnize(mylists, maxcolumns):
    for i in mylists:
        while len(i) < maxcolumns:
            i.insert(-1,None)

mylists = [["author1","author2","author3","this is the title of the article"],
           ["author1","author2","this is the title of the article"],
           ["author1","author2","author3","author4","this is the title of the article"]]

columnize(mylists,5)
print mylists

[['author1', 'author2', 'author3', None, 'this is the title of the article'], ['author1', 'author2', None, None, 'this is the title of the article'], ['author1', 'author2', 'author3', 'author4', 'this is the title of the article']]

使用列表推导不会破坏原始列表的替代版本:

def columnize(mylists, maxcolumns):
    return [j[:-1]+([None]*(maxcolumns-len(j)))+j[-1:] for j in mylists]

print columnize(mylists,5)

[['author1', 'author2', 'author3', None, 'this is the title of the article'], ['author1', 'author2', None, None, 'this is the title of the article'], ['author1', 'author2', 'author3', 'author4', 'this is the title of the article']]
于 2012-05-19T02:58:22.757 回答
1

如果我误解了,请原谅我,但听起来你正在以一种困难的方式解决这个问题。将您的文本文件转换为将标题映射到一组作者的字典非常容易:

>>> lines = ["auth1, auth2, auth3, article1", "auth1, auth2, article2","auth1, article3"]
>>> d = dict((x[-1], x[:-1]) for x in [line.split(', ') for line in lines])
>>> d
{'article2': ['auth1', 'auth2'], 'article3': ['auth1'], 'article1': ['auth1', 'auth2', 'auth3']}
>>> total_articles = len(d)
>>> total_articles
3
>>> max_authors = max(len(val) for val in d.values())
>>> max_authors
3
>>> for k,v in d.iteritems():
...     print k
...     print v + [None]*(max_authors-len(v))
... 
article2
['auth1', 'auth2', None]
article3
['auth1', None, None]
article1
['auth1', 'auth2', 'auth3']

然后,如果你真的想要,你可以使用python 内置的csv 模块输出这些数据。或者,您可以直接输出您需要的 SQL。

您多次打开同一个文件并多次读取它,只是为了获得可以从内存中的数据中得出的计数。请不要出于这些目的多次阅读该文件。

于 2012-05-19T03:24:29.477 回答