python - Python初学者：读取一个文件中的元素并使用它们修改另一个文件

Question

我是一名没有编程背景的经济学家。我正在尝试学习如何使用 python，因为有人告诉我它对于解析来自网站的数据非常强大。目前，我坚持使用以下代码，如果有任何建议，我将不胜感激。

首先，我写了一个代码来解析这个表中的数据：

http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146

我写的代码如下：

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os

def extract(soup):
table = soup.find("table", cellspacing=2)
for row in table.findAll('tr')[2:]:
        col = row.findAll('td')
        year = col[0].div.b.font.string
        detrazione = col[1].div.b.font.string
        ordinaria = col[2].div.b.font.string
        principale = col[3].div.b.font.string
        scopo = col[4].div.b.font.string
        record = (year, detrazione, ordinaria, principale, scopo)
        print >> outfile, "|".join(record)



outfile = open("milano.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

代码读取表格，只获取我需要的信息并创建一个 txt 文件。代码非常简陋，但它完成了这项工作。

我的问题现在开始。我在上面发布的网址只是我需要从中解析数据的大约 200 个网址之一。所有 url 仅由两个元素区分。使用以前的网址：

http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146

唯一标识此页面的两个元素是 MILANO（城市名称）和 15146（官僚代码）。

我想做的是，首先，创建一个包含两列的文件：

首先是我需要的城市名称；
在第二个官僚代码。

然后，我想在 python 中创建一个循环来读取该文件的每一行，正确修改我的代码中的 url 并为每个城市分别执行解析任务。

你对如何进行有什么建议吗？提前感谢您的任何帮助和建议！

[更新]

感谢大家的有用建议。根据我对 python 的了解，我发现 Thomas K 的答案最容易实现。不过，我仍然有问题。我通过以下方式修改了代码：

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
import csv

def extract(soup):
table = soup.find("table", cellspacing=2)
for row in table.findAll('tr')[2:]:
        col = row.findAll('td')
        year = col[0].div.b.font.string
        detrazione = col[1].div.b.font.string
        ordinaria = col[2].div.b.font.string
        principale = col[3].div.b.font.string
        scopo = col[4].div.b.font.string
        record = (year, detrazione, ordinaria, principale, scopo)
        print >> outfile, "|".join(record)

citylist = csv.reader(open("citycodes.csv", "rU"), dialect = csv.excel)
for city in citylist:
outfile = open("%s.txt", "w") % city
br = Browser()
br.set_handle_robots(False)
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=%s" % city
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

其中 citycodes.csv 采用以下格式

MILANO;12345
MODENA;67891

我收到以下错误：

Traceback (most recent call last):
File "modena2.py", line 25, in <module>
 outfile = open("%s.txt", "w") % city
TypeError: unsupported operand type(s) for %: 'file' and 'list'

再次感谢！

score 1 · Accepted Answer

您需要解决的一件小事：

这个：

for city in citylist:
    outfile = open("%s.txt", "w") % city
#                                 ^^^^^^

应该是这样的：

for city in citylist:
    outfile = open("%s.txt" % city, "w")
#                           ^^^^^^

score 0 · Accepted Answer

只是把基本的东西擦掉...

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os

outfile = open("milano.txt", "w")

def extract(soup):
    global outfile
    table = soup.find("table", cellspacing=2)
    for row in table.findAll('tr')[2:]:
            col = row.findAll('td')
            year = col[0].div.b.font.string
            detrazione = col[1].div.b.font.string
            ordinaria = col[2].div.b.font.string
            principale = col[3].div.b.font.string
            scopo = col[4].div.b.font.string
            record = (year, detrazione, ordinaria, principale, scopo)
            print >> outfile, "|".join(record)



br = Browser()
br.set_handle_robots(False)

# fill in your cities here anyway like
ListOfCityCodePairs = [('MILANO', 15146)]

for (city, code) in ListOfCityCodePairs:
    url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=d" % (city, code)
    page1 = br.open(url)
    html1 = page1.read()
    soup1 = BeautifulSoup(html1)
    extract(soup1)

outfile.close()

score 0 · Accepted Answer

如果文件是 CSV 格式，那么您可以使用csv它来阅读它。然后只需使用urllib.urlencode()生成查询字符串，并urlparse.urlunparse()生成完整的 URL。

score 0 · Accepted Answer

无需创建单独的文件，使用 python 字典代替，其中存在关系：city->code。

请参阅：http ://docs.python.org/tutorial/datastructures.html#dictionaries

score 0 · Accepted Answer

又快又脏：

import csv
citylist = csv.reader(open("citylist.csv"))
for city in citylist:
    url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=%s" % city
    # open the page and extract the information

假设您有一个 csv 文件，如下所示：

MILANO,15146
ROMA,12345

还有更强大的工具，就像urllib.urlencode()Ignacio 提到的那样。但他们可能为此矫枉过正。

PS 恭喜：你已经完成了艰难的任务——从 HTML 中抓取数据。循环遍历列表很容易。

python - Python初学者：读取一个文件中的元素并使用它们修改另一个文件

5 回答 5

Related

Reference