python - 使用 CSV 文件和 python 的多图像下载器

Question

我遇到此代码错误。谁能帮助我，以便我可以自动下载包含图像所有 URL 的 CSV 文件中的所有图像的过程？

我得到的错误是：

        URLError                                  Traceback (most recent call last)
      <ipython-input-320-dcd87f841181> in <module>
         19         urlShort = re.search(filejpg, str(r)).group()
         20         print(urlShort)
    ---> 21         download(x, f'{di}/{urlShort}')
         22         print(type(x))
         URLError: <urlopen error unknown url type: {'https>

这是我正在使用的代码：

from pathlib import Path
from shutil import rmtree as delete
from urllib.request import urlretrieve as download
from gazpacho import get, Soup
import re
import pandas as pd
import numpy as np


#import data
df = pd.read_csv('urlReady1.csv')
df.shape
#locate folder
di = 'Dubai'
Path(di).mkdir(exist_ok=True)

#change data to dict
dict_copy = df.to_dict('records')

#iterate over every row of the data and download the jpg file
for r in dict_copy:
    if r == 'urlready':
        print("header")
    else:
        x = str(r)
        filejpg = "[\d]{1,}\.jpg"
        urlShort = re.search(filejpg, str(r)).group()
        print(urlShort)
        download(x, f'{di}/{urlShort}')
        print(type(x))

score 0 · Accepted Answer

我看不到您的数据集，但我认为 pandasto_dict('records')正在向您返回一个 dict 列表（您将其存储为dict_copy）。然后，当您使用 r 遍历它时，for r in dict_copy:它不是 URL，而是以某种方式包含 URL 的 dict。因此 str(r) 将该 dict 转换{<stuff>}为'{<stuff>}'，然后您将其作为您的 URL 发送出去。

我认为这就是您看到错误的原因URLError: <urlopen error unknown url type: {'https>

print(dict_copy)在 df 转储之后（紧随其后dict_copy = df.to_dict('records')）和迭代开始时（print(r)紧随其后）添加打印语句for r in dict_copy:将帮助您了解正在发生的事情并测试/确认我的假设。

感谢您添加示例数据！所以dict_copy是这样的[{'urlReady': 'mobile.****.***.**/****/43153.jpg'}, {'urlReady': 'mobile.****.***.**/****/46137.jpg'}]

所以是的，dict_copy是一个字典列表，看起来像'urlReady'键和 URL 字符串作为值。因此，您想使用该键从每个字典中检索 url。最好的方法可能取决于数据中是否包含没有有效 URL 的内容等。但这可以让您开始并提供一些数据视图，看看是否有什么奇怪的地方：

for r in dict_copy:
    urlstr = r.get('urlReady', '') # .get with default return of '' means you know you can use string methods to validate data
    print('\nurl check: type is', type(urlstr), 'url is', urlstr)
    if type(urlstr) == str and '.jpg' in urlstr: # check to make sure the url has a jpg, you can replace with `if True` or another check if it makes sense
        filejpg = "[\d]{1,}\.jpg"
        urlShort = re.search(filejpg, urlstr).group()
        print('downloading from', urlstr, 'to', f'{di}/{urlShort}')
        download(urlstr, f'{di}/{urlShort}')
    else:
        print('bad data! dict:', r, 'urlstr:', urlstr)

python - 使用 CSV 文件和 python 的多图像下载器

1 回答 1

Related

Reference