0

我有一本字典,其中包含公司名称和该公司附带的相关商业改进局链接。我还有一个 CSV 文件,其中有 BBB 链接附加到这些公司的电话号码。我需要基于与公司名称关联的 BBB 链接以某种方式将两者结合起来。

我的最终目标是拥有一个包含以下内容的数据框:

公司名称、链接、电话号码

字典:

{'A. G. Builders, Inc.': 'https://www.bbb.org/us/nc/durham/profile/home-builders/ag-builders-inc-0593-6037923', 'A. R. Russell': 'https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691', 'A. R. Russell Builders, Inc.': 'https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691', 'A.C.A. Enterprises, LLC': 'https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401', 'A.D. Myers Builders, LLC': 'https://www.bbb.org/us/nc/charlotte/profile/general-contractor/meyer-builders-llc-0473-219405', 'ABS Construction Group': 'https://www.bbb.org/us/nc/newport/profile/general-contractor/ab-building-remodeling-llc-0593-90293532', 'Absolute Construction Group, LLC': 'https://www.bbb.org/us/nc/durham/profile/home-improvement/absolute-construction-group-llc-0593-90282628'}

代码:

phone_list = [] 
url_with_phone = []

def phone_numbers():
    driver = webdriver.Chrome()
    for url in url_list: #Looping through the list of the BBB links
        print(url) #Print the URL currently on
        driver.get(url)
        sleep(randint(4,6))
        phone = driver.find_elements_by_class_name("dtm-phone") #FINDS Phone num
        sleep(randint(4,8))
        print('looking for number')
        for p in phone:
            results = (p.text)
            print(results)
            sleep(randint(3,5))
            phone_list.append(results) # add phone number to phone_list
            sleep(randint(5,9))
            url_with_phone.append(url) #adds URL when phone num is found to match up with phone num

phone_numbers()

链接和电话号码的 CSV 输出:

URL Searched,Phone Numbers
https://www.bbb.org/us/nc/durham/profile/home-builders/ag-builders-inc-0593-6037923,(919) 384-7005
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 248-0597
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 527-1767
https://www.bbb.org/us/nc/charlotte/profile/general-contractor/meyer-builders-llc-0473-219405,(704) 737-8409

例如,CSV 文件中的第一个结果属于 AG Home Builders,有没有办法可以根据匹配值将字典(公司名称)中的键添加到 CSV?

我想将公司名称添加到 CSV。最好的方法是什么?我已阅读以下链接以尝试得出我自己的结果,但我自己尝试解决方案没有任何运气。(为字典中的一个键附加多个值列表到字典转换,每个键有多个值?

4

1 回答 1

0

您可以使用以下方法re从字符串中提取所有企业名称:

import re

a = '''URL Searched,Phone Numbers
https://www.bbb.org/us/nc/durham/profile/home-builders/ag-builders-inc-0593-6037923,(919) 384-7005
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 248-0597
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 527-1767
https://www.bbb.org/us/nc/charlotte/profile/general-contractor/meyer-builders-llc-0473-219405,(704) 737-8409'''

print([re.findall('.*c(?=\-\d)',n.split('/')[-1])[0] for n in a.split('\n')[1:]])

输出:

['ag-builders-inc',
 'russell-l-judy-builder-inc',
 'russell-l-judy-builder-inc',
 'aca-enterprises-llc',
 'aca-enterprises-llc',
 'meyer-builders-llc']

为了更漂亮的演示:

print([re.findall('.*c(?=\-\d)',n.split('/')[-1])[0].replace('-',' ').title() for n in a.split('\n')[1:]])

输出:

['Ag Builders Inc',
 'Russell L Judy Builder Inc',
 'Russell L Judy Builder Inc',
 'Aca Enterprises Llc',
 'Aca Enterprises Llc',
 'Meyer Builders Llc']

更新:

import re

a = '''URL Searched,Phone Numbers
https://www.bbb.org/us/nc/durham/profile/home-builders/ag-builders-inc-0593-6037923,(919) 384-7005
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/nc/raleigh/profile/general-contractor/russell-l-judy-builder-inc-0593-90082691,(919) 625-7841
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 248-0597
https://www.bbb.org/us/fl/ponce-de-leon/profile/building-contractors/aca-enterprises-llc-0683-90029401,(850) 527-1767
https://www.bbb.org/us/nc/charlotte/profile/general-contractor/meyer-builders-llc-0473-219405,(704) 737-8409'''

for line in a.split('\n')[1:]: # Iterate through the string by each line, excluding the first line

    name = line.split('/')[-1] # The business name is the last substring in each line seperated by a slash

    name = re.findall('.*c(?=\-\d)',name) # .*c is the get all the characters behind c, including c. (?=something) will look forward to see if something is right in front of it. \-\d stands for a dash, and a digit

    print(name[0]) # The 0 is for getting the string, instead of the list

输出:

ag-builders-inc
russell-l-judy-builder-inc
russell-l-judy-builder-inc
aca-enterprises-llc
aca-enterprises-llc
meyer-builders-llc
于 2020-06-24T00:50:45.030 回答