0

我正在尝试将 JSON 行格式的产品列表与另一个文件中的产品也以 JSON 格式匹配。这有时称为记录链接、实体解析、参考协调或仅匹配。

目标是将来自第三方零售商的产品列表(例如“Nikon D90 12.3MP Digital SLR Camera (Body Only)”)与一组已知产品(例如“Nikon D90”)进行匹配。

细节

数据对象

产品

{
"product_name": String // A unique id for the product
"manufacturer": String
"family": String // optional grouping of products
"model": String
"announced-date": String // ISO-8601 formatted date string, e.g. 2011-04-28T19:00:00.000-05:00
}

清单

{
"title": String // description of product for sale
"manufacturer": String // who manufactures the product for sale
"currency": String // currency code, e.g. USD, CAD, GBP, etc.
"price": String // price, e.g. 19.99, 100.00
}

结果

{
"product_name": String
"listings": Array[Listing]
}

数据 包含两个文件: products.txt – 包含大约 700 个产品列表。txt – 包含大约 20,000 个产品列表

当前代码(使用python):

import jsonlines
import json
import re
import logging, sys

logging.basicConfig(stream=sys.stderr, level=logging.DEBUG)

with jsonlines.open('products.jsonl') as products:
  for prod in products:
    jdump = json.dumps(prod)
    jload = json.loads(jdump)
    regpat = re.compile("^\s+|\s*-| |_\s*|\s+$")
    prodmatch = [x for x in regpat.split(jload["product_name"].lower()) if x]
    manumatch = [x for x in regpat.split(jload["manufacturer"].lower()) if x]
    modelmatch = [x for x in regpat.split(jload["model"].lower()) if x]
    wordmatch = prodmatch + manumatch + modelmatch
    #print (wordmatch)
    #logging.debug('product first output')
    with jsonlines.open('listings.jsonl') as listings:
      for entry in listings:
        jdump2 = json.dumps(entry)
        jload2 = json.loads(jdump2)
        wordmatch2 = [x for x in regpat.split(jload2["title"].lower()) if x]
        #print (wordmatch2)
        #logging.debug('listing first output')
        contained = [x for x in wordmatch2 if x in wordmatch]
        if contained:
          print(contained)
        #logging.debug('contained first match')

上面的代码在产品文件中拆分了产品名称、型号和制造商中的单词,并尝试匹配列表文件中的字符串,但我觉得这太慢了,必须有更好的方法来做到这一点。任何帮助表示赞赏

4

1 回答 1

0

首先,我不确定dumps() 和loads() 发生了什么。如果您能找到一种方法来避免在每次迭代中对所有内容进行序列化和反序列化,那将是一个巨大的胜利,因为与您在此处发布的代码相比,这似乎完全是多余的。

其次,列表的东西:因为它永远不会改变,为什么不在循环之前将它解析一次到某个数据结构(可能是一个将 wordmap2 的内容映射到它所派生的列表的字典)并在解析产品时重用该结构.json?

下一步:如果有办法重新调整它以使用multiprocessing我强烈建议你这样做。您在这里完全受制于 CPU,您可以轻松地让它在您的所有内核上并行运行。

最后,我尝试了一些花哨的正则表达式恶作剧。这里的目标是在 C 中实现的思想下将尽可能多的逻辑推入正则表达式re,因此将比在 Python 中执行所有这些字符串工作更高效。

import json
import re

PRODUCTS = """
[
{
"product_name": "Puppersoft Doggulator 5000",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5000",
"announced-date": "ymd"
},
{
"product_name": "Puppersoft Doggulator 5001",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5001",
"announced-date": "ymd"
},
{
"product_name": "Puppersoft Doggulator 5002",
"manufacturer": "Puppersoft",
"family": "Doggulator",
"model": "5002",
"announced-date": "ymd"
}
]
"""


LISTINGS = """
[
{
"title": "Doggulator 5002",
"manufacturer": "Puppersoft",
"currency": "Pupper Bux",
"price": "420"
},
{
"title": "Doggulator 5005",
"manufacturer": "Puppersoft",
"currency": "Pupper Bux",
"price": "420"
},
{
"title": "Woofer",
"manufacturer": "Shibasoft",
"currency": "Pupper Bux",
"price": "420"
}
]
"""

SPLITTER_REGEX = re.compile("^\s+|\s*-| |_\s*|\s+$")
product_re_map = {}
product_re_parts = []

# get our matching keywords from products.json
for idx, product in enumerate(json.loads(PRODUCTS)):
    matching_parts = [x for x in SPLITTER_REGEX.split(product["product_name"]) if x]
    matching_parts += [x for x in SPLITTER_REGEX.split(product["manufacturer"]) if x]
    matching_parts += [x for x in SPLITTER_REGEX.split(product["model"]) if x]

    # store the product object for outputting later if we get a match
    group_name = 'i{idx}'.format(idx=idx)
    product_re_map[group_name] = product
    # create a giganto-regex that matches anything from a given product.
    # the group name is a reference back to the matching product.
    # I use set() here to deduplicate repeated words in matching_parts.
    product_re_parts.append("(?P<{group_name}>{words})".format(group_name=group_name, words="|".join(set(matching_parts))))
# Do the case-insensitive matching in C code
product_re = re.compile("|".join(product_re_parts), re.I)

for listing in json.loads(LISTINGS):
    # we match against split words in the regex created above so we need to
    # split our source input in the same way
    matching_listings = []
    for word in SPLITTER_REGEX.split(listing['title']):
        if word:
            product_match = product_re.match(word)
            if product_match:
                for k in product_match.groupdict():
                    matching_listing = product_re_map[k]
                    if matching_listing not in matching_listings:
                        matching_listings.append(matching_listing)
    print listing['title'], matching_listings
于 2017-10-14T02:01:56.960 回答