python - LDA 生成主题的自动标记

Question

我正在尝试对客户反馈进行分类，并在 python 中运行了一个 LDA，并获得了 10 个主题的以下输出：

(0, u'0.559*"delivery" + 0.124*"area" + 0.018*"mile" + 0.016*"option" + 0.012*"partner" + 0.011*"traffic" + 0.011*"hub" + 0.011*"thanks" + 0.010*"city" + 0.009*"way"')
(1, u'0.397*"package" + 0.073*"address" + 0.055*"time" + 0.047*"customer" + 0.045*"apartment" + 0.037*"delivery" + 0.031*"number" + 0.026*"item" + 0.021*"support" + 0.018*"door"')
(2, u'0.190*"time" + 0.127*"order" + 0.113*"minute" + 0.075*"pickup" + 0.074*"restaurant" + 0.031*"food" + 0.027*"support" + 0.027*"delivery" + 0.026*"pick" + 0.018*"min"')
(3, u'0.072*"code" + 0.067*"gps" + 0.053*"map" + 0.050*"street" + 0.047*"building" + 0.043*"address" + 0.042*"navigation" + 0.039*"access" + 0.035*"point" + 0.028*"gate"')
(4, u'0.434*"hour" + 0.068*"time" + 0.034*"min" + 0.032*"amount" + 0.024*"pay" + 0.019*"gas" + 0.018*"road" + 0.017*"today" + 0.016*"traffic" + 0.014*"load"')
(5, u'0.245*"route" + 0.154*"warehouse" + 0.043*"minute" + 0.039*"need" + 0.039*"today" + 0.026*"box" + 0.025*"facility" + 0.025*"bag" + 0.022*"end" + 0.020*"manager"')
(6, u'0.371*"location" + 0.110*"pick" + 0.097*"system" + 0.040*"im" + 0.038*"employee" + 0.022*"evening" + 0.018*"issue" + 0.015*"request" + 0.014*"while" + 0.013*"delivers"')
(7, u'0.182*"schedule" + 0.181*"please" + 0.059*"morning" + 0.050*"application" + 0.040*"payment" + 0.026*"change" + 0.025*"advance" + 0.025*"slot" + 0.020*"date" + 0.020*"tomorrow"')
(8, u'0.138*"stop" + 0.110*"work" + 0.062*"name" + 0.055*"account" + 0.046*"home" + 0.043*"guy" + 0.030*"address" + 0.026*"city" + 0.025*"everything" + 0.025*"feature"')

有没有办法自动标记它们？我确实有一个 csv 文件，其中手动标记了反馈，但我不想自己提供这些标签。我希望模型创建标签。可能吗？

score 1 · Accepted Answer

此处的评论链接到另一个链接到论文的SO 答案。假设您想尽量做到这一点。这是一个对我有用的 MVP 风格的解决方案：在 Google 中搜索术语，然后在响应中查找关键字。

这是一些有效的代码，虽然很老套：

pip install cssselect

然后

from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
from collections import Counter


def get_srp_text(search_term):
    raw = get(f"https://www.google.com/search?q={topic_terms}").text
    page = fromstring(raw)


    blob = ""

    for result in page.cssselect("a"):
        for res in result.findall("div"):
            blob += ' '
            blob += res.text if res.text else " "
            blob += ' '
    return blob


def blob_cleaner(blob):
    clean_blob = blob.replace(r'[\/,\(,\),\:,_,-,\-]', ' ')
    return ''.join(e for e in blob if e.isalnum() or e.isspace())


def get_name_from_srp_blob(clean_blob):
    blob_tokens = list(filter(bool, map(lambda x: x if len(x) > 2 else '', clean_blob.split(' '))))
    c = Counter(blob_tokens)
    most_common = c.most_common(10)

    name = f"{most_common[0][0]}-{most_common[1][0]}"
    return name

pipeline = lambda x: get_name_from_srp_blob(blob_cleaner(get_srp_text(x)))

然后你可以从你的模型中获取主题词，例如

topic_terms = "delivery area mile option partner traffic hub thanks city way"

name = pipeline(topic_terms)
print(name)

>>> City-Transportation

和

topic_terms = "package address time customer apartment delivery number item support door"

name = pipeline(topic_terms)
print(name)

>>> Parcel-Package

你可以改进很多。例如，您可以使用 POS 标签来仅查找最常用的名词，然后将其用作名称。或者找出最常见的形容词和名词，取名为“形容词名词”。更好的是，您可以从链接的站点获取文本，然后运行YAKE来提取关键字。

无论如何，这演示了一种自动命名集群的简单方法，无需直接使用机器学习（尽管 Google 肯定会使用它来生成搜索结果，因此您会从中受益）。

python - LDA 生成主题的自动标记

1 回答 1

Related

Reference