tags - 标签层次结构和处理

Question

这是一个真正的问题，通常适用于标记项目（是的，这也适用于 StackOverflow，不，这不是关于 StackOverflow 的问题）。

整个标记问题有助于聚集相似的项目，无论它们可能是什么项目（笑话、博客文章、问题等）。但是，存在（通常但不严格）标签的层次结构，这意味着某些标签也暗示其他标签。举一个熟悉的例子，“c#”so 标签也暗示“.net”；另一个例子，在笑话数据库中，“金发女郎”标签暗示“嘲讽”标签，类似于“爱尔兰”或“比利时”或“加拿大”等，具体取决于笑话的国家/地区。

如果你有，你是如何在你的项目中处理这个问题的？我将提供一个答案，描述我在两个不同案例中使用的两种不同方法（实际上，相同的机制但在两个不同的环境中实现），但我不仅对类似的机制感兴趣，而且对你对层次结构问题的看法也感兴趣.

score 7 · Accepted Answer

这是一个棘手的问题。两个极端是本体（一切都是分层的）和民俗（标签没有分层）。我已经在 WikiAnswers 上回答了这个问题，并参考了 Clay Shirky 的“Ontology is Overrated”文章，该文章声称您不应该设置任何层次结构。

score 4 · Accepted Answer

实际上我会说它不是一个层次系统，而是一个语义网，标签含义之间存在距离。我的意思是：数学更接近于实验物理学，然后更接近于园艺。

建立这样一个网络的可能性：建立标签对并让人们判断感知距离（使用 1-10 之类的度量，意思是 [同义词，相似，...，反义词]，...），并且在搜索时，搜索一定距离内的所有标签。

如果来自相反方向（[a,b] close -> [b,a,] close），测量是否必须是相等的距离？还是接近意味着 [a,b] 关闭和 [b,c] 关闭 -> [a,b] 关闭？

也许第一个词会默认触发另一个语义场？如果你从“社会工作者”开始，“分析师”就在附近。如果你从“程序员”开始，“分析师”也很近。但是从这些点中的任何一点开始，您可能不会将另一个视为接近（“社会工作者”绝不接近“程序员”）。

因此，您只能在两个方向（以随机顺序）进行判断和判断。

[TagRelations]
tagId integer
closeTagId integer
proximity integer

选择相似标签的示例：

select closeTagId from TagRelations where tagId = :tagID and proximity < 3

score 2 · Accepted Answer

The mechanism I have implemented was to not use the tags given themselves, but an indirect lookup table (not strictly DBMS terms) which links a tag to many implied tags (obviously, a tag is linked with itself for this to work).

In a python project, the lookup table is a dictionary keyed on tags, with values sets of tags (where tags are plain strings).

In a database project (indifferent which RDBMS engine it was), there were the following tables:

[Tags]
tagID integer primary key
tagName text

[TagRelations]
tagID integer # first part of two-field key
tagID_parent integer # second part of key
trlValue float

where the trlValue was a value in the (0, 1] space, used to give a gravity for the each linked tag; a self-to-self tag relation always carries 1.0 in the trlValue, while the rest are algorithmically calculated (it's not important how exactly). Think the example jokes database I gave; a ['blonde', 'derisive', 0.5] record would correlate to a ['pondian', 'derisive', 0.5] and therefore suggest all derisive jokes given another.

tags - 标签层次结构和处理

3 回答 3

Related

Reference