nlp - 基于释义检测查找相似文本

Question

我有兴趣根据释义找到类似的内容（文本）。我该怎么做呢？有没有特定的工具可以做到这一点？最好在python中。

score 6 · Accepted Answer

我相信您正在寻找的工具是潜在语义分析。

鉴于我的帖子会很长，我不会详细解释它背后的理论——如果你认为它确实是你要找的，我建议你查一下。最好的起点是这里：

http://staff.scm.uws.edu.au/~lapark/lt.pdf

总之，LSA 试图基于相似词出现在相似文档中的假设来揭示词和短语的潜在/潜在含义。我将用R它来演示它是如何工作的。

我将设置一个函数，该函数将根据它们的潜在含义检索相似的文档：

# Setting up all the needed functions:

SemanticLink = function(text,expression,LSAS,n=length(text),Out="Text"){ 

  # Query Vector
  LookupPhrase = function(phrase,LSAS){ 
    lsatm = as.textmatrix(LSAS) 
    QV = function(phrase){ 
      q = query(phrase,rownames(lsatm)) 
      t(q)%*%LSAS$tk%*%diag(LSAS$sk) 
    } 

    q = QV(phrase) 
    qd = 0 

    for (i in 1:nrow(LSAS$dk)){ 
      qd[i] <- cosine(as.vector(q),as.vector(LSAS$dk[i,])) 
    }  
    qd  
  } 

  # Handling Synonyms
  Syns = function(word){   
    wl    =   gsub("(.*[[:space:]].*)","", 
                   gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","", 
                        unlist(strsplit(PlainTextDocument(synonyms(word)),",")))) 
    wl = wl[wl!=""] 
    return(wl)  
  } 

  ex = unlist(strsplit(expression," "))
  for(i in seq(ex)){ex = c(ex,Syns(ex[i]))}
  ex = unique(wordStem(ex))

  cache = LookupPhrase(paste(ex,collapse=" "),LSAS) 

  if(Out=="Text"){return(text[which(match(cache,sort(cache,decreasing=T)[1:n])!="NA")])} 
  if(Out=="ValuesSorted"){return(sort(cache,decreasing=T)[1:n]) } 
  if(Out=="Index"){return(which(match(cache,sort(cache,decreasing=T)[1:n])!="NA"))} 
  if(Out=="ValuesUnsorted"){return(cache)} 

}

请注意，我们在组装查询向量时使用了同义词。这种方法并不完美，因为qdap库中的一些同义词充其量是远程的......这可能会干扰您的搜索查询，因此要获得更准确但不太普遍的结果，您可以简单地摆脱同义词位并手动选择构成查询向量的所有相关术语。

让我们试试看。我还将使用包中的美国国会数据集RTextTools：

library(tm)
library(RTextTools)
library(lsa)
library(data.table)
library(stringr)
library(qdap)

data(USCongress)

text = as.character(USCongress$text)

corp = Corpus(VectorSource(text)) 

parameters = list(minDocFreq        = 1, 
                  wordLengths       = c(2,Inf), 
                  tolower           = TRUE, 
                  stripWhitespace   = TRUE, 
                  removeNumbers     = TRUE, 
                  removePunctuation = TRUE, 
                  stemming          = TRUE, 
                  stopwords         = TRUE, 
                  tokenize          = NULL, 
                  weighting         = function(x) weightSMART(x,spec="ltn"))

tdm = TermDocumentMatrix(corp,control=parameters)
tdm.reduced = removeSparseTerms(tdm,0.999)

# setting up LSA space - this may take a little while...
td.mat = as.matrix(tdm.reduced) 
td.mat.lsa = lw_bintf(td.mat)*gw_idf(td.mat) # you can experiment with weightings here
lsaSpace = lsa(td.mat.lsa,dims=dimcalc_raw()) # you don't have to keep all dimensions
lsa.tm = as.textmatrix(lsaSpace)

l = 50 
exp = "support trade" 
SemanticLink(text,exp,n=5,lsaSpace,Out="Text") 

[1] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small businesses, and for other purposes."                                                                       
[2] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel AJ."           
[3] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the yacht EXCELLENCE III."
[4] "A bill to authorize the Secretary of Transportation to issue a certificate of documentation with appropriate endorsement for employment in the coastwise trade for the vessel M/V Adios."    
[5] "A bill to amend the Internal Revenue Code of 1986 to provide tax relief for small business, and for other purposes."

如您所见，虽然“支持交易”可能不会出现在上面的示例中，但该函数已检索到一组与查询相关的文档。该功能旨在检索具有语义联系而不是精确匹配的文档。

我们还可以通过绘制余弦距离来查看这些文档与查询向量的“接近”程度：

plot(1:l,SemanticLink(text,exp,lsaSpace,n=l,Out="ValuesSorted") 
     ,type="b",pch=16,col="blue",main=paste("Query Vector Proximity",exp,sep=" "), 
     xlab="observations",ylab="Cosine")

不过，我还没有足够的声誉来制作这个情节，对不起。

如您所见，前 2 个条目与查询向量的关联似乎比其他条目更多（但大约有 5 个特别相关），即使阅读它们您也不会这样做。我想说这是使用同义词来构建查询向量的效果。然而，忽略这一点，该图允许我们有多少其他文档与查询向量远程相似。

编辑：

就在最近，我不得不解决您要解决的问题，但是上面的功能无法正常工作，只是因为数据很糟糕-文本很短，内容很少，主题也不多进行了探索。因此，为了在这种情况下找到相关条目，我开发了另一个纯粹基于正则表达式的函数。

它是这样的：

HLS.Extract = function(pattern,text=active.text){


  require(qdap)
  require(tm)
  require(RTextTools)

  p = unlist(strsplit(pattern," "))
  p = unique(wordStem(p))
  p = gsub("(.*)i$","\\1y",p)

  Syns = function(word){   
    wl    =   gsub("(.*[[:space:]].*)","",      
                   gsub("^c\\(|[[:punct:]]+|^[[:space:]]+|[[:space:]]+$","",  
                        unlist(strsplit(PlainTextDocument(synonyms(word)),",")))) 
    wl = wl[wl!=""] 
    return(wl)     
  } 

  trim = function(x){

    temp_L  = nchar(x)
    if(temp_L < 5)                {N = 0}
    if(temp_L > 4 && temp_L < 8)  {N = 1}
    if(temp_L > 7 && temp_L < 10) {N = 2}
    if(temp_L > 9)                {N = 3}
    x = substr(x,0,nchar(x)-N)
    x = gsub("(.*)","\\1\\\\\\w\\*",x)

    return(x)
  }

  # SINGLE WORD SCENARIO

  if(length(p)<2){

    # EXACT
    p = trim(p)
    ndx_exact  = grep(p,text,ignore.case=T)
    text_exact = text[ndx_exact]

    # SEMANTIC
    p = unlist(strsplit(pattern," "))

    express  = new.exp = list()
    express  = c(p,Syns(p))
    p        = unique(wordStem(express))

    temp_exp = unlist(strsplit(express," "))
    temp.p = double(length(seq(temp_exp)))

    for(j in seq(temp_exp)){
      temp_exp[j] = trim(temp_exp[j])
    }

    rgxp   = paste(temp_exp,collapse="|")
    ndx_s  = grep(paste(temp_exp,collapse="|"),text,ignore.case=T,perl=T)
    text_s = as.character(text[ndx_s])

    f.object = list("ExactIndex"    = ndx_exact,
                    "SemanticIndex" = ndx_s,
                    "ExactText"     = text_exact,
                    "SemanticText"  = text_s)
  }

  # MORE THAN 2 WORDS

  if(length(p)>1){

    require(combinat)

    # EXACT
    for(j in seq(p)){p[j] = trim(p[j])}

    fp     = factorial(length(p))
    pmns   = permn(length(p))
    tmat   = matrix(0,fp,length(p))
    permut = double(fp)
    temp   = double(length(p))
    for(i in 1:fp){
      tmat[i,] = pmns[[i]]
    }

    for(i in 1:fp){
      for(j in seq(p)){
        temp[j] = paste(p[tmat[i,j]])
      }
      permut[i] = paste(temp,collapse=" ")
    }

    permut = gsub("[[:space:]]",
                  "[[:space:]]+([[:space:]]*\\\\w{,3}[[:space:]]+)*(\\\\w*[[:space:]]+)?([[:space:]]*\\\\w{,3}[[:space:]]+)*",permut)

    ndx_exact  = grep(paste(permut,collapse="|"),text)
    text_exact = as.character(text[ndx_exact])


    # SEMANTIC

    p = unlist(strsplit(pattern," "))
    express = list()
    charexp = permut = double(length(p))
    for(i in seq(p)){
      express[[i]] = c(p[i],Syns(p[i]))
      express[[i]] = unique(wordStem(express[[i]]))
      express[[i]] = gsub("(.*)i$","\\1y",express[[i]])
      for(j in seq(express[[i]])){
        express[[i]][j] = trim(express[[i]][j])
      }
      charexp[i] = paste(express[[i]],collapse="|")
    }

    charexp  = gsub("(.*)","\\(\\1\\)",charexp)
    charexpX = double(length(p))
    for(i in 1:fp){
      for(j in seq(p)){
        temp[j] = paste(charexp[tmat[i,j]])
      }
      permut[i] = paste(temp,collapse=
                          "[[:space:]]+([[:space:]]*\\w{,3}[[:space:]]+)*(\\w*[[:space:]]+)?([[:space:]]*\\w{,3}[[:space:]]+)*")
    }
    rgxp   = paste(permut,collapse="|")
    ndx_s  = grep(rgxp,text,ignore.case=T)
    text_s = as.character(text[ndx_s])

    temp.f = function(x){
      if(length(x)==0){x=0}
    }

    temp.f(ndx_exact);  temp.f(ndx_s)
    temp.f(text_exact); temp.f(text_s)

    f.object = list("ExactIndex"    = ndx_exact,
                    "SemanticIndex" = ndx_s,
                    "ExactText"     = text_exact,
                    "SemanticText"  = text_s,
                    "Synset"        = express)

  }
  return(f.object)
  cat(paste("Exact Matches:",length(ndx_exact),sep=""))
  cat(paste("\n"))
  cat(paste("Semantic Matches:",length(ndx_s),sep=""))
}

尝试一下：

HLS.Extract("buy house",
            c("we bought a new house",
              "I'm thinking about buying a new home",
              "purchasing a brand new house"))[["SemanticText"]]

$SemanticText
[1] "I'm thinking about buying a new home" "purchasing a brand new house"

如您所见，该功能非常灵活。它还将提振“购房”。但它并没有选择“我们买了新房子”，因为“买了”是一个不规则动词——这是 LSA 会选择的那种东西。

因此，您可能想同时尝试两种方法，看看哪一种效果更好。SemanticLink 功能也需要大量的内存，当你有一个特别大的语料库时，你将无法使用它

干杯

score 0 · Accepted Answer

我建议您阅读此问题的答案，尤其是前两个答案非常好。
我也可以推荐Natural language processing toolkit（个人没试过）

score 0 · Accepted Answer

对于新闻文章之间的相似性，您可以使用词性标记提取关键字。NLTK 提供了一个很好的词性标注器。使用名词和名词短语作为关键词，将每篇新闻文章表示为一个关键词向量。

然后使用余弦相似度或一些这样的文本相似度度量来量化相似度。

进一步的增强包括处理同义词、词干、处理形容词（如果需要）、使用 TF-IDF 作为向量中的关键字权重等。

nlp - 基于释义检测查找相似文本

3 回答 3

Related

Reference