10

我正在为学校做一个研究项目。我编写了一些文本挖掘软件,可以分析集合中的法律文本并输出一个分数,表明它们有多相似。我运行程序将每个文本与其他文本进行比较,我有这样的数据(尽管还有更多点):

codeofhammurabi.txt crete.txt      0.570737
codeofhammurabi.txt iraqi.txt      1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt         1.25546
crete.txt iraqi.txt                0.329545
crete.txt magnacarta.txt           0.589786
crete.txt us.txt                   0.491903
iraqi.txt magnacarta.txt           0.834488
iraqi.txt us.txt                   1.37718
magnacarta.txt us.txt              1.09582

现在我需要将它们绘制在图表上。我可以轻松地反转分数,因此现在小值表示相似的文本,而大值表示不同的文本:该值可以是表示文本的图表上的点之间的距离。

codeofhammurabi.txt crete.txt      1.75212
codeofhammurabi.txt iraqi.txt      0.8812
codeofhammurabi.txt magnacarta.txt 1.0573
codeofhammurabi.txt us.txt         0.7965
crete.txt iraqi.txt                3.0344
crete.txt magnacarta.txt           1.6955
crete.txt us.txt                   2.0329
iraqi.txt magnacarta.txt           1.1983
iraqi.txt us.txt                   0.7261
magnacarta.txt us.txt              0.9125

简短版本:直接在上面的那些值是散点图上点之间的距离(1.75212 是 codeofhammurabi 点和 crete 点之间的距离)。我可以想象一个大的方程系统,其中圆圈代表点之间的距离。制作此图表的最佳方法是什么?我有 MATLAB、R、Excel,并且可以访问我可能需要的几乎所有软件。

如果您能指出我的方向,我将不胜感激。

4

7 回答 7

11

如果问题是“我怎么能像这个人那样做?” (来自xiii1408对问题的评论),那么答案是使用Gephi内置的Force Atlas 2算法对文档主题后验概率的欧几里得距离

“这家伙”就是马特·乔克斯,他是一位数字人文领域的创新学者。他在他的博客其他 地方等上记录了他的一些方法。Jockers主要工作R分享他的一些代码。他的基本工作流程似乎是:

  1. 将纯文本分成 1000 个单词的块,
  2. 删除停用词(不要词干),
  3. 做词性标注,只保留名词,
  4. 构建主题模型(使用 LDA),
  5. 根据主题比例计算文档之间的欧几里得距离,对距离进行子集化以仅将距离保持在某个阈值以下,然后
  6. 用力导向图可视化

这是一个小规模的可重现示例R(导出到 Gephi),可能与 Jockers 所做的相近:

#### prepare workspace
# delete current objects and clear RAM
rm(list = ls(all.names = TRUE))
gc()

获取数据...

#### import text
# working from the topicmodels package vignette
# using collection of abstracts of the Journal of Statistical Software (JSS) (up to 2010-08-05).
install.packages("corpus.JSS.papers", repos = "http://datacube.wu.ac.at/", type = "source")
data("JSS_papers", package = "corpus.JSS.papers")
# For reproducibility of results we use only abstracts published up to 2010-08-05 
JSS_papers <- JSS_papers[JSS_papers[,"date"] < "2010-08-05",]

清洁和重塑...

#### clean and reshape data
# Omit abstracts containing non-ASCII characters in the abstracts
JSS_papers <- JSS_papers[sapply(JSS_papers[, "description"], Encoding) == "unknown",]
# remove greek characters (from math notation, etc.)
library("tm")
library("XML")
remove_HTML_markup <- function(s) tryCatch({
    doc <- htmlTreeParse(paste("<!DOCTYPE html>", s),
                         asText = TRUE, trim = FALSE)
                         xmlValue(xmlRoot(doc))
                         }, error = function(s) s)
# create corpus
corpus <- Corpus(VectorSource(sapply(JSS_papers[, "description"], remove_HTML_markup)))
# clean corpus by removing stopwords, numbers, punctuation, whitespaces, words <3 characters long..
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus_clean <- tm_map(corpus, wordLengths=c(3,Inf), FUN = tm_reduce, tmFuns = funcs)

词性标注和名词子集...

#### Part-of-speach tagging to extract nouns only
library("openNLP", "NLP")
# function for POS tagging
tagPOS <-  function(x) {

  s <- NLP::as.String(x)
  ## Need sentence and word token annotations.

  a1 <- NLP::Annotation(1L, "sentence", 1L, nchar(s))
  a2 <- NLP::annotate(s, openNLP::Maxent_Word_Token_Annotator(), a1)
  a3 <- NLP::annotate(s,  openNLP::Maxent_POS_Tag_Annotator(), a2)

  ## Determine the distribution of POS tags for word tokens.
  a3w <- a3[a3$type == "word"]
  POStags <- unlist(lapply(a3w$features, `[[`, "POS"))

  ## Extract token/POS pairs (all of them): easy - not needed
  # POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
  return(unlist(POStags))
} 
# a  loop to do POS tagging on each document and do garbage cleaning after each document
# first prepare vector to hold results (for optimal loop speed)
corpus_clean_tagged <- vector(mode = "list",  length = length(corpus_clean))
# then loop through each doc and do POS tagging
# warning: this may take some time!
for(i in 1:length(corpus_clean)){
  corpus_clean_tagged[[i]] <- tagPOS(corpus_clean[[i]])
  print(i) # nice to see what we're up to
  gc()
}

# subset nouns
wrds <- lapply(unlist(corpus_clean), function(i) unlist(strsplit(i, split = " ")))
NN <- lapply(corpus_clean_tagged, function(i) i == "NN")
Noun_strings <- lapply(1:length(wrds), function(i) unlist(wrds[i])[unlist(NN[i])])
Noun_strings <- lapply(Noun_strings, function(i) paste(i, collapse = " "))
# have a look to see what we've got
Noun_strings[[1]]
[8] "variogram model splus user quality variogram model pairs locations measurements variogram nonstationarity outliers variogram fit sets soil nitrogen concentration"

具有潜在狄利克雷分配的主题建模...

#### topic modelling with LDA (Jockers uses the lda package and MALLET, maybe topicmodels also, I'm not sure. I'm most familiar with the topicmodels package, so here it is. Note that MALLET can be run from R: https://gist.github.com/benmarwick/4537873
# put the cleaned documents back into a corpus for topic modelling
corpus <- Corpus(VectorSource(Noun_strings))
# create document term matrix 
JSS_dtm <- DocumentTermMatrix(corpus)
# generate topic model 
library("topicmodels")
k = 30 # arbitrary number of topics (they are ways to optimise this)
JSS_TM <- LDA(JSS_dtm, k) # make topic model
# make data frame where rows are documents, columns are topics and cells 
# are posterior probabilities of topics
JSS_topic_df <- setNames(as.data.frame(JSS_TM@gamma),  paste0("topic_",1:k))
# add row names that link each document to a human-readble bit of data
# in this case we'll just use a few words of the title of each paper
row.names(JSS_topic_df) <- lapply(1:length(JSS_papers[,1]), function(i) gsub("\\s","_",substr(JSS_papers[,1][[i]], 1, 60)))

使用主题概率作为文档的“DNA”计算一个文档与另一个文档的欧几里得距离

#### Euclidean distance matrix
library(cluster)
JSS_topic_df_dist <-  as.matrix(daisy(JSS_topic_df, metric =  "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# This is how Jockers subsets the distance matrix to keep only 
# closely related documents and avoid a dense spagetti diagram 
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
JSS_topic_df_dist[ sweep(JSS_topic_df_dist, 1, (apply(JSS_topic_df_dist,1,min) + apply(JSS_topic_df_dist,1,sd) )) > 0 ] <- 0

使用力导向图进行可视化...

#### network diagram using Fruchterman & Reingold algorithm (Jockers uses the ForceAtlas2 algorithm which is unique to Gephi)
library(igraph)
g <- as.undirected(graph.adjacency(JSS_topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1,  vertex.color= "grey", edge.arrow.size = 0.1, vertex.label.dist=0.5, vertex.label = NA)

在此处输入图像描述 如果你想在 Gephi 中使用 Force Atlas 2 算法,你只需将R图形对象导出到一个graphml文件,然后在 Gephi 中打开它并将布局设置为 Force Atlas 2:

# this line will export from R and make the file 'JSS.graphml' in your working directory ready to open with Gephi
write.graph(g, file="JSS.graphml", format="graphml") 

这是使用 Force Atlas 2 算法的 Gephi 图: 在此处输入图像描述

于 2013-04-15T08:19:03.893 回答
10

您的数据实际上是文档中包含的词库跨越的多元空间中的距离(某种形式)。诸如此类的相异性数据通常被协调以提供相异性的最佳k - d 映射。主坐标分析和非度量多维缩放是两种这样的方法。我建议您绘制将这些方法中的一种或另一种应用于您的数据的结果。我在下面提供了这两个例子。

首先,加载您提供的数据(此阶段没有标签)

con <- textConnection("1.75212
0.8812
1.0573
0.7965
3.0344
1.6955
2.0329
1.1983
0.7261
0.9125
")
vec <- scan(con)
close(con)

您实际上拥有的是以下距离矩阵:

mat <- matrix(ncol = 5, nrow = 5)
mat[lower.tri(mat)] <- vec
colnames(mat) <- rownames(mat) <-
  c("codeofhammurabi","crete","iraqi","magnacarta","us")

> mat
                codeofhammurabi  crete  iraqi magnacarta us
codeofhammurabi              NA     NA     NA         NA NA
crete                   1.75212     NA     NA         NA NA
iraqi                   0.88120 3.0344     NA         NA NA
magnacarta              1.05730 1.6955 1.1983         NA NA
us                      0.79650 2.0329 0.7261     0.9125 NA

R,一般来说,需要一个类的相异对象"dist"。我们可以使用as.dist(mat)now 来获取这样的对象,或者我们可以跳过创建mat并直接进入"dist"对象,如下所示:

class(vec) <- "dist"
attr(vec, "Labels") <- c("codeofhammurabi","crete","iraqi","magnacarta","us")
attr(vec, "Size") <- 5
attr(vec, "Diag") <- FALSE
attr(vec, "Upper") <- FALSE

> vec
           codeofhammurabi   crete   iraqi magnacarta
crete              1.75212                           
iraqi              0.88120 3.03440                   
magnacarta         1.05730 1.69550 1.19830           
us                 0.79650 2.03290 0.72610    0.91250

现在我们有了一个正确类型的对象,我们可以调整它。R 有很多包和函数可以做到这一点(参见 CRAN 上的MultivariateEnvironmetrics任务视图),但我将使用vegan包,因为我对它有点熟悉......

require("vegan")

主坐标

首先,我说明如何使用vegan对您的数据进行主坐标分析。

pco <- capscale(vec ~ 1, add = TRUE)
pco

> pco
Call: capscale(formula = vec ~ 1, add = TRUE)

              Inertia Rank
Total           10.42     
Unconstrained   10.42    3
Inertia is squared Unknown distance (euclidified) 

Eigenvalues for unconstrained axes:
 MDS1  MDS2  MDS3 
7.648 1.672 1.098 

Constant added to distances: 0.7667353

到目前为止,第一个 PCO 轴在解释文本之间的差异方面是最重要的,如特征值所示。现在可以通过绘制 PCO 的特征向量来生成排序图,使用plot方法

plot(pco)

产生

在此处输入图像描述

非度量多维缩放

非度量多维缩放 (nMDS) 不会尝试在欧几里得空间中找到原始距离的低维表示。相反,它试图在k维中找到一个映射,以最好地保持观察之间距离的排序这个问题没有封闭形式的解决方案(与上面应用的 PCO 不同),需要迭代算法来提供解决方案。建议随机开始以确保算法尚未收敛到次优的局部最优解。Vegan 的metaMDS功能结合了这些功能以及更多功能。如果您想要普通的旧 nMDS,请参阅isoMDSMASS

set.seed(42)
sol <- metaMDS(vec)

> sol

Call:
metaMDS(comm = vec) 

global Multidimensional Scaling using monoMDS

Data:     vec 
Distance: user supplied 

Dimensions: 2 
Stress:     0 
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation 
Species: scores missing

有了这个小数据集,我们基本上可以完美地表示差异的等级排序(因此警告,未显示)。可以使用该plot方法实现绘图

plot(sol, type = "text", display = "sites")

产生

在此处输入图像描述

在这两种情况下,样本之间图上的距离是它们相异性的最佳二维近似值。在 PCO 图的情况下,它是实际相异性的 2-d 近似值(需要 3 个维度来完全表示所有相异性),而在 nMDS 图中,图上样本之间的距离反映了等级相异性不是观察之间的实际差异。但本质上,图上的距离代表了计算的差异。靠近的文本最相似,在情节上相距较远的文本彼此最不相似。

于 2013-04-14T23:00:20.230 回答
2

您可以使用 igraph 制作网络图。Fruchterman-Reingold 布局有一个参数来提供边权重。大于 1 的权重会沿边缘产生更多的“吸引力”,小于 1 的权重则相反。在您的示例中, crete.txt 的距离最短,位于中间,与其他顶点的边缘较小。事实上,它更接近于iraqi.txt。请注意,您必须反转 E(g)$weight 的数据才能获得正确的距离。

data1 <- read.table(text="
codeofhammurabi.txt crete.txt      0.570737
codeofhammurabi.txt iraqi.txt      1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt         1.25546
crete.txt iraqi.txt                0.329545
crete.txt magnacarta.txt           0.589786
crete.txt us.txt                   0.491903
iraqi.txt magnacarta.txt           0.834488
iraqi.txt us.txt                   1.37718
magnacarta.txt us.txt              1.09582")
par(mar=c(3,7,3.5,5), las=1)

library(igraph)
g <- graph.data.frame(data1, directed = FALSE)
E(g)$weight <- 1/data1[,3] #inversed, high weights = more attraction along the edges
l <- layout.fruchterman.reingold(g, weights=E(g)$weight)
plot(g, layout=l)

在此处输入图像描述

于 2013-04-14T23:18:53.653 回答
0

如果您想尝试 3D 条形视图,这个 Matlab 片段应该可以工作:

% Load data from file 'dist.dat', with values separated by spaces
fid = fopen('dist.dat');
data = textscan(                            ...
        fid,                   '%s%s%f', ...
        'Delimiter',           ' ',      ...
        'MultipleDelimsAsOne', true      ...
);
fclose(fid);

% Find all unique sources
text_bodies = unique(reshape([data{1:2}],[],1));

% Iterate trough the records and complete similarity matrix
N = numel(text_bodies);
similarity = NaN(N,N);
for k = 1:size(data{1},1)
        n1 = find(strcmp(data{1}{k}, text_bodies));
        n2 = find(strcmp(data{2}{k}, text_bodies));

        similarity(n1, n2) = data{3}(k); % Symmetrical part ignored
end;

% Display #D bar chart
bar3(similarity);
于 2013-04-14T23:02:43.607 回答
0

这是 Matlab 的潜在解决方案:

您可以将数据排列到正式的 5x5 相似性矩阵S中,其中元素S(i,j)表示文档i和文档j之间的相似性(或不相似性) 。假设您的距离度量是一个实际的metric,您可以通过mdscale(S,2)对该矩阵应用多维缩放

此函数将尝试查找数据的 5x2 维度表示,以保留在较高维度中找到的类之间的相似性(或不相似性)。然后,您可以将此数据可视化为 5 个点的散点图。

您也可以尝试使用 mdscale(S,3) 投影到一个 5x3 维矩阵,然后您可以使用 plot3() 对其进行可视化。

于 2013-04-14T22:58:02.727 回答
0

如果您想要表示点之间距离的圆圈,这将在 R 中起作用(我在您的示例中使用了第一个表):

data1 <- read.table(text="
codeofhammurabi.txt crete.txt      0.570737
codeofhammurabi.txt iraqi.txt      1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt         1.25546
crete.txt iraqi.txt                0.329545
crete.txt magnacarta.txt           0.589786
crete.txt us.txt                   0.491903
iraqi.txt magnacarta.txt           0.834488
iraqi.txt us.txt                   1.37718
magnacarta.txt us.txt              1.09582")
par(mar=c(3,7,3.5,5), las=1)

symbols(data1[,1],data1[,2], circles=data1[,3], inches=0.55, bg="lightblue", xaxt="n", yaxt="n", ylab="")
axis(1, at=data1[,1],labels=data1[,1])
axis(2, at=data1[,2],labels=data1[,2])
text(data1[,1], data1[,2], round(data1[,3],2), cex=0.9)

在此处输入图像描述

于 2013-04-14T22:58:10.433 回答
0

您是否在进行所有成对比较?取决于您如何计算距离(相似度),我不确定是否可以制作这样的散点图。因此,当您只考虑 3 个文本文件时,您的散点图很容易制作(边长相等的三角形)。但是当您添加第四个点时,您可能无法将其放置在其与现有 3 个点的距离满足所有约束的位置。

但是,如果你能做到这一点,那么你有一个解决方案,只需不断地添加新点......我认为......或者,如果你不需要散点图上的距离是精确的,你可以简单地做一个网并标记距离。

于 2013-04-14T22:19:06.320 回答