r - 从文档中单词之间的相关性生成网络图

Question

我有兴趣创建一个类似于此人网站上显示的网络图-此页面上的第一个>> http://minimaxir.com/2016/12/interactive-network/

我想在 .txt 文档中制作此图的节点 == 单词（在删除停用词和其他预处理之后）。我还想让这个图的顶点/边缘成为文档中其他单词的相关性（例如，单词“word”经常出现在单词“up”旁边），只考虑更强的相关性。我在想整个文档中的“节点大小”=“单词频率”，以及单词之间的“节点之间的距离”=关系的强度/弱点。

我目前正在使用 R、quanteda 和 ggplot2 以及其他一些依赖项的组合。

如果有人对我如何在 R 中生成单词相关性（最好使用 quanteda）有任何建议，然后将其绘制为图表，我将永远感激不尽！

当然，如果我可以对这个问题做出任何改进，请告诉我。这是我到目前为止的尝试：

library(quanteda)
library(readtext)
library(ggplot2)
library(stringi)

## Load the .txt doc 
document <- texts(readtext("file1.txt"))

## Make everything lowercase... store in a seperate variable
documentlower <- char_tolower(document$text)

## Tokenize the lower-case document
documenttokens <- tokens(documentlower, remove_punct = TRUE) %>% as.character()
(total_length <- length(documenttokens)

## Create the Document Frequency Matrix - here we can also remove stopwords and stem
docudfm <- dfm(documentlower, remove_punct = TRUE, remove = stopwords("english"), stem = TRUE)

## Inspect the top 10 Words by Count
textstat_frequency(docudfm, n = 10)

## Create a sorted list of tokens by frequency count
sorted_document <- topfeatures(docudfm, n = nfeat(docudfm))

## Normalize the data points to find their percentage of occurrence in the documents
sorted_document <- sorted_document / sum(sorted_document) * 100

## Also normalize the data points in the DFM
docudfm_pct <- dfm_weight(docudfm, scheme = "prop") * 100

r - 从文档中单词之间的相关性生成网络图

0 回答 0

Related

Reference