我必须对发表在 20,000 多种期刊列表中的科学论文进行分析。我的列表有超过 450,000 条记录,但有几个重复项(例如:来自不同机构的不止一位作者的论文出现不止一次)。
好吧,我需要计算每个期刊的不同论文数量,但问题是不同的作者并不总是以相同的方式提供信息,我可以得到如下表:
JOURNAL PAPER
0001-1231 A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
0001-1231 THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 COMPOSITIONAL AND LOCAL LIVELOCK ANALYSIS FOR CSP
0001-1231 AIDING EXPLORATORY TESTING WITH PRUNED GUI MODELS
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING
0001-1231 DECYCLING WITH A MATCHING.
0001-1231 DECYCLING WITH A MATCHING
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH
0001-1231 ON THE HARDNESS OF FINDING THE GEODETIC NUMBER OF A SUBCUBIC GRAPH.
0001-1232 DECISION TREE CLASSIFICATION WITH BOUNDED NUMBER OF ERRORS
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 AN INCREMENTAL LINEAR-TIME LEARNING ALGORITHM FOR THE OPTIMUM-PATH
0001-1232 COOPERATIVE CAPACITATED FACILITY LOCATION GAMES
0001-1232 OPTIMAL SUFFIX SORTING AND LCP ARRAY CONSTRUCTION FOR ALPHABETS
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M )
0001-1232 FAST MODULAR REDUCTION AND SQUARING IN GF (2 M)
0001-1232 ON THE GEODETIC NUMBER OF COMPLEMENTARY PRISMS
0001-1232 DESIGNING MICROTISSUE BIOASSEMBLIES FOR SKELETAL REGENERATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS: ILLEGAL ALLOCATION
0001-1232 GOVERNANCE OF BRAZILIAN PUBLIC ENVIRONMENTAL FUNDS - ILLEGAL ALLOCATION
我的目标是使用类似的东西:
data%>%
distinct(JOURNAL, PAPER)%>%
group_by(JOURNAL)%>%
mutate(papers_in_journal = n())
所以,我会有这样的信息:
JOURNAL papers_in_journal
0001-1231 6
0001-1232 7
问题是您可以在已发表论文的名称中看到一些错误。有些结尾有一个“句号”;有些有空格或替换符号;有些还有其他细微的变化,例如 W[1]-HARD 与 W-HARD。所以,如果我按原样运行代码,我所拥有的是:
JOURNAL papers_in_journal
0001-1231 10
0001-1232 10
我的问题:有没有办法在使用 distinct() 或类似命令时考虑相似性边距,所以我可以有类似 distinct(JOURNAL, PAPER %whithin% 0.95) 的东西?
从这个意义上说,我希望命令考虑:
A PRE-TEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
=
A PRETEST FOR FACTORING BIVARIATE POLYNOMIALS WITH COEFFICIENTS
THE P3 INFECTION TIME IS W[1]-HARD PARAMETERIZED BY THE TREEWIDTH
=
THE P3 INFECTION TIME IS W-HARD PARAMETERIZED BY THE TREEWIDTH
DECYCLING WITH A MATCHING
=
DECYCLING WITH A MATCHING.
etc.
我想没有使用 distinct() 这样简单的解决方案,而且我找不到任何替代命令来做到这一点。所以,如果这是不可能的,你可以建议我可能使用的任何消歧算法,我也很感激。
谢谢你。