r - PubMed XML parsing using entrez_fetch in rentrez

Question

I am collecting author's information and article information for a search term in PubMed. I am getting author name, publication year and other information successfully using entrez_fetch in rentrez package. Following is my example code:

library(rentrez)
library(XML)

pubmedSearch <- entrez_search("pubmed", term = "flexible ureteroscope", retmax = 100)
SearchResults <- entrez_fetch(db="pubmed", pubmedSearch$ids, rettype="xml", parsed=TRUE)
First_Name <- xpathSApply(SearchResults, "//Author", function(x) {xmlValue(x[["ForeName"]])})
Last_Name <- xpathSApply(SearchResults, "//Author", function(x) {xmlValue(x[["LastName"]])})
PubYear <- xpathSApply(SearchResults, "//PubDate", function(x) {xmlValue(x[["Year"]])})
PMID <- xpathSApply(SearchResults, "//ArticleIdList", function(x) {xmlValue(x[["ArticleId"]])})

Despite getting all the information I needed, I am having an issue in figuring out which authors are for which PMID. It is because length of authors are different for each PMID. For example, if I parsed author information for 100 articles as in my code, I get more than 100 authors name and I can not associate it with respective PMID. Overall, I would like to have an output data frame like this:

 PMID       First_Name   Last_Name          PubYear
 28221147   Carlos      Torrecilla Ortiz    2017
 28221147   Sergi       Colom Feixas        2017
 28208536   Dean G      Assimos             2017
 28203551   Chad M      Gridley             2017
 28203551   Bodo E      Knudsen             2017

So this way, I would know which are authors are associated with which PMID and it useful for further analysis.

Just for the note, this is a small example of my code. I am collecting more information using XML parsing via entrez_fetch in rentrez package.

This problem is really bugging me and I would really appreciate any help or guidance. Thank you for your efforts and help in advance.

score 2 · Accepted Answer

这确实是一个关于 xpath（用于在 XML 文件中指定这些节点的语言）的问题，我并不声称自己是这方面的专家。但我想我可以在这种情况下提供帮助。

您要确保一次提取一条已发布记录（PubmedArticle条目）的信息。您可以编写一个函数来为一条记录执行此操作

parse_paper <- function(paper){
  last_names <- xpathSApply(paper, ".//Author/LastName", xmlValue)
  first_names <- xpathSApply(paper, ".//Author/ForeName", xmlValue)
  pmid <- xpathSApply(paper, ".//ArticleId[@IdType='pubmed']", xmlValue)
  data.frame(pmid=pmid, last_names=last_names, first_names=first_names)
}

这应该为每个作者提供一行，每行具有相同的 pmid。我们现在可以通过在每篇文章上调用该函数来将其扩展到整篇文章。

parse_multiple_papers <- function(papers){
  res <- xpathApply(papers, "/PubmedArticleSet/*", parse_paper)
  do.call(rbind.data.frame, res)
}

head(parse_multiple_papers(SearchResults))

.

      pmid       last_names first_names
1 28221147 Torrecilla Ortiz      Carlos
2 28221147     Colom Feixas       Sergi
3 28208536          Assimos      Dean G
4 28203551          Gridley      Chad M
5 28203551          Knudsen      Bodo E
6 28101159               Li    Zhi-Gang

顺便说一句，我通常不搜索 stackoverflow，但会回答有关在github 存储库rentrez中作为问题提交的任何问题（它们不必是“错误”去那里）。

r - PubMed XML parsing using entrez_fetch in rentrez

1 回答 1

Related

Reference