json - R：解析从 Pubchem 导出的 JSON/XML 复合属性

Question

我想使用 JSON（或 XML）导出工具解析 R 中 Pubchem 中给出的给定化合物的所有化学性质。

示例：ALPHA-IONONE，pubchem 化合物 ID 5282108

https://pubchem.ncbi.nlm.nih.gov/compound/5282108

library("rjson")
data <- rjson::fromJSON(file="https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")

或者

library("RJSONIO")
data <- RJSONIO::fromJSON("https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/JSON/?response_type=display")

会给我一个嵌套列表树，但是我如何从这个相当复杂的嵌套列表列表到一个不错的数据框或数据框列表？

在这种情况下，我所追求的是一切

3.1 计算描述符

3.2 其他标识符

3.3 同义词

4.1 计算属性

在数据框的单行中，每个元素在单独的命名列中，每个元素（例如多个同义词）用“|”粘贴在一起作为分隔符。例如，在这种情况下，类似

pubchemid      IUPAC_Name    InChI       InChI_Key     Canonical SMILES      Isomeric SMILES     CAS     EC Number     Wikipedia      MeSH Synonyms     Depositor-Supplied Synonyms   Molecular_Weight    Molecular_Formula    XLogP3   Hydrogen_Bond_Donor_Count ... 
5282108        (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one       InChI=1S/C13H20O/c1-10-6-5-9-13(3,4)12(10)8-7-11(2)14/h6-8,12H,5,9H2,1-4H3/b8-7+ ....

具有多个项目的字段，例如存款人提供的同义词可以用“|”粘贴在一起，例如值可以是 ALPHA-IONONE|Iraldeine|...

其次，我还想将第 4.2.2 节 Kovats Retention Index 作为数据框导入

pubchemid      column_class            kovats_ri
5282108        Standard non-polar      1413
5282108        Standard non-polar      1417
...
5282108        Semi-standard non-polar 1427
...

（第 4.3.1 节 GC-MS 也不错，但由于它只显示 3 个顶峰，所以现在有点没用，所以我会跳过它）

有人知道如何以优雅的方式实现这一目标吗？

PS 请注意，对于任何给定的查询，并非所有这些字段都必须存在。

二维结构和一些性质也可以从

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=2d&response_type=display

和 3D 结构

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/5282108/record/SDF/?record_type=3d&response_type=display

数据也可以导出为 XML，使用

https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5282108/XML/?response_type=display

如果这会更容易

注意：也尝试使用 R package rpubchem，但似乎只导入了少量可用信息：

library("rpubchem")
get.cid(5282108)
CID  IUPACName CanonicalSmile MolecularFormula MolecularWeight TotalFormalCharge XLogP HydrogenBondDonorCount HydrogenBondAcceptorCount HeavyAtomCount    TPSA
2 5282108 (E)-4-(2,6,6-trimethylcyclohex-2-en-1-yl)but-3-en-2-one        C13H20O       192.297300               0                 3     0                      1                        14             17 5282108

score 1 · Accepted Answer

我的建议适用于 XML 文件，因为（感谢 XPath）我发现它们更便于遍历和选择节点。

请注意，这既不快（测试时花费了几秒钟）也不是最佳的（我解析每个文件两次 - 一次用于名称等，一次用于 Kovats Retention Index）。但我猜你会想要解析一些文件集并继续你的真正业务，而过早的优化是万恶之源。

我已将主要任务放入单独的功能中。如果您想获取一个特定的 pubchem 记录的数据，它们就可以使用了。但是，如果您想一次从几个 pubchem 记录中获取数据，您可以定义指向数据的指针向量并使用底部的示例将结果合并在一起。就我而言，vector 包含本地磁盘上文件的路径。也支持 URL，尽管我不鼓励它们（请记住，每个站点将被请求两次，如果记录数量更多，您可能希望以某种方式处理故障网络）。

您链接到的化合物在“EC 编号”上有多个条目。它们确实不同ReferenceNumber，但不是Name。我不确定为什么会这样，我应该如何处理（您的示例输出只包含一个 EC 编号条目），所以我把它留给了 R。R 为重复值添加了后缀并创建了EC.Number.1等EC.Number.2。这些后缀与文件不匹配，ReferenceNumber主数据框中的同一列可能会ReferenceNumber针对不同的化合物引用不同的 s。

似乎 pubchem 对 tags 使用以下格式<type>Value[List]。在少数地方我已经硬编码StringValue，但也许某些化合物在同一字段中有不同的类型。我通常不会考虑清单，除非是在要求的地方。因此，随着在此代码中抛出更多数据，可能需要进一步修改。

如果您有任何问题，请在评论中发表。我不确定是否应该解释该代码或什么。

library("xml2")
library("data.table")

compound.attributes <- function(file=NULL) {
  compound <- read_xml(file)
  ns <- xml_ns(compound)
  information <- xml_find_all(compound, paste0(
    "//d1:TOCHeading[text()='Computed Descriptors'",
    " or text()='Other Identifiers'",
    " or text()='Synonyms'",
    " or text()='Computed Properties']",
    "/following-sibling::d1:Section/d1:Information"
  ), ns)

  properties <- sapply(information, function(x) {
    name <- xml_text(xml_find_one(x, "./d1:Name", ns))
    value <- ifelse(length(xml_find_all(x, "./d1:StringValueList", ns)) > 0,
                    paste(sapply(
                      xml_find_all(x, "./d1:StringValueList", ns),
                      xml_text, trim=TRUE), sep="", collapse="|"),
                    xml_text(
                      xml_find_one(x, "./*[contains(name(),'Value')]", ns),
                      trim=TRUE)
    )
    names(value) <- name
    return(value)
  })
  rm(compound, information)
  properties <- as.list(properties)
  properties$pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
  return(data.frame(properties))
}

compound.retention.index <- function(file=NULL) {
  pubchemid <- sub(".*/([0-9]+)/?.*", "\\1", file)
  compound <- read_xml(file)
  ns <- xml_ns(compound)
  information <- xml_find_all(compound, paste0(
    "//d1:TOCHeading[text()='Kovats Retention Index']",
    "/following-sibling::d1:Information"
  ), ns)
  indexes <- lapply(information, function(x) {
    name <- xml_text(xml_find_one(x, "./d1:Name", ns))
    values <- as.numeric(sapply(
      xml_find_all(x, "./*[contains(name(), 'NumValue')]", ns), 
      xml_text))

    data.frame(pubchemid=pubchemid,
               column_class=name,
               kovats_ri=values)
  })

  return( do.call("rbind", indexes) )
}

compounds <- c("./5282108.xml", "./5282148.xml", "./91754124.xml")

cd <- rbindlist(
  lapply(compounds, compound.attributes),
  fill=TRUE
)

rti <- do.call("rbind",
               lapply(compounds, compound.retention.index))

json - R：解析从 Pubchem 导出的 JSON/XML 复合属性

1 回答 1

Related

Reference