我想使用提供的 API 从 R 中的 NIST webbook 网站检索与给定 CAS 注册号(Chemical Abstracts Service nr)相关的信息。
例如对于 cas nr。“19431-79-9”(Caryophylladienol II), http: //webbook.nist.gov/cgi/cbook.cgi?ID=19431-79-9&Units=SI&Mask=2000#Gas-Chrom 我得到了
casno = "19431-79-9"
casno2 = gsub("-", "", casno)
raw=readLines(paste('http://webbook.nist.gov/cgi/cbook.cgi?ID=',casno,'&Units=SI&Mask=2000#Gas-Chrom', sep=""))
# mass spec, empty here, but not e.g. for casno2="630035"
casno2="630035"
jcampfile = readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?JCAMP=C",casno2,"&Index=0&Type=Mass",sep=""))
if (jcampfile[[1]]=="##TITLE=Spectrum not found.") jcampfile=NA
casno2 = gsub("-", "", casno)
# molecular stucture
molfile2d=readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?Str2File=C",casno2,sep=""))
if (molfile2d==character(0)) molfile2d=NA
molfile3d=readLines(paste("http://webbook.nist.gov/cgi/cbook.cgi?Str3File=C",casno2,sep=""))
if (molfile3d==character(0)) molfile3d=NA
然后,我想从原始输出的以下位中提取以下变量和列表:
"name=\" Top \">Caryophylladienol II</a></h1>"
-> name="Caryophylladienol II"
"Formula</a>:</strong> C<sub>15</sub>H<sub>24</sub>O</li>\n \n \n<li><strong>"
-> formula="C15H24O"
"Molecular weight</a>:</strong> 220.3505</li>\n \n \n<li>"
-> MW=220.3505
"IUPAC Standard InChI:</strong>\n \n<br /><table>\n<tr><td>\n<ul style=\" list-style-type: circle;\">\n<li><tt>InChI=1S/C15H24O/c1-10-6-8-14(16)11(2)5-7-13-12(10)9-15(13,3)4/h12-14,16H,1-2,5-9H2,3-4H3/t12?,13?,14-/m1/s1</tt></li>\n"
-> InChI="InChI=1S/C15H24O/c1-10-6-8-14(16)11(2)5-7-13-12(10)9-15(13,3)4/h12-14,16H,1-2,5-9H2,3-4H3/t12?,13?,14-/m1/s1"
"IUPAC Standard InChIKey:</strong>\n<tt>CIIYOYPOMGIECX-JXQTWKCFSA-N</tt>"
-> InChiKey="CIIYOYPOMGIECX-JXQTWKCFSA-N"
"Stereoisomers:....<strong>
-> stereoisomers=XXX (list of stereoisomers)
"Other names:...\n"
-> synonyms=XXX (list of synonyms)
"Normal alkane RI..."
-> list of measured RIs plus on which column they were measured
e.g. here RIs=c(1637,1631,1627,1656,1615,1638,1628,1602,1611,1635,1622,1622,1627); columns=c("HP-5 MS","DB-5","RTX-1","Col-Elite 5MS","DB-5","DB-5","DB-5","DB-1","DB-5","CP Sil 5 CB","BP-1","RTX-1","DB-5")
关于如何最好地进行后一种解析的任何想法?理想情况下,这应该全部封装到一个函数中,该函数将 CAS nrs 列表作为输入,使用 NIST webbook 中的信息对其进行注释,并将它们写入文本文件。但没必要把它弄得如此完美——任何能让我开始的事情都会有帮助!
编辑:我一直在尝试使用 XML 包中的 htmlTreeParse 解析 html 文件,但我不太成功。任何对该功能有更多经验的人都可以帮助我吗?
编辑:我已经找到了在 Mathematica 中导入数据的解决方案,请参阅https://mathematica.stackexchange.com/questions/37091/look-up-info-associated-with-a-given-cas-chemical-identifier-来自-the-nist-webbo。如果有人有能力将该代码移植到 R,请告诉我!