regex - 从网站中提取具有其他数字的数字列

Question

生物化学家/生物信息学使用这个网站 (http://dgpred.cbr.su.se/index.php?p=TMpred)。输入蛋白质序列后，您会得到如下信息：

http://dgpred.cbr.su.se/analyze.php?with_length=on&seq=RGFTPLQWECVMASDFGHH

顶部和底部的一些数据，中间是 4 列，其中第 4 列是我们想要的数据。我想从第 4 列（对于很多蛋白质序列）中获取这些数字并放入 Excel。

我当前的工作流程（Mac OS X）是将富文本文档中的所有内容复制到 TextEdit，按住 alt+拖动数字（以便仅选择第 4 列中的数字），然后执行我的 AppleScript：

do shell script "pbpaste | sed 's/[^0-9.-]//g' | pbcopy"
do shell script "pbpaste | sed '/^$/d' | pbcopy"

我只是正则表达式的初学者，但这成功地给我留下了一个由换行符分隔的漂亮数字列表，可以粘贴到 Excel 中。

真正甜蜜的是放弃 TextEdit 步骤，让正则表达式直接从网站上获取数字。不过，这超出了我的水平。谁能帮我解决这个问题？即，仅从第 4 列中选择数字

score 0 · Accepted Answer

我注意到浏览器复制表格的方式不同。当我想从网页复制表格数据时，我倾向于尝试 IE/Chrome/Opera 浏览器，因为——至少在 Windows 上——我可以简单地将复制的表格直接粘贴到 Excel 中，并保留所有列。另一方面，Firefox 往往会把桌子弄得一团糟。

使用 Opera 复制有问题的表格并从单元格 A1 粘贴到 Excel 中，我在 F 列中获得所有绿色数字，在 H 列中获得红色数字。然后我在第 1 行右侧的列中键入以下公式并拖动单元格的向下角复制后续行：

=IF(AND(ISBLANK(F1), ISBLANK(H1)), "", IF(ISBLANK(F1), H1, F1))

现在在这个新列中，我看到了数据。我可以在原始数据之上粘贴一个新表格，然后重新计算右侧的公式。（其他浏览器的实际列可能不同）。

我承认这不是一个完全自动化的解决方案，但我发现这种方法在很多情况下都快速且有用，我认为值得分享。座右铭：如果一开始您选择的浏览器没有做正确的事情，请尝试另一个！

score 0 · Accepted Answer

当我复制这些数据时，我得到了这个结果：

R   1   -9.00           
       +0.03
G   2   -8.00           
       +0.36
F   3   -7.00       
-0.26

每奇数行 3 列，以 a 开头[A-Z]，然后是您想要的数据在下一行。

您想要的数字有两种形式：

^\t {3}([-+][0-9]+\.[0-9]{2})$  //for the red numbers

和：

^([-+][0-9]+\.[0-9]{2}) {3}\t$   //the green numbers

您可以像这样提取这两种类型：

^(\t {3})?([-+][0-9]+\.[0-9]{2})( {3}\t)?$

第二个捕获组([-+][0-9]+.[0-9]{2})是您所追求的内容：

s/^(\t {3})?([-+][0-9]+\.[0-9]{2})( {3}\t)?$/$2/g

而不是 Applescript，请考虑 BBEdit 或Textwrangler，您可能会发现它们更易于使用。

将其放在搜索字段中：

\r[A-Z].*\r(\t {3})?([-+][0-9]+.[0-9]{2})( {3}\t)?$

这在替换：

\r\2

选择“全部替换”

这个怎么运作

 \r        //  carriage return
 [A-Z]     //  any character from A to Z (the lines you DON't want all start with a letter)
 .         // any character
 *         // any number of times
 \r        // carriage return   
           // that deals with the lines you DON't want to keep
 (         // grouping
 \t        // tab character
  {3}      // space character repeated 3 times
 )         // close grouping
 ?         // zero or one occurences of the previous grouping
 (         // grouping (this is the bit you are after)
 [+-]      // character class - one of any of the [enclosed characters]
 [0-9]     // one of any of 0-9
 +         // repeated one or  more times
 \.        // full stop (escaped as it has special meaning in regext)
 [0-9]{2}  // exactly two occurences of any of 0-9
 )         // close capture parens (end of the group you are after)
 ( {3}\t)? // 3 spaces followed by a tab, occurring 0 or 1 time.
 $         // end of line  (in BBEdit/textwrangler you often use \r)

BBE/TW 中的重要细节，捕获的组被称为 \1,\2,\3，而不是 $1,$2,$3...</p>

regex - 从网站中提取具有其他数字的数字列

2 回答 2

Related

Reference