html - 正则表达式从 Google Reader JSON 文件中提取所有加星标的项目 URL

Question

遗憾的是，谷歌阅读器宣布将在今年年中关闭。由于我在 Google 阅读器中有大量已加星标的项目，我想备份它们。这可以通过谷歌阅读器外卖来实现。它生成一个JSON格式的文件。

现在我想从这个几个 MB 的大文件中提取所有文章的 url。

起初我认为最好对 url 使用正则表达式，但似乎最好通过正则表达式提取所需的文章 url 以查找文章 url。这将防止提取其他不需要的 url。

以下是 json 文件各部分的外观的简短示例：

"published" : 1359723602,
"updated" : 1359723602,
"canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],
"alternate" : [ {
  "href" : "http://feeds.arstechnica.com/~r/arstechnica/everything/~3/EphJmT-xTN4/",
  "type" : "text/html"
} ],

我只需要你可以在这里找到的网址：

 "canonical" : [ {
  "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"
} ],

也许有人有心情说正则表达式必须如何提取所有这些 url？

这样做的好处是有一种快速而肮脏的方式来从谷歌阅读器中提取加星标的项目网址，以便在处理后将它们导入到袖珍或印象笔记等服务中。

score 3 · Accepted Answer

我知道你问过正则表达式，但我认为有更好的方法来处理这个问题。多行正则表达式是一个 PITA，在这种情况下，不需要那种脑损伤。

我会从grep，而不是正则表达式开始。该-A1参数表示“返回匹配的行，然后返回”：

grep -A1 "canonical" <file>

这将返回如下行：

"canonical" : [ {
    "href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

然后，我会再次 grep 获取 href：

grep -A1 "canonical" <file> | grep "href"

给予

"href" : "http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

现在我可以使用 awk 来获取 url：

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }'

去掉网址上的第一个引号：

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/"

现在我只需要摆脱额外的报价：

grep -A1 "canonical" <file> | grep "href" | awk -F'" : "' '{ print $2 }' | tr -d '"'

就是这样！

http://arstechnica.com/apple/2013/02/omni-group-unveils-omnifocus-2-omniplan-omnioutliner-4-for-mac/

html - 正则表达式从 Google Reader JSON 文件中提取所有加星标的项目 URL

1 回答 1

Related

Reference