regex - 修剪sed中尖括号内的空格

Question

我实际上在撰写问题时解决了这个问题，但我认为它可能比我做的方式更整洁。

我想修剪空格和大多数标点符号，除了出现在 <>s 中的 url 合法内容（来自 rdf/n3 实体）。

源文本的一个示例是：
<this is a problem> <this_is_fine> "this is ok too" . <http://WeDontNeedToTouchThis.> <http:ThisContains"Quotes'ThatWillBreakThings> "This should be 'left alone'." .

输出需要将空格转换为下划线并修剪引号以及 url/iri 中不合法的任何内容。

<http://This is a "problem">=><http://This_is_a_problem>

这些都没用。
sed -e 's/\(<[^ ]*\) \(.*>\)/\1_\2/g' badDoc.n3 | head sed '/</,/>/{s/ /_/g}' badDoc.n3 | head

我的最终解决方案似乎可行，是：
sed -e ':a;s/\(<[^> ]*\) \(.*>\)/\1_\2/g;ta' badDoc.n3 | sed -e ':b;s/\(<[:/%_a-zA-Z0-9.\-]*\)[^><:/%_a-zA-Z0-9.\-]\(.*>\)/\1\2/g;tb' > goodDoc.n3

有没有更好的办法？

score 1 · Accepted Answer

首先，我想说这是一个有趣的问题。它看起来是一个简单的替换问题，但是如果进入它，它并不像我想象的那么容易。当我在寻找解决方案时，我确实想念 vim ！！！... :)

我不知道sed这个问题是否必须。我会用 awk 来做：

awk '{t=$0;
        while (match(t,/<[^>]*>/,a)>0){
                m[++i]=a[0];n[i]=a[0];t=substr(t,RSTART+RLENGTH)
        }
        for(x in n){
                gsub(/[\x22\x27]/,"",n[x])
                gsub(/ /,"_",n[x])
                sub(m[x],n[x])
        }}1' file

用你的例子测试一下：

kent$  cat file
<this is a problem> <this_is_fine> "this is ok too" . <http://WeDontNeedToTouchThis.> <http:ThisContains"Quotes'ThatWillBreakThings> "This should be 'left alone'." .

kent$  awk '{t=$0;
        while (match(t,/<[^>]*>/,a)>0){
                m[++i]=a[0];n[i]=a[0];t=substr(t,RSTART+RLENGTH)
        }
        for(x in n){
                gsub(/[\x22\x27]/,"",n[x])
                gsub(/ /,"_",n[x])
                sub(m[x],n[x])
        }}1' file
<this_is_a_problem> <this_is_fine> "this is ok too" . <http://WeDontNeedToTouchThis.> <http:ThisContainsQuotesThatWillBreakThings> "This should be 'left alone'." .

好吧，它并不是真正的单线，看看是否还有其他更短的解决方案。

regex - 修剪sed中尖括号内的空格

1 回答 1

Related

Reference