regex - 使用正则表达式 / sed 修剪文件

Question

我有一个包含几行这样的文件：

*wordX*-Sentence1.;Sentence2.;Sentence3.;Sentence4.

这些句子之一可能包含也可能不包含 wordX。我想要的是修剪文件使其看起来像这样：

*wordX*-Sentence1.;Sentence2.

其中 Sentence3 是第一个包含 wordX 的地方。

我怎样才能用 sed/awk 做到这一点？

编辑：

这是一个示例文件：

*WordA*-This sentence does not contain what i want.%Neither does this one.;Not here either.;Not here.;Here is WordA.;But not here.
*WordB*-WordA here.;WordB here, time to delete everything.;Including this sentece.
*WordC*-WordA, WordB. %Sample sentence one.;Sample Sentence 2.;Sample sentence 3.;Sample sentence 4.;WordC.;Discard this.

这是所需的输出：

*WordA*-This sentence does not contain what i want.%Neither does this one.;Not here either.;Not here.
*WordB*-WordA here.
*WordC*-WordA, WordB. %Sample sentence one.;Sample Sentence 2.;Sample sentence 3.;Sample sentence 4.

score 1 · Accepted Answer

这个任务更适合awk。使用以下 awk 命令：

awk -F ";" '/^ *\*.*?\*/ {printf("%s;%s\n", $1, $2)}' inFile

这假设您尝试匹配的单词始终包含在星号中*。

score 0 · Accepted Answer

sed -r -e 's/\.;/\n/g' \
       -e 's/-/\n/' \
       -e 's/^(\*([^*]*).*\n)[^\n]*\2.*/\1/' \
       -e 's/\n/-/' \
       -e 's/\n/.;/g' \
       -e 's/;$//'

（编辑：添加-:\n交换来处理第一句中的匹配。）

score 0 · Accepted Answer

这可能对您有用（GNU sed）：

sed -r 's/-/;/;:a;s/^(\*([^*]+)\*.*);[^;]+\2.*/\1;/;ta;s/;/-/;s/;$//' file

将-以下的转换wordX为;. wordX删除包含（从后到前工作）的句子。替换原来-的。删除最后的;。

regex - 使用正则表达式 / sed 修剪文件

3 回答 3

Related

Reference