java - 正则表达式去除单引号并保留撇号

Question

我想从文本文件中解析单词。应保留撇号，但应删除单引号。下面是一些测试数据：

john's apostrophe is a 'challenge'

我正在试验 grep 如下：

grep -o "[a-z'A-Z]*" file.txt

它产生：

john's
apostrophe
is
a
'challenge'

需要摆脱单词周围的那些引号challenge。

正确/期望的输出应该是：

john's
apostrophe
is
a
challenge

编辑：由于共识似乎是撇号很难识别，我现在正在寻找一种方法来从所有单词中去除任何类型的撇号（前导、尾随、嵌入）。这些词将被添加到词汇索引中。短语搜索也应该去掉撇号。这可能需要另一个问题。

score 4 · Accepted Answer

你需要使用grep吗？这是一个sed示例，以防万一：

$ echo "john's apostrophe is a 'challenge'" | sed -re "s/'(\S*)'/\1/g"
john's apostrophe is a challenge

sed是一个流编辑器，我用它来执行替换（格式是s/pattern/subst/，g代表全局。我匹配任意数量（*）的非空白字符（\S）并将其替换为同一组字符，指代它作为\1（我用圆括号捕获它(...)。

编辑：好的，这是一个丑陋的类似 Perl 的grep例子：

$ echo "john's apostrophe is a 'challenge'" | grep -oP "(?<=')\S*(?=')|\w+'?\w*"
john's
apostrophe
is
a
challenge

我不知道我做了什么，所以很可能出现意外行为:)

grep我使用肯定的环视断言来匹配单引号中的单词（断言用于引号不是匹配的一部分）或（|）带有可选撇号的单词，用“一个或多个单词”表示字符” ( \w+) 后跟'（或不），然后是可选的一些单词字符。

更多编辑：这是一个sed似乎可以完成这项工作并处理@tchrist的示例的命令：

$ echo "john's apostrophe is a 'challenge'" | sed -re "s/(\W|^) '(\w*)'(\W|$)/\1\2\3/g"
john's apostrophe is a challenge
$ echo "’Tis especially hard, ’tisn’t it now, to leave it for the dogs’ breakfast, let a lone for the cats'" | sed -re "s/(\W|^)'(\w*)'(\W|$)/\1\2\3/g"
’Tis especially hard, ’tisn’t it now, to leave it for the dogs’ breakfast, let a lone for the cats'

score 4 · Accepted Answer

这是一个更简单的grep方法：

grep -E -o "[a-zA-Z]([a-z'A-Z]*[a-zA-Z])?" file.txt

在Java中是：

Pattern.compile("[a-zA-Z]([a-z'A-Z]*[a-zA-Z])?")

（这两个都意味着“一个 ASCII 字母，可选地后跟 ASCII 字母和/或撇号和一个 ASCII 字母的混合”。这个想法是匹配的子字符串必须以字母开头并以字母结尾，但如果它是超过两个字符，则可以包含撇号。）

要接受非 ASCII 字母，Java 可以写成：

Pattern.compile("\\p{L}([\\p{L}']*\\p{L})?")

编辑更新的问题（去掉撇号）：我认为你不能只 grep用; 但是稍微扩展一下我们的曲目，你可以写：

tr -d "'" file.txt | grep -E -o "[a-zA-Z]+"

或在 Java 中：

String apostrippedStr = str.replace("'", "");

Pattern.compile("[a-zA-Z]+") // or "\\p{L}+" for non-ASCII support
// ... apply pattern to apostrippedStr

java - 正则表达式去除单引号并保留撇号

2 回答 2

Related

Reference