python - 提取一定数量字符之间的所有完整单词

Question

我想提取一段文本并从给定数量的字符中提取尽可能多的单词。我可以使用哪些工具/库来完成此任务？

例如，在给定的文本块中：

Have you managed to get your hands on Nikon's elusive D4 full-frame DSLR? 
It should be smooth sailing from here, with the occasional firmware update being 
your only critical acquisition going forward. D4 firmware 1.02 brings a handful of 
minor fixes, but if you're in need of any of the enhancements listed below, it's 
surely a must have:

如果我将它分配给一个字符串，然后 make string = string[0:100]，那将得到前 100 个字符，但是“sailing”这个词将被截断为“sailin”，我希望文本被截断在“航行”之前的空格之前或之后。

score 3 · Accepted Answer

使用正则表达式：

>>> re.match(r'(.{,100})\W', text).group(1)
"Have you managed to get your hands on Nikon's elusive D4 full-frame DSLR? It should be smooth"

这种方法可以让您搜索单词之间的任何标点符号（不仅是空格）。它将匹配 100 个或更少的字符。

为了处理小字符串，以下正则表达式更好：

re.match(r'(.{,100})(\W|$)', text).group(1)

score 1 · Accepted Answer

如果您真的只想在空格上断开字符串，请使用以下命令：

my_string = my_string[:100].rsplit(None, 1)[0]

但请记住，您实际上可能想要的不仅仅是空间。

score 0 · Accepted Answer

如果有的话，这将在前 100 个字符的最后一个空格处将其切断。

lastSpace = string[:100].rfind(' ')
string = string[:lastSpace] if (lastSpace != -1) else string[:100]

python - 提取一定数量字符之间的所有完整单词

3 回答 3

Related

Reference