python - 如何优化匹配长度为 n 的子字符串，但只匹配整个单词？

Question

我们需要从一个最长为“n”个字符的较大字符串中显示一些“预览文本”。不幸的是，我在 PyPi 上找不到处理这个问题的现有模块。

我希望能做一个适当的解决方案。虽然下面我的快速而肮脏的解决方案有效，但效率并不高——大量的持续比较。有没有人知道如何改进？我尝试了一个正则表达式，但在 20 分钟后放弃了。

我想出的笨拙的解决方案足以满足大多数需求，我只知道这可以更快、更简洁地完成——我很想知道如何做。

sample = "This is a sample string and I would like to break it down by words, without exceeding max_chars."

def clean_cut( text , max_chars ):
    rval = []
    words = text.split(' ')
    for word in words:
        len_rval = len(' '.join(rval))
        if len_rval + 1 + len(word) > max_chars :
            break
        rval.append(word)
    return ' '.join(rval)

for i in ( 15, 16, 17,30,35):
    cutdown = clean_cut( sample , i )
    print "%s | %s" % ( i , cutdown )

并且输出是正确的...

15 | This is a
16 | This is a sample
17 | This is a sample
30 | This is a sample string and I
35 | This is a sample string and I would

score 3 · Accepted Answer

以下实现可能对您有用

def clean_cut(st, end):
    st += ' ' #In case end > len(st)
    return st[:st[:end + 1].rfind(' ')]
for i in ( 15, 16, 17,30,35):
    cutdown = clean_cut( sample , i )
    print "%s | %s" % ( i , cutdown )

输出

15 | This is a
16 | This is a sample
17 | This is a sample
30 | This is a sample string and I
35 | This is a sample string and I would

笔记

与 textwrap 相比，此实现快 50 倍

>>> stmt_ab = """
for i in ( 15, 16, 17,30,35):
    cutdown = sample[:sample[:i + 1].rfind(' ')]
"""
>>> stmt_mg = """
for i in ( 15, 16, 17,30,35):
    cutdown =  textwrap.wrap(sample[:i+1],i)[0]
"""
>>> import timeit
>>> t1_ab = timeit.Timer(stmt=stmt_ab, setup = "from __main__ import sample")
>>> t1_mg = timeit.Timer(stmt=stmt_mg, setup = "from __main__ import sample, textwrap")
>>> t1_ab.timeit(10000)
0.10367805429780219
>>> t1_mg.timeit(10000)
5.597085870104877
>>>

score 1 · Accepted Answer

def substring_match(length, string):
    return re.search('(.{1,%d}) ' % length, string).group(0).strip()

应该工作，为我的琐碎测试做了

score 1 · Accepted Answer

你可以使用textwrap：

textwrap.wrap(yourstring[:length+1],length)[0]

切片字符串并不是特别必要，但可能会使整个事情更有效率......

>>> textwrap.wrap(sample[:15+1],15)[0]
'This is a'
>>> textwrap.wrap(sample[:16+1],16)[0]
'This is a sample'
>>> textwrap.wrap(sample[:17+1],17)[0]
'This is a sample'
>>> textwrap.wrap(sample[:30+1],30)[0]
'This is a sample string and I'
>>> textwrap.wrap(sample[:35+1],35)[0]
'This is a sample string and I would'

score 1 · Accepted Answer

有很好的库函数可以为您完成这项工作，就像textwrap@mgilson 的回答所指出的那样。

我将添加一个正则表达式答案只是为了好玩：

^.{0,n}(?<=\S)(?!\S)

将 n 替换为限制，并使用此正则表达式搜索第一个匹配项（最多只有 1 个匹配项）。我认为任何非空格字符都是单词的一部分。正向后视确保匹配的最后一个字符不是空格，而负向前瞻确保匹配中最后一个字符之后的字符是空格字符或字符串结尾。

In case you want to match something when the string starts with a long sequence of non-space, this regex will just break the sequence of non-space characters at the character limit:

^.{0,n}(?<=\S)(?!\S)|^\S{n}

python - 如何优化匹配长度为 n 的子字符串，但只匹配整个单词？

4 回答 4

Related

Reference