我需要一个字符串,并将其缩短为 140 个字符。


if len(tweet) > 140:
    tweet = re.sub(r"\s+", " ", tweet) #normalize space
    footer = "… " + utils.shorten_urls(post['url'])
    avail = 140 - len(footer)
    words = tweet.split()
    result = ""
    for word in words:
        word += " "
        if len(word) > avail:
        result += word
        avail -= len(word)
    tweet = (result + footer).strip()
    assert len(tweet) <= 140


>>> s = u"简讯:新華社報道,美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域,預計約30分鐘後抵達浦東國際機場,開展他上任後首次訪華之旅。"
>>> s
>>> s.split()

我应该怎么做才能处理 I18N?这对所有语言都有意义吗?

如果这很重要,我在 python 2.5.4 上。


9 回答 9



于 2009-11-15T20:57:22.313 回答

对于中文分词和处理自然语言的其他高级任务,如果不是一个完整的解决方案,也可以将NLTK作为一个很好的起点——它是一个丰富的基于 Python 的工具包,特别适合学习 NL 处理技术(而且不是很少的好足以为其中一些问题提供可行的解决方案)。

于 2009-11-15T21:05:37.367 回答

re.U标志\s根据 Unicode 字符属性数据库进行处理。

但是,根据 python 的 unicode 数据库,给定的字符串显然不包含任何空格字符:

>>> x = u'\u7b80\u8baf\uff1a\u65b0\u83ef\u793e\u5831\u9053\uff0c\u7f8e\u570b\u7e3d\u7d71\u5967\u5df4\u99ac\u4e58\u5750\u7684\u300c\u7a7a\u8ecd\u4e00\u865f\u300d\u5c08\u6a5f\u665a\u4e0a10\u664242\u5206\u9032\u5165\u4e0a\u6d77\u7a7a\u57df\uff0c\u9810\u8a08\u7d0430\u5206\u9418\u5f8c\u62b5\u9054\u6d66\u6771\u570b\u969b\u6a5f\u5834\uff0c\u958b\u5c55\u4ed6\u4e0a\u4efb\u5f8c\u9996\u6b21\u8a2a\u83ef\u4e4b\u65c5\u3002'
>>> re.compile(r'\s+', re.U).split(x)
于 2009-11-16T22:43:39.363 回答

I tried out the solution with PyAPNS for push notifications and just wanted to share what worked for me. The issue I had is that truncating at 256 bytes in UTF-8 would result in the notification getting dropped. I had to make sure the notification was encoded as "unicode_escape" to get it to work. I'm assuming this is because the result is sent as JSON and not raw UTF-8. Anyways here is the function that worked for me:

def unicode_truncate(s, length, encoding='unicode_escape'):
    encoded = s.encode(encoding)[:length]
    return encoded.decode(encoding, 'ignore')
于 2010-01-21T03:19:52.657 回答




对我的原始实现的唯一更改是不要在最后一个单词上强制使用空格,因为它在任何语言中都是不需要的(并使用 unicode 字符......&#x2026而不是......three dots来保存 2 个字符)

于 2009-11-16T22:33:44.877 回答

Basically, in CJK (Except Korean with spaces), you need dictionary look-ups to segment words properly. Depending on your exact definition of "word", Japanese can be more difficult than that, since not all inflected variants of a word (i.e. "行こう" vs. "行った") will appear in the dictionary. Whether it's worth the effort depends upon your application.

于 2012-02-03T06:24:07.133 回答

这将打破单词的决定推向了 re 模块,但它可能对您来说足够好。

import re

def shorten(tweet, footer="", limit=140):
    """Break tweet into two pieces at roughly the last word break
    before limit.
    lower_break_limit = limit / 2
    # limit under which to assume breaking didn't work as expected

    limit -= len(footer)

    tweet = re.sub(r"\s+", " ", tweet.strip())
    m = re.match(r"^(.{,%d})\b(?:\W|$)" % limit, tweet, re.UNICODE)
    if not m or m.end(1) < lower_break_limit:
        # no suitable word break found
        # cutting at an arbitrary location,
        # or if len(tweet) < lower_break_limit, this will be true and
        # returning this still gives the desired result
        return tweet[:limit] + footer
    return m.group(1) + footer
于 2009-11-15T21:27:15.673 回答

What you're looking for is Chinese word segmentation tools. Word segmentation is not an easy task and is currently not perfectly solved. There are several tools:

  1. CkipTagger

    Developed by Academia Sinica, Taiwan.

  2. jieba

    Developed by Sun Junyi, a Baidu engineer.

  3. pkuseg

    Developed by Language Computing and Machine Learning Group, Peking University

If what you want is character segmentation, it can be done albeit not very useful.

>>> s = u"简讯:新華社報道,美國總統奧巴馬乘坐的「空軍一號」專機晚上10時42分進入上海空域,預計約30分鐘後抵達浦東國際機場,開展他上任後首次訪華之旅。&quot;
>>> chars = list(s)
>>> chars
[u'\u7b80', u'\u8baf', u'\uff1a', u'\u65b0', u'\u83ef', u'\u793e', u'\u5831', u'\u9053', u'\uff0c', u'\u7f8e', u'\u570b', u'\u7e3d', u'\u7d71', u'\u5967', u'\u5df4', u'\u99ac', u'\u4e58', u'\u5750', u'\u7684', u'\u300c', u'\u7a7a', u'\u8ecd', u'\u4e00', u'\u865f', u'\u300d', u'\u5c08', u'\u6a5f', u'\u665a', u'\u4e0a', u'1', u'0', u'\u6642', u'4', u'2', u'\u5206', u'\u9032', u'\u5165', u'\u4e0a', u'\u6d77', u'\u7a7a', u'\u57df', u'\uff0c', u'\u9810', u'\u8a08', u'\u7d04', u'3', u'0', u'\u5206', u'\u9418', u'\u5f8c', u'\u62b5', u'\u9054', u'\u6d66', u'\u6771', u'\u570b', u'\u969b', u'\u6a5f', u'\u5834', u'\uff0c', u'\u958b', u'\u5c55', u'\u4ed6', u'\u4e0a', u'\u4efb', u'\u5f8c', u'\u9996', u'\u6b21', u'\u8a2a', u'\u83ef', u'\u4e4b', u'\u65c5', u'\u3002']
>>> print('/'.join(chars))
于 2020-10-15T13:38:20.590 回答
于 2009-11-16T22:49:44.053 回答