1

我正在尝试构建一个测试单元来对发布管理的一个非常大的实现进行压力测试。我想使用 NLTK 来生成段落,关于不同的事物和文章的随机标题。

NLTK 有能力做这样的事情吗?我想尝试使每篇文章都独一无二,以测试不同的布局大小。我也想对主题做同样的事情。

PS Am 需要生成 1+ 百万篇文章,这些文章最终将用于测试许多事情(性能、搜索、布局等)

有人可以请教吗?

4

1 回答 1

6

我用过这个。它采用 Noam Chomsky 的短语并生成随机段落。您可以将原料文本更改为您想要的任何内容。当然,您使用的文本越多越好。

# List of LEADINs to buy time.
leadins = """To characterize a linguistic level L,
        On the other hand,
        This suggests that
        It appears that
        Furthermore """

# List of SUBJECTs chosen for maximum professorial macho.
subjects = """ the notion of level of grammaticalness
        a case of semigrammaticalness of a different sort
        most of the methodological work in modern linguistics
        a subset of English sentences interesting on quite independent grounds
        the natural general principle that will subsume this case """

#List of VERBs chosen for autorecursive obfuscation.
verbs = """can be defined in such a way as to impose
        delimits
        suffices to account for
        cannot be arbitrary in
        is not subject to """


# List of OBJECTs selected for profound sententiousness.

objects = """ problems of phonemic and morphological analysis.
        a corpus of utterance tokens upon which conformity has been defined by the paired utterance test.
        the traditional practice of grammarians.
        the levels of acceptability from fairly high (e.g. (99a)) to virtual gibberish (e.g. (98d)).
        a stipulation to place the constructions into these various categories.
        a descriptive fact.
        a parasitic gap construction."""

import textwrap, random
from itertools import chain, islice, izip
from time import sleep

def chomsky(times=1, line_length=72):
    parts = []
    for part in (leadins, subjects, verbs, objects):
        phraselist = map(str.strip, part.splitlines())
        random.shuffle(phraselist)
        parts.append(phraselist)
    output = chain(*islice(izip(*parts), 0, times))
    return textwrap.fill(' '.join(output), line_length)

print chomsky()

为我返回:

这表明不同类型的半语法案例不受成对话语测试定义一致性的话语标记语料库的影响。

对于标题,你当然可以

chomsky().split('\n')[0]
于 2012-12-13T13:12:01.563 回答