0

我有一个大文本文件,其中的单词散布着数字和两种类型的字符,a'|''.'. 我在 StackOverflow 上进行了搜索,发现如何获取这个字符串并且只保留字母。例如,如果

old_fruits='apple|0.00|kiwi|0.00|0.5369|-0.2437|banana|0.00|pear'

然后

re.sub("[^A-Za-z]","",old_fruits)

会回来

'applekiwibananapear'

我正在尝试将这些单词写到一个文件中,每行一个单词,然后是换行符,然后是下一个单词,例如:

apple
kiwi
banana
pear

任何想法或指向正确的方向表示赞赏。

4

5 回答 5

1

尝试这个:

import re

old_fruits = 'apple|0.00|kiwi|0.00|0.5369|-0.2437|banana|0.00|pear'

with open('fruits.out', 'w') as f:
    fruits = re.findall(r'[^\W\d]+', old_fruits)
    f.write('\n'.join(fruits))
于 2012-08-13T03:41:38.883 回答
1

您可以在不使用正则表达式的情况下执行此操作。在管道字符上拆分字符串,使用生成器表达式和 inbuildstring.isalpha()函数过滤掉那些仅是字母字符的单词,并将它们连接起来形成最终输出:

old_fruits = 'apple|0.00|kiwi|0.00|0.5369|-0.2437|banana|0.00|pear'
words = (word for word in old_fruits.split('|') if word.isalpha())
new_fruits = '\n'.join(words)

print(new_fruits)

输出是

apple
kiwi
banana
pear

根据需要(未写入文件,但我假设您能够应对)。

编辑:敲了一个快速脚本来提供一些正则表达式与非正则表达式的时间比较:

import timeit

# Setup - not counted in the timing so it doesn't matter we include regex for both tests
setup = r"""old_fruits = 'apple|0.00|kiwi|0.00|0.5369|-0.2437|banana|0.00|pear'
import re
fruit_re=re.compile(r'[^\W\d]+')
"""

no_re = r"""words = (word for word in old_fruits.split('|') if word.isalpha())
new_fruits = '\n'.join(words)"""

with_re = r"""new_fruits = '\n'.join(fruit_re.findall(old_fruits))"""

num = 10000

print("Short input")
t = timeit.timeit(no_re, setup, number=num)
print("No regex: {0:.2f} microseconds to run".format((t*1e6)/num))
t = timeit.timeit(with_re, setup, number=num)
print("With regex: {0:.2f} microseconds to run".format((t*1e6)/num))

print("")
print("100 times longer input")

setup = r"""old_fruits = 'apple|0.00|kiwi|0.00|0.5369|-0.2437|banana|0.00|pear'*100
import re
fruit_re=re.compile(r'[^\W\d]+')"""

t = timeit.timeit(no_re, setup, number=num)
print("No regex: {0:.2f} microseconds to run".format((t*1e6)/num))
t = timeit.timeit(with_re, setup, number=num)
print("With regex: {0:.2f} microseconds to run".format((t*1e6)/num))

我电脑上的结果:

Short input
No regex: 18.31 microseconds to run
With regex: 15.37 microseconds to run

100 times longer input
No regex: 793.79 microseconds to run
With regex: 999.08 microseconds to run

因此,预编译的正则表达式在短输入字符串上更快,对于更长的输入字符串,生成器表达式更快(至少在我的计算机上 - Ubuntu Linux,Python 2.7 - 结果可能因您而异)。

于 2012-08-13T03:52:58.610 回答
0
of=old_fruits.split("|")
for i in range(0,len(of),2):
 # write to file
于 2012-08-13T03:40:58.920 回答
0

使用 OP 的代码作为基础:

import re
old_fruits = 'apple|0.00|kiwi|0.00|0.5369|-0.2437|banana|0.00|pear'

with open('outdata.txt', 'w') as f:
    f.write('\n'.join(re.sub("[^A-Za-z]"," ",old_fruits).split()))

apple
kiwi
banana
pear

在文件中'outdata.txt'

于 2012-08-13T03:38:07.373 回答
0

答案并不难,虽然我不知道这是否是最佳实践,但为什么不

print re.sub("[^A-Za-z]+","\n",old_fruits) #re.sub("[^A-Za-z]+","\n",old_fruits) is the string you want

“+”表示 1+ 个非字母字符实例将被替换为 \n

于 2012-08-13T03:39:18.380 回答