python - Python，处理字符串

Question

我需要为我的班级构建一个程序，该程序将：从文件中读取混乱的文本，并为该文本提供一个书本形式，因此从输入中：

This    is programing   story , for programmers  . One day    a variable
called
v  comes    to a   bar    and ordred   some whiskey,   when suddenly 
      a      new variable was declared .
a new variable asked : "    What did you ordered? "

进入输出

This is programing story,
for programmers. One day 
a variable called v comes
to a bar and ordred some 
whiskey, when suddenly a 
new variable was 
declared. A new variable
asked: "what did you 
ordered?"

我是编程的初学者，我的代码在这里

   def vypis(t):
    cely_text = ''
    for riadok in t:
        cely_text += riadok.strip()
    a = 0     
    for i in range(0,80):
        if cely_text[0+a] == " " and cely_text[a+1] == " ":
            cely_text = cely_text.replace ("  ", " ")
        a+=1
    d=0    
    for c in range(0,80):
        if cely_text[0+d] == " " and (cely_text[a+1] == "," or cely_text[a+1] == "." or cely_text[a+1] == "!" or cely_text[a+1] == "?"):
            cely_text = cely_text.replace (" ", "")
        d+=1   
def vymen(riadok):
    for ch in riadok:
        if ch in '.,":':
            riadok = riadok[ch-1].replace(" ", "")
x = int(input("Zadaj x"))
t = open("text.txt", "r")
v = open("prazdny.txt", "w")
print(vypis(t))

这段代码删除了一些空格，我试图在“.,_?”之类的符号之前删除空格。但这不起作用为什么？感谢帮助：）

score 3 · Accepted Answer

你想做很多事情，所以让我们按顺序排列它们：

让我们以漂亮的文本形式（字符串列表）获取文本：

>>> with open('text.txt', 'r') as f:
...     lines = f.readlines()

>>> lines
['This    is programing   story , for programmers  . One day    a variable', 
 'called', 'v  comes    to a   bar    and ordred   some whiskey,   when suddenly ',
 '      a      new variable was declared .', 
 'a new variable asked : "    What did you ordered? "']

你到处都有换行符。让我们用空格替换它们并将所有内容连接成一个大字符串：

>>> text = ' '.join(line.replace('\n', ' ') for line in lines)

>>> text
'This    is programing   story , for programmers  . One day    a variable called v  comes    to a   bar    and ordred   some whiskey,   when suddenly        a      new variable was declared . a new variable asked : "    What did you ordered? "'

现在我们要删除任何多个空格。我们按空格、制表符等分割...并只保留非空词：

>>> words = [word for word in text.split() if word]
>>> words
['This', 'is', 'programing', 'story', ',', 'for', 'programmers', '.', 'One', 'day', 'a', 'variable', 'called', 'v', 'comes', 'to', 'a', 'bar', 'and', 'ordred', 'some', 'whiskey,', 'when', 'suddenly', 'a', 'new', 'variable', 'was', 'declared', '.', 'a', 'new', 'variable', 'asked', ':', '"', 'What', 'did', 'you', 'ordered?', '"']

让我们用空格加入我们的单词......（这次只有一个）

>>> text = ' '.join(words)
>>> text
'This is programing story , for programmers . One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared . a new variable asked : " What did you ordered? "'

我们现在要删除所有的<SPACE>.，<SPACE>,等等...：

>>> for char in (',', '.', ':', '"', '?', '!'):
...     text = text.replace(' ' + char, char)
>>> text
'This is programing story, for programmers. One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared. a new variable asked:" What did you ordered?"'

好的，工作还没有完成，因为"仍然搞砸了，大写字母没有设置等等......您仍然可以逐步更新您的文本。对于大写，例如考虑：

>>> sentences = text.split('.')
>>> sentences
['This is programing story, for programmers', ' One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared', ' a new variable asked:" What did you ordered?"']

看看你怎么能解决它？诀窍是只进行字符串转换，例如：

正确的句子不会因转换而改变
一个不正确的句子通过转换得到改进

通过这种方式，您可以编写它们以逐步改进您的文本。

一旦你有一个格式很好的文本，就像这样：

>>> text
'This is programing story, for programmers. One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared. A new variable asked: "what did you ordered?"'

您必须定义类似的句法规则才能以书籍格式打印出来。例如考虑函数：

>>> def prettyprint(text):
...     return '\n'.join(text[i:i+50] for i in range(0, len(text), 50))

它将以 50 个字符的精确长度打印每一行：

>>> print prettyprint(text)
This is programing story, for programmers. One day
 a variable called v comes to a bar and ordred som
e whiskey, when suddenly a new variable was declar
ed. A new variable asked: "what did you ordered?"

不坏，但可以更好。就像我们之前玩弄文本、行、句子和单词来匹配英语的句法规则一样，想要完全一样来匹配印刷书籍的句法规则。

在这种情况下，英语和印刷书籍都使用相同的单位：单词，以句子排列。这表明我们可能希望直接处理这些问题。一个简单的方法是定义你自己的对象：

>>> class Sentence(object):
...     def __init__(self, content, punctuation):
...         self.content = content
...         self.endby = punctuation
...     def pretty(self):
...         nice = []
...         content = self.content.pretty()
...         # A sentence starts with a capital letter
...         nice.append(content[0].upper())
...         # The rest has already been prettified by the content
...         nice.extend(content[1:])
...         # Do not forget the punctuation sign
...         nice.append('.')
...         return ''.join(nice)

>>> class Paragraph(object):
...     def __init__(self, sentences):
...         self.sentences = sentences
...     def pretty(self):
...         # Separating our sentences by a single space
...         return ' '.join(sentence.pretty() for sentence in sentences)

等等......这样你可以将你的文本表示为：

>>> Paragraph(
...   Sentence(
...     Propositions([Proposition(['this', 
...                                'is', 
...                                'programming', 
...                                'story']),
...                   Proposition(['for',
...                                'programmers'])],
...                   ',')
...     '.'),
...   Sentence(...

ETC...

从字符串（甚至是乱七八糟的字符串）转换为这样的树相对简单，因为您只分解为尽可能小的元素。当您想以书本格式打印它时，您可以在树的每个元素上定义自己的book方法，例如像这样，传递 current line， outputlines和 currentoffset上的 current line：

 class Proposition(object):
      ...
      def book(self, line, lines, offset, line_length):
          for word in self.words:
              if offset + len(word) > line_length:
                  lines.append(' '.join(line))
                  line = []
                  offset = 0
              line.append(word)
          return line, lines, offset

 ...

 class Propositions(object):
     ...
     def book(self, lines, offset, line_length):
         lines, offset = self.Proposition1.book(lines, offset, line_length)
         if offset + len(self.punctuation) + 1 > line_length: 
              # Need to add the punctuation sign with the last word
              # to a new line
              word = line.pop()
              lines.append(' '.join(line))
              line = [word + self.punctuation + ' ']
              offset = len(word + self.punctuation + ' ')
         line, lines, offset = self.Proposition2.book(lines, offset, line_length)
         return line, lines, offset

并努力达到Sentence, Paragraph, Chapter...

这是一个非常简单的实现（实际上是一个不平凡的问题），它没有考虑音节或理由（您可能希望拥有），但这是要走的路。

请注意，我没有提到字符串模块、字符串格式化或正则表达式，这些都是您可以定义语法规则或转换后使用的工具。这些是非常强大的工具，但这里最重要的是准确了解将无效字符串转换为有效字符串的算法。一旦你有了一些可以工作的伪代码，正则表达式和格式字符串可以帮助你比普通字符迭代更轻松地实现它。book（例如，在我之前的单词树示例中，正则表达式可以极大地简化树的构造，而 Python 强大的字符串格式化函数可以使pretty方法更容易）。

score 1 · Accepted Answer

要去除多个空格，您可以使用简单的正则表达式替换。

import re
cely_text = re.sub(' +',' ', cely_text)

然后对于标点符号，您可以运行类似的子：

cely_text = re.sub(' +([,.:])','\g<1>', cely_text)

python - Python，处理字符串

2 回答 2

Related

Reference