python - Python: re.compile and re.sub

Question

Question part 1

I got this file f1:

<something @37>
<name>George Washington</name> 
<a23c>Joe Taylor</a23c>
</something @37>

and I want to re.compile it that it looks like this f1: (with spaces)

George Washington Joe Taylor

I tried this code but it kinda deletes everything:

import re
file = open('f1.txt')
fixed = open('fnew.txt','w')
text = file.read()

match = re.compile('<.*>')
for unwanted in text:
    fixed_doc = match.sub(r' ',text)

fixed.write(fixed_doc)

My guess is the re.compile line but I'm not quite sure what to do with it. I'm not supposed to use 3rd party extensions. Any ideas?

Question part 2

I had a different question about comparing 2 files I got this code from Alfe:

from collections import Counter

def test():
    with open('f1.txt') as f:
        contentsI = f.read()
    with open('f2.txt') as f:
        contentsO = f.read()

    tokensI = Counter(value for value in contentsI.split()
                        if value not in [])
    tokensO = Counter(value for value in contentsO.split()
                        if value not in [])
    return not (tokensI - tokensO) and not (set(tokensO) - set(tokensI))

Is it possible to implement the re.compile and re.sub in the 'if value not in []' section?

score 35 · Accepted Answer

I will explain what happens with your code:

import re
file = open('f1.txt')
fixed = open('fnew.txt','w')
text = file.read()

match = re.compile('<.*>')
for unwanted in text:
    fixed_doc = match.sub(r' ',text)

fixed.write(fixed_doc)

The instruction text = file.read() creates an object text of type string named text.
Note that I use bold characters text to express an OBJECT, and text to express the name == IDENTIFIER of this object.
As a consequence of the instruction for unwanted in text:, the identifier unwanted is successively assigned to each character referenced by the text object.

Besides, re.compile('<.*>') creates an object of type RegexObject (which I personnaly call compiled) regex or simply regex , <.*> being only the regex pattern).
You assign this compiled regex object to the identifier match: it's a very bad practice, because match is already the name of a method of regex objects in general, and of the one you created in particular, so then you could write match.match without error.
match is also the name of a function of the re module.
This use of this name for your particular need is very confusing. You must avoid that.

There's the same flaw with the use of file as a name for the file-handler of file f1. file is already an identifier used in the language, you must avoid it.

Well. Now this bad-named match object is defined, the instruction fixed_doc = match.sub(r' ',text) replaces all the occurences found by the regex match in text with the replacement r' '.
Note that it's completely superfluous to write r' ' instead of just ' ' because there's absolutely nothing in ' ' that needs to be escaped. It's a fad of some anxious people to write raw strings every time they have to write a string in a regex problem.

Because of its pattern <.+> in which the dot symbol means "greedily eat every character situated between a < and a > except if it is a newline character" , the occurences catched in the text by match are each line until the last > in it.
As the name unwanted doesn't appear in this instruction, it is the same operation that is done for each character of the text, one after the other. That is to say: nothing interesting.
To analyze the execution of a programm, you should put some printing instructions in your code, allowing to understand what happens. For example, if you do print repr(fixed_doc), you'll see the repeated printing of this: ' \n \n \n '. As I said: nothing interesting.

There's one more default in your code: you open files, but you don't shut them. It is mandatory to shut files, otherwise it could happen some weird phenomenons, that I personnally observed in some of my codes before I realized this need. Some people pretend it isn't mandatory, but it's false.
By the way, the better manner to open and shut files is to use the with statement. It does all the work without you have to worry about.

.

So , now I can propose you a code for your first problem:

import re

def ripl(mat=None,li = []):
    if mat==None:
        li[:] = []
        return
    if mat.group(1):
        li.append(mat.span(2))
        return ''
    elif mat.span() in li:
        return ''
    else:
        return mat.group()

r = re.compile('</[^>]+>'
               '|'
               '<([^>]+)>(?=.*?(</\\1>))',
               re.DOTALL)

text = '''<something @37>
<name>George <wxc>Washington</name> 
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'

result

1------------------------------------1
<something @37>
<name>George <wxc>Washington</name> 
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>
2------------------------------------2

George <wxc>Washington 
Joe </zazaza>Taylor

3------------------------------------3

The principle is as follows:

When the regex detects a tag,
- if it's an end tag, it matches - if it's a start tag, it matches only if there is a corresponding end tag somewhere further in the text
For each match, the method sub() of the regex r calls the function ripl() to perform the replacement.
If the match is with a start tag (which is necessary followed somewhere in the text by its corresponding end tag, by construction of the regex), then ripl() returns ''.
If the match is with an end tag, ripl() returns '' only if this end tag has previously in the text been detected has being the corresponding end tag of a previous start tag. This is done possible by recording in a list li the span of each corresponding end tag's span each time a start tag is detected and matching.

The recording list li is defined as a default argument in order that it's always the same list that is used at each call of the function ripl() (please, refer to the functionning of default argument to undertsand, because it's subtle).
As a consequence of the definition of li as a parameter receiving a default argument, the list object li would retain all the spans recorded when analyzing several text in case several texts would be analyzed successively. In order to avoid the list li to retain spans of past text matches, it is necessary to make the list empty. I wrote the function so that the first parameter is defined with a default argument None: that allows to call ripl() without argument before any use of it in a regex's sub() method.
Then, one must think to write ripl() before any use of it.

.

If you want to remove the newlines of the text in order to obtain the precise result you showed in your question, the code must be modified to:

import re

def ripl(mat=None,li = []):
    if mat==None:
        li[:] = []
        return
    if mat.group(1):
        return ''
    elif mat.group(2):
        li.append(mat.span(3))
        return ''
    elif mat.span() in li:
        return ''
    else:
        return mat.group()


r = re.compile('( *\n *)'
               '|'
               '</[^>]+>'
               '|'
               '<([^>]+)>(?=.*?(</\\2>)) *',
               re.DOTALL)

text = '''<something @37>
<name>George <wxc>Washington</name> 
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>'''
print '1------------------------------------1'
print text
print '2------------------------------------2'
ripl()
print r.sub(ripl,text)
print '3------------------------------------3'

result

1------------------------------------1
<something @37>
<name>George <wxc>Washington</name> 
<a23c>Joe </zazaza>Taylor</a23c>
</something @37>
2------------------------------------2
George <wxc>WashingtonJoe </zazaza>Taylor
3------------------------------------3

score 2 · Accepted Answer

您可以使用 Beautiful Soup 轻松做到这一点：

from bs4 import BeautifulSoup
file = open('f1.txt')
fixed = open('fnew.txt','w')

#now for some soup

soup = BeautifulSoup(file)

fixed.write(str(soup.get_text()).replace('\n',' '))

上述行的输出将是：

George Washington Joe Taylor

（至少这适用于你给我的样本）

对不起，我不明白第2部分，祝你好运！

score 1 · Accepted Answer

不需要重新编译

import re  

clean_string = ''  

with open('f1.txt') as f1:  
    for line in f1:  
        match = re.search('.+>(.+)<.+', line)  
        if match:  
            clean_string += (match.group(1))  
            clean_string += ' '  

print(clean_string) # 'George Washington Joe Taylor'

score 0 · Accepted Answer

Figured the first part out it was the missing '?'

match = re.compile('<.*?>')

does the trick.

Anyway still not sure about the second questions. :/

score 0 · Accepted Answer

对于第 1 部分，请尝试以下代码片段。但是考虑使用 Moe Jan 建议的像 beautifulsoup 这样的库

import re
import os
def main():
    f = open('sample_file.txt')
    fixed = open('fnew.txt','w')



    #pattern = re.compile(r'(?P<start_tag>\<.+?\>)(?P<content>.*?)(?P<end_tag>\</.+?\>)')
    pattern = re.compile(r'(?P<start><.+?>)(?P<content>.*?)(</.+?>)')
    output_text = []
    for text in f:
        match = pattern.match(text)
        if match is not None:
            output_text.append(match.group('content'))

    fixed_content = ' '.join(output_text)


    fixed.write(fixed_content)
    f.close()
    fixed.close()

if __name__ == '__main__':
    main()

对于第 2 部分：

我对您的要求并不完全清楚-但是我的猜测是您想做类似的事情if re.sub(value) not in []。但是，请注意，您只需在初始化实例re.compile之前调用一次。Counter如果您澄清问题的第二部分会更好。

实际上，我建议您使用内置的 Python diff 模块来查找两个文件之间的差异。使用这种方式比使用您自己的 diff 算法更好，因为 diff 逻辑经过了良好的测试和广泛使用，并且不易受到由于虚假换行符、制表符和空格字符的存在而导致的逻辑或程序错误的影响。

python - Python: re.compile and re.sub

5 回答 5

Related

Reference