1

下载页面并对其进行小幅编辑,将本段中的前65更改为68

在此处输入图像描述

然后我用 BeauifulSoup 解析这两个源并用difflib区分它们。

url = 'https://secure.ssa.gov/apps10/reference.nsf/links/02092016062645AM'
response = urllib2.urlopen(url)
content = response.read()  # get response as list of lines

url2 = 'file:///Users/Pyderman/projects/temp/02092016062645AM-modified.html'
response2 = urllib2.urlopen(url2)
content2 = response2.read()  # get response as list of lines
import difflib
d = difflib.Differ()

diffed = d.compare(content, content)

soup = bs4.BeautifulSoup(content, "lxml")
soup2= bs4.BeautifulSoup(content2, "lxml")
diff = d.compare(list(soup.stripped_strings), list(soup2.stripped_strings))
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print change

打印更改给出:

- The Achieving a Better Life Experience (ABLE) Act, H.R. 5771, legislation passed on December 19, 2014. It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).  This provision will apply to any individual who attains age 65 on or after December 19, 2015 (the one year anniversary of enactment of this bill).  Two new Universal Text Identifiers (UTIs), UTI WCP060 and WCP061 were created to comply with this change.
+ The Achieving a Better Life Experience (ABLE) Act, H.R. 5771, legislation passed on December 19, 2014. It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).  This provision will apply to any individual who attains age 65 on or after December 19, 2015 (the one year anniversary of enactment of this bill).  Two new Universal Text Identifiers (UTIs), UTI WCP060 and WCP061 were created to comply with this change.

所以它打印了整个段落,尽管变化很小。我想它通过整段而不是句子来显示差异是一件好事,但是我们能以某种方式使输出更细化吗?就目前而言,如果我只想突出显示更改的文本,我将不得不对这两个几乎相同的字符串进行一些额外的增量比较。

4

1 回答 1

1

您可以使用nltk.sent_tokenize()将汤字符串拆分为句子:

from nltk import sent_tokenize

sentences = [sentence for string in soup.stripped_strings for sentence in sent_tokenize(string)]
sentences2 = [sentence for string in soup2.stripped_strings for sentence in sent_tokenize(string)]

diff = d.compare(sentences, sentences2)
changes = [change for change in diff if change.startswith('-') or  change.startswith('+')]
for change in changes:
    print(change)

仅打印检测到更改的适当句子:

- It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 65 to full retirement age (FRA).
+ It contains a Title II provision that changes the age at which workers compensation/public disability offset ends for disability beneficiaries from age 68 to full retirement age (FRA).
于 2016-02-13T03:00:27.183 回答