1

我正在寻找某种方法来确定来自任何文章网站的字符串类型,例如这个。类型可以是标题、作者、日期、文章本身。我使用 BeautifulSoup 和 Boilerpipe 来抓取相关内容:

from boilerpipe.extract import Extractor
from bs4 import BeautifulSoup
import urllib

urlvar = 'http://goarticles.com/article/Finding-Out-What-Makes-a-UK-VPN-Tick/7454423/'

result = urllib.urlopen(urlvar)
soup = BeautifulSoup(result.read())

extractor = Extractor(extractor='ArticleExtractor', html=soup.prettify())

extracted_html = extractor.getHTML()
print(extracted_html.encode("utf-8"))

现在我有一些看起来像这样的输出:

<HTML prefix="og: http://ogp.me/ns#">
<style type="text/css">
A:before { content:' '; } 
A:after { content:' '; } 
SPAN:before { content:' '; } 
SPAN:after { content:' '; } 
</style>
<BODY><DIV id="content"><DIV class="wrapper"><DIV id="article-content"><DIV id="main-col"><DIV align="left" class="article"><H1 class="art_head">
        Finding Out What Makes a UK VPN Tick
        <EM>
         by John H. Hickman
        </EM></H1><H2 class="art_cat"><B>
         in
         <A href="/category/technology/">
          Technology
         </A></B>
        (submitted 2013-03-09)
       </H2><P>
        How a UK VPN Works
       </P><P>
        A VPN creates a virtual tunnel through the Internet, through which encrypted data passes through. No matter how a user connects to the web, a VPN secures the connection with high grade encryption. This is important for public hotspots and other unsecure networks that put users at risk for digital eavesdropping. The benefits don't stop there; below is a description of the benefits of using a British VPN.
       </P><P>
        Can You Access UK Websites with a UK VPN?
       </P><P>
        Many websites are geo-restricted or blocked from users who don't have a UK IP address. A British VPN provider bypasses these restrictions so that users can visit any UK website.
       </P><P>
        Quality VPN providers will also offer other server connections that appear to be in several different countries, including the US, Europe and Asia. Each British VPN provider will offer a different number of servers in different locations. The more servers that the service provides, the more options users have.
       </P><P>
        Is Security That Important?
       </P><P>
        Many think that they have nothing to hide, but information of any kind is valuable to hackers and phishers on the Internet. Each time a person transfers data through the Internet, that information is open to interception. But a United Kingdom VPN provides an encrypted, secure connection so that data stays private. This is how a secure connection works:
       </P><P>
        ΓÇó The United Kingdom VPN provider provides an application that connects to the server. With a few clicks, the connection is configured and ready to go.

        ΓÇó The VPN replaces the actual IP address with one that isn't tied to a physical location.

        ΓÇó The United Kingdom VPN encrypts all information that the user sends and receives online.

        ΓÇó The VPN allows the user to bypass location based IP blocking.

        Many different British VPNs will feature 64bit, 128bit and 256bit encryption; the higher the bit, the better the encryption.

        Only the server and the client can decrypt the secure connection. If a person uses a Wi-Fi hotspot at work, school, the airport or in a restaurant, an unsecure connection can leave them vulnerable to phishing and packet sniffing. A UK VPN will keep this information safe and secure wherever a user accesses the Internet. It will keep passwords, the websites history and online activity secure. A good UK VPN provider will secure the user's Internet connection so that they can browse websites in the UK and other countries safe from hackers and enjoy the web without any worries.
       </P><H3 class="about_author">
        About the Author
       </H3><P>
        Thomas R. Hickman believes the internet should be free and open. The
        <A href="https://www.goldenfrog.com/vyprvpn/uk-vpn" target="_new">
         UK VPN Service
        </A>
        he mentions above is one tool you can use to increase your online freedom. Thomas strives to inform others about the tools available for avoiding restrictions on the web. The fastest UK VPN Provider can be found at
        <A href="https://www.goldenfrog.com/vyprvpn/uk-vpn" target="_new">
         https://www.goldenfrog.com/vyprvpn/uk-vpn
        </A></P><DIV id="a_disclaimer">
        Use and distribution of this article is subject to our
        <A href="/publisher.html" target="_blank">
         Publisher Guidelines
        </A>
        whereby the original author's information and copyright must be included.
       </DIV></DIV></DIV><DIV id="left-col"><P><B>
        John H. Hickman
       </B></P></DIV></DIV></DIV></DIV></BODY></HTML>

或这个:

<HTML lang="en-US" xmlns:FB="http://www.facebook.com/2008/fbml" xmlns:OG="http://ogp.me/ns#">
<style type="text/css">
A:before { content:' '; } 
A:after { content:' '; } 
SPAN:before { content:' '; } 
SPAN:after { content:' '; } 
</style>
<BODY class="no-js yog-ltr yog-blog" dir="ltr"><DIV class="yog-bd" id="yog-bd" role="main"><DIV class="yog-wrap yog-grid yog-24u"><DIV class="yog-col yog-16u yom-primary"><DIV class="yog-wrap yom-art-bd"><DIV class="yog-col yog-11u"><DIV class="yom-mod yom-art-content "><DIV class="bd"><P class="first"><SPAN class="yom-figure yom-fig-right" style="width:630px;"><SPAN class="legend">
             A clock hangs in Plantation, Fla. (Joe Raedle/Getty Images)
            </SPAN></SPAN></P><P>
           Here's everything you wanted to know about the time change this weekend.
          </P><P><STRONG>
            When is it?
           </STRONG>
           The time change begins on Sunday, March 10, at 2 a.m., when clocks are moved forward by one hour.
          </P><P><STRONG>
            Why 2 a.m.?
           </STRONG>
           The time change is set for 2 a.m. because it was decided to be the
           <A href="http://www.lifeslittlemysteries.com/81-why-does-daylight-saving-time-begin-at-2-am.html">
            least disruptive
           </A>
           time of day. Moving time forward or back an hour at that time doesnΓÇÖt change the date, which avoids confusion, and most people are asleep, or if people do work on a Sunday, itΓÇÖs usually later than 2 a.m.
          </P><P><STRONG>
            Do all states observe daylight saving time?
           </STRONG>
           Hawaii and most of Arizona donΓÇÖt
           <A href="http://www.timeanddate.com/time/us/arizona-no-dst.html">
            observe the time change
           </A>
           . U.S. territories that donΓÇÖt go on daylight saving time include American Samoa, Guam, Puerto Rico and the Virgin Islands.
          </P><P><STRONG>
            Why do we have it?
           </STRONG>
           The idea is to save electricity because there are more hours of natural light.
           <A href="http://news.nationalgeographic.com/news/2008/03/080307-daylight-saving.html">
            Studies have shown
           </A>
           the savings to be fairly nominalΓÇöthe time change leading people to switch on the lights earlier in the morning instead or cranking up the air conditioning, for example.
          </P><P><STRONG>
            What is the history of daylight saving time?
           </STRONG>
           Fun fact: The idea was first floated in 1784 by one Benjamin Franklin. While minister of France, he
           <A href="http://www.livescience.com/11069-started-daylight-saving-time.html">
            wrote the essay
           </A>
           &quot;An Economical Project for Diminishing the Cost of Light.&quot;
          </P><P>
           The idea failed to see the light of day until 1883, when the U.S. railroads instituted a standardized time for their train schedules. That time change was imposed nationally during the First World War to conserve energy, but it was repealed after the war. It became the national time again during World War II.
          </P><P>
           After that, it was up to the states to decide if they wanted it, and when it would start and end. Congress finally enacted the Uniform Time Act in 1966, which standardized the beginning and end of daylight time for the states that observed it. In 1974 and 1975, the energy crisis moved Congress to enact earlier daylight start times, which were reversed when the crisis was over.
          </P><P><A href="http://aa.usno.navy.mil/faq/docs/daylight_time.php">
            Daylight saving time
           </A>
           since then had always been in AprilΓÇöuntil the Energy Policy Act of 2005 ordered the earlier start time to begin in March 2007.
          </P></DIV></DIV></DIV></DIV><DIV class="yom-mod yom-bcarousel ymg-carousel" id="MediaInfiniteBrowse"><DIV class="hd"><DIV class="yui-carousel-nav"><DIV class="heading"><H3>
           Explore Related Content
          </H3></DIV></DIV></DIV></DIV></DIV></DIV></DIV></BODY></HTML>

我正在寻找任何方法来确定哪个标签包含标题字符串、作者字符串、发布日期和文章字符串(如果有)。
我在胡闹scrapy,但它不包含任何用于从不同站点获取此信息的通用算法。
我开始怀疑这样的事情是否可能,除了某种疯狂的评估字符串长度并希望标题标签总是比作者标签更多的字符,但少于文章标签。但这样想是非常幼稚的,输出可能非常不准确。
关于如何做的任何指示?

4

0 回答 0