7

我正在尝试检索 Wikipedia 文章的第一段文本,在此示例中为UNIX,但它返回给我一个不想要的输出。

对于我在 Wikipedia api 和 StackOverflow 上阅读的内容,这是进行调用的请求 URL:

http://en.wikipedia.org/w/api.php?format=php&action=query&titles=unix&redirects=1&prop=revisions&rvprop=content&rvsection=0&rvlimit=1

我的预期输出将是:

Unix(正式商标为 UNIX,有时也写为 Unix in small caps)是一种多任务、多用户计算机操作系统,最初由贝尔实验室的一群 AT&T 员工于 1969 年开发,其中包括 Ken Thompson、Dennis Ritchie、Brian Kernighan、道格拉斯·麦克罗伊、迈克尔·莱斯克和乔·奥萨纳。[1] Unix 操作系统最初是用汇编语言开发的,但到 1973 年几乎完全用 C 重新编码,极大地促进了它的进一步开发和移植到其他硬件。今天的 Unix 系统演变分为多个分支,由 AT&T 以及各种商业供应商、大学(如加州大学伯克利分校的 BSD)和非营利组织随着时间的推移而开发。

我目前的结果:

{{Use dmy dates|date=August 2012}}
{{Infobox OS
|name               = Unix
|logo               = 
|screenshot         = [[File:Unix history-simple.svg|250px]]
|caption            = Evolution of Unix and Unix-like systems
|website            = [http://www.unix.org unix.org]
|developer          = [[Ken Thompson (computer programmer)|Ken Thompson]], [[Dennis Ritchie]], [[Brian Kernighan]], [[Douglas McIlroy]], and [[Joe Ossanna]] at [[Bell Labs]]
|source_model       = Historically [[Closed source software|closed source]], now some Unix projects ([[Berkeley Software Distribution|BSD]] family and [[Illumos]]) are [[open source]]d.
|frequently_updated = yes <!-- Release version update? Don't edit this page, just click on the version number! -->
|programmed_in      = [[C (programming language)|C]] 
|kernel_type        = [[Monolithic Kernel|Monolithic]]
|ui                 = [[Command-line interface]] & [[Graphical user interface|Graphical]] ([[X Window System]])
|language           = English 
|family             = Unix
|released           = {{start date and age|df=yes|1969}}
|license            = [[Proprietary software|Proprietary]]
|working_state      = Current 
}}

'''Unix''' (officially trademarked as '''UNIX''', sometimes also written as '''<span style="font-variant: small-caps;">Unix</span>''' in small caps) is a [[Computer multitasking|multitasking]], [[multi-user]] computer [[operating system]] originally developed in 1969 by a group of [[American Telephone & Telegraph|AT&T]] employees at [[Bell Labs]], including [[Ken Thompson]], [[Dennis Ritchie]], [[Brian Kernighan]], [[Douglas McIlroy]], [[Michael Lesk]] and [[Joe Ossanna]].<ref name=" Ritchie">{{cite journal
  | last = Ritchie
  | first = D.M.
  | authorlink = 
  | coauthors = Thompson, K.
  | title = The UNIX Time-Sharing System
  | journal = Bell System Tech. J.
  | volume = 57
  | issue = 6
  | pages = 1905-1929
  | publisher = American Tel. & Tel.
  | location = USA
  | date = July 1978
  | url = http://www.alcatel-lucent.com/bstj/vol57-1978/articles/bstj57-6-1905.pdf
  | issn = 
  | doi = 
  | id = 
  | accessdate = December 9, 2012}}</ref>  The Unix operating system was first developed in [[assembly language]], but by 1973 had been almost entirely recoded in [[C (programming language)|C]], greatly facilitating its further development and [[Software portability|porting]] to other hardware. Today's Unix system evolution is split into various branches, developed over time by AT&T as well as various commercial vendors, universities (such as [[University of California, Berkeley]]'s [[BSD]]), and [[non-profit]] organizations.

[[The Open Group]], an industry standards consortium, owns the UNIX trademark. Only systems fully compliant with and certified according to the [[Single UNIX Specification]] are qualified to use the trademark; others might be called ''Unix system-like'' or ''[[Unix-like]]'', although the Open Group disapproves<ref>[http://www.unix.org/questions_answers/faq.html#7a  What is a "Unix-like" operating system?] Unix.org FAQ</ref> of this term.  However, the term ''Unix'' is often used informally to denote any operating system that closely resembles the trademarked system.

During the late 1970s and early 1980s, the influence of Unix in academic circles led to large-scale adoption of Unix (particularly of the [[Berkeley Software Distribution|BSD]] variant, originating from the [[University of California, Berkeley]]) by commercial startups, the most notable of which are [[Solaris (operating system)|Solaris]], [[HP-UX]], [[Sequent Computer Systems|Sequent]], and [[AIX operating system|AIX]], as well as [[Darwin (operating system)|Darwin]], which forms the core set of components upon which [[Apple Inc.|Apple]]'s [[OS X]], [[Apple TV]], and [[IOS (Apple)|iOS]] are based.<ref>{{cite web|url=http://marketshare.hitslink.com/operating-system-market-share.aspx?qprid=8&qpcustomd=0 |title=Operating system market share |publisher=Marketshare.hitslink.com |date= |accessdate=2012-08-22}}</ref><ref>{{cite web|url=http://developer.apple.com/library/mac/#documentation/MacOSX/Conceptual/OSX_Technology_Overview/SystemTechnology/SystemTechnology.html#//apple_ref/doc/uid/TP40001067-CH207-BCICAIFJ |title=Loading |publisher=Developer.apple.com |date= |accessdate=2012-08-22}}</ref> Today, in addition to certified Unix systems such as those already mentioned, [[Unix-like]] operating systems such as [[MINIX]], [[Linux]], and [[BSD]] descendants ([[FreeBSD]], [[NetBSD]], [[OpenBSD]], and [[DragonFly BSD]]) are commonly encountered. The term ''traditional Unix'' may be used to describe an operating system that has the characteristics of either [[Version 7 Unix]] or [[UNIX System V]]."

检索文章的正确方法是什么?

提前致谢!

4

4 回答 4

6

如果您只想要纯文本,请使用TextExtractshttp ://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&explaintext=1&titles=Unix

这将产生:Unix is a multitasking, multi-user computer operating system that exists in many variants. The original Unix was developed at AT&T's Bell Labs research center by Ken Thompson, Dennis Ritchie, and others. From the power user's or programmer's perspective, Unix systems are characterized by a modular design that is sometimes called the "Unix philosophy," meaning the OS provides a set of simple tools that each perform a limited, well-defined function, with a unified filesystem as the main means of communication and a shell scripting and command language to combine the tools to perform complex workflows.

于 2014-04-22T06:32:30.873 回答
0

我一直在用 Python 处理这个问题。第一个任务是获取想要的文本;之后,您需要解析 HTML 并删除所有无关信息。

以下函数将为您提供第 n 个文本部分(n=0 返回摘要):

import requests

def getWikiSection(topic, n):
    url = 'http://en.wikipedia.org/w/api.php?action=parse&page=%s&format=json&prop=text&section=%s' % (topic, str(n))
    json_response = requests.get(url).json().items()
    if len(json_response) > 1 and json_response[1][0] == u'error':
        print json_response[1][1][u'info']
        return None
    return stripTags(json_response[0][1][u'text'][u'*'])

快速演练:首先,我们为给定主题创建 URL;然后,我们获取 JSON 响应;如果我们查询了无效的部分或主题(即,该主题不存在页面,或者我们超出了页面的长度),我们会打印一个错误;否则,我们清理响应。

清理响应由最后一行的“stripTags”函数处理,该函数删除 HTML 标记。这里是:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return ''.join(self.fed)

def stripTags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

当然,这可以扩展为根据您的喜好解析文本。例如,我删除了如下引用:

import re

def removeReferences(s):
    return re.sub(r'\[[0-9]+\]', '', s)

希望这可以帮助。

于 2013-05-31T11:49:06.250 回答
0

试试这个网址

(注意:pagename 是您请求的页面)

但这会返回一堆 html 废话……您可以通过执行以下操作过滤 hml:

$.getJSON("http://en.wikipedia.org/w/api.php?"+"action=parse&format=json&prop=text&section=all&page=" + entry + "&redirects&callback=?", function(data)
    {   

        if (!data.error)
        {
            var markup = data.parse.text["*"];
            if (typeof markup !== "undefined")
            {
                $("#entry").text(entry).show();
                var blurb = $('<div id="articleText"></div>').html(markup);

                // remove links as they will not work
                blurb.find('a').each(function() { $(this).replaceWith($(this).html()); });

                // remove any references
                blurb.find('sup').remove();

                // remove cite error
                blurb.find('.mw-ext-cite-error').remove();
                $('#article').html($(blurb).find('p'));

                $("#article").append(link);
                // console.log(markup);
            }
        }
    });

你可以在这里阅读更多

于 2014-09-29T11:26:27.533 回答
0

我会检查MobileFrontend Extension,它至少会给你一些<p>工作。使用http://en.wikipedia.org/w/api.php?format=json&action=mobileview&page=Unix§ions=0&prop=text|sections给出

{"mobileview":{"sections":[{"id":0,"text":"\nUnix\n\n
\nUnix 和类 Unix 系统的演变\n\n\n\n公司/开发人员\nKen贝尔实验室的 Thompson、Dennis Ritchie、Brian Kernighan、Douglas McIlroy 和 Joe Ossanna\n\n\n使用\nC 和汇编语言编程\n\n\nOS 系列\nUnix\n\n\n工作状态\n当前\n\ n\n源代码模型\n历史上封闭源代码,现在一些 Unix 项目(BSD 家族和 Illumos)是开源的。\n\n\n初始版本\n1969 年 4 月 20 日;44 年前(1969 年 4 月 20 日)\n\n\ n可用语言\n英语\n\n\n内核类型\n单片机\n\n\n默认用户界面\n命令行界面和图形(X Window系统)\n\n\n许可证\n专有\n\n\n官方网站\nhttp://www.unix.org\">unix.org\n\n\n

Unix(正式商标为UNIX,有时也写成Unix in small caps)是一个多任务、多用户的计算机操作系统,最初由一群人于 1969 年开发

(剪断)

您必须接受它并以其他方式解析它(perl、bash 等),但此时您不妨放弃 API 并采取一些curlwget行动,这样会很容易。

于 2013-07-23T22:19:46.733 回答