c# - 解析 RSS 提要最近一直抛出文档类型定义 (DTD) 错误

Question

这是一个最近开始困扰我的 rss 提要解析器的错误。今天早上我的四个 RSS 提要开始抛出这个异常：

For security reasons DTD is prohibited in this XML document. To enable DTD processing set the DtdProcessing property on XmlReaderSettings to Parse and pass the settings into XmlReader.Create method.

代码过去可以正常工作，但我相信这四个特定的 rss 提要发生了变化，导致了这个问题。以前没有使用 DTD 的提要使用 DTD，或者我的 SyndicationFeed 无法解析的某种模式更改。

所以我将代码更改为

string url = RssFeed.AbsoluteUri;
XmlReaderSettings st = new XmlReaderSettings();

st.DtdProcessing = DtdProcessing.Parse;
st.ValidationType = ValidationType.DTD;

XmlReader reader = XmlReader.Create(url,st);

SyndicationFeed feed = SyndicationFeed.Load(reader);

reader.Close();

然后我开始收到此错误：

The 'html' element is not declared.在 System.Xml.XmlValidatingReaderImpl.ValidationEventHandling.System.Xml.IValidationEventHandling.SendEvent(Exception exception, XmlSeverityType severity) at System.Xml.Schema.BaseValidator.SendValidationEvent(String code, String arg) at System.Xml.Schema.DtdValidator.ProcessElement() at System.Xml.Schema.DtdValidator.ValidateElement() at System.Xml.Schema.DtdValidator.Validate() at System.Xml.XmlValidatingReaderImpl.ProcessCoreReaderEvent() at System.Xml.XmlValidatingReaderImpl.Read() at System.Xml.XmlReader.MoveToContent() at System.Xml.XmlReader.IsStartElement(String localname, String ns) at System.ServiceModel.Syndication.Atom10FeedFormatter.CanRead(XmlReader reader) at System.ServiceModel.Syndication.SyndicationFeed.Load[TSyndicationFeed](XmlReader reader) at System.ServiceModel.Syndication.SyndicationFeed.Load(XmlReader reader)

我不知道这个“html”元素来自哪里，因为提要和提要中的任何可见 dtd 定义（http://jobs.huskyenergy.com/RSS）都没有提到它。我也尝试将其设置Dtdprocessing为DtdProcessing.ignore但是会导致以下错误：

The element with name 'html' and namespace '' is not an allowed feed format.

这更令人困惑，因为命名空间是空白的，我不确定这个被上帝遗弃的 html 元素来自哪里。

我非常接近于编写自己的 xml 阅读器并抓取 SyndicationFeed，但是我想确保在走这条路之前用尽所有可能的解决方案。

一个 RSS 提要，如果有帮助的话：http: //jobs.huskyenergy.com/RSS

score 3 · Accepted Answer

这是一个解决方案，它为给定的 RSS url 提供新的和填充的 SyndicationFeed 对象：

var feedUrl = @"http://jobs.huskyenergy.com/RSS";
try
{
    var webClient = new WebClient();
    // hide ;-)
    webClient.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
    // fetch feed as string
    var content = webClient.OpenRead(feedUrl);
    var contentReader = new StreamReader(content);
    var rssFeedAsString = contentReader.ReadToEnd();
    // convert feed to XML using LINQ to XML and finally create new XmlReader object
    var feed = SyndicationFeed.Load(XDocument.Parse(rssFeedAsString).CreateReader());
    // take the info from the firdst feed entry
    var firstFeedItem = feed.Items.FirstOrDefault();
    Console.WriteLine(firstFeedItem.Title.Text);
    Console.WriteLine(firstFeedItem.Links.FirstOrDefault().Uri.AbsoluteUri);
}
catch (Exception exception)
{
    Console.WriteLine(exception.Message);
}

该站点显然只处理来自“浏览器”的调用，因此分别伪装代码。通话为一体。结果是：

Summer Student UEO Regulatory & Environment Strategy - (Calgary, AB)
http://jobs.huskyenergy.com/ca/alberta/student/jobid4444904-summer-student-ueo-regulatory--environment-strategy-jobs

WebClient 类还支持事件和任务的异步操作，因此使阅读器非阻塞是没有问题的。

对html问题的解释如下：该站点更改了某些内容和/或它们以某种方式不允许自动提要（不再）。html消息来自服务中断消息。我尝试访问该服务（使用带有 LINQPad 的 LINQ to XML，不要怀疑 Dump 功能）：

var feedUrl = @"http://jobs.huskyenergy.com/RSS";
var feedContent = XDocument.Load(feedUrl);
feedContent.Dump();
//var feed = SyndicationFeed.Load(feedContent.CreateReader());
//feed.Dump();

并得到了这个答案：

<!DOCTYPE html []>
<!--[if IE 7]><html lang="en" prefix="og: http://ogp.me/ns#" class="non-js lt-ie9 lt-ie8"><![endif]-->
<!--[if IE 8]><html lang="en" prefix="og: http://ogp.me/ns#" class="non-js lt-ie9"><![endif]-->
<!--[if gt IE 8]><!-->
<html lang="en" prefix="og: http://ogp.me/ns#" class="non-js">
  <!--<![endif]-->
  <head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width" />
    <title>
    Service Interruption
</title>
    <link rel="stylesheet" href="http://seostatic.tmp.com/SiteOutage/style.css" />
  </head>
  <body>
    <p id="outageMessage">This system is currently experiencing a service interruption. <br />We apologize for any inconvenience.</p>
  </body>
</html>

于是 html 元素就显露出来了。:-) 该网站在浏览器中打开时看起来很好，这意味着 XmlReader resp。LINQ to XML 工作正常。

c# - 解析 RSS 提要最近一直抛出文档类型定义 (DTD) 错误

1 回答 1

Related

Reference