1

我想使用.Net 的WebClient类下载网页,提取标题(即和之间的内容<title></title>并将页面保存到文件中。

问题是,页面以 UTF-8 编码,并且System.IO.StreamWriter在使用带有此类字符的文件名时会引发异常。

我用谷歌搜索并尝试了几种将 UTF8 转换为 ANSI 的方法,但无济于事。有人有这方面的工作代码吗?

'Using WebClient asynchronous downloading
Private Sub AlertStringDownloaded(ByVal sender As Object, 
                                  ByVal e As DownloadStringCompletedEventArgs)
    If e.Cancelled = False AndAlso e.Error Is Nothing Then
        Dim Response As String = CStr(e.Result)

        'Doesn't work               
        Dim resbytes() As Byte = Encoding.UTF8.GetBytes(Response)
        Response = Encoding.Default.GetString(Encoding.Convert(Encoding.UTF8, 
                                              Encoding.Default, resbytes))

        Dim title As Regex = New Regex("<title>(.+?) \(", 
                                       RegexOptions.Singleline)
        Dim m As Match
        m = title.Match(Response)
        If m.Success Then
            Dim MyTitle As String = m.Groups(1).Value

            'Illegal characters in path.
            Dim objWriter As New System.IO.StreamWriter("c:\" & MyTitle & ".txt")
            objWriter.Write(Response)
            objWriter.Close()
        End If
    End If
End Sub

编辑:感谢大家的帮助。事实证明,错误不是由 UTF8 引起的,而是页面标题部分中隐藏的 LF 字符,这显然是路径中的非法字符。


编辑:这是删除文件名/路径中一些非法字符的简单方法:

Dim MyTitle As String = m.Groups(1).Value
Dim InvalidChars As String = New String(Path.GetInvalidFileNameChars()) + New String(Path.GetInvalidPathChars())
For Each c As Char In InvalidChars
    MyTitle = MyTitle.Replace(c.ToString(), "")
Next

编辑:这是告诉 WebClient 期待 UTF-8 的方法:

Dim webClient As New WebClient
AddHandler webClient.DownloadStringCompleted, AddressOf AlertStringDownloaded
webClient.Encoding = Encoding.UTF8
webClient.DownloadStringAsync(New Uri("www.acme.com"))
4

1 回答 1

1

我认为问题与 UTF-8 无关。我认为您的正则表达式将包括</title>它是否出现在同一行。Windows 文件名中的字符<>无效。

如果这不是问题,那么查看一些示例输入和输出值会很有帮助MyTitle

于 2013-01-07T11:59:42.060 回答