我想使用.Net 的WebClient
类下载网页,提取标题(即和之间的内容<title>
)</title>
并将页面保存到文件中。
问题是,页面以 UTF-8 编码,并且System.IO.StreamWriter
在使用带有此类字符的文件名时会引发异常。
我用谷歌搜索并尝试了几种将 UTF8 转换为 ANSI 的方法,但无济于事。有人有这方面的工作代码吗?
'Using WebClient asynchronous downloading
Private Sub AlertStringDownloaded(ByVal sender As Object,
ByVal e As DownloadStringCompletedEventArgs)
If e.Cancelled = False AndAlso e.Error Is Nothing Then
Dim Response As String = CStr(e.Result)
'Doesn't work
Dim resbytes() As Byte = Encoding.UTF8.GetBytes(Response)
Response = Encoding.Default.GetString(Encoding.Convert(Encoding.UTF8,
Encoding.Default, resbytes))
Dim title As Regex = New Regex("<title>(.+?) \(",
RegexOptions.Singleline)
Dim m As Match
m = title.Match(Response)
If m.Success Then
Dim MyTitle As String = m.Groups(1).Value
'Illegal characters in path.
Dim objWriter As New System.IO.StreamWriter("c:\" & MyTitle & ".txt")
objWriter.Write(Response)
objWriter.Close()
End If
End If
End Sub
编辑:感谢大家的帮助。事实证明,错误不是由 UTF8 引起的,而是页面标题部分中隐藏的 LF 字符,这显然是路径中的非法字符。
编辑:这是删除文件名/路径中一些非法字符的简单方法:
Dim MyTitle As String = m.Groups(1).Value
Dim InvalidChars As String = New String(Path.GetInvalidFileNameChars()) + New String(Path.GetInvalidPathChars())
For Each c As Char In InvalidChars
MyTitle = MyTitle.Replace(c.ToString(), "")
Next
编辑:这是告诉 WebClient 期待 UTF-8 的方法:
Dim webClient As New WebClient
AddHandler webClient.DownloadStringCompleted, AddressOf AlertStringDownloaded
webClient.Encoding = Encoding.UTF8
webClient.DownloadStringAsync(New Uri("www.acme.com"))