html - 使用 Libreoffice Basic 读取 HTML 页面

Question

我是 LibreOffice Basic 的新手。我正在尝试在 LibreOffice Calc 中编写一个宏，该宏将从单元格（例如 Stark）中读取维斯特洛贵族家族的名称，并通过在冰之维基上的相关页面上查找来输出该家族的词和火。它应该像这样工作：

在此处输入图像描述

这是伪代码：

Read HouseName from column A
Open HtmlFile at "http://www.awoiaf.westeros.org/index.php/House_" & HouseName
Iterate through HtmlFile to find line which begins "<table class="infobox infobox-body"" // Finds the info box for the page.
Read Each Row in the table until Row begins Words
Read the contents of the next <td> tag, and return this as a string.

我的问题是第二行，我不知道如何读取 HTML 文件。我应该如何在 LibreOffice Basic 中执行此操作？

score 0 · Accepted Answer

这主要有两个问题。1. 性能您的 UDF 需要在每个存储单元格中获取 HTTP 资源。2. HTML 不幸的是，OpenOffice 或 LibreOffice 中没有 HTML 解析器。只有一个 XML 解析器。这就是为什么我们不能直接用 UDF 解析 HTML。

这会起作用，但速度很慢而且不是很普遍：

Public Function FETCHHOUSE(sHouse as String) as String

   sURL = "http://awoiaf.westeros.org/index.php/House_" & sHouse

   oSimpleFileAccess = createUNOService ("com.sun.star.ucb.SimpleFileAccess")
   oInpDataStream = createUNOService ("com.sun.star.io.TextInputStream")
   on error goto falseHouseName
   oInpDataStream.setInputStream(oSimpleFileAccess.openFileRead(sUrl))
   on error goto 0
   dim delimiters() as long
   sContent = oInpDataStream.readString(delimiters(), false)

   lStartPos = instr(1, sContent, "<table class=" & chr(34) & "infobox infobox-body" )
   if lStartPos = 0 then
     FETCHHOUSE = "no infobox on page"
     exit function
   end if   
   lEndPos = instr(lStartPos, sContent, "</table>")
   sTable = mid(sContent, lStartPos, lEndPos-lStartPos + 8)

   lStartPos = instr(1, sTable, "Words" )
   if lStartPos = 0 then
     FETCHHOUSE = "no Words on page"
     exit function
   end if        
   lEndPos = instr(lStartPos, sTable, "</tr>")
   sRow = mid(sTable, lStartPos, lEndPos-lStartPos + 5)

   oTextSearch = CreateUnoService("com.sun.star.util.TextSearch")
   oOptions = CreateUnoStruct("com.sun.star.util.SearchOptions")
   oOptions.algorithmType = com.sun.star.util.SearchAlgorithms.REGEXP
   oOptions.searchString = "<td[^<]*>"
   oTextSearch.setOptions(oOptions)
   oFound = oTextSearch.searchForward(sRow, 0, Len(sRow))
   If  oFound.subRegExpressions = 0 then 
     FETCHHOUSE = "Words header but no Words content on page"
     exit function   
   end if
   lStartPos = oFound.endOffset(0) + 1
   lEndPos = instr(lStartPos, sRow, "</td>")
   sWords = mid(sRow, lStartPos, lEndPos-lStartPos)

   FETCHHOUSE = sWords
   exit function

   falseHouseName:
   FETCHHOUSE = "House name does not exist"

End Function

更好的方法是，如果您可以从 Wiki 提供的 Web API 中获取所需的信息。你知道维基背后的人吗？如果是这样，那么您可以将其放在那里作为建议。

问候

阿克塞尔

html - 使用 Libreoffice Basic 读取 HTML 页面

1 回答 1

Related

Reference