java - extract annotated text from text file by using java code

Question

I have annotated text file on the following format:

<paragraph><weakness>Buffer</weakness> <weakness>Overflow</weakness>
in <location>client/mysql.cc</location> in <application>Oracle</application> 
<application>MySQL</application> and <application>MariaDB</application> 
<version>before</version> <version>5.2</version> <vulnerability>allows
</vulnerability> <vulnerability>remote</vulnerability> 
<application>database</application> <application>servers</application> 
...
...

What I would like to do is to create a Java code to parse the above text file and put it in the following format:

Buffer  weakness
overflow  weakness
in   O <--- 'O' means doesn't have annotation
Oracle  application
MySQL   application
...
...

I tried to tokenize the file, but the problem is, I will do parsing and formatting again, and I could lose some useful information!!

Please any help !!

score 1 · Accepted Answer

您可以使用一些可以解析您的 xml 的 XML 解析器：例如：dom4j、XOM

如果您知道要查找的元素的 XPATH，也可以使用 JDK 1.5 及更高版本中提供的Java Xpath 库从 XML 中提取内容。例如：要提取所有弱点，您可以使用以下 XPATH ：/paragraph/weakness

选择最适合您目的的库。

score 0 · Accepted Answer

如果文件确实是格式正确的 XML（带有平衡的标签，所有&字符都转义为&等），那么使用 XSLT 2.0 转换就很简单了

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <xsl:output method="text" />
  <!-- ignore text nodes that are _entirely_ whitespace -->
  <xsl:strip-space elements="*" />

  <xsl:template match="/">
    <xsl:apply-templates select="//paragraph//text()" />
  </xsl:template>

  <xsl:template match="text()">
    <!-- name of the element that contains this text node -->
    <xsl:param name="tag" select="local-name(..)"/>
    <!-- for each word in the text node -->
    <xsl:for-each select="tokenize(normalize-space(), ' ')">
      <!-- word-TAB-tag-NL -->
      <xsl:value-of select="concat(., '&#9;', $tag, '&#10;')" />
    </xsl:for-each>
  </xsl:template>

  <!-- special case for nodes directly under <paragraph> - use "O" -->
  <xsl:template match="paragraph/text()">
    <xsl:next-match>
      <xsl:with-param name="tag" select="'O'" />
    </xsl:next-match>
  </xsl:template>

</xsl:stylesheet>

您可以使用Saxon 9 HE从 Java 运行它。

score 0 · Accepted Answer

将您的文本按空格拆分为一个字符串数组，然后对于数组中的每个字符串，查找“<”符号（如果找到），然后使用 Xpath 对其进行解析，否则根据需要写出值和 0。

...
String split[] = readLine.split("\\s");
for (String string : split) {
  if (string.indexOf("<") != -1) {
    //XPath parsing
  } else {
    System.out.println(string + " O");
  }
}
...

java - extract annotated text from text file by using java code

3 回答 3

Related

Reference