0

我有一个大XML文件,下面是其中的摘录:

...
<LexicalEntry id="Ait~ifAq_1">
  <Lemma partOfSpeech="n" writtenForm="اِتِّفاق"/>
  <Sense id="Ait~ifAq_1_tawaAfuq_n1AR" synset="tawaAfuq_n1AR"/>
  <WordForm formType="root" writtenForm="وفق"/>
</LexicalEntry>
<LexicalEntry id="tawaA&amp;um__1">
  <Lemma partOfSpeech="n" writtenForm="تَوَاؤُم"/>
  <Sense id="tawaA&amp;um__1_AinosijaAm_n1AR" synset="AinosijaAm_n1AR"/>
  <WordForm formType="root" writtenForm="وأم"/>
</LexicalEntry>    
<LexicalEntry id="tanaAgum_2">
  <Lemma partOfSpeech="n" writtenForm="تناغُم"/>
  <Sense id="tanaAgum_2_AinosijaAm_n1AR" synset="AinosijaAm_n1AR"/>
  <WordForm formType="root" writtenForm="نغم"/>
</LexicalEntry>


<Synset baseConcept="3" id="tawaAfuq_n1AR">
  <SynsetRelations>
    <SynsetRelation relType="hyponym" targets="AinosijaAm_n1AR"/>
    <SynsetRelation relType="hyponym" targets="AinosijaAm_n1AR"/>
    <SynsetRelation relType="hypernym" targets="ext_noun_NP_420"/>
  </SynsetRelations>
  <MonolingualExternalRefs>
    <MonolingualExternalRef externalReference="13971065-n" externalSystem="PWN30"/>
  </MonolingualExternalRefs>
</Synset>
...

我想从中提取特定信息。对于给定writtenForm的 from<Lemma><WordForm>,程序获取synsetfrom的值<Sense>that writtenForm(same <LexicalEntry>) 并搜索所有与from具有相同值的id值。之后,程序给我们那个 的所有关系,即显示 的值并返回并寻找与 的值相同的人的值,然后显示它的。<Synset>synset<Sense>SynsetrelType<LexicalEntry>synset<Sense>targetswrittenForm

我认为这有点复杂,但结果应该是这样的:

اِتِّفاق hyponym تَوَاؤُم, اِنْسِجام

由于内存消耗,解决方案之一是使用 Stream 阅读器。但我不知道我应该如何继续得到我想要的。请帮帮我。

4

3 回答 3

1

SAX 解析器与 DOM 解析器不同。它只查看当前item项,在未来项变为当前项之前无法查看它们item。当 XML 文件非常大时,它是您可以使用的众多工具之一。取而代之的是很多。仅举几例:

  • SAX解析器
  • DOM解析器
  • JDOM解析器
  • DOM4J解析器
  • STAX解析器

您可以在此处找到所有这些教程。

在我看来,学习后可以直接使用DOM4JJDOM用于商业产品。

Parser的逻辑SAX是你有一个MyHandler正在扩展的类DefaultHandler@Overrides它的一些方法:

XML文件:

<?xml version="1.0"?>
<class>
   <student rollno="393">
      <firstname>dinkar</firstname>
      <lastname>kad</lastname>
      <nickname>dinkar</nickname>
      <marks>85</marks>
   </student>
   <student rollno="493">
      <firstname>Vaneet</firstname>
      <lastname>Gupta</lastname>
      <nickname>vinni</nickname>
      <marks>95</marks>
   </student>
   <student rollno="593">
      <firstname>jasvir</firstname>
      <lastname>singn</lastname>
      <nickname>jazz</nickname>
      <marks>90</marks>
   </student>
</class>

处理程序类:

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class UserHandler extends DefaultHandler {

   boolean bFirstName = false;
   boolean bLastName = false;
   boolean bNickName = false;
   boolean bMarks = false;

   @Override
   public void startElement(String uri, 
   String localName, String qName, Attributes attributes)
      throws SAXException {
      if (qName.equalsIgnoreCase("student")) {
         String rollNo = attributes.getValue("rollno");
         System.out.println("Roll No : " + rollNo);
      } else if (qName.equalsIgnoreCase("firstname")) {
         bFirstName = true;
      } else if (qName.equalsIgnoreCase("lastname")) {
         bLastName = true;
      } else if (qName.equalsIgnoreCase("nickname")) {
         bNickName = true;
      }
      else if (qName.equalsIgnoreCase("marks")) {
         bMarks = true;
      }
   }

   @Override
   public void endElement(String uri, 
   String localName, String qName) throws SAXException {
      if (qName.equalsIgnoreCase("student")) {
         System.out.println("End Element :" + qName);
      }
   }

   @Override
   public void characters(char ch[], 
      int start, int length) throws SAXException {
      if (bFirstName) {
         System.out.println("First Name: " 
            + new String(ch, start, length));
         bFirstName = false;
      } else if (bLastName) {
         System.out.println("Last Name: " 
            + new String(ch, start, length));
         bLastName = false;
      } else if (bNickName) {
         System.out.println("Nick Name: " 
            + new String(ch, start, length));
         bNickName = false;
      } else if (bMarks) {
         System.out.println("Marks: " 
            + new String(ch, start, length));
         bMarks = false;
      }
   }
}

主要课程:

import java.io.File;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;

import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

public class SAXParserDemo {
   public static void main(String[] args){

      try { 
         File inputFile = new File("input.txt");
         SAXParserFactory factory = SAXParserFactory.newInstance();
         SAXParser saxParser = factory.newSAXParser();
         UserHandler userhandler = new UserHandler();
         saxParser.parse(inputFile, userhandler);     
      } catch (Exception e) {
         e.printStackTrace();
      }
   }   
}
于 2016-12-22T16:09:44.037 回答
1
于 2016-12-22T18:07:41.547 回答
0

如果此 XML 文件太大而无法在内存中表示,请使用 SAX。

您将需要编写 SAX 解析器来维护位置。为此,我通常使用 StringBuffer,但 Stack of Strings 也可以很好地工作。这部分很重要,因为它允许您跟踪返回文档根目录的路径,这将允许您了解在给定时间点您在文档中的位置(在尝试仅提取信息少)。

主要逻辑流程如下:

 1. When entering a node, add the node's name to the stack.
 2. When exiting a node, pop the node's name (top element) off the stack.
 3. To know your location, read your current branch of the XML from the bottom of the stack to the top of the stack.
 4. When entering a region you care about, clear the buffer you will capture the characters into
 5. When exiting a region you care about, flush the buffer into the data structure you will return back as your output.

通过这种方式,您可以有效地跳过您不关心的 XML 树的所有分支。

于 2016-12-22T15:08:49.190 回答