1

因此,我有一长串这样的单词,并且基于第一个空格,我想将单词拆分为word-meaning。基本上我正在使用Apache POI它,因为我必须读取 docx 文件,然后从中获取数据。

    abash  humiliate, embarrass
    abdicate  relinquish power or position
    aberrant  abnormal
    abet  aid, encourage (typically of crime)
    abeyance  postponement
    aboriginal  indigenous 
    abridge  shorten
    abstemious  moderate
...

那么什么正则表达式适合我的目的,以便我可以像这样显示它:

word :abash
meaning : humiliate, embarrass
...

我的代码是:

public class WordFileReader {

    /**
     * @param args
     */
    public static void main(String[] args) {
         try {
                FileInputStream fis = new FileInputStream("E:\\important.docx");
                org.apache.poi.xwpf.extractor.XWPFWordExtractor oleTextExtractor = new XWPFWordExtractor(new XWPFDocument(fis));
                System.out.print(oleTextExtractor.getText());            
            } catch (Exception e) {
                    e.printStackTrace();
            }

    }

}

--编辑--根据建议的答案,我正在使用这个

public static void main(String[] args) {
         try {
                FileInputStream fis = new FileInputStream("E:\\Words.docx");
                org.apache.poi.xwpf.extractor.XWPFWordExtractor oleTextExtractor = new XWPFWordExtractor(new XWPFDocument(fis));
                //System.out.print(oleTextExtractor.getText());

                Scanner sc = new Scanner(oleTextExtractor.getText());            
                while(sc.hasNextLine()) {
                 String line = sc.nextLine();
                 int i = line.indexOf(' ');
                 String word = line.substring(0, i);
                 String meaning = line.substring(i).trim();

                 System.out.println("word "+word);
                 System.out.println("meaning "+meaning);
                }

            } catch (Exception e) {
                    e.printStackTrace();
            }

    }

但我明白了

java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.substring(Unknown Source)
    at WordFileReader.main(WordFileReader.java:25)
4

4 回答 4

3

我会使用 java.util.Scanner 从文本中提取行

Scanner sc = new Scanner(oleTextExtractor.getText());            
while(sc.hasNextLine()) {
    String line = sc.nextLine();
    ...

然后我会把这条线分成单词和含义

 int i = line.indexOf(' ', 2);  // start from pos 2 to avoid a article
 String word = txt.substring(0, i);
 String meaning = txt.substring(i).trim();

或者

 String[] parts = line.split("(?<!^a)\\s+", 2);
 String word = parts[0];
 String meaning = parts[1];
于 2013-06-10T08:44:26.043 回答
1

使用java.lang.String.split(String regex, int limit)

String[] parts = line.split("\\s", 1)
String word = parts[0];
String meaning = parts[1];
于 2013-06-10T08:37:28.877 回答
0

您可以按如下方式使用子字符串:

int index = line.indexOf(" ");

"word : "+ line.substring(0, index)+"\n 含义: "+line.substring(index+1)

于 2013-06-10T08:41:04.210 回答
0

下面的代码对我来说很好。我使用 BufferedReader 从文件中读取文本。

BufferedReader br=null;
    try {
        br = new BufferedReader(new FileReader("C:\\test.txt"));
    } catch (FileNotFoundException ex) {
        Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
    }
try {
    StringBuilder sb = new StringBuilder();
    String line="";
    String [] parts=null;
    String everything="",word="",meaning="";
        try {
            line = br.readLine();
        } catch (IOException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        }

    while (line != null) {
        sb.append(line);

        parts= line.split(" ",2);
        word=parts[0];
        meaning=parts[1];

    System.out.println("word:"+word.toString());
    System.out.println("meaning:"+meaning.toString());

        sb.append("\n");
            try {
                line = br.readLine();
            } catch (IOException ex) {
                Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
            }
    }

} finally {
        try {
            br.close();

        } catch (IOException ex) {
            Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
        }
}
于 2013-06-10T09:19:06.837 回答