0

I am writing a Java program which will deminify any HTML/XML file from a single line to multiple lines (structured way). The method is simple. I am using a regex to split the single string into multiple strings and append a new line (\n) to each of those substring. BUT the program is not able to split my single string at all. Could any1 help me with this? Below is my program:

package Deminifier;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.InputStreamReader;

public class Deminifier {

    public static void main(String[] args) {

        Deminifier demo = new Deminifier ();
        demo.execute();
    }

    public void execute(){
        BufferedReader br = null;
        String currentLine;
        try {
            br = new BufferedReader(new FileReader("myfile.txt"));


        while((currentLine = br.readLine())!= null){
            System.out.println("Input text is as follows:");
            System.out.println(currentLine);
            Deminifier demo = new Deminifier();
            System.out.println("Output Formatted text is as follows:");
            demo.toDeminify(currentLine);
        }
        } catch (Exception e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
    private void toDeminify(String currentLine) {
        String lineToParse = currentLine;
        String returnString =null;
        String[] splitString = (lineToParse.split("</([A-Z][A-Z0-9_]*)\b[^>]*>"));
        System.out.println("Number of lines:"+splitString.length);
        for (String s : splitString) {
            System.out.println(s+"\n");
        }

    }
}

Can anyone help me in this matter? why is my String array "split String" returning just "1" array element? I have tried the regex expression and it works in one of my application (is able to identify all end tags).

4

3 回答 3

0

是文件编码问题吗?如果文件使用 UTF-8 但 FileReader 需要 US-ASCII,那么您可能会遇到此问题。

于 2013-03-13T13:15:29.810 回答
0

您的正则表达式似乎假定 HTML 都是大写的。真的是这样吗?

否则,请尝试

</([a-zA-Z][a-zA-Z0-9_]*)\b[^>]*>

也可以写成更短的

</[a-zA-Z]\w*?>

(我认为,还没有测试过)

于 2013-03-13T13:30:45.783 回答
0

您的代码的一个问题是您正在对结束标记进行拆分,这意味着它不会出现在返回数组中的任何项目中。您可能想使用类似replaceAll. 您的正则表达式看起来也有点可疑,但是如果不能看到示例输入文件就很难判断。

您可以调整以下内容:

Pattern p = Pattern.compile("</[^>]+>");
while((currentLine = br.readLine())!= null){
    System.out.println("Input text is as follows:");
    System.out.println(currentLine);
    System.out.println("Output Formatted text is as follows:");
    Matcher m = p.matcher(currentLine);
    System.out.println(m.replaceAll("$0\n"));
}

此外,在您Deminifier在循环内部实例化的原始代码中,您希望将其移到外部。

于 2013-03-13T14:19:36.583 回答