java - 有没有办法识别字符串中的标记，同时也可以通过最长的子字符串？

Question

我试图弄清楚如何正确识别输入文件中的标记并返回它应该是什么类型，同时使用空格和换行符的分隔符。词法分析器应该识别的四种类型是：

Identifiers = ([a-z] | [A-Z])([a-z] | [A-Z] | [0-9])* 
Numbers = [0-9]+ 
Punctuation = \+ | \- | \* | / | \( | \) | := | ;
Keywords = if | then | else | endif | while | do | endwhile | skip

例如，如果文件有一行内容：

tcu else i34 2983 ( + +eqdQ

它应该标记并打印出来：

identifier: tcu
keyword: else
identifier: i34
number: 2983
punctuation: (
punctuation: +
punctuation: +
identifier: eqdQ

对于两种不同类型彼此相邻的情况，我无法弄清楚如何让词法分析器通过最长的子字符串。

这就是我的尝试：

//start
public static void main(String[] args) throws IOException {

//input file//
File file = new File("input.txt");
//output file//
FileWriter writer = new FileWriter("output.txt");

//instance variables
String sortedOutput = "";
String current = "";
Scanner scan = new Scanner(file);
String delimiter = "\\s+ | \\s*| \\s |\\n|$ |\\b\\B|\\r|\\B\\b|\\t";
String[] analyze;
BufferedReader read = new BufferedReader(new FileReader(file));

//lines get read here from the .txt file
while(scan.hasNextLine()){
sortedOutput = sortedOutput.concat(scan.nextLine() + System.lineSeparator());
}
//lines are tokenized here
analyze = sortedOutput.split(delimiter);

//first line is printed here through a separate reader
current = read.readLine();
System.out.println("Current Line: " + current + System.lineSeparator());
writer.write("Current Line: " + current + System.lineSeparator() +"\n");

//string matching starts here
for(String a: analyze) 
    {
        //matches identifiers if it doesn't match with a keyword
        if(a.matches(patternAlpha))
        {
            if(a.matches(one))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(two))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(three))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(four))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(five))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(six))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(seven))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else if(a.matches(eight))
            {
                System.out.println("Keyword: " + a);
                writer.write("Keyword: "+ a + System.lineSeparator());
            }
            else
            {
                System.out.println("Identifier: " + a);
                writer.write("Identifier: "+ a + System.lineSeparator());
            }
        }
        //number check
        else if(a.matches(patternNumber))
        {
            System.out.println("Number: " + a);
            writer.write("Number: "+ a + System.lineSeparator());
        }
        //punctuation check
        else if(a.matches(patternPunctuation))
        {
            System.out.println("Punctuation: " + a);
            writer.write("Punctuation: "+ a + System.lineSeparator());
        }
        //this special case here updates the current line with the next line
        else if(a.matches(nihil)) 
        {
            System.out.println();
            current = read.readLine();
            System.out.println("\nCurrent Line: " + current + System.lineSeparator());
            writer.write("\nCurrent Line: " + current + System.lineSeparator() + "\n");
        }
        //everything not listed in regex is read as an error
        else 
        {
            System.out.println("Error reading: " + a);
            writer.write("Error reading: "+ a + System.lineSeparator());
        }
    }
//everything closes here to avoid errors
scan.close();
read.close();
writer.close();
    }
}

我将不胜感激任何建议。先感谢您。

score 1 · Accepted Answer

这绝对可以在没有解析器的情况下完成，因为输入到解析器的标记几乎总是可以由常规语言定义（Unix 工具 Lex 和 Flex 多年来一直在这样做。请参阅Flex（词法分析器生成器）。我没有想花时间将我做这件事的一些 Python 代码手动翻译成 Java，但我花了几分钟时间为您的示例修改它。我确实做了一些我认为合适的更改。作为输入对于解析器，您通常希望将(、 )和;字符视为不同的标记。您还希望将每个保留字视为不同的标记类，而不是将它们集中在一起作为关键字（或像我所做的那样的单数关键字）。

方法

使用带有命名捕获组的正则表达式定义您的标记。确保你有一个空格和注释（如果你的语言定义了注释）。
包含一个将匹配任何单个字符（'.'用于正则表达式）的 ERROR 标记，以确保find()在输入用尽之前始终返回匹配项。此 ERROR 正则表达式必须是最后一个替代模式，如果匹配，则表示无法识别的标记。
放置这些是一个列表，确保所有保留字的正则表达式位于标识符的正则表达式之前。
通过使用“|”“加入”列表中的项目，从第 3 步创建单个正则表达式操作员。
搜索下一个匹配项。如果找到的实际匹配是空格或注释，并且这些标记对解析器没有语义意义，则继续匹配。如果是 ERROR 标记，则将其返回给解析器，但不要返回连续的错误标记。当输入用尽时，返回一个文件结束标记。

快速Java实现

这个版本的结构使得next可以调用方法来返回Token对象。此外，令牌类型通常用整数表示更方便，因为它最终将用于索引解析表：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Lexer {

    public static class Token
    {
        public int tokenNumber;
        public String tokenValue;

        public Token(int tokenNumber, String tokenValue)
        {
            this.tokenNumber = tokenNumber;
            this.tokenValue = tokenValue;
        }
    }

    public static int WHITESPACE = 1; // group 1
    public static int PUNCTUATION = 2; // group 2 etc.
    public static int LPAREN = 3;
    public static int RPAREN = 4;
    public static int KEYWORD = 5;
    public static int IDENTIFIER = 6;
    public static int NUMBER = 7;
    public static int SEMICOLON = 8;
    public static int ERROR = 9;
    public static int EOF = 10;

    Matcher m;
    String text;
    boolean skipError;


    public static void main(String[] args) {
        Lexer lexer = new Lexer("tcu else i34 !!!! 2983 ( + +eqdQ!!!!"); // With some error characters "!" thrown in the middle and at the end
        for(;;) {
            Token token = lexer.next();
            System.out.println(token.tokenNumber + ": " + token.tokenValue);
            if (token.tokenNumber == EOF)
                break;
        }
    }

    public Lexer(String text)
    {

        String _WHITESPACE = "(\\s+)";
        String _PUNCTUATION = "((?:[+*/-]|:=))";
        String _LPAREN = "(\\()";
        String _RPAREN = "(\\))";
        String _KEYWORD = "(if|then|else|endif|while|do|endwhile|skip)";
        String _IDENTIFIER = "([a-zA-Z][0-9a-zA-Z]*)";
        String _NUMBER = "([0-9)]+)";
        String _SEMICOLON = "(;)";
        String _ERROR = "(.)"; // must be last and able to capture one character

        String regex = String.join("|", _WHITESPACE, _PUNCTUATION, _LPAREN, _RPAREN, _KEYWORD, _IDENTIFIER, _NUMBER, _SEMICOLON, _ERROR);

        Pattern p = Pattern.compile(regex);
        this.text = text;
        m = p.matcher(this.text);
        skipError = false;
    }

    public Token next()
    {
        Token token = null;
        for(;;) {
            if (!m.find())
                return new Token(EOF, "<EOF>");
            for (int tokenNumber = 1; tokenNumber <= 9; tokenNumber++) {
                String tokenValue = m.group(tokenNumber);
                if (tokenValue != null) {
                    token = new Token(tokenNumber, tokenValue);
                    break;
                }
            }
            if (token.tokenNumber == ERROR) {
                if (!skipError) {
                    skipError = true; // we don't want successive errors
                    return token;
                }
            }
            else {
                skipError = false;
                if (token.tokenNumber != WHITESPACE)
                    return token;
            }
        }
    }

}

印刷：

6: tcu
5: else
6: i34
9: !
7: 2983
3: (
2: +
2: +
6: eqdQ
9: !
10: <EOF>

Java 演示

java - 有没有办法识别字符串中的标记，同时也可以通过最长的子字符串？

1 回答 1

Related

Reference