-1

我使用 ANTLR 和 Presto 语法来解析 SQL 查询。这是我用来解析查询的原始字符串定义:

STRING
    : '\'' ( '\\' .
           | ~[\\']       // match anything other than \ and '
           | '\'\''       // match ''
           )*
      '\''
    ;

这对大多数查询都有效,直到我看到具有不同转义规则的查询。例如:

select 
table1(replace(replace(some_col,'\\'',''),'\"' ,'')) as features 
from table1

所以我修改了我的字符串定义,现在它看起来像:

STRING
    : '\'' ( '\\' .
           | '\\\\'  .  {HelperUtils.isNeedSpecialEscaping(this)}?       // match \ followed by any char
           | ~[\\']       // match anything other than \ and '
           | '\'\''       // match ''
           )*
      '\''
    ;

但是,这不适用于上面提到的查询,因为我得到

'\\'',''),'

作为单个字符串。谓词为以下查询返回 True。知道如何处理这个查询吗?

谢谢,尼尔。

4

2 回答 2

0

最后我能够解决它。这是我使用的表达式:

STRING
    : '\'' ( '\\\\'  .  {HelperUtils.isNeedSpecialEscaping(this)}?
           | '\\' (~[\\] | . {!HelperUtils.isNeedSpecialEscaping(this)}?)
           | ~[\\']       // match anything other than \ and '
           | '\'\''       // match ''
           )*
      '\''
    ;
于 2021-01-01T22:12:10.760 回答
0
grammar Question;

sql
@init {System.out.println("Question last update 2352");}
    : replace+ EOF
    ;

replace
    : REPLACE '(' expr ')'
    ;

expr
    : ( replace | ID ) ',' STRING ',' STRING
    ;

REPLACE : 'replace' DIGIT? ;
ID      : [a-zA-Z0-9_]+ ;
DIGIT   : [0-9] ;
STRING  : '\'' '\\\\\'' '\''                // '\\''
        | '\'' '\'\'' '\''                  // ''''
        | '\'' ~[\\']* '\'\'' ~[\\']* '\''  // 'it is 8 o''clock'
        | '\'' .*? '\'' ;
NL      : '\r'? '\n'  -> channel(HIDDEN) ;
WS      : [ \t]+      -> channel(HIDDEN) ;

文件input.txt(没有更多示例,我只能猜测):

replace1(replace(some_col,'\\'',''),'\"' ,'')
replace2(some_col,'''','')
replace3(some_col,'abc\tdef\tghi','xyz')
replace4(some_col,'abc\ndef','xyz')
replace5(some_col,'it is 8 o''clock','8')

执行 :

$ alias a4='java -jar /usr/local/lib/antlr-4.9-complete.jar'
$ alias grun='java org.antlr.v4.gui.TestRig'
$ a4 Question.g4
$ javac Question*.java
$ grun Question sql -tokens input.txt 
[@0,0:7='replace1',<REPLACE>,1:0]
[@1,8:8='(',<'('>,1:8]
[@2,9:15='replace',<REPLACE>,1:9]
[@3,16:16='(',<'('>,1:16]
[@4,17:24='some_col',<ID>,1:17]
[@5,25:25=',',<','>,1:25]
[@6,26:30=''\\''',<STRING>,1:26]
[@7,31:31=',',<','>,1:31]
[@8,32:33='''',<STRING>,1:32]
[@9,34:34=')',<')'>,1:34]
[@10,35:35=',',<','>,1:35]
[@11,36:39=''\"'',<STRING>,1:36]
[@12,40:40=' ',<WS>,channel=1,1:40]
[@13,41:41=',',<','>,1:41]
[@14,42:43='''',<STRING>,1:42]
[@15,44:44=')',<')'>,1:44]
[@16,45:45='\n',<NL>,channel=1,1:45]
[@17,46:53='replace2',<REPLACE>,2:0]
[@18,54:54='(',<'('>,2:8]
[@19,55:62='some_col',<ID>,2:9]
[@20,63:63=',',<','>,2:17]
[@21,64:67='''''',<STRING>,2:18]
[@22,68:68=',',<','>,2:22]
[@23,69:70='''',<STRING>,2:23]
[@24,71:71=')',<')'>,2:25]
[@25,72:72='\n',<NL>,channel=1,2:26]
[@26,73:80='replace3',<REPLACE>,3:0]
[@27,81:81='(',<'('>,3:8]
[@28,82:89='some_col',<ID>,3:9]
[@29,90:90=',',<','>,3:17]
[@30,91:105=''abc\tdef\tghi'',<STRING>,3:18]
[@31,106:106=',',<','>,3:33]
[@32,107:111=''xyz'',<STRING>,3:34]
[@33,112:112=')',<')'>,3:39]
[@34,113:113='\n',<NL>,channel=1,3:40]
[@35,114:121='replace4',<REPLACE>,4:0]
[@36,122:122='(',<'('>,4:8]
[@37,123:130='some_col',<ID>,4:9]
[@38,131:131=',',<','>,4:17]
[@39,132:141=''abc\ndef'',<STRING>,4:18]
[@40,142:142=',',<','>,4:28]
[@41,143:147=''xyz'',<STRING>,4:29]
[@42,148:148=')',<')'>,4:34]
[@43,149:149='\n',<NL>,channel=1,4:35]
[@44,150:157='replace5',<REPLACE>,5:0]
[@45,158:158='(',<'('>,5:8]
[@46,159:166='some_col',<ID>,5:9]
[@47,167:167=',',<','>,5:17]
[@48,168:185=''it is 8 o''clock'',<STRING>,5:18]
[@49,186:186=',',<','>,5:36]
[@50,187:189=''8'',<STRING>,5:37]
[@51,190:190=')',<')'>,5:40]
[@52,191:191='\n',<NL>,channel=1,5:41]
[@53,192:191='<EOF>',<EOF>,6:0]
Question last update 2352
于 2021-01-01T23:40:22.993 回答