有趣的问题。%
正如您已经看到的那样,您无法可靠地将其替换为空间。您需要有关将通过 uri 传输的内容的更多信息,然后缩小到必须替换的内容和不可以替换的内容,例如
%ZTest -> a space for sure
%Abababtest -> is it a space? probably... but we need to be sure that no strange characters or sequences are allowed
%23th%Affleck%20Street -> space? hex? what is what?
您需要更多信息来可靠地解决该问题,例如:
- 哪些是允许的符号?或者哪些是允许解码的十六进制范围?
- 哪些查询参数是
%
作为空格包含的?(所以你可以只改造它们)
- 你还需要解码西里尔文、阿拉伯文、汉字吗?
- 如果 a
%20
在 URI 中,我们可以假设 no%
将是一个空格吗?或者是否有可能两者都在 URI 中显示为空格?
有了这些附加信息,应该更容易解决问题。
尽管如此,这里有一个解决方案可能会让您朝着正确的方向前进(但请考虑底部的警告!):
Pattern HEX_PATTERN = Pattern.compile("(?i)%([A-F0-9]{2})?");
String CHARSET = "utf-8";
String ENCODED_SPACE = "%20";
String ALLOWED_SYMBOLS = "\\p{L}|\\s|@";
String semiDecode(String uri) throws UnsupportedEncodingException {
Matcher m = HEX_PATTERN.matcher(uri);
StringBuffer semiDecoded = new StringBuffer();
while (m.find()) {
String match = m.group();
String hexString = m.group(1);
String replacementString = match;
if (hexString == null) {
replacementString = ENCODED_SPACE;
} else {
// alternatively to the following just check whether the hex value is in an allowed range...
// you may want to lookup https://en.wikipedia.org/wiki/List_of_Unicode_characters for this
String decodedSymbol = URLDecoder.decode(match, CHARSET);
if (!decodedSymbol.matches(ALLOWED_SYMBOLS)) {
replacementString = ENCODED_SPACE + hexString;
}
}
m.appendReplacement(semiDecoded, replacementString);
}
m.appendTail(semiDecoded);
return semiDecoded.toString();
}
示例用法:
String uri = "upi://pay?pa=praksh%40kmbl&pn=Prakash%Abmar&cu=INR";
String semiDecoded = semiDecode(uri);
System.out.println("Input: " + uri);
System.out.println("Semi-decoded: " + semiDecoded);
System.out.println("Completely decoded query: " + new URI(semiDecoded).getQuery());
这将打印:
Input: upi://pay?pa=praksh%40kmbl&pn=Prakash%Abmar&cu=INR
Semi-decoded: upi://pay?pa=praksh%40kmbl&pn=Prakash%20Abmar&cu=INR
Completely decoded query: pa=praksh@kmbl&pn=Prakash Abmar&cu=INR
警告......一些事情要记住:
- 此特定实现不适用于占用超过 2 个十六进制值的西里尔字母、中文或其他字母(即
%##%##
,%##%##%##
单个字符将不再被解码)
- 您需要根据您的需要调整允许的符号(请参阅
ALLOWED_SYMBOLS
; 现在它接受任何字母、任何空格和的正则表达式@
)
- charset utf-8 被假定为