bash - 如何在 shell 中解码 URL 编码的字符串？

Question

我有一个文件，其中包含已编码的用户代理列表。例如：

Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

我想要一个 shell 脚本，它可以读取这个文件并用解码的字符串写入一个新文件。

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

我一直在尝试使用这个示例来实现它，但到目前为止它还没有工作。

$ echo -e "$(echo "%31+%32%0A%33+%34" | sed 'y/+/ /; s/%/\\x/g')"

我的脚本看起来像：

#!/bin/bash
for f in *.log; do
  echo -e "$(cat $f | sed 'y/+/ /; s/%/\x/g')" > y.log
done

score 112 · Accepted Answer

这是一个简单的单行解决方案。

$ function urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }

它可能看起来像 perl :) 但它只是纯粹的 bash。没有 awks，没有 seds ......没有开销。使用 : 内置、特殊参数、模式替换和 echo 内置的 -e 选项将十六进制代码转换为字符。有关详细信息，请参阅 bash 的联机帮助页。您可以将此功能用作单独的命令

$ urldecode https%3A%2F%2Fgoogle.com%2Fsearch%3Fq%3Durldecode%2Bbash
https://google.com/search?q=urldecode+bash

或者在变量赋值中，像这样：

$ x="http%3A%2F%2Fstackoverflow.com%2Fsearch%3Fq%3Durldecode%2Bbash"
$ y=$(urldecode "$x")
$ echo "$y"
http://stackoverflow.com/search?q=urldecode+bash

score 27 · Accepted Answer

如果您是python开发人员，这可能更可取：

对于 Python 3.x（默认）：

echo -n "%21%20" | python3 -c "import sys; from urllib.parse import unquote; print(unquote(sys.stdin.read()));"

对于 Python 2.x（已弃用）：

echo -n "%21%20" | python -c "import sys, urllib as ul; print ul.unquote(sys.stdin.read());"

urllib非常擅长处理 URL 解析

score 20 · Accepted Answer

使用 BASH，从标准输入读取百分比编码的 URL 并解码：

while read; do echo -e ${REPLY//%/\\x}; done

按CTRL-D表示文件结束（EOF）并优雅退出。

您可以通过将文件设置为标准来解码文件的内容：

while read; do echo -e ${REPLY//%/\\x}; done < file

您可以解码来自管道的输入，例如：

echo 'a%21b' | while read; do echo -e ${REPLY//%/\\x}; done

内置命令读取标准，直到它看到换行符。它设置一个名为REPLY等于它刚刚读取的文本行的变量。
${REPLY//%/\\x}用 '\x' 替换所有 '%' 实例。
echo -e解释\xNN为十六进制值为的 ASCII 字符NN。
while 重复此循环，直到读取命令失败，例如。已达到 EOF。

以上不会将'+'更改为''。也将 '+' 更改为 ' '，就像客人的回答：

while read; do : "${REPLY//%/\\x}"; echo -e ${_//+/ }; done

:是一个 BASH 内置命令。在这里，它只接受一个参数并且什么都不做。
双引号使所有内容都包含在一个参数中。
_是一个特殊参数，在参数扩展之后等于前一个命令的最后一个参数。这是REPLY'%' 的所有实例都替换为 '\x' 的值。
${_//+/ }用 ' ' 替换所有 '+' 实例。

这仅使用 BASH 而不会启动任何其他进程，类似于客人的回答。

score 15 · Accepted Answer

这似乎对我有用。

#!/bin/bash
urldecode(){
  echo -e "$(sed 's/+/ /g;s/%\(..\)/\\x\1/g;')"
}

for f in /opt/logs/*.log; do
    name=${f##/*/}
    cat $f | urldecode > /opt/logs/processed/$HOSTNAME.$name
done

用空格替换 '+'，用 '\x' 转义符替换 % 符号，并让 echo 使用 '-e' 选项解释 \x 转义符不起作用。出于某种原因，cat 命令将 % 符号打印为它自己的编码形式 %25。所以 sed 只是简单地将 %25 替换为 \x25。当使用 -e 选项时，它只是将 \x25 评估为 % 并且输出与原始输出相同。

痕迹：

原文： Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

sed: Mozilla\x252F5.0\x2520\x2528Macintosh\x253B\x2520U\x253B\x2520Intel\x2520Mac\x2520OS\x2520X\x252010.6\x253B\x2520en

echo -e: Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en

修复：基本上忽略 sed 中 % 之后的 2 个字符。

sed: Mozilla\x2F5.0\x20\x28Macintosh\x3B\x20U\x3B\x20Intel\x20Mac\x20OS\x20X\x2010.6\x3B\x20en

echo -e: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

经过广泛的测试后，不确定这会导致什么并发症，但目前有效。

score 7 · Accepted Answer

perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/pack H2,$1/gie' ./*.log

通过-i就地更新文件（一些sed实现已经从中借用perl）.back作为备份扩展。

s/x/y/e替换为perl 代码x的e值。y

本例中的 perl 代码用于将在（正则表达式中的第一个括号对）中pack捕获的十六进制数打包为相应的字符。$1

另一种方法pack是使用chr(hex($1))：

perl -pi.back -e 'y/+/ /;s/%([\da-f]{2})/chr hex $1/gie' ./*.log

如果可用，您还可以使用uri_unescape()from URI::Escape：

perl -pi.back -MURI::Escape -e 'y/+/ /;$_=uri_unescape$_' ./*.log

score 6 · Accepted Answer

用于在本机 Bash 中执行此操作的 Bash 脚本（原始来源）：

LANG=C

urlencode() {
    local l=${#1}
    for (( i = 0 ; i < l ; i++ )); do
        local c=${1:i:1}
        case "$c" in
            [a-zA-Z0-9.~_-]) printf "$c" ;;
            ' ') printf + ;;
            *) printf '%%%.2X' "'$c"
        esac
    done
}

urldecode() {
    local data=${1//+/ }
    printf '%b' "${data//%/\x}"
}

如果要对文件内容进行 urldecode，只需将文件内容作为参数。

这是一个测试，如果解码后的编码文件内容不同（如果它运行几秒钟，脚本可能正常工作），它将停止运行：

while true
  do cat /dev/urandom | tr -d '\0' | head -c1000 > /tmp/tmp;
     A="$(cat /tmp/tmp; printf x)"
     A=${A%x}
     A=$(urlencode "$A")
     urldecode "$A" > /tmp/tmp2
     cmp /tmp/tmp /tmp/tmp2
     if [ $? != 0 ]
       then break
     fi
done

score 5 · Accepted Answer

如果你的服务器上安装了 php，你可以很容易地用 url 编码的字符串“cat”甚至“tail”任何文件。

tail -f nginx.access.log | php -R 'echo urldecode($argn)."\n";'

score 5 · Accepted Answer

正如@barti_ddu在评论中所说，\x“应该[双重]转义”。

% echo -e "$(echo "Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en" | sed 'y/+/ /; s/%/\\x/g')"
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

我不会把 Bash 和 sed 混在一起，而是用 Python 来完成。这是一个粗略的方法：

#!/usr/bin/env python

import glob
import os
import urllib

for logfile in glob.glob(os.path.join('.', '*.log')):
    with open(logfile) as current:
        new_log_filename = logfile + '.new'
        with open(new_log_filename, 'w') as new_log_file:
            for url in current:
                unquoted = urllib.unquote(url.strip())
                new_log_file.write(unquoted + '\n')

score 3 · Accepted Answer

基于其他一些答案，但对于 POSIX 世界，可以使用以下函数：

url_decode() {
    printf '%b\n' "$(sed -E -e 's/\+/ /g' -e 's/%([0-9a-fA-F]{2})/\\x\1/g')"
}

它之所以使用printf '%b\n'是因为没有echo -e，并且会中断sed调用以使其更易于阅读，从而强制-E能够将引用与\1. 它还强制以下内容%看起来像一些十六进制代码。

score 3 · Accepted Answer

用于 url 解码的 bash 习惯用法

这是一个 bash 习惯用法，用于对 variabe 中保存的字符串进行 url 解码x并将结果分配给 variable y：

: "${x//+/ }"; printf -v y '%b' "${_//%/\\x}"

与接受的答案不同，它在分配期间保留尾随换行符。（尝试将 url 解码的结果分配v%0A%0A%0A给一个变量。）

它也很快。将 url 解码的结果分配给变量比接受的答案快6700% 。

警告：bash 变量不可能包含 NUL。例如，任何试图解码%00并将结果分配给变量的 bash 解决方案都将不起作用。

基准详情

函数.sh

#!/bin/bash
urldecode() { : "${*//+/ }"; echo -e "${_//%/\\x}"; }
x=%21%20
for (( i=0; i<5000; i++ )); do
  y=$(urldecode "$x")
done

成语.sh

#!/bin/bash
x=%21%20
for (( i=0; i<5000; i++ )); do
  : "${x//+/ }"; printf -v y '%b' "${_//%/\\x}"
done

$ hyperfine --warmup 5 ./function.sh ./idiom.sh
Benchmark #1: ./function.sh
  Time (mean ± σ):      2.844 s ±  0.036 s    [User: 1.728 s, System: 1.494 s]
  Range (min … max):    2.801 s …  2.907 s    10 runs
 
Benchmark #2: ./idiom.sh
  Time (mean ± σ):      42.4 ms ±   1.0 ms    [User: 40.7 ms, System: 1.1 ms]
  Range (min … max):    40.5 ms …  44.8 ms    64 runs
 
Summary
  './idiom.sh' ran
   67.06 ± 1.76 times faster than './function.sh'

如果你真的想要一个功能......

如果你真的想要一个函数，比如出于可读性的原因，我建议如下：

# urldecode [-v var ] argument
#
#   Urldecode the argument and print the result.
#   It replaces '+' with SPACE and then percent decodes.
#   The output is consistent with https://meyerweb.com/eric/tools/dencoder/
#
# Options:
#   -v var    assign the output to shell variable VAR rather than
#             print it to standard output
#
urldecode() {
  local assign_to_var=
  local OPTIND opt
  while getopts ':v:' opt; do
    case $opt in
      v)
        local var=$OPTARG
        assign_to_var=Y
        ;;
      \?)
        echo "$FUNCNAME: error: -$OPTARG: invalid option" >&2
        exit 1
        ;;
      :)
        echo "$FUNCNAME: error: -$OPTARG: this option requires an argument" >&2
        exit 1
        ;;
      *)
        echo "$FUNCNAME: error: an unexpected execution path has occurred." >&2
        exit 1
        ;;
    esac
  done
  shift "$((OPTIND - 1))"
  if [[ $assign_to_var ]]; then
    : "${1//+/ }"; printf -v "$var" %b "${_//%/\\x}"
  else
    : "${1//+/ }"; printf %b "${_//%/\\x}"
  fi
}

将解码结果分配给 shell 变量的示例：

x='v%0A%0A%0A'
urldecode -v y "$x"
echo -n "$y" | od -An -tx1

结果：

 76 0a 0a 0a

这个函数虽然没有上面的习语那么快，但由于不涉及子外壳，在做作业时仍然比接受的答案快 1300%。此外，如示例输出所示，由于不涉及命令替换，它保留了尾随换行符。

score 2 · Accepted Answer

使用 GNU awk：

LC_ALL=C gawk -vRS='%[[:xdigit:]]{2}' '
  RT {RT = sprintf("%c",strtonum("0x" substr(RT, 2)))}
  {gsub(/\+/," ");printf "%s", $0 RT}'

将在标准输入上采用 URI 编码并在标准输出上打印解码的输出。

我们将记录分隔符设置为匹配%XX序列的正则表达式。在 GNUawk中，匹配它的输入存储在 RT 特殊变量中。我们从那里提取十六进制数字，附加到“0x”strnum()以转换为一个数字，依次传递给sprintf("%c")它在 C 语言环境中将转换为相应的字节值。

score 2 · Accepted Answer

更新Jay对 Python 3.5+ 的回答：
echo "%31+%32%0A%33+%34" | python -c "import sys; from urllib.parse import unquote ; print(unquote(sys.stdin.read()))"

尽管如此，brendan 的带有解释的 bash 解决方案似乎更加直接和优雅。

score 2 · Accepted Answer

使用 sed：

#!/bin/bash
URL_DECODE="$(echo "$1" | sed -E 's/%([0-9a-fA-F]{2})/\\x\1/g;s/\+/ /g'"
echo -e "$URL_DECODE"

s/%([0-9a-fA-F]{2})/\\x\1/g将 % 替换为 \x 以将 urlencoded 转换为十六进制
s/\+/ /g将 + 替换为空格 ''，以防在查询字符串中使用 +

只需将其保存decodeurl.sh并使其可执行chmod +x decodeurl.sh

如果您也需要一种编码方式，则此完整代码将有所帮助：

#!/bin/bash
#
# Enconding e Decoding de URL com sed
#
# Por Daniel Cambría
# daniel.cambria@bureau-it.com
#
# jul/2021

function url_decode() {
echo "$@" \
    | sed -E 's/%([0-9a-fA-F]{2})/\\x\1/g;s/\+/ /g'
}

function url_encode() {
    # Conforme RFC 3986
    echo "$@" \
    | sed \
    -e 's/ /%20/g' \
    -e 's/:/%3A/g' \
    -e 's/,/%2C/g' \
    -e 's/\?/%3F/g' \
    -e 's/#/%23/g' \
    -e 's/\[/%5B/g' \
    -e 's/\]/%5D/g' \
    -e 's/@/%40/g' \
    -e 's/!/%41/g' \
    -e 's/\$/%24/g' \
    -e 's/&/%26/g' \
    -e "s/'/%27/g" \
    -e 's/(/%28/g' \
    -e 's/)/%29/g' \
    -e 's/\*/%2A/g' \
    -e 's/\+/%2B/g' \
    -e 's/,/%2C/g' \
    -e 's/;/%3B/g' \
    -e 's/=/%3D/g'
}

echo -e "URL decode: " $(url_decode "$1")
echo -e "URL encode: " $(url_encode "$1")

score 1 · Accepted Answer

使用zshshell（而不是bash），唯一的 shell 其变量可以保存任何字节值，包括 NUL（编码为%00）：

set -o extendedglob +o multibyte
string='Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en'
decoded=${${string//+/ }//(#b)%([[:xdigit:]](#c2))/${(#):-0x$match[1]}}

${var//pattern/replacement}: ksh 风格的参数扩展运算符扩展为$var每个匹配字符串的值pattern替换为replacement.
(#b)激活反向引用，以便模式中括号内的每个部分都可以作为$match[n]替换中的对应部分进行访问。
(#c2): 相当于 ERE{2}
${(#)param-expansion}: 参数扩展，其中#标志导致结果被解释为算术表达式并返回相应的字节值。
${var:-value}: 扩展为valueif$var为空，这里根本不应用于任何变量，因此我们可以指定任意字符串作为参数扩展的主题。

要使其成为就地解码变量内容的函数：

uridecode_var() {
  emulate -L zsh
  set -o extendedglob +o multibyte
  eval $1='${${'$1'//+/ }//(#b)%([[:xdigit:]](#c2))/${(#):-0x$match[1]}}'
}

$ string='Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en'
$ uridecode_var string
$ print -r -- $string
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en

score 1 · Accepted Answer

$ uenc='H%C3%B6he %C3%BCber%20dem%20Meeresspiegel'
$ utf8=$(echo -e "${uenc//%/\\x}")
$ echo $utf8
Höhe über dem Meeresspiegel
$

score 0 · Accepted Answer

这是一个在纯 bash 中完成的解决方案，其中输入和输出是 bash 变量。它将“+”解码为空格并处理“%20”空格以及其他 % 编码的字符。

#!/bin/bash
#here is text that contains both '+' for spaces and a %20
text="hello+space+1%202"
decoded=$(echo -e `echo $text | sed 's/+/ /g;s/%/\\\\x/g;'`)
echo decoded=$decoded

score 0 · Accepted Answer

python，用于 zshrc

# Usage: decodeUrl %3A%2F%2F
function decodeUrl(){
    echo "$1" | python3 -c "import sys; from urllib.parse import unquote; print(unquote(sys.stdin.read()));"    
}

# Usage: encodeUrl https://google.com/search?q=urldecode+bash
#          return: https://google.com/search\?q\=urldecode+bash
function encodeUrl(){
    echo "$1" | python3 -c "import sys; from urllib.parse import quote; print(quote(sys.stdin.read()));"
}

score 0 · Accepted Answer

扩展到 https://stackoverflow.com/a/37840948/8142470
以使用 HTML 实体

$ htmldecode() { : "${*//+/ }"; echo -e "${_//&#x/\x}" | tr -d';'; }
$ htmldecode "http://google.com/search&?q=urldecode+bash" http://google.com/search&?q=urldecode+重击

（参数必须引用）

score -1 · Accepted Answer

只是想分享这个其他解决方案，纯 bash：

encoded_string="Mozilla%2F5.0%20%28Macintosh%3B%20U%3B%20Intel%20Mac%20OS%20X%2010.6%3B%20en"
printf -v encoded_string "%b" "${encoded_string//\%/\x}"
echo $encoded_string

score -1 · Accepted Answer

Python 答案的略微修改版本，它在单行中接受输入和输出文件。

cat inputfile.txt | python -c "import sys, urllib as ul; print ul.unquote(sys.stdin.read());" > ouputfile.txt

score -5 · Accepted Answer

$ uenc='H%C3%B6he %C3%BCber%20dem%20Meeresspiegel'
$ utf8=$(printf "${uenc//%/\\x}")
$ echo $utf8
Höhe über dem Meeresspiegel
$

bash - 如何在 shell 中解码 URL 编码的字符串？

21 回答 21

用于 url 解码的 bash 习惯用法

基准详情

如果你真的想要一个功能......

Related

Reference