我正在编写一个程序,用于gsed
从 csv 文件中提取多字节字符。
它适用于 UTF-8 编码的 csv 文件,但不适用于 SHIFT_JIS 编码的 csv 文件。
test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8 | gsed -r 's/"(.*)","(.*)"/\1 \2/'
こんにちは hello%
test % cat sjis_sample.csv | gsed -r 's/"(.*)","(.*)"/\1 \2/' | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
LINE 1:
Read file with UTF-8
LINE 2:
Extracted text contents from csv file after converting encoding from SHIFT_JIS to UTF-8
-> Works well
LINE 3:
Extracted text contents from csv file without converting encoding
-> It seems that `gsed` failed to capture text contents with match pattern.
有人知道如何使用gsed
SHIFT_JIS 编码文件吗?
谢谢你。
% gsed --version
gsed (GNU sed) 4.8
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Jay Fenlason, Tom Lord, Ken Pizzini,
Paolo Bonzini, Jim Meyering, and Assaf Gordon.
This sed program was built without SELinux support.
GNU sed home page: <https://www.gnu.org/software/sed/>.
General help using GNU software: <https://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.
test % locale
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
解决了
感谢@KamilCuk
GNU sed 是语言环境感知的。如果您想使用原始字节(即,您可以检查 Shift_JIS 中代表 " 的字节并将其提供给 sed),请使用:
LC_ALL=C sed ....
我设置LANG
而不是LC_ALL
asC
因为我无法设置LC_ALL
as C
。
test % cat sjis_convert.sh
#!/bin/bash
LANG=C
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
iconv -f shift_jis -t utf-8
test % ./sjis_convert.sh
こんにちは hello%
附录
我无法设置C
为LC_ALL
.
test % cat sjis_convert.sh
#!/bin/bash
LC_ALL=C
locale
echo ''
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
iconv -f shift_jis -t utf-8
echo ''
locale
test % ./sjis_convert.sh
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
"こんにちは","hello"
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
相反,我设置C
并LANG
成功了。
test % cat ./sjis_convert.sh
#!/bin/bash
LANG=C
locale
echo ''
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
iconv -f shift_jis -t utf-8
echo ''
locale
test % ./sjis_convert.sh
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
こんにちは hello
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=