我正在尝试使用 LispWorks 检测文件编码。
LispWorks 应该具备这样的功能,请参阅External Formats and File Streams。
[注:细节基于@rainer-joswig 和@svante 评论]
system:*file-encoding-detection-algorithm*
设置为默认值,
(setf system:*file-encoding-detection-algorithm*
'(find-filename-pattern-encoding-match
find-encoding-option
detect-utf32-bom
detect-unicode-bom
detect-utf8-bom
specific-valid-file-encoding
locale-file-encoding))
并且,
;; Specify the correct characters
(lw:set-default-character-element-type 'cl:character)
此处提供了一些可验证的文件:
UNICODE
并被LATIN-1
正确检测
;; UNICODE
;; http://www.humancomp.org/unichtm/tongtwst.htm
(with-open-file (ss "/tmp/tongtwst.htm")
(stream-external-format ss))
;; => (:UNICODE :LITTLE-ENDIAN T :EOL-STYLE :CRLF)
;; LATIN-1
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :LF)
检测UTF-8
不能马上起作用,
;; UTF-8 encoding
;; http://www.humancomp.org/unichtm/tongtwst8.htm
(with-open-file (ss "/tmp/tongtws8.htm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :CRLF)
添加UTF-8
到*specific-valid-file-encodings*
使其工作,
(pushnew :utf-8 system:*specific-valid-file-encodings*)
;; system:*specific-valid-file-encodings*
;; => (:UTF-8)
;; http://www.humancomp.org/unichtm/tongtwst8.htm
(with-open-file (ss "/tmp/tongtws8.htm")
(stream-external-format ss))
;; => (:UTF-8 :EOL-STYLE :CRLF)
但是现在与LATIN-1
上面相同的文件被检测为 UTF-8,
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
(stream-external-format ss))
;; => (:UTF-8 :EOL-STYLE :LF)
也推LATIN-1
到*specific-valid-file-encodings*
,
(pushnew :latin-1 system:*specific-valid-file-encodings*)
;; system:*specific-valid-file-encodings*
;; => (:LATIN-1 :UTF-8)
;; This one works again
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :LF)
;; But this one, which was properly detected as `UTF-8`,
;; is now detected as `LATIN-1`, *which is wrong.*
(with-open-file (ss "/tmp/tongtws8.htm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :CRLF)
我做错了什么?
如何使用 LispWorks 正确检测文件编码?