TIL: Encode and decode text in Emacs Lisp
Emacs Lisp strings are arrays of characters and support Unicode. The length
function returns the number of characters, while string-bytes
returns the
number of bytes. See below for a comparison between two strings, one English,
one Japanese.
(setq hello-en "hello")
(setq hello-ja "おはよう")
(length hello-en) ;; 5
(length hello-ja) ;; 4
(string-bytes hello-en) ;; 5
(string-bytes hello-ja) ;; 12
You can see the underlying character codes using string-to-list
.
(string-to-list hello-en)
;; (104 101 108 108 111)
(string-to-list hello-ja)
;; (12362 12399 12424 12358)
How about encoding and decoding? If you run M-x list-coding-systems
, it’ll
give you a list of all of the string encoding systems. My Emacs installation
has a whole bunch including utf-8, iso-latin-1, us-ascii, and a variety of CJK
encodings. We can use encode-coding-string
to turn our string into an encoded
string.
(encode-coding-string hello-en 'us-ascii)
;; "hello"
(encode-coding-string hello-ja 'us-ascii)
;; "????" not helpful
(encode-coding-string hello-ja 'japanese-shift-jis)
;; "\202\250\202\315\202\346\202\244"
We can decode the latter too, but we get it back with a text property that includes the charset. I’ve stripped that off, just to show that we can go fully round the encode and decode cycle.
(What’s a text property? It’s basically a string with some attached metadata that’s used by Emacs for formatting - setting colours or fonts, making clickable links or, as in this case, annotating what character set the string is in.)
(setq hello-ja-sjis
(encode-coding-string hello-ja 'japanese-shift-jis))
(setq hello-ja-decoded
(decode-coding-string hello-ja-sjis 'japanese-shift-jis))
(setq hello-ja-decoded-wo-props
(substring-no-properties hello-ja-decoded))
(string-equal hello-ja-decoded-wo-props hello-ja) ;; t
A brief summary of broadly equivalent code in Python for comparison:
hello_en = "hello"
hello_ja = "おはよう"
assert len(hello_en) == 5
assert len(hello_ja) == 4
assert len(bytearray(hello_en.encode("utf8"))) == 5
assert len(bytearray(hello_ja.encode("utf8"))) == 12
hello_ja_shiftjis = hello_ja.encode("shift_jisx0213")
hello_ja_roundtripped = hello_ja_shiftjis.decode("shift_jisx0213")
assert hello_ja_roundtripped == hello_ja
This is a TIL post. What’s that?