TIL: Encode and decode text in Emacs Lisp

Jan 03, 2023
til emacs emacs lisp lisp elisp unicode text-encoding strings
2 min read

Emacs Lisp strings are arrays of characters and support Unicode. The length function returns the number of characters, while string-bytes returns the number of bytes. See below for a comparison between two strings, one English, one Japanese.

(setq hello-en "hello")
(setq hello-ja "おはよう")
(length hello-en) ;; 5
(length hello-ja) ;; 4
(string-bytes hello-en) ;; 5
(string-bytes hello-ja) ;; 12

You can see the underlying character codes using string-to-list.

(string-to-list hello-en)
;; (104 101 108 108 111)
(string-to-list hello-ja)
;; (12362 12399 12424 12358)

How about encoding and decoding? If you run M-x list-coding-systems, it’ll give you a list of all of the string encoding systems. My Emacs installation has a whole bunch including utf-8, iso-latin-1, us-ascii, and a variety of CJK encodings. We can use encode-coding-string to turn our string into an encoded string.

(encode-coding-string hello-en 'us-ascii)
;; "hello"
(encode-coding-string hello-ja 'us-ascii)
;; "????" not helpful
(encode-coding-string hello-ja 'japanese-shift-jis)
;; "\202\250\202\315\202\346\202\244"

We can decode the latter too, but we get it back with a text property that includes the charset. I’ve stripped that off, just to show that we can go fully round the encode and decode cycle.

(What’s a text property? It’s basically a string with some attached metadata that’s used by Emacs for formatting - setting colours or fonts, making clickable links or, as in this case, annotating what character set the string is in.)

(setq hello-ja-sjis
    (encode-coding-string hello-ja 'japanese-shift-jis))
(setq hello-ja-decoded
    (decode-coding-string hello-ja-sjis 'japanese-shift-jis))
(setq hello-ja-decoded-wo-props
    (substring-no-properties hello-ja-decoded))
(string-equal hello-ja-decoded-wo-props hello-ja) ;; t

A brief summary of broadly equivalent code in Python for comparison:

hello_en = "hello"
hello_ja = "おはよう"
assert len(hello_en) == 5
assert len(hello_ja) == 4

assert len(bytearray(hello_en.encode("utf8"))) == 5
assert len(bytearray(hello_ja.encode("utf8"))) == 12

hello_ja_shiftjis = hello_ja.encode("shift_jisx0213")
hello_ja_roundtripped = hello_ja_shiftjis.decode("shift_jisx0213")

assert hello_ja_roundtripped == hello_ja

This is a TIL post. What’s that?