Look up the named Unicode encoding. May fail with
The set of known encodings is system-dependent, but includes at least:
UTF-8
- UTF-16, UTF-16BE, UTF-16LE
- UTF-32, UTF-32BE, UTF-32LE
There is additional notation (borrowed from GNU iconv) for specifying
how illegal characters are handled:
- a suffix of //IGNORE, e.g. UTF-8//IGNORE, will
cause all illegal sequences on input to be ignored, and on output will
drop all code points that have no representation in the target
encoding.
- a suffix of //TRANSLIT will choose a replacement
character for illegal sequences or code points.
- a suffix of //ROUNDTRIP will use a PEP383-style escape
mechanism to represent any invalid bytes in the input as Unicode
codepoints (specifically, as lone surrogates, which are normally
invalid in UTF-32). Upon output, these special codepoints are detected
and turned back into the corresponding original byte.
In theory, this mechanism allows arbitrary data to be roundtripped via
a
String with no loss of data. In practice, there are two
limitations to be aware of:
- This only stands a chance of working for an encoding which is an
ASCII superset, as for security reasons we refuse to escape any bytes
smaller than 128. Many encodings of interest are ASCII supersets (in
particular, you can assume that the locale encoding is an ASCII
superset) but many (such as UTF-16) are not.
- If the underlying encoding is not itself roundtrippable, this
mechanism can fail. Roundtrippable encodings are those which have an
injective mapping into Unicode. Almost all encodings meet this
criterion, but some do not. Notably, Shift-JIS (CP932) and Big5
contain several different encodings of the same Unicode
codepoint.
On Windows, you can access supported code pages with the prefix
CP; for example,
"CP1250".