Usage¶
To use ANSEL Codecs in a project
import ansel
ansel.register()
The register
function registers each of the encodings supported by
the ansel module. Once registered, they can be used with any of the functions
of the codecs
module or other functions that rely on codecs, for
example:
with open(filename, "r", encodings="ansel") as fp:
fp.read()
Will open the file filename
for read with the “ansel” encoding.
Encodings¶
The following encodings are provided and registered with the codecs
module:
Codec |
Description |
---|---|
ansel |
American National Standard for Extended Latin Alphabet Coded Character Set for Bibliographic Use (ANSEL). |
gedcom |
GEDCOM extensions to ANSEL. |
Limitations¶
Pythons open()
uses the codecs.IncrementalEncoder
interface, however it doesn’t invoke
codecs.IncrementalEncoder.encode()
with final=True. This prevents
the final character written from being emitted to the stream. For example:
parts = ["P", "a", "\u030A", "l"]
with open("tmpfile", "w", encoding="ansel") as fp:
for part in parts:
fp.write(part)
will write the bytes:
|
|
|
Note that the last character, ‘l’, does not appear in the byte sequence.
Related functions like codecs.open()
have similar issues. They don’t
rely on the codecs.IncrementalEncoder()
, and instead use the
codecs.encode()
function. Since each write is considered atomic,
combining characters split across multiple write calls are not handled
correctly:
with codecs.open("tmpfile", "w", encoding="ansel") as fp:
for part in parts:
fp.write(part)
will write the bytes:
|
|
|
|
Note that while all of the bytes were written, the combining character follows the character it modifies. In ANSEL, the combining character should be before the character it modifies.
To avoid these issues, manually encoding and writing the parts is recommended. For example:
with codecs.open("tmpfile", "wb") as fp:
for part in codecs.iterencode(parts, encoding="ansel"):
fp.write(part)
will write the bytes:
|
|
|
|
This version writes the correct byte sequence.