Usage¶

To use ANSEL Codecs in a project

import ansel

ansel.register()

The register function registers each of the encodings supported by the ansel module. Once registered, they can be used with any of the functions of the codecs module or other functions that rely on codecs, for example:

with open(filename, "r", encodings="ansel") as fp:
    fp.read()

Will open the file filename for read with the “ansel” encoding.

Encodings¶

The following encodings are provided and registered with the codecs module:

Codec	Description
ansel	American National Standard for Extended Latin Alphabet Coded Character Set for Bibliographic Use (ANSEL).
gedcom	GEDCOM extensions to ANSEL.

Limitations¶

Pythons open() uses the codecs.IncrementalEncoder interface, however it doesn’t invoke codecs.IncrementalEncoder.encode() with final=True. This prevents the final character written from being emitted to the stream. For example:

parts = ["P", "a", "\u030A", "l"]
with open("tmpfile", "w", encoding="ansel") as fp:
    for part in parts:
        fp.write(part)

will write the bytes:

0x50 P

0xEA ◌̊

0x61 a

Note that the last character, ‘l’, does not appear in the byte sequence.

Related functions like codecs.open() have similar issues. They don’t rely on the codecs.IncrementalEncoder(), and instead use the codecs.encode() function. Since each write is considered atomic, combining characters split across multiple write calls are not handled correctly:

with codecs.open("tmpfile", "w", encoding="ansel") as fp:
    for part in parts:
        fp.write(part)

will write the bytes:

0x50 P

0x61 a

0xEA ◌̊

0x6C l

Note that while all of the bytes were written, the combining character follows the character it modifies. In ANSEL, the combining character should be before the character it modifies.

To avoid these issues, manually encoding and writing the parts is recommended. For example:

with codecs.open("tmpfile", "wb") as fp:
    for part in codecs.iterencode(parts, encoding="ansel"):
        fp.write(part)

will write the bytes:

0x50 P

0xEA ◌̊

0x61 a

0x6C l

This version writes the correct byte sequence.