Sequential Data Streams

At a time, when ASCII [] was the only encoding used on computers, letters where synonyms for code points, for code values and for their graphical representation. The letter ‘A’, for example, was sent as its ASCII code 65 from the keyboard to the computer. String computations related to ‘A’ as 65. It was stored as 65 on a storage device, and when it had to be displayed a 65 was written into a memory cell of the display chip. Naturally, lexical analyzers were described as running on letters.

When data streams are encoded, then its elements are no longer letters. The term lexatom is used to distinguish what triggers state transitions from other closely related concepts. This section provides definitions for surrounding concepts-starting with characters and letters.

Character:

A symbol, such as a letter or number, that represents information [1].

Letter:

A symbol usually written or printed representing a speech sound and constituting a unit of an alphabet [2].

Both concepts are very close to each other. In this text each term, ‘letter’ and ‘character’, shall carry the union of both meanings. In Unicode, some graphical representations can be constructed out of more than one element, such as consonants and diacritics. On the other hand, ligatures combine some diacritics with letters, e.g. שּ (0xFB2D) for Hebrew ‘sheen’ with dagesh. Some code points even represent multiple letters such as in ffi (0xFB03) for a compressed display of ‘f’, ‘f’, ‘i’ or ﷴ (0xFDF4) for ‘م’, ‘ح’, ‘م’, ‘د’ (i.e. ‘Muhammad’) in Arabic. With such a broader scope, it cannot be assumed that lexatoms are letters or characters. More precise definitions are required to distinguish graphical representations and encodings.

Grapheme:

A minimally distinctive unit of writing in the context of a particular writing system []. The Arabic letter ل (lam) and the vocalization mark ُ   (damma) are graphemes. The vocalized ُل is not.

Glyph:

An graphical representation of a grapheme. This may include a specification of font, slant, and style. Input to lexical analysis should be void of elements of style, or it requires some special markers.

Code Point:

A numeric value representing a grapheme. Unicode [] defines distinct mappings from graphemes to code points for many writing systems.

The definitions of ‘encoding’ and ‘code unit’ depend on each other.

Encoding:

Mapping from a code point to a sequence of code units.

Code Unit:

A code unit is the size of an element of an encoded representation of a code point depending on a given encoding [3].

For a lexical analyzer that runs on Unicode, the lexatoms consist of a sequence of code points. If the lexical analyzer runs directly on an encoding, then the lexatoms are the instance of the related code units that constitute the code points. This is shown in basic-concetps-data-stream-lexatom-explanation showing three different DFAs to detect the Hieroglyph P002. The top DFA runs on Unicode code points. The second one runs on the UTF16 with code units of 16bit size. The trigger lexatoms are 0xD80C, 0xDE9E. The last DFA runs on UTF8 with a code unit of 8bit. It triggers on the lexatoms are 0xF0, 0x93, 0x8A, and 0x9D.

../_images/lexatom-explanation.svg

Fig. 4 Egyptian Hieroglyph P002 and DFAs to detect it implement for different encodings: UTF32 (top), UTF16 (middle), and UTF8 (bottom).

A lexical analyzer runs on a computer. There, lexatoms are carried in memory cells. For effective iteration over a lexatom sequence, memory cells are best aligned adjacently. This alignment is called a ‘buffer’.

Buffer:

A buffer is a region in computer memory that consists of a sequence of adjacent same-sized memory cells. Each memory cells carries a numeric representation of a lexatom.

Buffer element type:

The buffer element type defines how to interpret the content of a memory cell and defines its extend. The element type must allow to carry the numeric value of any possibly occurring lexatom.

If in C/C++ the buffer element type is chosen as unsigned 8bit integer (uint8_t), then a buffer is an array of 1byte wide numbers. A memory cell with the content 0xb01011010 is interpreted as 0x5A and may mean ‘Z’ in ASCII.

Summarizing a code point point is a numeric value that stands for a slot in the Unicode table. A code unit is the size of an element in the encoded representation of a code point, e.g. 8bit in UTF8. The buffer element type determines the numeric range of lexatoms. Its size is best chosen to be the same as the code unit. A lexatom is an instance of a code unit or a code point if no encoding is used. That is, if a code unit is 16bit, then a lexatom may be 0x2661.

Previous paragraphs defined the term lexatom in the context of Unicode. For raw DNA analysis a lexatom is simply the representation of a nucleotide basis such as 0x41 (A), 0x47 (G), 0x43 (C), 0x54 (T), and 0x55 (U). Genetic code translates DNA or mRNA into proteins derived from triplets of three nucleotides. On this level, a lexatom is equal to a possible combination of three nucleotides, the so called codon. In general, lexatoms are what trigger state transitions.

Footnotes