Converter Helper Functions

The lexical analyzer engine generated by quex runs on a buffer that has potentially a different coding than what the user actually requires. When using a default token class, the two member functions

const std::string    pretty_char_text() const;
const std::wstring   pretty_wchar_text() const;

convert the member text inside the token into something appropriate for the types char and wchar_t. UTF8 is considered to be appropriate for char. Depending on the size of wchar_t the output may either be UCS4, i.e. UTF32, or UTF16 for systems where sizeof(wchar_t) == 2.

Moreover, converters from the buffer’s codec to UTF8, UTF16, UTF32, char and wchar_t are provided for each generated lexical analyzer. For Unicode code based buffers the required functions are declared and implemented by including:

#include <quex/code_base/converter_helper/from-unicode-buffer>

and:

#include <quex/code_base/converter_helper/from-unicode-buffer.i>

These headers depend on the definitions of:

QUEX_TYPE_CHARACTER
QUEX_SETTING_CHARACTER_SIZE
QUEX_SETTING_CHARACTER_CODEC

and are thus analyzer specific. The are placed in the analyzer’s namespace.

When using subset of Unicode (e.g. ASCII, UCS4 or UTF32) as input encoding, then the buffer encoding is Unicode. Also, when using converters (see Conversion) the buffer is still filled with Unicode characters. Then the Unicode converters in the files mentioned above can be used, as they are

QUEX_INLINE void   unicode_to_utf8(...);
QUEX_INLINE void   unicode_to_utf16(...);
QUEX_INLINE void   unicode_to_utf32(...);
QUEX_INLINE void   unicode_to_char(...);
QUEX_INLINE void   unicode_to_wchar(...);

which are located in the analyzer’s namespace. In plain C there are no namespaces, so the analyzer’s name precedes the functions name. For example, if the analyzer is named ‘MyLex’, the function group is

QUEX_INLINE void   MyLex_unicode_to_utf8(...);
QUEX_INLINE void   MyLex_unicode_to_utf16(...);
QUEX_INLINE void   MyLex_unicode_to_utf32(...);
QUEX_INLINE void   MyLex_unicode_to_char(...);
QUEX_INLINE void   MyLex_unicode_to_wchar(...);

The functions above convert a whole string. For single character conversions the same function group is present only ending with _character in the function name, i.e.

QUEX_INLINE void   unicode_to_utf8_character(...);

converts a single character from Unicode to UTF8.

The converters towards UTF8, UTF16, and UTF32 are the ‘basis’ and the converters towards char and wchar_t are mapped to one of them depending on what is appropriate for the size of char and wchar_t. Where the exact signature of each function follows the scheme of converter in the following code fragment

QUEX_INLINE void   converter(const SourceT** source_pp,
                             const SourceT*  SourceEnd,
                             DrainT**        drain_pp,
                             const DrainT*   DrainEnd);

The converter tries to convert as many characters from source to drain as possible; until it reaches SourceEnd or the drain pointer reaches DrainEnd. The pointer to pointer arguments are required because the pointers need to be adapted. This facilitates a repeated call to the converter in case that either source or drain is fragmented.

Note

Depending on the drain’s size the not all characters may be converted. A conversion for a character is not accomplished if the remaining drain size is less than the maximum character encoding. For UTF8 it is 8 bytes, for UTF16 4 bytes and for UTF32 for bytes.

The previous converter is present in C and C++. In C++ the following converters are available, which are possibly not as fast but more convenient.

QUEX_INLINE string<uint8_t>   unicode_to_utf8(string<qtc>);
QUEX_INLINE string<uint16_t>  unicode_to_utf16(string<qtc>);
QUEX_INLINE string<uint32_t>  unicode_to_utf32(string<qtc>);
QUEX_INLINE string<char>      unicode_to_char(string<qtc>);
QUEX_INLINE string<wchar_t>   unicode_to_wchar(string<qtc>);

where string<X> is a shorthand for std::basic_string<X> and string<qtc> is a shorthand for std::basic_string<QUEX_TYPE_CHARACTER>. This means, that they can take a string of the type of the lexeme and return a string which is appropriate for the drain’s codec. Fortunately, there is nothing compared to std::basic_string in plain C. So, in this case those functions do not exist.

When the internal engine is designed using --codec then the buffer codec is some dedicated character encoding. The Lexeme that is presented to the user has exactly the coding of the internal buffer. Precisely, it is a chain of QUEX_TYPE_CHARACTER objects that are encoded in the buffer’s character encoding. Then quex has to generate the converters towards UTF8, UTF16, and UTF32. The converters follow the same scheme as for Unicode, only that ‘unicode’ is replaced by the codec’s name, e.g.

QUEX_INLINE void   iso8859_7_to_utf8(...);
QUEX_INLINE void   iso8859_7_to_utf16(...);
QUEX_INLINE void   iso8859_7_to_utf32(...);
QUEX_INLINE void   iso8859_7_to_char(...);
QUEX_INLINE void   iso8859_7_to_wchar(...);

are the generated converters if --codec iso8859-7 was specified. The converters can be included by

#include "MyLexer-converter-iso8859_7"   // Declarations
#include "MyLexer-converter-iso8859_7.i" // Implementations

Where MyLexer is the name of the generated lexical analyzer class and iso8859_7 is the name of the engine’s codec. Furthermore, there is a set of basic functions that are designed to support the aforementioned functions, but are still available for whom it may be useful. They are accessed by including

#include <quex/code_base/converter_helper/from-utf8>
#include <quex/code_base/converter_helper/from-utf16>
#include <quex/code_base/converter_helper/from-utf32>

for the declarations and:

#include <quex/code_base/converter_helper/from-utf8.i>
#include <quex/code_base/converter_helper/from-utf16.i>
#include <quex/code_base/converter_helper/from-utf32.i>

for the implementations. They function exactly the same way as the dedicate converters for the --codec converters do. That is, their signatures are for example

QUEX_INLINE void   utf8_to_utf8(...);
QUEX_INLINE void   utf8_to_utf16(...);
QUEX_INLINE void   utf8_to_utf32(...);
QUEX_INLINE void   utf8_to_char(...);
QUEX_INLINE void   utf8_to_wchar(...);

in order to convert UTF8 strings to one of the target codecs. UTF16 and UTF32 work analogously.