Converter Helper Functions¶
The lexical analyzer engine generated by quex runs on a buffer that has potentially a different coding than what the user actually requires. When using a default token class, the two member functions
const std::string pretty_char_text() const;
const std::wstring pretty_wchar_text() const;
convert the member text
inside the token into something appropriate
for the types char
and wchar_t
. UTF8 is considered to be appropriate
for char
. Depending on the size of wchar_t
the output may either
be UCS4, i.e. UTF32, or UTF16 for systems where sizeof(wchar_t) == 2
.
Moreover, converters from the buffer’s codec to UTF8, UTF16, UTF32, char
and wchar_t
are provided for each generated lexical analyzer. For Unicode
code based buffers the required functions are declared and implemented by
including:
#include <quex/code_base/converter_helper/from-unicode-buffer>
and:
#include <quex/code_base/converter_helper/from-unicode-buffer.i>
These headers depend on the definitions of:
QUEX_TYPE_CHARACTER
QUEX_SETTING_CHARACTER_SIZE
QUEX_SETTING_CHARACTER_CODEC
and are thus analyzer specific. The are placed in the analyzer’s namespace.
When using subset of Unicode (e.g. ASCII, UCS4 or UTF32) as input encoding, then the buffer encoding is Unicode. Also, when using converters (see Conversion) the buffer is still filled with Unicode characters. Then the Unicode converters in the files mentioned above can be used, as they are
QUEX_INLINE void unicode_to_utf8(...);
QUEX_INLINE void unicode_to_utf16(...);
QUEX_INLINE void unicode_to_utf32(...);
QUEX_INLINE void unicode_to_char(...);
QUEX_INLINE void unicode_to_wchar(...);
which are located in the analyzer’s namespace. In plain C there are no namespaces, so the analyzer’s name precedes the functions name. For example, if the analyzer is named ‘MyLex’, the function group is
QUEX_INLINE void MyLex_unicode_to_utf8(...);
QUEX_INLINE void MyLex_unicode_to_utf16(...);
QUEX_INLINE void MyLex_unicode_to_utf32(...);
QUEX_INLINE void MyLex_unicode_to_char(...);
QUEX_INLINE void MyLex_unicode_to_wchar(...);
The functions above convert a whole string. For single character conversions
the same function group is present only ending with _character
in the
function name, i.e.
QUEX_INLINE void unicode_to_utf8_character(...);
converts a single character from Unicode to UTF8.
The converters towards UTF8, UTF16, and UTF32 are the ‘basis’ and the
converters towards char
and wchar_t
are mapped to one of them depending
on what is appropriate for the size of char
and wchar_t
. Where the
exact signature of each function follows the scheme of converter
in the
following code fragment
QUEX_INLINE void converter(const SourceT** source_pp,
const SourceT* SourceEnd,
DrainT** drain_pp,
const DrainT* DrainEnd);
The converter tries to convert as many characters from source to drain as
possible; until it reaches SourceEnd
or the drain pointer reaches
DrainEnd
. The pointer to pointer arguments are required because the
pointers need to be adapted. This facilitates a repeated call to the converter
in case that either source or drain is fragmented.
Note
Depending on the drain’s size the not all characters may be converted. A conversion for a character is not accomplished if the remaining drain size is less than the maximum character encoding. For UTF8 it is 8 bytes, for UTF16 4 bytes and for UTF32 for bytes.
The previous converter is present in C and C++. In C++ the following converters are available, which are possibly not as fast but more convenient.
QUEX_INLINE string<uint8_t> unicode_to_utf8(string<qtc>);
QUEX_INLINE string<uint16_t> unicode_to_utf16(string<qtc>);
QUEX_INLINE string<uint32_t> unicode_to_utf32(string<qtc>);
QUEX_INLINE string<char> unicode_to_char(string<qtc>);
QUEX_INLINE string<wchar_t> unicode_to_wchar(string<qtc>);
where string<X>
is a shorthand for std::basic_string<X>
and
string<qtc>
is a shorthand for std::basic_string<QUEX_TYPE_CHARACTER>
.
This means, that they can take a string of the type of the lexeme and
return a string which is appropriate for the drain’s codec. Fortunately,
there is nothing compared to std::basic_string
in plain C. So, in
this case those functions do not exist.
When the internal engine is designed using --codec
then the buffer codec is
some dedicated character encoding. The Lexeme
that is presented to the user
has exactly the coding of the internal buffer. Precisely, it is a chain of
QUEX_TYPE_CHARACTER
objects that are encoded in the buffer’s character
encoding. Then quex has to generate the converters towards UTF8, UTF16, and
UTF32. The converters follow the same scheme as for Unicode, only that
‘unicode’ is replaced by the codec’s name, e.g.
QUEX_INLINE void iso8859_7_to_utf8(...);
QUEX_INLINE void iso8859_7_to_utf16(...);
QUEX_INLINE void iso8859_7_to_utf32(...);
QUEX_INLINE void iso8859_7_to_char(...);
QUEX_INLINE void iso8859_7_to_wchar(...);
are the generated converters if --codec iso8859-7
was specified. The
converters can be included by
#include "MyLexer-converter-iso8859_7" // Declarations
#include "MyLexer-converter-iso8859_7.i" // Implementations
Where MyLexer
is the name of the generated lexical analyzer class and
iso8859_7
is the name of the engine’s codec. Furthermore, there is
a set of basic functions that are designed to support the aforementioned
functions, but are still available for whom it may be useful. They are
accessed by including
#include <quex/code_base/converter_helper/from-utf8>
#include <quex/code_base/converter_helper/from-utf16>
#include <quex/code_base/converter_helper/from-utf32>
for the declarations and:
#include <quex/code_base/converter_helper/from-utf8.i>
#include <quex/code_base/converter_helper/from-utf16.i>
#include <quex/code_base/converter_helper/from-utf32.i>
for the implementations. They function exactly the same way as the dedicate
converters for the --codec
converters do. That is, their signatures are
for example
QUEX_INLINE void utf8_to_utf8(...);
QUEX_INLINE void utf8_to_utf16(...);
QUEX_INLINE void utf8_to_utf32(...);
QUEX_INLINE void utf8_to_char(...);
QUEX_INLINE void utf8_to_wchar(...);
in order to convert UTF8 strings to one of the target codecs. UTF16 and UTF32 work analogously.