_sec:edit_distance:

Edit Distance¶

Edit distance quantifies the similarity of two lexemes. It is equal to the minimum number of operations required to transform one lexeme into the other []. Operations subject to edit distance are:

Insertion: E.g. insert r after i in fist delivers first.

Deletion: E.g. delete m in limp delivers lip.

Substitution: E.g. substitute r by t in fear delivers feat.

Transposition: E.g. transpose o and r in from delivers form.

To arrive at the word love from the word olive two operations are required. The i needs to be deleted and the l and o need to be transposed. Therefore, the edit distance is two. Applications of edit distance considerations are spelling correction [], fuzzy matching [], and analysis of similarities in DNA/RNA strands [].

A regular expression and its corresponding DFA represent a set of lexemes. Let the edit distance between two DFAs be defined as below.

DFA Edit Distance: The edit distance between two DFAs R and S is equal to the Hausdorff distance [] between the sets of lexemes matching R and S. This holds only for non-cyclic DFAs.

Let \(d(l_0, l_1)\) be the edit distance between two lexemes \(l_0\) and \(l_1\). Since \(d(l_0, l_1) = d(l_1, l_0)\) the Hausdorff distance, is given as:

\[d_H(R, S) = \sup_{r \in R} \inf_{s \in S} d(r, s)\]

In other words: for any lexeme \(r\) in \(R\) there is at least one lexeme \(s\) in \(S\) with \(d(r,s) \le d_H(R,S)\).

The aforementioned edit operations are translated into insertion, deletion, substitution, or transposition of transition conditions between states. Each lexeme matched by a DFA corresponds to a path along transitions in the DFA. As long as the DFA does not contain loops, the modification of a transition effects each path at most once. Accordingly, N modifications to N transitions effect the set of lexemes at most N times. This is what the Hausdorff distance assumes. But, this only holds for non-cyclic graphs (i.e. those without arbitrary repetitions). The modification of a transition in a loop modifies corresponding paths in an arbitrary number.

Deriving DFAs with a specific edit distance from a nominal DFA enables similarity matches. That is, instead of requiring the exact sequence, a match triggers also for similar sequences. The commands listed here to produce similar-matching DFAs do operate also on cyclic DFAs–while issuing a warning. Precise results, though, can only be expected from operations on non-cyclic DFAs. The commands to to derive DFAs with specific edit distances are the following.

\\EdInsertN{DFA N CS}

Matches only lexemes that can be generated by N insertions of characters from character set CS. For example, the expression:

\EdInsert{"abc" 1 [XY]}

matches abcX, abcY, abXc, abYc, aXbc, aYcb, Xabc, and Yacb. That is, whenever no more than one of X or Y is inserted, the DFA matches.

\\EdInsert{DFA N CS}: Matches any lexeme matched by DFA and those lexemes which can be generated by at maximum N insertions of characters from CS.

The naming scheme for edit distance operations goes as follows. \EdOperation (without the N suffix) means ‘matching the nominal lexemes and those achieved with up to N Operations’. \EdOperationN (with the N suffix) means ‘matching only those lexemes achieved by precisely N Operations’. Accordingly, the remaining operations Delete, Substitute, and Transpose are defined.

\\EdDeleteN{Dfa Number CS}

Matches only lexemes achieved by N deletions of characters from CS. For example the expression:

\EdDeleteN{"abc" 1 [ac]}

matches the lexemes ab, bc. The lexeme ac is not matched because only a and c are deletable. b is not. key is not matched, because it cannot be generated by a deletion from key.

\\EdDelete{Nominal Number CS}: Matches Nominal``s lexemes and and those lexemes achieved by at maximum ``N deletions of characters from CS.

\\EdSubstitute{Nominal Number CS SubstituteCS}: Matches Nominals``s lexemes and those lexemes achieved by at maximum ``N substitutions of characters in CS by characters from SubstituteCS.

\\EdSubstituteN{Dfa Number CS SubstituteCS}: Matches only lexemes achieved by N substitutions of characters in CS by characters from SubstituteCS.

\\EdTranspose{Nominal Number CS}: Matches any nominal lexeme and and those lexemes achieved by at maximum N transpositions of adjacent characters from CS.

\\EdTransposeN{Dfa Number SubstituteCS}: Matches only lexemes achieved by N transpositions of adjacent characters from CS.

The Levenshtein automaton [] limits the total number of insertions, deletions, and substitions []. That is, a lexeme that can be reached by two insertions, one deletion, and one substitution of a nominal lexeme would be part of Levenshtein-generated DFA for distance 4. Figure fig_levenshtein_automaton shows a Levenshtein automata derived from a DFA matching the string world with a maximum distance of 2. If a trigger requires the modification of an operation, the state of the next upper layer is entered. When a state of the highest layer is reached, no further deviation can be tolerated.

\\EdLevenshtein{Dfa N InsertCS DeleteCS SubstitutableCS SubstituteCS}: Matches any lexeme that can be produced by applying at maximum N operations of insertion, deletion, or substitution on a lexeme matched by Dfa.

A variant of the Levenshtein distance is the Damerau-Levenshtein distance. It includes the transposition operation []. To mimic the behavior of a Damerau-Levenshtein automata, a pseudo-damerau DFA may be generated. The automaton is implemented as follows: For each number from 0 to the maximum edit distance N, derive a pair (i, N-i) where i is the number of allowed transpositions and N-i the number of insertions, deletions, or substitutions. The union of DFAs that are generated from Levenshtein automatons of level i on transposed automatons of level N-i shall cover all lexemes with a Damerau-Levenshtein distance of N [#f1]_.

\\EdPseudoDamerau{Dfa N InsertCS DeleteCS SubstitutableCS SubstituteCS TransposeCS}: Matches any lexeme achieved by applying at maximum N operations of insertion, deletion, substitution, or transposition on a lexeme matched by Dfa.

The automaton is called a pseudo, i.e. seemingly, because it does not cover the complete set of lexemes with a specific Damerau-Levenshtein distance of N [#f2]_. It only applies transpositions on the original nominal lexemes. For example, a lexeme achieved by a deletion and a subsequent transposition is not necessarily matched [f#2]_. The computation required for \EdPseudoDamerau tends to be enormous. Whenever practical, a simple Levenshtein automata on a transposed DFA shall do, i.e. instead of:

\EdPseudoDamerau{"world" 4 [a-z] [a-z] [a-z] [a-z] [a-z]}

a simpler version

EdLevenshtein{EdTransposeN{“world” 1 N [a-z]} 3 [a-z] [a-z] [a-z] [a-z]}

might do as well.

Applications¶

The following paragraphs pinpoint the wide range of applications for edit distance based DFA generation.

Diacritic Tolerance¶

The deletion may be applied to match against words in Arabic or Hebrew with and without diacricts, i.e. with and without vowelization, as shown in the following for Arabic.

   define {
       // Arabic Vowelization:
       //    U+064B Fathatan     U+064C Dammatan     U+064D Kasratan
       //    U+064E Fatha        U+064F Damma        U+0650 Kasra
       //    U+0651 Shadda       U+0652 Sukun        U+0653 Maddah Above
       //    U+0654 Hamza Above  U+0655 Hamza Below  U+0656 Subscript Alef
       DIACRITICS [\X064B\X064C\X064D\X064E\X064F\X0650\X0651\X0652\X0653\X0654\X0655\X0656]

       // A maximum number of 100 diacritics/word shall do.
       \macro  ANY(integer N, dfa X): \EdDelete{{X} {N} {DIACRITICS}}
   }

With the aforementioned setup, tolerant matching is achieved by the
specification of the full vowelized word in the ``TOL`` macro call
preceeded by the number of vowels.

   mode TOLERANT_MATCHER {
      {ANY العَملِ 3}  => QUEX_TKN_TEACHER(Lexeme);
   }

Fuzzy Matching and Spelling Error Detection¶

A classical application of fuzzy matching is the detection of spelling errors and the proposition of their correction. For example, it is natural to miss a letter when typing quickly such that, for example, the word shipment is written shpment. A DFA that tolerates a deletion from lexemes matching "shipment would match the deviating lexeme and ask something like ‘did you mean shipment’.

A potential problem arises when fuzzy matching DFAs tolerate insertions. Then, they are able outrun nominal DFAs by matching longer lexemes. In other words, a fuzzy match would be detected while the original pattern is actually present. The solution is a mutual exclusive matching implemented by two modes. The first mode is matches the nominal patterns. If none matches, the input position must be reset and a mode for fuzzy matching is entered. This is shown in the following example.:

mode PRECISE {
    AAACAATAAGAA       => QUEX_TKN_AAACAATAAGAA(Lexeme);
    AAATAAAAATAAGAACAA => QUEX_TKN_AAATAAAAATAAGAACAA(Lexeme);
    AAATAACAAGAA       => QUEX_TKN_AAATAACAAGAA(Lexeme);
    on_failure         { self.undo(); self.enter(FUZZY); }
    // In C:            'self.undo(&self);'
}

mode FUZZY {
    {TOL AAACAATAAGAA}       => QUEX_TKN_FUZZY_AAACAATAAGAA(Lexeme);
    {TOL AAATAAAAATAAGAACAA} => QUEX_TKN_FUZZY_AAATAAAAATAAGAACAA(Lexeme);
    {TOL AAATAACAAGAA}       => QUEX_TKN_FUZZY_AAATAACAAGAA(Lexeme);
    [ACTG]                   => QUEX_TKN_UNKNOWN(Lexeme);
}

where TOL might be defined as:

define {
   ...
   \macro  TOL(dfa X):
           \EdPseudoDamerau{{X} 2 [ACTG] [ACTG] [ACTG] [ACTG] [ACTG]}
   ...
}

In languages with delimiters, such as whitespace between lexemes, the problem of outrunning is reduced, but not categorily outruled. When parsing English texts the lexeme ‘world’ may be taken as a fuzzy version od ‘word’. In such circumstances it may not be necessary to have two separate modes. Instead, the fuzzy matchers may be subjected to subtraction of all nominal patterns.:

   define {
       ...
       CS          [a-z]
       ALL_NOMINAL "word"|"world"

       \macro TOL(dfa X, integer N):
              \Diff{\EdLevenshtein{{X} {N} {CS} {CS} {CS} {CS}} {ALL_NOMINAL}}
       ...
   }

The ``\Diff`` operation ensures that a fuzzy matcher produced bz ``TOL`` does
not match any lexeme contained in ``ALL_NOMINAL``.

   mode ENGLISH {
       "word"          => QUEX_TKN_WORD(Lexeme);
       "world"         => QUEX_TKN_WORLD(Lexeme);
       {TOL 2 "word"}  => QUEX_TKN_FUZZY_WORD(Lexeme);
       {TOL 2 "world"} => QUEX_TKN_FUZZY_WORLD(Lexeme);
   }

Edit Distance Measurement¶

Another application is the very fast determination of a given lexeme with respect to another lexeme, or the determination of the closest lexeme from a given bank of lexemes. For given sets of lexemes, a mode for every edit distance is determined. The modes are then applied sequentially, from the mode for the lowest edit distance to the highest edit distance. The first mode that triggers reports the matching lexeme and its edit distance. This can be done as in the following example for C.

mode ED_0 {
    "handkerchief"         => QUEX_TKN_HANDKERCHIEF(number=0);
    "conscience"           => QUEX_TKN_CONSCIENCE(number=0);
    "playwright"           => QUEX_TKN_PLAYWRIGHT(number=0);
    on_failure             { self.undo(&self); self.enter_mode(&self, ED_1); }
}

mode ED_1 {
    {TOL 1 "handkerchief"} => QUEX_TKN_HANDKERCHIEF(number=1);
    {TOL 1 "conscience"}   => QUEX_TKN_CONSCIENCE(number=1);
    {TOL 1 "playwright"}   => QUEX_TKN_PLAYWRIGHT(number=1);
    on_failure             { self.undo(&self); self.enter_mode(&self, ED_2); }
}

mode ED_2 {
    {TOL 2 "handkerchief"} => QUEX_TKN_HANDKERCHIEF(number=2);
    {TOL 2 "conscience"}   => QUEX_TKN_CONSCIENCE(number=2);
    {TOL 2 "playwright"}   => QUEX_TKN_PLAYWRIGHT(number=2);
    on_failure             { self.undo(&self); self.enter_mode(&self, ED_2); }
}

mode ED_3 {
    {TOL 3 "handkerchief"} => QUEX_TKN_HANDKERCHIEF(number=3);
    {TOL 3 "conscience"}   => QUEX_TKN_CONSCIENCE(number=3);
    {TOL 3 "playwright"}   => QUEX_TKN_PLAYWRIGHT(number=3);
}

Omitting the on_failure handler in the last mode triggers the sending of QUEX_TKN_FAILURE in case that no fuzzy matcher matched. An application interface for such an edit distance measure function is shown below:

const char*
fuzzy_match(const char* Begin, size_t N, int* edit_distance) {
// INPUT:
//    Begin: pointer to first character of lexeme.
//    N:     number of bytes in lexeme.
// ADAPTS:
//    edit_distance: edit distance of the closest lexeme matching
// RETURNS:
//    Pointer to zero-terminated name of the closest lexeme matching.
//    If nothing matches, then "<none>" is returned.
    EdLex        ed(Begin, N, nullptr);
    EdLex_Token* token_p = nullptr;
    ed.receive(token_p);
    switch(token_p->id) {
    case QUEX_TKN_HANDKERCHIEF: name = "handkerchief"; break;
    case QUEX_TKN_CONSCIENCE:   name = "conscience"; break;
    case QUEX_TKN_PLAYWRIGHT:   name = "playwright"; break;
    default:                    name = "<none>"; break;
    }
    *edit_distance = token_p->number;
    return name;
}