_sec:edit_distance:
Edit Distance¶
Edit distance quantifies the similarity of two lexemes. It is equal to the minimum number of operations required to transform one lexeme into the other []. Operations subject to edit distance are:
Insertion: E.g. insert
r
afteri
infist
deliversfirst
.Deletion: E.g. delete
m
inlimp
deliverslip
.Substitution: E.g. substitute
r
byt
infear
deliversfeat
.Transposition: E.g. transpose
o
andr
infrom
deliversform
.
To arrive at the word love
from the word olive
two operations are
required. The i
needs to be deleted and the l
and o
need to be
transposed. Therefore, the edit distance is two. Applications of edit distance
considerations are spelling correction [], fuzzy matching
[], and analysis of similarities in DNA/RNA strands
[].
A regular expression and its corresponding DFA represent a set of lexemes. Let the edit distance between two DFAs be defined as below.
- DFA Edit Distance
The edit distance between two DFAs
R
andS
is equal to the Hausdorff distance [] between the sets of lexemes matchingR
andS
. This holds only for non-cyclic DFAs.
Let \(d(l_0, l_1)\) be the edit distance between two lexemes \(l_0\) and \(l_1\). Since \(d(l_0, l_1) = d(l_1, l_0)\) the Hausdorff distance, is given as:
In other words: for any lexeme \(r\) in \(R\) there is at least one lexeme \(s\) in \(S\) with \(d(r,s) \le d_H(R,S)\).
The aforementioned edit operations are translated into insertion, deletion, substitution, or transposition of transition conditions between states. Each lexeme matched by a DFA corresponds to a path along transitions in the DFA. As long as the DFA does not contain loops, the modification of a transition effects each path at most once. Accordingly, N modifications to N transitions effect the set of lexemes at most N times. This is what the Hausdorff distance assumes. But, this only holds for non-cyclic graphs (i.e. those without arbitrary repetitions). The modification of a transition in a loop modifies corresponding paths in an arbitrary number.
Deriving DFAs with a specific edit distance from a nominal DFA enables similarity matches. That is, instead of requiring the exact sequence, a match triggers also for similar sequences. The commands listed here to produce similar-matching DFAs do operate also on cyclic DFAs–while issuing a warning. Precise results, though, can only be expected from operations on non-cyclic DFAs. The commands to to derive DFAs with specific edit distances are the following.
- \\EdInsertN{DFA N CS}
Matches only lexemes that can be generated by
N
insertions of characters from character setCS
. For example, the expression:\EdInsert{"abc" 1 [XY]}
matches
abcX
,abcY
,abXc
,abYc
,aXbc
,aYcb
,Xabc
, andYacb
. That is, whenever no more than one ofX
orY
is inserted, the DFA matches.
- \\EdInsert{DFA N CS}
Matches any lexeme matched by
DFA
and those lexemes which can be generated by at maximumN
insertions of characters fromCS
.
The naming scheme for edit distance operations goes as follows. \EdOperation
(without the N suffix) means ‘matching the nominal lexemes and those achieved
with up to N
Operation
s’. \EdOperationN
(with the N suffix) means
‘matching only those lexemes achieved by precisely N
Operation
s’.
Accordingly, the remaining operations Delete
, Substitute
, and Transpose
are defined.
- \\EdDeleteN{Dfa Number CS}
Matches only lexemes achieved by
N
deletions of characters fromCS
. For example the expression:\EdDeleteN{"abc" 1 [ac]}
matches the lexemes
ab
,bc
. The lexemeac
is not matched because onlya
andc
are deletable.b
is not.key
is not matched, because it cannot be generated by a deletion fromkey
.
- \\EdDelete{Nominal Number CS}
Matches
Nominal``s lexemes and and those lexemes achieved by at maximum ``N
deletions of characters fromCS
.
- \\EdSubstitute{Nominal Number CS SubstituteCS}
Matches
Nominals``s lexemes and those lexemes achieved by at maximum ``N
substitutions of characters inCS
by characters fromSubstituteCS
.
- \\EdSubstituteN{Dfa Number CS SubstituteCS}
Matches only lexemes achieved by
N
substitutions of characters inCS
by characters fromSubstituteCS
.
- \\EdTranspose{Nominal Number CS}
Matches any nominal lexeme and and those lexemes achieved by at maximum
N
transpositions of adjacent characters fromCS
.
- \\EdTransposeN{Dfa Number SubstituteCS}
Matches only lexemes achieved by
N
transpositions of adjacent characters fromCS
.
The Levenshtein automaton [] limits the total number of
insertions, deletions, and substitions []. That is, a lexeme
that can be reached by two insertions, one deletion, and one substitution of a
nominal lexeme would be part of Levenshtein-generated DFA for distance 4.
Figure fig_levenshtein_automaton shows a Levenshtein automata derived
from a DFA matching the string world
with a maximum distance of 2. If a
trigger requires the modification of an operation, the state of the next upper
layer is entered. When a state of the highest layer is reached, no further
deviation can be tolerated.
- \\EdLevenshtein{Dfa N InsertCS DeleteCS SubstitutableCS SubstituteCS}
Matches any lexeme that can be produced by applying at maximum
N
operations of insertion, deletion, or substitution on a lexeme matched byDfa
.
A variant of the Levenshtein distance is the Damerau-Levenshtein distance. It includes the transposition operation []. To mimic the behavior of a Damerau-Levenshtein automata, a pseudo-damerau DFA may be generated. The automaton is implemented as follows: For each number from 0 to the maximum edit distance N, derive a pair (i, N-i) where i is the number of allowed transpositions and N-i the number of insertions, deletions, or substitutions. The union of DFAs that are generated from Levenshtein automatons of level i on transposed automatons of level N-i shall cover all lexemes with a Damerau-Levenshtein distance of N [#f1]_.
- \\EdPseudoDamerau{Dfa N InsertCS DeleteCS SubstitutableCS SubstituteCS TransposeCS}
Matches any lexeme achieved by applying at maximum
N
operations of insertion, deletion, substitution, or transposition on a lexeme matched byDfa
.
The automaton is called a pseudo, i.e. seemingly, because it does not cover
the complete set of lexemes with a specific Damerau-Levenshtein distance of
N
[#f2]_. It only applies transpositions on the original nominal lexemes.
For example, a lexeme achieved by a deletion and a subsequent transposition is
not necessarily matched [f#2]_. The computation required for \EdPseudoDamerau
tends to be enormous. Whenever practical, a simple Levenshtein automata on
a transposed DFA shall do, i.e. instead of:
\EdPseudoDamerau{"world" 4 [a-z] [a-z] [a-z] [a-z] [a-z]}
a simpler version
EdLevenshtein{EdTransposeN{“world” 1 N [a-z]} 3 [a-z] [a-z] [a-z] [a-z]}
might do as well.
Applications¶
The following paragraphs pinpoint the wide range of applications for edit distance based DFA generation.
Diacritic Tolerance¶
The deletion may be applied to match against words in Arabic or Hebrew with and without diacricts, i.e. with and without vowelization, as shown in the following for Arabic.
define {
// Arabic Vowelization:
// U+064B Fathatan U+064C Dammatan U+064D Kasratan
// U+064E Fatha U+064F Damma U+0650 Kasra
// U+0651 Shadda U+0652 Sukun U+0653 Maddah Above
// U+0654 Hamza Above U+0655 Hamza Below U+0656 Subscript Alef
DIACRITICS [\X064B\X064C\X064D\X064E\X064F\X0650\X0651\X0652\X0653\X0654\X0655\X0656]
// A maximum number of 100 diacritics/word shall do.
\macro ANY(integer N, dfa X): \EdDelete{{X} {N} {DIACRITICS}}
}
With the aforementioned setup, tolerant matching is achieved by the
specification of the full vowelized word in the ``TOL`` macro call
preceeded by the number of vowels.
mode TOLERANT_MATCHER {
{ANY العَملِ 3} => QUEX_TKN_TEACHER(Lexeme);
}
Fuzzy Matching and Spelling Error Detection¶
A classical application of fuzzy matching is the detection of spelling errors
and the proposition of their correction. For example, it is natural to miss
a letter when typing quickly such that, for example, the word shipment
is
written shpment
. A DFA that tolerates a deletion from lexemes matching
"shipment
would match the deviating lexeme and ask something like
‘did you mean shipment’.
A potential problem arises when fuzzy matching DFAs tolerate insertions. Then, they are able outrun nominal DFAs by matching longer lexemes. In other words, a fuzzy match would be detected while the original pattern is actually present. The solution is a mutual exclusive matching implemented by two modes. The first mode is matches the nominal patterns. If none matches, the input position must be reset and a mode for fuzzy matching is entered. This is shown in the following example.:
mode PRECISE {
AAACAATAAGAA => QUEX_TKN_AAACAATAAGAA(Lexeme);
AAATAAAAATAAGAACAA => QUEX_TKN_AAATAAAAATAAGAACAA(Lexeme);
AAATAACAAGAA => QUEX_TKN_AAATAACAAGAA(Lexeme);
on_failure { self.undo(); self.enter(FUZZY); }
// In C: 'self.undo(&self);'
}
mode FUZZY {
{TOL AAACAATAAGAA} => QUEX_TKN_FUZZY_AAACAATAAGAA(Lexeme);
{TOL AAATAAAAATAAGAACAA} => QUEX_TKN_FUZZY_AAATAAAAATAAGAACAA(Lexeme);
{TOL AAATAACAAGAA} => QUEX_TKN_FUZZY_AAATAACAAGAA(Lexeme);
[ACTG] => QUEX_TKN_UNKNOWN(Lexeme);
}
where TOL
might be defined as:
define {
...
\macro TOL(dfa X):
\EdPseudoDamerau{{X} 2 [ACTG] [ACTG] [ACTG] [ACTG] [ACTG]}
...
}
In languages with delimiters, such as whitespace between lexemes, the problem of outrunning is reduced, but not categorily outruled. When parsing English texts the lexeme ‘world’ may be taken as a fuzzy version od ‘word’. In such circumstances it may not be necessary to have two separate modes. Instead, the fuzzy matchers may be subjected to subtraction of all nominal patterns.:
define {
...
CS [a-z]
ALL_NOMINAL "word"|"world"
\macro TOL(dfa X, integer N):
\Diff{\EdLevenshtein{{X} {N} {CS} {CS} {CS} {CS}} {ALL_NOMINAL}}
...
}
The ``\Diff`` operation ensures that a fuzzy matcher produced bz ``TOL`` does
not match any lexeme contained in ``ALL_NOMINAL``.
mode ENGLISH {
"word" => QUEX_TKN_WORD(Lexeme);
"world" => QUEX_TKN_WORLD(Lexeme);
{TOL 2 "word"} => QUEX_TKN_FUZZY_WORD(Lexeme);
{TOL 2 "world"} => QUEX_TKN_FUZZY_WORLD(Lexeme);
}
Edit Distance Measurement¶
Another application is the very fast determination of a given lexeme with respect to another lexeme, or the determination of the closest lexeme from a given bank of lexemes. For given sets of lexemes, a mode for every edit distance is determined. The modes are then applied sequentially, from the mode for the lowest edit distance to the highest edit distance. The first mode that triggers reports the matching lexeme and its edit distance. This can be done as in the following example for C.
mode ED_0 {
"handkerchief" => QUEX_TKN_HANDKERCHIEF(number=0);
"conscience" => QUEX_TKN_CONSCIENCE(number=0);
"playwright" => QUEX_TKN_PLAYWRIGHT(number=0);
on_failure { self.undo(&self); self.enter_mode(&self, ED_1); }
}
mode ED_1 {
{TOL 1 "handkerchief"} => QUEX_TKN_HANDKERCHIEF(number=1);
{TOL 1 "conscience"} => QUEX_TKN_CONSCIENCE(number=1);
{TOL 1 "playwright"} => QUEX_TKN_PLAYWRIGHT(number=1);
on_failure { self.undo(&self); self.enter_mode(&self, ED_2); }
}
mode ED_2 {
{TOL 2 "handkerchief"} => QUEX_TKN_HANDKERCHIEF(number=2);
{TOL 2 "conscience"} => QUEX_TKN_CONSCIENCE(number=2);
{TOL 2 "playwright"} => QUEX_TKN_PLAYWRIGHT(number=2);
on_failure { self.undo(&self); self.enter_mode(&self, ED_2); }
}
mode ED_3 {
{TOL 3 "handkerchief"} => QUEX_TKN_HANDKERCHIEF(number=3);
{TOL 3 "conscience"} => QUEX_TKN_CONSCIENCE(number=3);
{TOL 3 "playwright"} => QUEX_TKN_PLAYWRIGHT(number=3);
}
Omitting the on_failure
handler in the last mode triggers the sending of
QUEX_TKN_FAILURE
in case that no fuzzy matcher matched. An application
interface for such an edit distance measure function is shown below:
const char* fuzzy_match(const char* Begin, size_t N, int* edit_distance) { // INPUT: // Begin: pointer to first character of lexeme. // N: number of bytes in lexeme. // ADAPTS: // edit_distance: edit distance of the closest lexeme matching // RETURNS: // Pointer to zero-terminated name of the closest lexeme matching. // If nothing matches, then "<none>" is returned. EdLex ed(Begin, N, nullptr); EdLex_Token* token_p = nullptr; ed.receive(token_p); switch(token_p->id) { case QUEX_TKN_HANDKERCHIEF: name = "handkerchief"; break; case QUEX_TKN_CONSCIENCE: name = "conscience"; break; case QUEX_TKN_PLAYWRIGHT: name = "playwright"; break; default: name = "<none>"; break; } *edit_distance = token_p->number; return name; }