Going the (Levenshtein) Distance
Over on the RPG Cafe website, a nice chap called Smit Dixit published a sample of his RPGLE code calculating the Levenshtein distance between two words.
What is the Levenshtein Distance?
The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.Levenshtein distance may also be referred to as edit distance, although it may also denote a larger family of distance metrics. It is closely related to pairwise string alignments.
The CueLogic Blog
The exercise here, is to find a way to find the minimum number of changes that are needed to discover the difference between two words. You can change, delete or move letters.
Since I’m investigating a search technique to allow a flexible database search this code might ( just might ) come in handy… So, after a little spring clean with my RPG modernising brush it ended up looking like this:
How do we call this *SRVPGM?
In RPG we could call it like this:
dcl-pr LEVENSHTN likeds(ds1retn);
*n char(100) const; // p@Source
*n char(100) const; // p@Target
end-pr;
dcl-ds ds1Retn inz;
d1LevDist zoned(9);
end-ds;
SQLRPGLE *SRVPGM (Service Program) called LEVENSRV
**FREE
// Prdgram: LEVENSRV SQLRPGLE *SRVPGM
// The Levenshtein algorithm calculates the least number of edit operations that
// are necessary to modify one string to obtain another string.
// This *SRVPGM does the calc and return a 9char alpha value of either:
// * The value
// * *BLANKS... if the inputs were (duh) blank
// * *SAME..... if the inputs were (yep you guessed it) the same
//
// Compile Compile as a module and bind in *SRVPGM(LEVENSRV)
// 1. CRTSQLRPGI ??OBJ()
// ?*SRCFILE(NICK/QRPGLESRC)
// ?*SRCMBR(LEVENSRV)
// ?*OBJTYPE(*MODULE)
// 2. CRTSRVPGM SRVPGM(NICK/LEVENSRV) MODULE(*SRVPGM) EXPORT(*ALL)
// Credit Based on source from RPG Cafe website, a smart chap called Smit Dixit
// History August 2019 - nick@nicklitten.com - Created
ctl-opt
debug
option(*nodebugio:*srcstmt)
datfmt(*iso-) timfmt(*iso.)
indent('| ') truncnbr(*yes) expropts(*resdecpos)
copyright('| LEVENSRV 2019.08.18 V000 Return the Levenshtein distance between two strings')
nomain;
dcl-proc rtn_levenshtein_distance export;
dcl-pi rtn_levenshtein_distance char(9);
p_source char(100);
p_target char(100);
end-pi;
dcl-ds ds1Retn inz;
d1LevDist zoned(9);
end-ds;
dcl-s w@Curdt7 packed(7) inz(0);
dcl-s w@IsoDate date(*iso);
dcl-s w@TodayUSA date(*usa);
dcl-s SqlStm char(1024) inz;
dcl-c tck const(x'7D');
dcl-s w@Index packed(4) inz;
dcl-c validchars const('ABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890');
dcl-ds matrix dim(100) qualified;
idx packed(4) dim(100);
end-ds;
dcl-s i packed(4) inz;
dcl-s j packed(4) inz;
dcl-s s_i char(1) inz;
dcl-s t_j char(1) inz;
dcl-s n packed(4) inz;
dcl-s m packed(4) inz;
dcl-s w@Source char(100) inz;
dcl-s w@Target char(100) inz;
dcl-s w@Cost packed(1) inz;
dcl-s w@Above1 packed(4) inz;
dcl-s w@Left1 packed(4) inz;
dcl-s w@DiagCost packed(4) inz;
dcl-s w@MinVal packed(4) inz;
dcl-s w@SplPos packed(4) inz;
dcl-s w@SplChar char(1) inz;
clear ds1Retn;
//Validate Parameters
if p_Source = *Blanks or p_Target = *Blanks;
return '*BLANK';
else;
w@Source = p_Source;
w@Target = p_Target;
n = %len(%trim(w@Source));
m = %len(%trim(w@Target));
if n = 0 or m = 0;
d1LevDist = 999999999; //Indicates at least one string is blank
return '*BLANK';
endif;
if n = m and %trim(w@Source) = %trim(w@Target); //Strings are same
d1LevDist = n;
return '*SAME';
endif;
//Set both source and target strings to upper case
EXEC SQL SET :w@Source = UPPER(:w@Source);
EXEC SQL SET :w@target = UPPER(:w@Target);
//Remove special characters from both source and target strings
clear w@SplPos;
clear w@SplChar;
dou %check(validchars:%trim(w@Source)) = 0;
w@SplPos = %check(validchars:%trim(w@Source));
if w@SplPos > 0;
w@Source = %subst(w@Source:1:(w@SplPos-1)) + %subst(w@Source:(w@SplPos+1));
else;
leave;
endif;
enddo;
dou %check(validchars:%trim(w@Target)) = 0;
w@SplPos = %check(validchars:%trim(w@Target));
if w@SplPos > 0;
w@Target = %subst(w@Target:1:(w@SplPos-1)) + %subst(w@Target:(w@SplPos+1));
else;
leave;
endif;
enddo;
//Recalculate lengths of cleaned up strings
n = %len(%trim(w@Source));
m = %len(%trim(w@Target));
//Calculate Levenshtein Distance based on cleaned up strings
//Step 1 - Initialize 1st row for source string
for i = 1 by 1 to (n + 1);
matrix(1).idx(i) = i - 1;
endfor;
//Step 2 - Initialize 1st column for target string
for j = 1 by 1 to (m + 1);
matrix(j).idx(1) = j - 1;
endfor;
//Step 3 - Start calculating distances
for i = 2 by 1 to (n + 1);
s_i = %subst(%trim(w@Source):(i-1):1);
for j = 2 by 1 to (m + 1);
t_j = %subst(%trim(w@Target):(j-1):1);
if (s_i = t_j);
w@Cost = 0;
else;
w@Cost = 1;
endif;
clear w@Above1;
clear w@Left1;
clear w@DiagCost;
w@Above1 = matrix(j-1).idx(i) + 1;
w@Left1 = matrix(j).idx(i-1) + 1;
w@DiagCost = matrix(j-1).idx(i-1) + w@Cost;
w@MinVal = w@Above1;
if (w@Left1 < w@MinVal);
w@MinVal = w@Left1;
endif;
if (w@DiagCost < w@MinVal);
w@MinVal = w@DiagCost;
endif;
matrix(j).idx(i) = w@MinVal;
endfor;
endfor;
d1LevDist = matrix(m + 1).idx(n + 1);
endif;
return %char(ds1Retn);
end-proc;
We compile the Service Program:
Now lets add that service program to a binding directory so we can reference it in other programs. Obviously, if you already have a binding directory to use, skip this step:
CRTBNDDIR BNDDIR(NICK/LEVENBIND) TEXT('This is the binding directory for LEVENSRV')
Add the new *SRVPGM to your binding directory:
Example CLLE calling the *SRVPGM LEVENSRV
I like IBM i Control Language as much as I like Marmite. And I love Marmite:
PGM DCL VAR(&INPUT1) TYPE(CHAR) LEN(100) DCL VAR(&INPUT2) TYPE(CHAR) LEN(100) DCL VAR(&RTN) TYPE(*CHAR) LEN(9) DCLPRCOPT BNDDIR(LEVENBIND) CHGVAR VAR(&INPUT1) VALUE('GILY') CHGVAR VAR(&INPUT2) VALUE('GEELY') CALLPRC PRC(RTN_LEVENSHTEIN_DISTANCE) PARM((&INPUT1) + (&INPUT2)) RTNVAL(&RTN) SNDPGMMSG MSGID(CPF9898) MSGF(QCPFMSG) + MSGDTA('Levenshtein distance between' + *BCAT &INPUT1 *TCAT ' and' *BCAT &INPUT2 + *TCAT ' is' BCAT &RTN) TOPGMQ(PRV) + MSGTYPE(*COMP) CHGVAR VAR(&INPUT1) VALUE('HONDA') CHGVAR VAR(&INPUT2) VALUE('HYUNDAI') CALLPRC PRC(RTN_LEVENSHTEIN_DISTANCE) PARM((&INPUT1) + (&INPUT2)) RTNVAL(&RTN) SNDPGMMSG MSGID(CPF9898) MSGF(QCPFMSG) + MSGDTA('Levenshtein distance between' + *BCAT &INPUT1 *TCAT ' and' *BCAT &INPUT2 + *TCAT ' is' BCAT &RTN) TOPGMQ(PRV) + MSGTYPE(*COMP) ENDPGM
Compile it.
Call It.
#feeltheCLgroove
And those are the exact results I was hoping for 🙂
Now lets look at a quick example of calling it from RPGLE:
Example RPGLE calling the *SRVPGM LEVENSRV
A little piece of RPG code to call the *SRVPGM and see what values we get back:
**FREE // Program: LEVENTEST2 // History: August 2019 - nick@nicklitten.com - Created ctl-opt debug option(nodebugio:srcstmt) datfmt(iso-) timfmt(iso.) bnddir('LEVENBIND') indent('| ') truncnbr(yes) expropts(resdecpos) copyright('| LEVENTEST2 2019.08.18 V000 Test example using the Levenshtein *SRVPGM'); dcl-s source char(100); dcl-s target char(100); dcl-s rtnval char(9); // This is the procedure definition for our *SRVPGM. Obviously you should put this in a COPYBOOK but // since its a test I'm being uber lazy, until I get fired up by this second cup of coffee dcl-pr rtn_levenshtein_distance char(9); *n char(100); *n char(100); end-pr; // now call the rtn_levenshtein_distance procedure and see what it gives us: source = 'GILY'; target = 'GEELY'; rtnval = rtn_levenshtein_distance(source:target); dsply ('1. Value GILY/GEELY is ' + rtnval); source = 'HONDA'; target = 'HYUNDAI'; rtnval = rtn_levenshtein_distance(source:target); dsply ('2. Value HONDA/HYUNDAI is ' + rtnval); *inlr = *on;
We can compile this and then just call it with the two words you want to test:
Levenshtein Distance Test Results
Levenshtein distance between GILY and GEELY is 2
Levenshtein distance between HONDA and HYUNDAI is 3
This is beautiful!
Great! Thanks a lot! 🙂