Going the (Levenshtein) Distance in RPG Free

IBM i

Aug 18

Over on the RPG Cafe website, a nice chap called Smit Dixit published a sample of his RPGLE code calculating the Levenshtein distance between two words.

What is the Levenshtein Distance?

The Levenshtein distance is a string metric for measuring difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. It is named after Vladimir Levenshtein, who considered this distance in 1965.Levenshtein distance may also be referred to as edit distance, although it may also denote a larger family of distance metrics. It is closely related to pairwise string alignments.

The CueLogic Blog

The exercise here, is to find a way to find the minimum number of changes that are needed to discover the difference between two words. You can change, delete or move letters.

Since I’m investigating a search technique to allow a flexible database search this code might ( just might ) come in handy… So, after a little spring clean with my RPG modernising brush it ended up looking like this:

How do we call this *SRVPGM?

In RPG we could call it like this:

   dcl-pr LEVENSHTN likeds(ds1retn);
     *n char(100) const; // p@Source
     *n char(100) const; // p@Target
    end-pr;

    dcl-ds ds1Retn inz;
     d1LevDist zoned(9);
    end-ds; 

SQLRPGLE *SRVPGM (Service Program) called LEVENSRV

**FREE
 
// Prdgram: LEVENSRV SQLRPGLE *SRVPGM                                                     
//          The Levenshtein algorithm calculates the least number of edit operations that 
//          are necessary to modify one string to obtain another string.                  
//          This *SRVPGM does the calc and return a 9char alpha value of either:          
//           * The value                                                                  
//           * *BLANKS... if the inputs were (duh) blank                                  
//           * *SAME..... if the inputs were (yep you guessed it) the same                
//    
// Compile      Compile as a module and bind in *SRVPGM(LEVENSRV)                         
//              1. CRTSQLRPGI ??OBJ()                                                     
//                 ?*SRCFILE(NICK/QRPGLESRC)                                              
//                 ?*SRCMBR(LEVENSRV)                                                     
//                 ?*OBJTYPE(*MODULE)                                                     
//              2. CRTSRVPGM SRVPGM(NICK/LEVENSRV) MODULE(*SRVPGM) EXPORT(*ALL)           
// Credit       Based on source from RPG Cafe website, a smart chap called Smit Dixit     
// History      August 2019 - nick@nicklitten.com - Created                               
                                                                                          
ctl-opt                                                                                   
 debug
 option(*nodebugio:*srcstmt)                                                                
 datfmt(*iso-) timfmt(*iso.)                                                   
 indent('| ') truncnbr(*yes) expropts(*resdecpos)                                           
 copyright('| LEVENSRV 2019.08.18 V000 Return the Levenshtein distance between two strings')
 nomain;                                                                                    
                                                                                            
dcl-proc rtn_levenshtein_distance export;                                                   
 dcl-pi rtn_levenshtein_distance char(9);                                                   
  p_source char(100);                                                                       
  p_target char(100);                                                                       
 end-pi;                                                                                    
                                                                                            
 dcl-ds ds1Retn inz;                                                                        
  d1LevDist zoned(9);                                                                       
 end-ds;                                                                                    
                                                                                            
 dcl-s w@Curdt7 packed(7) inz(0);                                                           
 dcl-s w@IsoDate date(*iso);                                                                
 dcl-s w@TodayUSA date(*usa);                                                               
 dcl-s SqlStm char(1024) inz;                                                               
 dcl-c tck const(x'7D');                                                                    
 dcl-s w@Index packed(4) inz;                                                               
 dcl-c validchars const('ABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890');                                                            
 dcl-ds matrix dim(100) qualified;              
 idx packed(4) dim(100);                        
 end-ds;                                        
 dcl-s i packed(4) inz;                         
 dcl-s j packed(4) inz;                         
 dcl-s s_i char(1) inz;                         
 dcl-s t_j char(1) inz;                         
 dcl-s n packed(4) inz;                         
 dcl-s m packed(4) inz;                         
 dcl-s w@Source char(100) inz;                  
 dcl-s w@Target char(100) inz;                  
 dcl-s w@Cost packed(1) inz;                    
 dcl-s w@Above1 packed(4) inz;                  
 dcl-s w@Left1 packed(4) inz;                   
 dcl-s w@DiagCost packed(4) inz;                
 dcl-s w@MinVal packed(4) inz;                  
 dcl-s w@SplPos packed(4) inz;                  
 dcl-s w@SplChar char(1) inz;                   
                                                
 clear ds1Retn;                                 
                                                
 //Validate Parameters                          
 if p_Source = *Blanks or p_Target = *Blanks;   
   return '*BLANK';                             
 else;                                                               
                                                                     
   w@Source = p_Source;                                              
   w@Target = p_Target;                                              
   n = %len(%trim(w@Source));                                        
   m = %len(%trim(w@Target));                                        
                                                                     
   if n = 0 or m = 0;                                                
    d1LevDist = 999999999;  //Indicates at least one string is blank 
    return '*BLANK';                                                 
   endif;                                                            
                                                                     
   if n = m and %trim(w@Source) = %trim(w@Target); //Strings are same
    d1LevDist = n;                                                   
    return '*SAME';                                                  
   endif;                                                            
                                                                     
   //Set both source and target strings to upper case                
   EXEC SQL SET :w@Source = UPPER(:w@Source);                        
   EXEC SQL SET :w@target = UPPER(:w@Target);                        
                                                                     
   //Remove special characters from both source and target strings   
   clear w@SplPos;                                                   
   clear w@SplChar;                                                  
   dou %check(validchars:%trim(w@Source)) = 0;                                   
    w@SplPos = %check(validchars:%trim(w@Source));                               
    if w@SplPos > 0;                                                             
     w@Source = %subst(w@Source:1:(w@SplPos-1)) + %subst(w@Source:(w@SplPos+1)); 
    else;                                                                        
     leave;                                                                      
    endif;                                                                       
   enddo;                                                                        
                                                                                 
   dou %check(validchars:%trim(w@Target)) = 0;                                   
    w@SplPos = %check(validchars:%trim(w@Target));                               
    if w@SplPos > 0;                                                             
     w@Target = %subst(w@Target:1:(w@SplPos-1)) + %subst(w@Target:(w@SplPos+1)); 
    else;                                                                        
     leave;                                                                      
    endif;                                                                       
   enddo;                                                                        
                                                                                 
   //Recalculate lengths of cleaned up strings                                   
   n = %len(%trim(w@Source));                                                    
   m = %len(%trim(w@Target));                                                    
                                                                                 
   //Calculate Levenshtein Distance based on cleaned up strings                  
   //Step 1 - Initialize 1st row for source string                               
   for i = 1 by 1 to (n + 1);                         
    matrix(1).idx(i) = i - 1;                         
   endfor;                                            
                                                      
   //Step 2 - Initialize 1st column for target string 
   for j = 1 by 1 to (m + 1);                         
    matrix(j).idx(1) = j - 1;                         
   endfor;                                            
                                                      
   //Step 3 - Start calculating distances             
   for i = 2 by 1 to (n + 1);                         
                                                      
    s_i = %subst(%trim(w@Source):(i-1):1);            
                                                      
    for j = 2 by 1 to (m + 1);                        
                                                      
     t_j = %subst(%trim(w@Target):(j-1):1);           
                                                      
     if (s_i = t_j);                                  
      w@Cost = 0;                                     
     else;                                            
      w@Cost = 1;                                     
     endif;                                           
                                                      
     clear w@Above1;                               
     clear w@Left1;                                
     clear w@DiagCost;                             
     w@Above1    = matrix(j-1).idx(i) + 1;         
     w@Left1     = matrix(j).idx(i-1) + 1;         
     w@DiagCost  = matrix(j-1).idx(i-1) + w@Cost;  
     w@MinVal = w@Above1;                          
                                                   
     if (w@Left1 < w@MinVal);                      
      w@MinVal = w@Left1;                          
     endif;                                        
                                                   
     if (w@DiagCost < w@MinVal);                   
      w@MinVal = w@DiagCost;                       
     endif;                                        
                                                   
     matrix(j).idx(i) = w@MinVal;                  
                                                   
    endfor;                                        
                                                   
   endfor;                                         
                                                   
   d1LevDist = matrix(m + 1).idx(n + 1);           
 endif;                                            
                               
 return %char(ds1Retn);        
                               
end-proc;                      

We compile the Service Program:

Now lets add that service program to a binding directory so we can reference it in other programs. Obviously, if you already have a binding directory to use, skip this step:

CRTBNDDIR BNDDIR(NICK/LEVENBIND) TEXT('This is the binding directory for LEVENSRV')

Add the new *SRVPGM to your binding directory:

Example CLLE calling the *SRVPGM LEVENSRV

I like IBM i Control Language as much as I like Marmite. And I love Marmite:

PGM

 DCL        VAR(&INPUT1) TYPE(CHAR) LEN(100)
 DCL        VAR(&INPUT2) TYPE(CHAR) LEN(100)              
 DCL        VAR(&RTN) TYPE(*CHAR) LEN(9)

 DCLPRCOPT  BNDDIR(LEVENBIND)

 CHGVAR     VAR(&INPUT1) VALUE('GILY')                     
 CHGVAR     VAR(&INPUT2) VALUE('GEELY')

 CALLPRC    PRC(RTN_LEVENSHTEIN_DISTANCE) PARM((&INPUT1) + 
              (&INPUT2)) RTNVAL(&RTN)

 SNDPGMMSG  MSGID(CPF9898) MSGF(QCPFMSG) +                 
              MSGDTA('Levenshtein distance between' +      
              *BCAT &INPUT1 *TCAT ' and' *BCAT &INPUT2 +   
              *TCAT ' is' BCAT &RTN) TOPGMQ(PRV) +       
              MSGTYPE(*COMP)

 CHGVAR     VAR(&INPUT1) VALUE('HONDA')                    
 CHGVAR     VAR(&INPUT2) VALUE('HYUNDAI')

 CALLPRC    PRC(RTN_LEVENSHTEIN_DISTANCE) PARM((&INPUT1) + 
              (&INPUT2)) RTNVAL(&RTN) 

 SNDPGMMSG  MSGID(CPF9898) MSGF(QCPFMSG) +              
              MSGDTA('Levenshtein distance between' +   
              *BCAT &INPUT1 *TCAT ' and' *BCAT &INPUT2 +
              *TCAT ' is' BCAT &RTN) TOPGMQ(PRV) +    
              MSGTYPE(*COMP)                            
 ENDPGM

Compile it.

Call It.

#feeltheCLgroove

And those are the exact results I was hoping for 🙂

Now lets look at a quick example of calling it from RPGLE:

Example RPGLE calling the *SRVPGM LEVENSRV

A little piece of RPG code to call the *SRVPGM and see what values we get back:

**FREE                                                                                              
 // Program: LEVENTEST2                                                                              
 // History: August 2019 - nick@nicklitten.com - Created                                             
 ctl-opt                                                                                             
  debug                                                                                              
  option(nodebugio:srcstmt)                                                                        
  datfmt(iso-) timfmt(iso.)                                                                        
  bnddir('LEVENBIND')                                                                                
  indent('| ') truncnbr(yes) expropts(resdecpos)                                                   
  copyright('| LEVENTEST2 2019.08.18 V000 Test example using the Levenshtein *SRVPGM');     
         
 dcl-s source char(100);                                                                             
 dcl-s target char(100);                                                                             
 dcl-s rtnval char(9);       
                                                                        
 // This is the procedure definition for our *SRVPGM. Obviously you should put this in a COPYBOOK but
 // since its a test I'm being uber lazy, until I get fired up by this second cup of coffee          
 dcl-pr rtn_levenshtein_distance char(9);                                                            
   *n char(100);                                                                                     
   *n char(100);                                                                                     
 end-pr;                                                                                             
        
// now call the rtn_levenshtein_distance procedure and see what it gives us: 
 source = 'GILY';                                                             
 target = 'GEELY';                                                            
 rtnval = rtn_levenshtein_distance(source:target);                            
 dsply ('1. Value GILY/GEELY is ' + rtnval);   
                               
 source = 'HONDA';                                                            
 target = 'HYUNDAI';                                                          
 rtnval = rtn_levenshtein_distance(source:target);                            
 dsply ('2. Value HONDA/HYUNDAI is ' + rtnval);    
                           
 *inlr = *on;                                                                                                                 

We can compile this and then just call it with the two words you want to test:

Levenshtein Distance Test Results

Levenshtein distance between GILY and GEELY is 2

Levenshtein distance between HONDA and HYUNDAI is 3

Follow

About the Author

IBM i Software Developer, Digital Dad, AS400 Anarchist, RPG Modernizer, Alpha Nerd and Passionate Eater of Cheese and Biscuits. Nick Litten Dot Com is a mixture of blog posts that can be sometimes serious, frequently playful and probably down-right pointless all in the space of a day. Enjoy your stay, feel free to comment and in the words of the most interesting man in the world: Stay thirsty my friend.