NAME Text::Fuzzy::PP - partial or fuzzy string matching using edit distances (Pure Perl) SYNOPSIS use Text::Fuzzy::PP; my $tf = Text::Fuzzy::PP->new ('boboon'); print "Distance is ", $tf->distance ('babboon'), "\n"; # Prints "Distance is 2" my @words = qw/the quick brown fox jumped over the lazy dog/; my $nearest = $tf->nearest (\@words); print "Nearest array entry is ", $words[$nearest], "\n"; # Prints "Nearest array entry is brown" DESCRIPTION This module is a drop in, pure perl, substitute for Text::Fuzzy. All documentation is taken directly from Text::Fuzzy. This module calculates the Levenshtein edit distance between words, and does edit-distance-based searching of arrays and files to find the nearest entry. It can handle either byte strings or character strings (strings containing Unicode), treating each Unicode character as a single entity. It is designed for high performance in searching for the nearest to a particular search term over an array of words or a file, by reducing the number of calculations which needs to be performed. It supports either bytewise edit distances or Unicode-based edit distances: use utf8; my $tf = Text::Fuzzy::PP->new ('あいうえお☺'); print $tf->distance ('うえお☺'), "\n"; # prints "2". The default edit distance is the Levenshtein edit distance, which applies an equal weight of one to additions (`cat' -> `cart'), substitutions (`cat' -> `cut'), and deletions (`carp' -> `cap'). Optionally, the Damerau-Levenshtein edit distance, which additionally allows transpositions (`salt' -> `slat') may be selected using the method transpositions_ok. METHODS new my $tf = Text::Fuzzy::PP->new ('bibbety bobbety boo'); Create a new Text::Fuzzy::PP object from the supplied word. distance my $dist = $tf->distance ($word); Return the edit distance to `$word' from the word used to create the object in new. nearest my $index = $tf->nearest (\@words); This returns the index of the nearest element in the array to the argument to new. If none of the elements are less than the maximum distance away from the word, `$index' is -1. if ($index >= 0) { printf "Found at $index, distance was %d.\n", $tf->last_distance (); } Use set_max_distance to alter the maximum distance used. If there is more than one word with the same distance in `@words', this returns the first of them. last_distance my $last_distance = $tf->last_distance (); The distance from the previous match closest match. This is used in conjunction with nearest to find the edit distance to the previous match. set_max_distance # Set the max distance. $tf->set_max_distance (3); Set the maximum edit distance of `$tf'. The default maximum distance is 10. Set the maximum distance to a low value to improve the speed of searches over lists with nearest, or to reject unlikely matches. When searching for a near match, anything with an edit distance of a value at least as high as the maximum is rejected without computing the exact distance. To compute exact distances, call this method with zero or undefined, the maximum edit distance is switched off, and whatever the nearest match is is accepted. get_max_distance # Get the maximum edit distance. print "The max distance is ", $tf->get_max_distance (), "\n"; Get the maximum edit distance of `$tf'. The default is set to 10. The maximum distance may be set with set_max_distance. scan_file $tf->scan_file ('/usr/share/dict/words'); Scan a file to find the nearest match to the word used in new. This assumes that the file contains lines of text separated by newlines and finds the closest match in the file. This does not currently support Unicode-encoded files. transpositions_ok $tf->transpositions_ok (1); A true value in the argument changes the type of edit distance used to allow transpositions, such as `clam' and `calm'. Initially transpositions are not allowed, giving the Levenshtein edit distance. If transpositions are used, the edit distance becomes the Damerau-Levenshtein edit distance. A false value disallows transpositions: $tf->transpositions_ok (0); PRIVATE METHODS These methods are not expected to be useful for the general user. They may be useful in benchmarking the module and checking its correctness. no_alphabet $tf->no_alphabet (1); This turns off alphabetizing of the string. Alphabetizing is a filter used in nearest where the intersection of all the characters in the two strings is computed, and if the alphabetical difference of the two strings is greater than the maximum distance, the match is rejected without applying the dynamic programming algorithm. This increases speed, because the dynamic programming algorithm is slow. The alphabetizing should not ever reject anything which is a legitimate match, and it should make the program run faster in almost every case. The only envisaged uses of switching this off are checking that the algorithm is working correctly, and benchmarking performance. get_trans my $trans_ok = $tf->get_trans (); This returns the value set by transpositions_ok. unicode_length my $length = $tf->unicode_length (); This returns the length in characters (not bytes) of the string used in new. If the string is not marked as Unicode, it returns the undefined value. In the following, `$l1' should be equal to `$l2'. use utf8; my $word = 'ⅅⅆⅇⅈⅉ'; my $l1 = length $word; my $tf = Text::Fuzzy::PP->new ($word); my $l2 = $tf->unicode_length (); ualphabet_rejections my $rejected = $tf->ualphabet_rejections (); After running nearest over an array, this returns the number of entries of the array which were rejected using only the alphabet. Its value is reset to zero each time nearest is called. length_rejections my $rejected = $tf->length_rejections (); After running nearest over an array, this returns the number of entries of the array which were rejected because the length difference between them and the target string was larger than the maximum distance allowed. ACKNOWLEDGEMENTS Text::Fuzzy is authored by Ben Bullock (BKB). The levenshtein algorithm, the documentation, and Text::Fuzzy's tests were taken directly from Text::Fuzzy. BUGS Please report bugs to: https://rt.cpan.org/Public/Dist/Display.html?Name=Text-Fuzzy-PP AUTHOR Nick Logan <ugexe@cpan.org> LICENSE AND COPYRIGHT This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.