Multiple keyword (100s to 1000s) search (string-search algorithm) in PHP
I have this problem to solve in my PHP project where some keywords (from a few hundreds to a few thousands, lengths can vary) need to be searched in a string about 100-300 characters long, sometimes of lesser length 30-50 chars. I can preprocess the keywords for reusing for new instances of search strings. I am kind of new to PHP and did not find a way to do this in the PHP library. Doing a bit of searching, I found a few good candidates in Aho Corasick algorithm and then this improvement by Sun Wu and Udi Manber, which also seems to be known as agrep (or is a part of agrep): http://webglimpse.net/pubs/TR94-17.pdf
There is Rabin Karp, Suffix Trees etc too but they did not look quite suitable as first was for fixed length keywords and latter seems quite generic and will need rather a lot of work.
Can anyone let me know if implementing the Agrep/Sun Wu-Manber on my own in php is a good way to solve this problem? Any other feedback?
EDIT: as I mentioned below in a comment, there are hundreds or more of distinct search keywords, so regex will not help. So that response is not helpful.
I think you can solve this problem by using "Levenshtein distance" metric.
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences.
Plus, PHP has a levenshtein() method. Use your keyword list as array & searchable string as input and iterate over your array and use levenshtein() in each iteration for matching.
As of PHP 5.5, PHP's strtr uses the Wu-Manbers algorithm for multi-pattern matching. See commit ccf15cf2 in the PHP git repository for details about the implementation. It is quite efficient, in my experience.
A pure-PHP implementation of the Aho-Corasick algorithm is available here: https://packagist.org/packages/wikimedia/aho-corasick