Lucene.NET PorterStemFilter Source Examples and is it Right for me?

Firstly, I must say that the version of Lucene.NET we are using is not up to date as it came packaged with Sitecore 6.4.1 and until now we haven't had to dig too far into the use of Analyzers and Stemming (big mistake!).

Basically, we are trying to implement some form of Stemming, either at Index-time or Query-time (looking for advice on which is best?). The main problem we have is that all of the documentation about Stemming is in Java, which I am really struggling to bring over to C#. I am hopeful that someone can either provide source examples or links to resources in this area.

Due to the fact that our version of Lucene.NET is very old, I think the use of the Snowball Analyzer won't be possible (not even available in our verison), hence the reason we are considering the PorterStemFilter.

Can anyone provide any assistance / advice on how I can make Stemming work without having to upgrade Lucene?

Kind Regards

Steve

Answers


Usually you write your own Analyzer that builds the TokenStream chain. You will need to stem at Index and Search time.

You just use your Analyzer like any other.

Example Analyzer:

public class MyAnalyzer : Analyzer
{
    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        return new PorterStemFilter(new StandardTokenizer(reader));
    }
}

Snowball is the best afaik. There are implementations in C, Java, ... . Just build your own Analyzer/Tokenfilter in your project.


With an old version of Lucene, you might want to consider taking a copy of the C# implementation of the PorterStemmerAlgorithm class

E.g. from here: http://tartarus.org/~martin/PorterStemmer/csharp2.txt

You can use that to stem your key field values at index time to store the stemmed version of all the words in a "stemmed field" in the index.

At query time you can then use the same class to stem the search terms and search the "stemmed field" with the stemmed query.

This way, you bypass all the specific analyser stuff that was implemented in later versions of Lucene and you can still search on the full versions of all the words...

It's a bit more manual than would be ideal - but it'll get the job done :-)

Good Luck!


Need Your Help

Pushing image to private docker registry with credentials

nginx jenkins docker docker-registry

I'm currently running a registry docker container and ngnix docker container similar to this setup to protect my private docker registry. I'm running Jenkins to automatically build my application i...

Implement "related items" feature using SQLAlchemy

python sqlalchemy

I need to implement a "related items" feature, i.e. to allow items from the same table to be arbitrarily linked to each other in a many-to-many fashion. Something similar to how news websites show