SQL Server 2008 Full Text Search (FTS) versus Lucene.NET
I know there have been questions in the past about SQL 2005 versus Lucene.NET but since 2008 came out and they made a lot of changes to it and was wondering if anyone can give me pros/cons (or link to an article).
I built a medium-size knowledge base (maybe 2GB of indexed text) on top of SQL Server 2005's FTS in 2006, and have now moved it to 2008's iFTS. Both situations have worked well for me, but the move from 2005 to 2008 was actually an improvement for me.
My situation was NOT like StackOverflow's in the sense that I was indexing data that was only refreshed nightly, however I was trying to join search results from multiple CONTAINSTABLE statements back in to each other and to relational tables.
In 2005's FTS, this meant each CONTAINSTABLE would have to execute its search on the index, return the full results and then have the DB engine join those results to the relational tables (this was all transparent to me, but it was happening and was expensive to the queries). 2008's iFTS improved this situation because the database integration allows the multiple CONTAINSTABLE results to become part of the query plan which made a lot of searches more efficient.
I think that both 2005 and 2008's FTS engines, as well as Lucene.NET, have architectural tradeoffs that are going to align better or worse to a lot of project circumstances - I just got lucky that the upgrade worked in my favor. I can completely see why 2008's iFTS wouldn't work in the same configuration as 2005's for the highly OLTP nature of a use case like StackOverflow.com. However, I would not discount the possibility that the 2008 iFTS could be isolated from the heavy insert transaction load... but it also sounds like it could be as much work to accomplish that as move to Lucene.NET ... and the cool factor of Lucene.NET is hard to ignore ;)
Anyway, for me, the ease and efficiency of SQL 2008's iFTS in the majority of situations probably edges out Lucene's 'cool' factor (though it is easy to use, I've never used it in a production system so I'm reserving comment on that). I would be interesting in knowing how much more efficient Lucene is (has turned out to be? is it implemented now?) in StackOverflow or similar situations.
SQL Server FTS is going to be easier to manage for a small deployment. Since FTS is integrated with the DB, the RDBMS handles updating the index automatically. The con here is that you don't have an obvious scaling solution short of replicating DB's. So if you don't need to scale, SQL Server FTS is probably "safer". Politically, most shops are going to be more comfortable with a pure SQL Server solution.
On the Lucene side, I would favor SOLR over straight-up Lucene. With either solution you have to do more work yourself updating the index when the data changes, as well as mapping data yourself to the SOLR/Lucene index. The pros are that you can easily scale by adding additional indexes. You could run these indexes on very lean linux servers, which eliminates some license costs. If you take the Lucene/SOLR route, I would aim to put ALL the data you need directly into the index, rather than putting pointers back to the DB in the index. You can include data in the index that is not searchable, so for example you could have pre-built HTML or XML stored in the index, and serve it up as a search result. With this approach your DB could be down but you are still able to serve up search results in a disconnected mode.
I've never seen a head-to-head performance comparison between SQL Server 2008 and Lucene, but would love to see one.
Haven't used SQL Server 2008 personally, though based on that blog entry, it looks like the full-text search functionality is slower than it was in 2005.
we use both full-text-search possibilities, but in my opinion it depends on the data itself and your needs.
we scale with web-servers, and therefore i like lucene, because i don't have that much load on the sql-server.
for starting at null and wanting to have a full-textsearch i would prefer the sql-server solution, because i think it is really fast to get results, if you want lucene you have to implement more at start (and also get some know-how).
One consideration that you need to keep in mind is what kind of search constraints you have in addition to the full-text constraint. If you are doing constraints that lucene can't provide, then you will almost certainly want to use FTS. One of the nice things about 2008 is that they improved the integration of FTS with standard sql server queries so performance should be better with mixed database and FT constraints than it was in 2005.