For umbraco versions: Not Version related

MIsc
Codegarden 2008 open space minutes.

Chapters

Advanced Lucene Search

Present

Ismail Mayat (facilitator)

Between 10 and 15 people , Morten Bock, Michel Dumontier, Len Dierickx and others

This was my first time working with the openspace format but also my first time presenting anything Umbraco related to other Members of the Umbraco community. It was a bit nerve racking especially when the web connection went down and I could not demo anything! Big thanks to Morten Bock for use of his 3G Broadband dongle.

My topic for the open space format was lucene search within Umbraco. As per my blog post I covered the following recent updates that I made to the original umbracoUtilities project but to summarise the demo:

  1. Did a quick introduction about the evolution of the umbracoUtilites project namely umbracoUtilities -> My version of umbracoUtilities (with media indexing based on core code from Niels) -> umbSearch (in codeplex with media indexing based on field in document type that points to media file to index) -> my umbSearch (which is umbracoUtilities with media indexing via a scheduled task. NB the scheduled tasks node in umbracoSettings.config was new to a few people). Also discussed that quite a few people have had problems trying to get lucene search to work.
  2. Mentioned that for sites less than 200 pages and not pdf indexing requirements Doug Robars excellent xslt search would probably be better option
  3. Demonstrated Ajax autosuggest text box based on Microsoft ajax control toolkit
  4. Demonstrated suggested term search. Basically you are searching for word tunnel but you spell it tunnle the search results page will suggest a number of terms that it has found in your index that it thinks you mean.
  5. Demonstrated FindSimilar search. You have a search result set and for a given document in the result set you want to find documents that are similar. The usercontrol by default searches on content field but you can pass in another field that exists in the index. By default the results of the similar search displays using a repeater however you can pass in an xslt to format. Also discussed another use for the findsimilar namely using at as a see also on a page either in the right hand navigation or at the bottom of the page (conceptual linking as opposed to hard linking using a content picker.). There was an excellent suggestion from someone (cant' remember name) regarding conceptual linking to in the umbraco backend allow lucene to suggest the similar items but let the content creator select the items to actually include in the list. This would allow fine control over the find similar list.
  6. Demonstrated google like advanced search. Discussed the format of the html advanced search form that uses JavaScript to construct lucene query and that users could put their own fields onto the form and provided they follow naming convention and the fields exist in the index they would build their own custom advanced search. Soren Tidmund mentioned site he is currently working on which is basically a jobsite for job seekers and employers. So site would need CV and job search with any number of parameters. I mentioned that this would be quite straight forward using the form.
  7. Went briefly through the code, also showed wordnet. I have put the code into the project but it's not actually used. Also showed the XSLT extension methods that wrap around some of the search functionality. I mentioned that the word net synonym lookup was possibly overkill and synonym lookup with a customised lookup index would only be useful for possibly chemical / pharmaceuticals organisations. However Len Dierickx of eurofins mentioned a possible case for something like this for his organisation eurofins

Another excellent suggestion came from Gert Abislov. Basically that it would be nice to index meta data of images that have been tagged with meta data in the media section.

Currently images are ignored during the indexing process but it shouldn't be too difficult to update the indexing code to add these properties. This could be useful if you are creating a website that provides image galleries so would be good to say get all images that are landscapes. I mentioned that I have also seen php library that can compare images but couldn't find link there may possibly be .net port.

Another suggestion was to make use of exif to extract meta data out of images and index that as well. Most digital cameras already insert meta data into images. (Google / flickr mash up).

Michel Dumontier of Dixxit also discussed his experiences with working with lucene using French.

A good question asked by one of the audience was regarding multilingual sites and would you need separate indexes? I discussed one idea and that was to have on each multilingual home page node a language code setting i.e. FR for French and during the indexing process when the content for the document is extracted get the home page parent and get the language version code and put that into the index thus you could on an advanced search have language as a field or each language version would have its own search page with the language code as a filter.

I also discussed my to-do list.

  • First thing was to look at implementing a security filter currently secure documents are not put in the index but there may be a need for making secure documents searchable if you are implementing a big extranet.
  • Indexing databases (like a forum so that you could include in your website search results). I mentioned that I have found on sourceforge some java code that indexes mysql to lucene not found .net sql server yet.
  • Also pointed people to linqtolucene project on codeplex.

Actions:

A few people asked about when this will be released as a package. I could not give a firm date because there are a few issues that need to be resolved also the code still

needs some refactoring. Also the package would need to be documented. With regards to documentation Michel Dumontier of Dixxit has kindly volunteered to do this. So hopefully soon.

My thoughts about openformat.

It was an excellent way to quickly get new ideas / network / feedback. The only downside was that I could not attend the other sessions.

Think that covers everything.


Brilliant umbraco hosting provided by FAB-IT