c# - Which Lucene SearchAnalyzer to be used for special character search -


i using lucene.net standard analyzer in asp.net project search. search not return results keywords c#, .net etc. if type c or net (removing . , #) works. on stackoverflow (which uses lucene too), noticed when type .net changes [.net] while searching, got links standard analyzer not able handle special character search, , white space analyzer not work not give expected results. can on how manages search?

i'll characterize doing bit more closely here:

while i'm not privy implementation details of stackoverflow, you'll note same behavior when searching "java" or "hibernate", though these have no issues standard analyzer. transformed "[java]" , "[hibernate]". denotes tag search. doesn't happen searching "lucene" or "junit" has popularity of tags. suspect tag titles indexed in un-analyzed form.

for interesting example, try out "j++". dead-end java implementation has mere 8 questions using tag on so, won't trigger automatic tag search. search "[j++]", , you'll see 8. search "j++", , you'll have rough time finding relevant particular language, you'll find plenty reference .

onward, fixing problem:

yes, standardanalyzer (speaking imprecisely, see uax-29 precise rules) rid of punctuation. typical approach this, use same analyzer when querying. if use standardanalyzer analyze queries indexed documents, searched terms match, 2 query terms mentioned above reduced net , c, , should results.

but now, you've hit upon perhaps the classic example of problem standardanalyzer. means c, c++, , c# represented precisely same in index, there no way search 1 without matching other two!

there few ways deal this, mind:

  1. throw baby out bathwater: use whitespaceanalyzer or such, , lose nice, fancy things standardanalyzer out.

  2. just handle few little edge cases: okay, lucene doesn't punctuation, , have known terms have problem that. luckily, have string.replace. replace them little more lucene-friendly, "c", "cplusplus" , "csharp". again, make sure gets done both @ query , index time. the problem is: since doing outside of analyzer, transformation affect stored version of field well, forcing reverse before display results user.

  3. do same #2, bit fancier: #2 might work right, you've got these analyzers handling transforming data consumption lucene, affect indexed version of field, rather stored one. why not use them? analyzer has call, initreader, in can slap charfilter on front of analyzer stack (see example way down @ bottom of the analysis package documentation). text run through analyzer transformed charfilter before standardtokenizer (which gets rid of punctuation, among other things) gets it's hands on it. mappingcharfilter, instance.

you can't subclass standardanalyzer, though, thinking being should implementing analyzer, rather subclassing implementations of (see the discussion here, if you're interested in more complete discussion of thought process there). so, assuming want make sure absolutely all functionality of standardanalyzer in deal, copy-paste source code, , add override of initreaders method:

public class extrafancystandardanalyzer extends stopwordanalyzerbase {      public static final int default_max_token_length = 255;      private int maxtokenlength = default_max_token_length;      public static final chararrayset stop_words_set = stopanalyzer.english_stop_words_set;      public extrafancystandardanalyzer(version matchversion,             chararrayset stopwords) {         super(matchversion, stopwords);         buildmap();     }      public extrafancystandardanalyzer(version matchversion) {         this(matchversion, stop_words_set);     }      public extrafancystandardanalyzer(version matchversion, reader stopwords)             throws ioexception {         this(matchversion, loadstopwordset(stopwords, matchversion));     }      public void setmaxtokenlength(int length) {         maxtokenlength = length;     }      public int getmaxtokenlength() {         return maxtokenlength;     }       // following 2 methods, , call buildmap() in ctor     // things changed standardanalyzer      private normalizecharmap map;      public void buildmap() {         normalizecharmap.builder builder = new normalizecharmap.builder();         builder.add("c++", "cplusplus");         builder.add("c#", "csharp");         map = builder.build();     }      @override     protected reader initreader(string fieldname, reader reader) {         return new mappingcharfilter(map, reader);     }      @override     protected tokenstreamcomponents createcomponents(final string fieldname,             final reader reader) {         final standardtokenizer src = new standardtokenizer(matchversion,                 reader);         src.setmaxtokenlength(maxtokenlength);         tokenstream tok = new standardfilter(matchversion, src);         tok = new lowercasefilter(matchversion, tok);         tok = new stopfilter(matchversion, tok, stopwords);         return new tokenstreamcomponents(src, tok) {             @override             protected void setreader(final reader reader) throws ioexception {                 src.setmaxtokenlength(extrafancystandardanalyzer.this.maxtokenlength);                 super.setreader(reader);             }         };     } } 

note: written , tested in java, lucene version 4.7. c# implementation shouldn't different. copy standardanalyzer, build mappingcharfilter (which hair simpler deal in version 3.0.3), , wrap reader in override of initreader method.


Comments

Popular posts from this blog

c++ - How to add Crypto++ library to Qt project -

jQuery Mobile app not scrolling in Firefox -

how to receive file in java(servlet/jsp) -