Class ICUTokenizerFactory

java.lang.Object
org.apache.lucene.analysis.AbstractAnalysisFactory
org.apache.lucene.analysis.TokenizerFactory
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory
All Implemented Interfaces:
org.apache.lucene.util.ResourceLoaderAware

public class ICUTokenizerFactory extends org.apache.lucene.analysis.TokenizerFactory implements org.apache.lucene.util.ResourceLoaderAware
Factory for ICUTokenizer. Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the DefaultICUTokenizerConfig.

To use the default set of per-script rules:

 <fieldType name="text_icu" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.ICUTokenizerFactory"/>
   </analyzer>
 </fieldType>

You can customize this tokenizer's behavior by specifying per-script rule files, which are compiled by the ICU RuleBasedBreakIterator. See the ICU RuleBasedBreakIterator syntax reference.

To add per-script rules, add a "rulefiles" argument, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a resource path. E.g. to specify rules for Latin (script code "Latn") and Cyrillic (script code "Cyrl"):

 <fieldType name="text_icu_custom" class="solr.TextField" positionIncrementGap="100">
   <analyzer>
     <tokenizer class="solr.ICUTokenizerFactory" cjkAsWords="true"
                rulefiles="Latn:my.Latin.rules.rbbi,Cyrl:my.Cyrillic.rules.rbbi"/>
   </analyzer>
 </fieldType>
Since:
3.1
SPI Name (case-insensitive: if the name is 'htmlStrip', 'htmlstrip' can be used when looking up the service).
"icu"
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final String
    SPI name

    Fields inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory

    LUCENE_MATCH_VERSION_PARAM, luceneMatchVersion
  • Constructor Summary

    Constructors
    Constructor
    Description
    Default ctor for compatibility with SPI
    Creates a new ICUTokenizerFactory
  • Method Summary

    Modifier and Type
    Method
    Description
    create(org.apache.lucene.util.AttributeFactory factory)
     
    void
    inform(org.apache.lucene.util.ResourceLoader loader)
     

    Methods inherited from class org.apache.lucene.analysis.TokenizerFactory

    availableTokenizers, create, findSPIName, forName, lookupClass, reloadTokenizers

    Methods inherited from class org.apache.lucene.analysis.AbstractAnalysisFactory

    defaultCtorException, get, get, get, get, get, getBoolean, getChar, getClassArg, getFloat, getInt, getLines, getLuceneMatchVersion, getOriginalArgs, getPattern, getSet, getSnowballWordSet, getWordSet, isExplicitLuceneMatchVersion, require, require, require, requireBoolean, requireChar, requireFloat, requireInt, setExplicitLuceneMatchVersion, splitAt, splitFileNames

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

  • Constructor Details

    • ICUTokenizerFactory

      public ICUTokenizerFactory(Map<String,String> args)
      Creates a new ICUTokenizerFactory
    • ICUTokenizerFactory

      public ICUTokenizerFactory()
      Default ctor for compatibility with SPI
  • Method Details

    • inform

      public void inform(org.apache.lucene.util.ResourceLoader loader) throws IOException
      Specified by:
      inform in interface org.apache.lucene.util.ResourceLoaderAware
      Throws:
      IOException
    • create

      public ICUTokenizer create(org.apache.lucene.util.AttributeFactory factory)
      Specified by:
      create in class org.apache.lucene.analysis.TokenizerFactory