Class ICUTokenizer

java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.icu.segmentation.ICUTokenizer
All Implemented Interfaces:
Closeable, AutoCloseable

public final class ICUTokenizer extends org.apache.lucene.analysis.Tokenizer
Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/)

Words are broken across script boundaries, then segmented according to the BreakIterator and typing provided by the ICUTokenizerConfig

See Also:
WARNING: This API is experimental and might change in incompatible ways in the next release.
  • Nested Class Summary

    Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

    org.apache.lucene.util.AttributeSource.State
  • Field Summary

    Fields inherited from class org.apache.lucene.analysis.Tokenizer

    input

    Fields inherited from class org.apache.lucene.analysis.TokenStream

    DEFAULT_TOKEN_ATTRIBUTE_FACTORY
  • Constructor Summary

    Constructors
    Constructor
    Description
    Construct a new ICUTokenizer that breaks text into words from the given Reader.
    Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
    ICUTokenizer(org.apache.lucene.util.AttributeFactory factory, ICUTokenizerConfig config)
    Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    end()
     
    boolean
     
    void
     

    Methods inherited from class org.apache.lucene.analysis.Tokenizer

    close, correctOffset, setReader, setReaderTestPoint

    Methods inherited from class org.apache.lucene.util.AttributeSource

    addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

    Methods inherited from class java.lang.Object

    clone, finalize, getClass, notify, notifyAll, wait, wait, wait
  • Constructor Details

    • ICUTokenizer

      public ICUTokenizer()
      Construct a new ICUTokenizer that breaks text into words from the given Reader.

      The default script-specific handling is used.

      The default attribute factory is used.

      See Also:
    • ICUTokenizer

      public ICUTokenizer(ICUTokenizerConfig config)
      Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.

      The default attribute factory is used.

      Parameters:
      config - Tailored BreakIterator configuration
    • ICUTokenizer

      public ICUTokenizer(org.apache.lucene.util.AttributeFactory factory, ICUTokenizerConfig config)
      Construct a new ICUTokenizer that breaks text into words from the given Reader, using a tailored BreakIterator configuration.
      Parameters:
      factory - AttributeFactory to use
      config - Tailored BreakIterator configuration
  • Method Details

    • incrementToken

      public boolean incrementToken() throws IOException
      Specified by:
      incrementToken in class org.apache.lucene.analysis.TokenStream
      Throws:
      IOException
    • reset

      public void reset() throws IOException
      Overrides:
      reset in class org.apache.lucene.analysis.Tokenizer
      Throws:
      IOException
    • end

      public void end() throws IOException
      Overrides:
      end in class org.apache.lucene.analysis.TokenStream
      Throws:
      IOException