Class HMMChineseTokenizer

java.lang.Object
org.apache.lucene.util.AttributeSource
org.apache.lucene.analysis.TokenStream
org.apache.lucene.analysis.Tokenizer
org.apache.lucene.analysis.util.SegmentingTokenizerBase
org.apache.lucene.analysis.cn.smart.HMMChineseTokenizer
All Implemented Interfaces:
Closeable, AutoCloseable

public class HMMChineseTokenizer extends org.apache.lucene.analysis.util.SegmentingTokenizerBase
Tokenizer for Chinese or mixed Chinese-English text.

The analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.

  • Nested Class Summary

    Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource

    org.apache.lucene.util.AttributeSource.State
  • Field Summary

    Fields inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase

    buffer, BUFFERMAX, offset

    Fields inherited from class org.apache.lucene.analysis.Tokenizer

    input

    Fields inherited from class org.apache.lucene.analysis.TokenStream

    DEFAULT_TOKEN_ATTRIBUTE_FACTORY
  • Constructor Summary

    Constructors
    Constructor
    Description
    Creates a new HMMChineseTokenizer
    HMMChineseTokenizer(org.apache.lucene.util.AttributeFactory factory)
    Creates a new HMMChineseTokenizer, supplying the AttributeFactory
  • Method Summary

    Modifier and Type
    Method
    Description
    protected boolean
     
    void
     
    protected void
    setNextSentence(int sentenceStart, int sentenceEnd)
     

    Methods inherited from class org.apache.lucene.analysis.util.SegmentingTokenizerBase

    end, incrementToken, isSafeEnd

    Methods inherited from class org.apache.lucene.analysis.Tokenizer

    close, correctOffset, setReader, setReaderTestPoint

    Methods inherited from class org.apache.lucene.util.AttributeSource

    addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, endAttributes, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, removeAllAttributes, restoreState, toString

    Methods inherited from class java.lang.Object

    clone, finalize, getClass, notify, notifyAll, wait, wait, wait
  • Constructor Details

    • HMMChineseTokenizer

      public HMMChineseTokenizer()
      Creates a new HMMChineseTokenizer
    • HMMChineseTokenizer

      public HMMChineseTokenizer(org.apache.lucene.util.AttributeFactory factory)
      Creates a new HMMChineseTokenizer, supplying the AttributeFactory
  • Method Details

    • setNextSentence

      protected void setNextSentence(int sentenceStart, int sentenceEnd)
      Specified by:
      setNextSentence in class org.apache.lucene.analysis.util.SegmentingTokenizerBase
    • incrementWord

      protected boolean incrementWord()
      Specified by:
      incrementWord in class org.apache.lucene.analysis.util.SegmentingTokenizerBase
    • reset

      public void reset() throws IOException
      Overrides:
      reset in class org.apache.lucene.analysis.util.SegmentingTokenizerBase
      Throws:
      IOException