Class DatasetSplitter

java.lang.Object
org.apache.lucene.classification.utils.DatasetSplitter

public class DatasetSplitter extends Object
Utility class for creating training / test / cross validation indexes from the original index.
  • Constructor Summary

    Constructors
    Constructor
    Description
    DatasetSplitter(double testRatio, double crossValidationRatio)
    Create a DatasetSplitter by giving test and cross validation IDXs sizes
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    split(org.apache.lucene.index.IndexReader originalIndex, org.apache.lucene.store.Directory trainingIndex, org.apache.lucene.store.Directory testIndex, org.apache.lucene.store.Directory crossValidationIndex, org.apache.lucene.analysis.Analyzer analyzer, boolean termVectors, String classFieldName, String... fieldNames)
    Split a given index into 3 indexes for training, test and cross validation tasks respectively

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • DatasetSplitter

      public DatasetSplitter(double testRatio, double crossValidationRatio)
      Create a DatasetSplitter by giving test and cross validation IDXs sizes
      Parameters:
      testRatio - the ratio of the original index to be used for the test IDX as a double between 0.0 and 1.0
      crossValidationRatio - the ratio of the original index to be used for the c.v. IDX as a double between 0.0 and 1.0
  • Method Details

    • split

      public void split(org.apache.lucene.index.IndexReader originalIndex, org.apache.lucene.store.Directory trainingIndex, org.apache.lucene.store.Directory testIndex, org.apache.lucene.store.Directory crossValidationIndex, org.apache.lucene.analysis.Analyzer analyzer, boolean termVectors, String classFieldName, String... fieldNames) throws IOException
      Split a given index into 3 indexes for training, test and cross validation tasks respectively
      Parameters:
      originalIndex - an LeafReader on the source index
      trainingIndex - a Directory used to write the training index
      testIndex - a Directory used to write the test index
      crossValidationIndex - a Directory used to write the cross validation index
      analyzer - Analyzer used to create the new docs
      termVectors - true if term vectors should be kept
      classFieldName - name of the field used as the label for classification; this must be indexed with sorted doc values
      fieldNames - names of fields that need to be put in the new indexes or null if all should be used
      Throws:
      IOException - if any writing operation fails on any of the indexes