Class KnnVectorDict

java.lang.Object
org.apache.lucene.demo.knn.KnnVectorDict
All Implemented Interfaces:
Closeable, AutoCloseable

public class KnnVectorDict extends Object implements Closeable
Manages a map from token to numeric vector for use with KnnVector indexing and search. The map is stored as an FST: token-to-ordinal plus a dense binary file holding the vectors.
  • Constructor Summary

    Constructors
    Constructor
    Description
    KnnVectorDict(org.apache.lucene.store.Directory directory, String dictName)
    Sole constructor
  • Method Summary

    Modifier and Type
    Method
    Description
    static void
    build(Path gloveInput, org.apache.lucene.store.Directory directory, String dictName)
    Convert from a GloVe-formatted dictionary file to a KnnVectorDict file pair.
    void
     
    void
    get(org.apache.lucene.util.BytesRef token, byte[] output)
    Get the vector corresponding to the given token.
    int
    Get the dimension of the vectors returned by this.
    long
    Return the size of the dictionary in bytes

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • KnnVectorDict

      public KnnVectorDict(org.apache.lucene.store.Directory directory, String dictName) throws IOException
      Sole constructor
      Parameters:
      directory - Lucene directory from which knn directory should be read.
      dictName - the base name of the directory files that store the knn vector dictionary. A file with extension '.bin' holds the vectors and the '.fst' maps tokens to offsets in the '.bin' file.
      Throws:
      IOException
  • Method Details

    • get

      public void get(org.apache.lucene.util.BytesRef token, byte[] output) throws IOException
      Get the vector corresponding to the given token. NOTE: the returned array is shared and its contents will be overwritten by subsequent calls. The caller is responsible to copy the data as needed.
      Parameters:
      token - the token to look up
      output - the array in which to write the corresponding vector. Its length must be getDimension() * Float.BYTES. It will be filled with zeros if the token is not present in the dictionary.
      Throws:
      IllegalArgumentException - if the output array is incorrectly sized
      IOException - if there is a problem reading the dictionary
    • getDimension

      public int getDimension()
      Get the dimension of the vectors returned by this.
      Returns:
      the vector dimension
    • close

      public void close() throws IOException
      Specified by:
      close in interface AutoCloseable
      Specified by:
      close in interface Closeable
      Throws:
      IOException
    • build

      public static void build(Path gloveInput, org.apache.lucene.store.Directory directory, String dictName) throws IOException
      Convert from a GloVe-formatted dictionary file to a KnnVectorDict file pair.
      Parameters:
      gloveInput - the path to the input dictionary. The dictionary is delimited by newlines, and each line is space-delimited. The first column has the token, and the remaining columns are the vector components, as text. The dictionary must be sorted by its leading tokens (considered as bytes).
      directory - a Lucene directory to write the dictionary to.
      dictName - Base name for the knn dictionary files.
      Throws:
      IOException
    • ramBytesUsed

      public long ramBytesUsed()
      Return the size of the dictionary in bytes