Class TreeDigest


  • public class TreeDigest
    extends AbstractTDigest
    Adaptive histogram based on something like streaming k-means crossed with Q-digest.
    The special characteristics of this algorithm are:
    a) smaller summaries than Q-digest
    b) works on doubles as well as integers.
    c) provides part per million accuracy for extreme quantiles and typically <1000 ppm accuracy for middle quantiles
    d) fast
    e) simple
    f) test coverage > 90%
    g) easy to adapt for use with map-reduce
    • Constructor Summary

      Constructors 
      Constructor Description
      TreeDigest​(double compression)
      A histogram structure that will record a sketch of a distribution.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void add​(double x, int w)
      Adds a sample to a histogram.
      void add​(double x, int w, Centroid base)  
      void asBytes​(java.nio.ByteBuffer buf)
      Outputs a histogram as bytes using a particularly cheesy encoding.
      void asSmallBytes​(java.nio.ByteBuffer buf)
      Serialize this TDigest into a byte buffer.
      int byteSize()
      Returns an upper bound on the number bytes that will be required to represent this histogram.
      double cdf​(double x)
      Returns the fraction of all points added which are <= x.
      int centroidCount()
      The number of centroids currently in the TDigest.
      java.lang.Iterable<? extends Centroid> centroids()
      An iterable that lets you go through the centroids in ascending order by mean.
      void compress()
      Re-examines a t-digest to determine whether some centroids are redundant.
      void compress​(GroupTree other)  
      double compression()
      Returns the current compression factor.
      static TreeDigest fromBytes​(java.nio.ByteBuffer buf)
      Reads a histogram from a byte buffer
      static TDigest merge​(double compression, java.lang.Iterable<TDigest> subData, java.util.Random gen)  
      double quantile​(double q)
      Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.
      long size()
      Returns the number of samples represented in this histogram.
      int smallByteSize()
      Returns an upper bound on the number of bytes that will be required to represent this histogram in the tighter representation.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • TreeDigest

        public TreeDigest​(double compression)
        A histogram structure that will record a sketch of a distribution.
        Parameters:
        compression - How should accuracy be traded for size? A value of N here will give quantile errors almost always less than 3/N with considerably smaller errors expected for extreme quantiles. Conversely, you should expect to track about 5 N centroids for this accuracy.
    • Method Detail

      • add

        public void add​(double x,
                        int w)
        Description copied from class: TDigest
        Adds a sample to a histogram.
        Specified by:
        add in class TDigest
        Parameters:
        x - The value to add.
        w - The weight of this point.
      • merge

        public static TDigest merge​(double compression,
                                    java.lang.Iterable<TDigest> subData,
                                    java.util.Random gen)
      • compress

        public void compress()
        Description copied from class: TDigest
        Re-examines a t-digest to determine whether some centroids are redundant. If your data are perversely ordered, this may be a good idea. Even if not, this may save 20% or so in space.
        The cost is roughly the same as adding as many data points as there are centroids. This is typically < 10 * compression, but could be as high as 100 * compression.
        This is a destructive operation that is not thread-safe.
        Specified by:
        compress in class TDigest
      • size

        public long size()
        Returns the number of samples represented in this histogram. If you want to know how many centroids are being used, try centroids().size().
        Specified by:
        size in class TDigest
        Returns:
        the number of samples that have been added.
      • cdf

        public double cdf​(double x)
        Description copied from class: TDigest
        Returns the fraction of all points added which are <= x.
        Specified by:
        cdf in class TDigest
        Parameters:
        x - the value at which the CDF should be evaluated
        Returns:
        the approximate fraction of all samples that were less than or equal to x.
      • quantile

        public double quantile​(double q)
        Description copied from class: TDigest
        Returns an estimate of the cutoff such that a specified fraction of the data added to this TDigest would be less than or equal to the cutoff.
        Specified by:
        quantile in class TDigest
        Parameters:
        q - The quantile desired. Can be in the range [0,1].
        Returns:
        The minimum value x such that we think that the proportion of samples is <= x is q.
      • centroidCount

        public int centroidCount()
        Description copied from class: TDigest
        The number of centroids currently in the TDigest.
        Specified by:
        centroidCount in class TDigest
        Returns:
        The number of centroids
      • centroids

        public java.lang.Iterable<? extends Centroid> centroids()
        Description copied from class: TDigest
        An iterable that lets you go through the centroids in ascending order by mean. Centroids returned will not be re-used, but may or may not share storage with this TDigest.
        Specified by:
        centroids in class TDigest
        Returns:
        The centroids in the form of an Iterable.
      • compression

        public double compression()
        Description copied from class: TDigest
        Returns the current compression factor.
        Specified by:
        compression in class TDigest
        Returns:
        The compression factor originally used to set up the TDigest.
      • byteSize

        public int byteSize()
        Returns an upper bound on the number bytes that will be required to represent this histogram.
        Specified by:
        byteSize in class TDigest
        Returns:
        The number of bytes required.
      • smallByteSize

        public int smallByteSize()
        Returns an upper bound on the number of bytes that will be required to represent this histogram in the tighter representation.
        Specified by:
        smallByteSize in class TDigest
        Returns:
        The number of bytes required.
      • asBytes

        public void asBytes​(java.nio.ByteBuffer buf)
        Outputs a histogram as bytes using a particularly cheesy encoding.
        Specified by:
        asBytes in class TDigest
        Parameters:
        buf - The byte buffer into which the TDigest should be serialized.
      • asSmallBytes

        public void asSmallBytes​(java.nio.ByteBuffer buf)
        Description copied from class: TDigest
        Serialize this TDigest into a byte buffer. Some simple compression is used such as using variable byte representation to store the centroid weights and using delta-encoding on the centroid means so that floats can be reasonably used to store the centroid means.
        Specified by:
        asSmallBytes in class TDigest
        Parameters:
        buf - The byte buffer into which the TDigest should be serialized.
      • fromBytes

        public static TreeDigest fromBytes​(java.nio.ByteBuffer buf)
        Reads a histogram from a byte buffer
        Returns:
        The new histogram structure