These are from algebra.Monoid
Creates a sketch out of multiple items.
Creates a sketch out of multiple items.
Creates a sketch out of a single item.
Creates a sketch out of a single item.
Combines the two sketches.
Combines the two sketches.
The sketches must use the same hash functions.
Returns an instance of
T calculated by summing all instances in iter in one pass.
Returns an instance of
T calculated by summing all instances in iter in one pass. Returns None if
iter is empty, else Some[
T].
None if iter is empty, else an option value containing the summed
T
Override if there is a faster way to compute this sum than iter.reduceLeftOption using plus.
Monoid for top-N based TopCMS sketches. Use with care! (see warning below)
Warning: Adding top-N CMS instances (
++) is an unsafe operationTop-N computations are not associative. The effect is that a top-N CMS has an ordering bias (with regard to heavy hitters) when merging CMS instances (e.g. via
++). This means merging heavy hitters across CMS instances may lead to incorrect, biased results: the outcome is biased by the order in which CMS instances / heavy hitters are being merged, with the rule of thumb being that the earlier a set of heavy hitters is being merged, the more likely is the end result biased towards these heavy hitters.The warning above only applies when adding CMS instances (think:
cms1 ++ cms2). In comparison, heavy hitters are correctly computed when:Seq[K]cms + itemorcms + (item, count).See the discussion in Algebird issue 353 for further details.
Alternatives
The following, alternative data structures may be better picks than a top-N based CMS given the warning above:
Usage
The type
Kis the type of items you want to count. You must provide an implicitCMSHasher[K]forK, and Algebird ships with several such implicits for commonly used types such asLongandBigInt.If your type
Kis not supported out of the box, you have two options: 1) You provide a "translation" function to convert items of your (unsupported) typeKto a supported type such as Double, and then use thecontramapfunction of CMSHasher to create the requiredCMSHasher[K]for your type (see the documentation of CMSHasher for an example); 2) You implement aCMSHasher[K]from scratch, using the existing CMSHasher implementations as a starting point.Note: Because Arrays in Scala/Java not have sane
equalsandhashCodeimplementations, you cannot safely use types such asArray[Byte]. Extra work is required for Arrays. For example, you may opt to convertArray[T]to aSeq[T]viatoSeq, or you can provide appropriate wrapper classes. Algebird provides one such wrapper class, Bytes, to safely wrap anArray[Byte]for use with CMS.The type used to identify the elements to be counted. For example, if you want to count the occurrence of user names, you could map each username to a unique numeric ID expressed as a
Long, and then count the occurrences of thoseLongs with a CMS of typeK=Long. Note that this mapping between the elements of your problem domain and their identifiers used for counting via CMS should be bijective. We require a CMSHasher context bound forK, see CMSHasher for available implicits that can be imported. Which type K should you pick in practice? For domains that have less than2^64unique elements, you'd typically useLong. For larger domains you can tryBigInt, for example.