public class CountSketch<T> extends BaseFrequency<T>
Implementation of the CountSketch algorithm from the paper 'Finding frequent items in data streams' written by 'Charikar, M., Chen, K., and Farach-colton, M. (2002)'.
Modifier and Type | Field and Description |
---|---|
protected int[][] |
data
Data structure to estimate the frequency.
|
protected long |
elementsCounted
Total number of occurences of all elements counted so far.
|
protected Class<? extends HashFunction<?>> |
functionClass
The hash function which will be used.
|
protected List<HashFunction<T>> |
h
Hashfunctions which determine the bucket to add the
s-hashfunctions.
|
protected int |
k
As CountSketch also provides top-k estimation a parameter k can be provided.
|
protected List<HashFunction<T>> |
s
Hashfunctions which determine the value which will be added.
|
protected Map<T,CountEntry<T>> |
topItems
Map for the top-k-algorithm.
|
defaultMinSupport
Constructor and Description |
---|
CountSketch(int domain,
int numberOfHashFunctions,
int numberOfBuckets,
int k)
Constructor of the CountSketch algorithm.
|
Modifier and Type | Method and Description |
---|---|
boolean |
add(T item,
long incrementCount)
Counts the item by passing it to the internal
data strucutre.
|
boolean |
contains(T item)
Tells whether the data structure contains the given item or not.
|
long |
estimateCount(T item)
Estimate the frequency for the given item.
|
long |
estimateFrequency(T item)
Estimates the frequency of the provided item.
|
List<CountEntry<T>> |
getFrequentItems(double minSupport) |
Collection<T> |
getTopK()
Returns the top-k items which were
counted so far.
|
protected void |
initializeHashes(int domain,
int nrOfHashFunctions,
int nrOfbuckets,
HashFunctionFactory<T> factory)
Initialize all necessary hash functions.
|
Set<T> |
keySet() |
long |
size() |
protected boolean |
updateData(T item)
Updates the data structure with the given item.
|
add, getDefaultMinSupport, getFrequentItems, peek, peek, setDefaultMinSupport
protected long elementsCounted
protected int[][] data
Data structure to estimate the frequency.
protected int k
As CountSketch also provides top-k estimation a parameter k can be provided. If <= 0 top-k overhead is disabled.
protected Map<T,CountEntry<T>> topItems
Map for the top-k-algorithm.
protected Class<? extends HashFunction<?>> functionClass
The hash function which will be used.
protected List<HashFunction<T>> h
Hashfunctions which determine the bucket to add the s-hashfunctions.
protected List<HashFunction<T>> s
public CountSketch(int domain, int numberOfHashFunctions, int numberOfBuckets, int k)
Constructor of the CountSketch algorithm. This construction can take quite a long time since the construction of the hashfunctions is rather time-consuming.
domain
- The (estim.) domain, i.e. how many different items are expectednumberOfHashFunctions
- The number of hashfunctions which determine a bucketnumberOfBuckets
- The number of buckets where a counter will be maintainedk
- parameter for the top-k variant. If you want to disable
the top-k overhead, than set k to 0 or lowerprotected void initializeHashes(int domain, int nrOfHashFunctions, int nrOfbuckets, HashFunctionFactory<T> factory)
Initialize all necessary hash functions.
domain
- the (estim.) domain, i.e. how many different items are expectednrOfHashFunctions
- the number of hashfunctions which determine a bucketnrOfbuckets
- the number of buckets where a counter will be maintainedpublic boolean add(T item, long incrementCount)
Counts the item by passing it to the internal data strucutre.
If a k greater than zero was set, the top-k map will be maintained also.
item
- The item to countincrementCount
- public long estimateCount(T item)
ISimpleFrequency
item
- The item that will have the frequency estimatedpublic long size()
public List<CountEntry<T>> getFrequentItems(double minSupport)
public Collection<T> getTopK()
Returns the top-k items which were counted so far.
If k was set to 0 or lower, an empty Collection will be returned.
public long estimateFrequency(T item)
Estimates the frequency of the provided item.
item
- the item which frequency shall be estimatedpublic boolean contains(T item)
IRichFrequency
item
- The item to be verifiedprotected boolean updateData(T item)
Updates the data structure with the given item.
item
- the item to insert into the data structureCopyright © 2014. All rights reserved.