Streaminer

A collection of algorithms for mining data streams, including frequent itemsets, quantiles, sampling, moving averages, set membership and cardinality.

Releases

Maven:

<dependency>
    <groupId>com.github.mayconbordin</groupId>
    <artifactId>streaminer</artifactId>
    <version>1.0</version>
</dependency>

Frequent Itemsets

Algorithms

CountSketch [1]
CountMinSketch [2]
LossyCounting [3]
Majority [4]
MisraGries [5]
SpaceSaving [6]
StickySampling [3]
RealCounting
SimpleTopKCounting

Usage

Except for the CountMinSketchAlt class, all algorithms implement the IRichFrequency interface. Here's an example using the SpaceSaving algorithm:

Random r = new Random();
int counters = 20;
double support = 0.01;
double maxError = 0.1;

IRichFrequency<Integer> counter = new SpaceSaving<Integer>(counters, support, maxError);
for (int i=0 i<1000; i++) {
    counter.add(r.nextInt(100), 1);
}

// get the top 10 items
List<CountEntry<Integer>> topk = counter.peek(10);

// print the items
for (CountEntry<Integer> item : topk) {
    System.out.println(item);
}

// get the frequency of a single item
int item = 25;
long freq = counter.estimateCount(item);
System.out.println(item + ": " + freq);

Top-K

Algorithms

StreamSummary [6]
ConcurrentStreamSummary
Frequent
StochasticTopper

Usage

The basic usage of a Top-K algorithm is basically the same as the frequent itemset, except that these algorithms do not support the estimateCount method.

ITopK<String> counter = new StreamSummary<String>(3);

String[] stream = {"X", "X", "Y", "Z", "A", "B", "C", "X", "X", "A", "C", "A", "A"};
for (String i : stream) {
    counter.add(i);
}

List<CountEntry<String>> topk = counter.peek(3);
for (CountEntry<String> item : topk) {
    System.out.println(item);
}

Quantiles

Algorithms

CKMSQuantiles [7]
Frugal2U [8]
GKQuantiles [9]
MPQuantiles [10]
QDigest [11]
WindowSketchQuantiles [12]
RSSQuantiles [13]
EnsembleQuantiles
ExactQuantiles
ExactQuantilesAll
SimpleQuantiles
SumQuantiles
TDigest

Usage

double[] quantiles = new double[]{0.05, 0.25, 0.5, 0.75, 0.95};
IQuantiles<Integer> instance = new Frugal2U(quantiles, 0);

RandomEngine r = new MersenneTwister64(0);
Normal dist = new Normal(100, 50, r);
int numSamples = 1000;

for(int i = 0; i < numSamples; ++i) {
    int num = (int) Math.max(0, dist.nextDouble());
    instance.offer(num);
}

for (double q : quantiles) {
    System.out.println(q + ": " + instance.getQuantile(q));
}

Cardinality

Algorithms

AdaptiveCounting [14]
LogLog [15]
HyperLogLog [16]
HyperLogLogPlus [17]
LinearCounting [18]
CountThenEstimate

Usage

ICardinality card = new LogLog(8);

for (int i=0; i<100; i++) {
    card.offer(Math.random()*100.0);
}

System.out.println("Cardinality: " + card.cardinality());

Average

Algorithms

MovingAverage
SimpleEWMA
VariableEWMA
TEWMA [25]

Usage

// create a EWMA with 15 seconds of age for the metrics in the period
IAverage avg = new VariableEWMA(15.0);

for (int i=0; i<100; i++) {
    avg.add(Math.random()*100.0);
    if (i%10 == 0)
        System.out.println("Average: " + avg.getAverage());
}

Membership

Algorithms

BloomFilter [22]
CountingBloomFilter [19]
DynamicBloomFilter [20]
RetouchedBloomFilter [21]
StableBloomFilter [23]
TimingBloomFilter [24]

References

[1] Charikar, Moses, Kevin Chen, and Martin Farach-Colton. "Finding frequent items in data streams." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 693-703.

[2] Cormode, Graham, and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications." Journal of Algorithms 55.1 (2005): 58-75.

[3] Manku, Gurmeet Singh, and Rajeev Motwani. "Approximate frequency counts over data streams." Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002.

[4] M. J. Fischer and S. L. Salzberg. "Finding a Majority Among N Votes: Solution to Problem 81-5(Journal of Algorithms, June 1981)", Journal of Algorithms, 3:4, December 1982, pp. 362-380.

[5] Misra, Jayadev, and David Gries. "Finding repeated elements." Science of computer programming 2.2 (1982): 143-152.

[6] Metwally, Ahmed, Divyakant Agrawal, and Amr El Abbadi. "Efficient computation of frequent and top-k elements in data streams." Database Theory-ICDT 2005. Springer Berlin Heidelberg, 2005. 398-412.

[7] Cormode, Graham, et al. "Effective computation of biased quantiles over data streams." Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE, 2005.

[8] Ma, Qiang, S. Muthukrishnan, and Mark Sandler. "Frugal Streaming for Estimating Quantiles." Space-Efficient Data Structures, Streams, and Algorithms. Springer Berlin Heidelberg, 2013. 77-96.

[9] Greenwald, Michael, and Sanjeev Khanna. "Space-efficient online computation of quantile summaries." ACM SIGMOD Record. Vol. 30. No. 2. ACM, 2001.

[10] Munro, J. Ian, and Mike S. Paterson. "Selection and sorting with limited storage." Theoretical computer science 12.3 (1980): 315-323.

[11] Shrivastava, Nisheeth, et al. "Medians and beyond: new aggregation techniques for sensor networks." Proceedings of the 2nd international conference on Embedded networked sensor systems. ACM, 2004.

[12] Arasu, Arvind, and Gurmeet Singh Manku. "Approximate counts and quantiles over sliding windows." Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2004.

[13] Gilbert, Anna C., et al. "How to summarize the universe: Dynamic maintenance of quantiles." Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002.

[14] Cai, Min, et al. "Fast and accurate traffic matrix measurement using adaptive cardinality counting." Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data. ACM, 2005.

[15] Durand, Marianne, and Philippe Flajolet. "Loglog counting of large cardinalities." Algorithms-ESA 2003. Springer Berlin Heidelberg, 2003. 605-617.

[16] Flajolet, Philippe, et al. "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).

[17] Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013.

[18] Whang, Kyu-Young, Brad T. Vander-Zanden, and Howard M. Taylor. "A linear-time probabilistic counting algorithm for database applications." ACM Transactions on Database Systems (TODS) 15.2 (1990): 208-229.

[19] Fan, L., Cao, P., Almeida, J., & Broder, A. Z. (2000). Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking (TON), 8(3), 281-293.

[20] Guo, Deke, Jie Wu, Honghui Chen, and Xueshan Luo. "Theory and Network Applications of Dynamic Bloom Filters." In INFOCOM, pp. 1-12. 2006.

[21] Donnet, Benoit, Bruno Baynat, and Timur Friedman. "Retouched bloom filters: allowing networked applications to trade off selected false positives against false negatives." In Proceedings of the 2006 ACM CoNEXT conference, p. 13. ACM, 2006.

[22] Bloom, Burton H. "Space/time trade-offs in hash coding with allowable errors." Communications of the ACM 13, no. 7 (1970): 422-426.

[23] Deng, Fan, and Davood Rafiei. "Approximately detecting duplicates for streaming data using stable bloom filters." Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 2006.

[24] Dautrich Jr, Jonathan L., and Chinya V. Ravishankar. "Inferential time-decaying Bloom filters." Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013.

[25] Martin, Ruediger, and Michael Menth. "Improving the Timeliness of Rate Measurements." In MMB, pp. 145-154. 2004.