A collection of algorithms for mining data streams
A collection of algorithms for mining data streams, including frequent itemsets, quantiles, sampling, moving averages, set membership and cardinality.
Maven:
<dependency>
<groupId>com.github.mayconbordin</groupId>
<artifactId>streaminer</artifactId>
<version>1.0</version>
</dependency>
Except for the CountMinSketchAlt
class, all algorithms implement the IRichFrequency
interface. Here's an example using the SpaceSaving
algorithm:
Random r = new Random();
int counters = 20;
double support = 0.01;
double maxError = 0.1;
IRichFrequency<Integer> counter = new SpaceSaving<Integer>(counters, support, maxError);
for (int i=0 i<1000; i++) {
counter.add(r.nextInt(100), 1);
}
// get the top 10 items
List<CountEntry<Integer>> topk = counter.peek(10);
// print the items
for (CountEntry<Integer> item : topk) {
System.out.println(item);
}
// get the frequency of a single item
int item = 25;
long freq = counter.estimateCount(item);
System.out.println(item + ": " + freq);
The basic usage of a Top-K algorithm is basically the same as the frequent itemset, except that these algorithms do not support the estimateCount
method.
ITopK<String> counter = new StreamSummary<String>(3);
String[] stream = {"X", "X", "Y", "Z", "A", "B", "C", "X", "X", "A", "C", "A", "A"};
for (String i : stream) {
counter.add(i);
}
List<CountEntry<String>> topk = counter.peek(3);
for (CountEntry<String> item : topk) {
System.out.println(item);
}
double[] quantiles = new double[]{0.05, 0.25, 0.5, 0.75, 0.95};
IQuantiles<Integer> instance = new Frugal2U(quantiles, 0);
RandomEngine r = new MersenneTwister64(0);
Normal dist = new Normal(100, 50, r);
int numSamples = 1000;
for(int i = 0; i < numSamples; ++i) {
int num = (int) Math.max(0, dist.nextDouble());
instance.offer(num);
}
for (double q : quantiles) {
System.out.println(q + ": " + instance.getQuantile(q));
}
ICardinality card = new LogLog(8);
for (int i=0; i<100; i++) {
card.offer(Math.random()*100.0);
}
System.out.println("Cardinality: " + card.cardinality());
// create a EWMA with 15 seconds of age for the metrics in the period
IAverage avg = new VariableEWMA(15.0);
for (int i=0; i<100; i++) {
avg.add(Math.random()*100.0);
if (i%10 == 0)
System.out.println("Average: " + avg.getAverage());
}
[1]
Charikar, Moses, Kevin Chen, and Martin Farach-Colton. "Finding frequent items in data streams." Automata, Languages and Programming. Springer Berlin Heidelberg, 2002. 693-703.
[2]
Cormode, Graham, and S. Muthukrishnan. "An improved data stream summary: the count-min sketch and its applications." Journal of Algorithms 55.1 (2005): 58-75.
[3]
Manku, Gurmeet Singh, and Rajeev Motwani. "Approximate frequency counts over data streams." Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002.
[4]
M. J. Fischer and S. L. Salzberg. "Finding a Majority Among N Votes: Solution to Problem 81-5(Journal of Algorithms, June 1981)", Journal of Algorithms, 3:4, December 1982, pp. 362-380.
[5]
Misra, Jayadev, and David Gries. "Finding repeated elements." Science of computer programming 2.2 (1982): 143-152.
[6]
Metwally, Ahmed, Divyakant Agrawal, and Amr El Abbadi. "Efficient computation of frequent and top-k elements in data streams." Database Theory-ICDT 2005. Springer Berlin Heidelberg, 2005. 398-412.
[7]
Cormode, Graham, et al. "Effective computation of biased quantiles over data streams." Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on. IEEE, 2005.
[8]
Ma, Qiang, S. Muthukrishnan, and Mark Sandler. "Frugal Streaming for Estimating Quantiles." Space-Efficient Data Structures, Streams, and Algorithms. Springer Berlin Heidelberg, 2013. 77-96.
[9]
Greenwald, Michael, and Sanjeev Khanna. "Space-efficient online computation of quantile summaries." ACM SIGMOD Record. Vol. 30. No. 2. ACM, 2001.
[10]
Munro, J. Ian, and Mike S. Paterson. "Selection and sorting with limited storage." Theoretical computer science 12.3 (1980): 315-323.
[11]
Shrivastava, Nisheeth, et al. "Medians and beyond: new aggregation techniques for sensor networks." Proceedings of the 2nd international conference on Embedded networked sensor systems. ACM, 2004.
[12]
Arasu, Arvind, and Gurmeet Singh Manku. "Approximate counts and quantiles over sliding windows." Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2004.
[13]
Gilbert, Anna C., et al. "How to summarize the universe: Dynamic maintenance of quantiles." Proceedings of the 28th international conference on Very Large Data Bases. VLDB Endowment, 2002.
[14]
Cai, Min, et al. "Fast and accurate traffic matrix measurement using adaptive cardinality counting." Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data. ACM, 2005.
[15]
Durand, Marianne, and Philippe Flajolet. "Loglog counting of large cardinalities." Algorithms-ESA 2003. Springer Berlin Heidelberg, 2003. 605-617.
[16]
Flajolet, Philippe, et al. "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).
[17]
Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013.
[18]
Whang, Kyu-Young, Brad T. Vander-Zanden, and Howard M. Taylor. "A linear-time probabilistic counting algorithm for database applications." ACM Transactions on Database Systems (TODS) 15.2 (1990): 208-229.
[19]
Fan, L., Cao, P., Almeida, J., & Broder, A. Z. (2000). Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking (TON), 8(3), 281-293.
[20]
Guo, Deke, Jie Wu, Honghui Chen, and Xueshan Luo. "Theory and Network Applications of Dynamic Bloom Filters." In INFOCOM, pp. 1-12. 2006.
[21]
Donnet, Benoit, Bruno Baynat, and Timur Friedman. "Retouched bloom filters: allowing networked applications to trade off selected false positives against false negatives." In Proceedings of the 2006 ACM CoNEXT conference, p. 13. ACM, 2006.
[22]
Bloom, Burton H. "Space/time trade-offs in hash coding with allowable errors." Communications of the ACM 13, no. 7 (1970): 422-426.
[23]
Deng, Fan, and Davood Rafiei. "Approximately detecting duplicates for streaming data using stable bloom filters." Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 2006.
[24]
Dautrich Jr, Jonathan L., and Chinya V. Ravishankar. "Inferential time-decaying Bloom filters." Proceedings of the 16th International Conference on Extending Database Technology. ACM, 2013.
[25]
Martin, Ruediger, and Michael Menth. "Improving the Timeliness of Rate Measurements." In MMB, pp. 145-154. 2004.