DEIM Forum 2016 D8-4 158 0094 14 1 060 0814 14 9 101 8430 2 1 2 E-mail: yusaku.kaneta@rakuten.com, arim@ist.hokudai.ac.jp, uno@nii.jp U = [0, u) x U f x ϕ (0, 1] ε (0, ϕ] f x > ϕn x ˆF U n x ˆF f x > (ϕ ε)n O(log(k/δ) log(u)/ε) O(log(k/δ)) O(log 2 (k/δ)/ε) δ (0, 1] k = 1/ϕ Cormode Muthukrishnan (ACM Trans. Database Syst., 2005) 1. S (frequent elements problem) ϕ (0, 1] F = {x U f x > ϕn} U = {0,..., u 1} f x x U S n = fx x U [3, 6 10, 14 17] Ω(u) [7] o(u) [7 9, 14, 15, 17] ( [6]) n S ϕ (0, 1] ε (0, ϕ] ˆF U x ˆF f x > (ϕ ε)n F ˆF Cormode Muthukrishnan [7] Combinatorial Group Testing (CGT) CGT Count-Min Sketch (CMS) [8] CGT O(log(k/δ) log(u)/ε) O(log(k/δ) log(u)) ˆF O(log(k/δ)(log(u) + log(k/δ))/ε) [7] k = 1/ϕ δ (0, 1] log u Cormode Muthukrishnan [7] O(τ log(k/δ) log τ (u)/ε) O(log(k/δ) log τ (u)) τ 2 O(log(k/δ)(τ log τ (u) + log(k/δ))/ε) Cormode Muthukrishnan (1) o(u) O(log u) (2) CGT [7] O(log u) O(log(u)/ log(k/δ)) Bit-Parallel Combinatorial Group Testing (BP-CGT)
1 O(log(k/δ)) O(log 2 (k/δ)/ε) O(log(k/δ) log(u)/ε) k = 1/ϕ δ (0, 1] 1 δ Cormode Muthukrishnan [7] w unit-cost word RAM w log u m = O(w) C[0..m 1] Baeza-Yates Gonnet [1] k O(w/ log k) O(log k) Grabowski Fredriksson [12] O(log k) O(log log k) O(w/ log log k) O(log k) O(w) Bille Thorup [2] O(w) O(w) 1. 1. [6] Frandsen Skyum [10] ϕ = 0.5 O(n) 4 O(log u) Cormode Muthukrishnan [7] ϕ = ε = 0.5 O(log u) O(log u) 4 Cormode Muthukrishnan [7] Cormode Muthukrishnan [8] CMS CMS CGT [13] Cormode Hadjieleftheriou [6] b = 16 CGT [6] 1. 2. 2 3 O(w) 4 ϕ = ε = 0.5 3 [7] 5 4 [7] [7] 2. Z = {0, ±1,...} N = {0, 1,...} i, j (i j) [i..j] = {i, i + 1,..., j} Z p [p] = [0..p 1] N w log u unit-cost word-ram w w m w/m m x N bit(x, i) {0, 1}
x i S [m] int(s) [0..2 m 1] bit(int(s), i) = 1 i S x Z (x 0) lsb(x) = min{i N x mod 2 i = 0} lsb(x) x 1 lsb(0) = c n,m-1 c i,m-1 c 0,m-1 1 c n,5 c n,2 c n,m-2 c 0,5 c 0,2 c 0,m-2 c n,4 c n,1 c n,m-3 c i,5 c i,2 c i,m-2 c i,4 c i,1 c i,m-3 c 0,4 c 0,1 c 0,m-3 c n,3 c n,0 c n c i,3 c i,0 c i c 0,3 c 0,0 c 0 C[i] C[0..m 1] m mod 3 = 0 1 n lsb(1),..., lsb(n) lsb 2 x, y (1) lsb(x + y) = lsb(x) lsb(x) < lsb(y). (2) lsb(x + y) > lsb(x) lsb(x) = lsb(y). (3) lsb(x + y) < lsb(x) lsb(x) > lsb(y). 3. O(w) m n m = O(w) n = O(w) C[0..m 1] m C[0],..., C[m 1] Z x [0..2 m 1] 1 ( ) increment(c, x): i [m] C[i] C[i] + bit(x, i). decrement(c, x): i [m] C[i] C[i] bit(x, i). iszero(c): int({i [m] C[i] = 0}) increment, decrement, iszero C[0..m 1] Θ(m) [7] O(log u) 2 ( ) w word RAM O(w) O(n) lsb(pre(t)) 3. 1.! C[0..m 1] C[i] Z lsb(t) n c i,0,..., c i,n 1 C[i] = n 1 j=0 ci,j2j C[i] t c i,0,..., c i,lsb(t) increment decrement C c 0,..., c n 1 m C[i]j i mod 3 i 1 m 3 {0, ±1} {0, ±1, ±2} increment/decrement t 0 min{lsb(t), n 1} t 0 l < lsb(t) c l {0, ±1} lsb(t) l < n c l {0, ±1, ±2} (c i + x 2, +1) if c i + x +2. (c i, x) (c i + x + 2, 1) if c i + x 2. (c i + x, 0) otherwise. t pre(t) = max{t 2 lsb(t), 0} c lsb(t) 3 t (1) 0 i < lsb(t) c i {0, ±1, ±2}. (2) lsb(t) i < lsb(pre(t)) c i {0, ±1}.
(3) lsb(pre(t)) i < n c i {0, ±1, ±2}.. t t = 0 lsb(t) = n 0 l < n c l = 0 t 1 t + 1 1. lsb(t + 1) > lsb(t) 2 (2) lsb(t) = 0 pre(t+1) c lsb(t+1) {0, ±1} pre(t + 1) < t < t 0 i < lsb(t) c i {0, ±1, ±2} lsb(t + 1) > lsb(t) 2. lsb(t + 1) < lsb(t) 2 (3) lsb(t + 1) = 0 lsb(t + 1) = 0 i < n c i 2 t t + 1 c 0 2 t lsb(t + 1) < lsb(t) t t lsb(t) + 1 L lsb(t) = lsb(t) i=0 c i2 i L lsb(t) 3 2 lsb(t) 1 iszero 0 i < n U i = n l=i c l2 l U i n t U i C = U 0 = 0 U i u i Z U i = u i 2 i t L lsb(t) = 3 2 i 1 1 u lsb(t)+1 +1 u i U i t C = U lsb(t)+1 + L lsb(t) = 0 x Z iszero u 0 = 0 t c 0,..., c lsb(t) u 0,..., u lsb(t) t u i (0 i lsb(t)) if 2u i+1 + c i > 2. u i 2u i+1 + c i otherwise. C[0..m 1] p p 0 p Algorithm 1 1: procedure update(c 0,..., c n 1, u 0,..., u n 1, x t ): 2: x x t 3: for i = 0 to min{lsb(t), n 1} do (c i + x 2, +1) if c i + x +2. 4: (c i, x) (c i + x + 2, 1) if c i + x 2. (c i + x, 0) otherwise. 5: for i = min{lsb(t), n 1} to 0 do { 6: u i if (u i+1 = ) ( 2u i+1 + c i > 2). 2u i+1 + c i otherwise. p p (iszero(c) & x) p p & (iszero(c) & x) C[0..m 1] C[i] α[i] C[i] α[i] C[i] = α[i] C[i] > α[i] C[i] < α[i] O(max i [m] α[i]) 4. ϕ = ε = 0.5 O(log u) m = log u 2 ( ) insert 2 (S, x): x S f x f x + 1. delete 2 (S, x): x S f x f x 1. output 2 (S): ϕ = ε = 0.5 ˆF 4. 1. Cormode Muthukrishnan Cormode Muthukrishnan [7] O(log u) C[0..m 1] n N C[i] = x U fx bit(x, i) n = fx x U C[i] x 0 i < m bit(x, i) = 1 C[i] > 0.5n
Algorithm 2 1: procedure insert 2 (S = (C[0..m 1], n), x) 2: n n + 1 3: increment(c, x) 4: if n is odd then 5: increment(c, 2 m 1) Algorithm 3 1: procedure delete 2 (S = (C[0..m 1], n), x) 2: n n 1 3: decrement(c, x) 4: if n is even then 5: decrement(c, 2 m 1) Algorithm 4 1: procedure output 2 (S = (C[0..m 1], n)) 2: return check >0 (C) 4. 2. [7] C[0..m 1] O(w) C[i] > 0.5n O(w) C[i] C[i] = f x bit(x, i) 0.5n. x U C[i] > 0 ±1 Algorithm 2 4 check >0 (C[0..m 1]) C[i] > 0 bit(x, i) = 1 x 2 3 ( ) O(log u) 5. 0 < ϕ 1 BP-CGT O(log(k/δ) log(u)/ε) O(log(k/δ)) O(log 2 (k/δ)/ε) δ 3 ( ) insert(s, x): x S f x f x + 1. delete(s, x): x S f x f x 1. output(s): ϕ, ε ˆF 5. 1. Cormode Muthukrishnan Cormode Muthukrishnan CGT [7] b > 0 O(log(k/δ) log τ (u)) O(log(k/δ)(τ log τ (u) + log(k/δ))/ε). 1 δ CGT [7] r c n N: N[1..r][1..c]: C[1..r][1..c][0..m 1]: h 1,..., h r : r h i : U [1..c]. n = x U f x N[i][j] = x U f x [h i (x) = j] C[i][j][l] = x U f x [h i (x) = j] bit(x, l) CGT x U r r c r 2 c i x F f x f x ϕn x x E[f x ] n c Markov P[f x ϕn] E[f x] ϕn = 1 cϕ. c 2/ϕ E[f x ] ϕn/2 P[f x ϕn] 1/2 1/2 f x < ϕn r f x ϕn (1/2) r = δ/k Boole F k = 1/ϕ 1 δ x ˆF f x > (ϕ ε)n x ˆF i [1..r] N[i][h i(x)] > ϕn
Algorithm 5 BP-CGT 1: procedure insert(s =(C[1..r][1..c][0..m 1], N[1..r][1..c], n), x) 2: n n + 1 3: for i = 1 to r do 4: insert 2 ((C[i][h i (x)], N[i][h i (x)]), x) Algorithm 6 BP-CGT 1: procedure delete(s =(C[1..r][1..c][0..m 1], N[1..r][1..c], n), x) 2: n n 1 3: for i = 1 to r do 4: delete 2 ((C[i][h i (x)], N[i][h i (x)]), x) Algorithm 7 BP-CGT 1: procedure output(s =(C[1..r][1..c][0..m 1], N[1..r][1..c], n)) 2: ˆF 3: for (i, j) = (1, 1) to (r, c) do 4: x output 2 ((C[i][j], N[i][j])) 5: for k = 1 to r do 6: if N[k][h k (x)] ϕn then 7: break 8: ˆF ˆF {x} 9: return ˆF x U f x < (ϕ ε)n x i i x f x εn ε < ϕ c 2/ε Markov E[f x ] εn/2 P[f x εn] 1/2 x 1/2 r x ˆF (1/2) r = δ/k 1 δ x ˆF f x (ϕ ε)n 5. 2. BP-CGT CGT [7] m = O(w) C[i][j][0..m 1] 3 O(log(k/δ)) CGT ϕn x r ϕn x r (c 1) BP-CGT Algorithm 5 7 CGT 4 BP-CGT 1 δ ˆF. BP-CGT ˆF 1 δ F ˆF x F f x f x > ϕn x f x Markov 1/2 f x ϕn 1/2 f x > N[i][h i(x)]/2 = (f x + f x )/2 output 2 (C[i][h i(x)], N[i][h i(x)]) x output x ˆF f x 1 δ f x (ϕ ε)n BP-CGT CGT 6. BP-CGT U = [0, u) O(log(k/δ) log(u)/ε) O(log(k/δ)) O(log 2 (k/δ)/ε) δ (0, 1] n O((n log(k/δ) log u + log 2 (k/δ))/ε) Cormode Muthukrishnan CGT [7] O(log u) O(log(u)/ log(k/δ)) CGT BP-CGT m = O(w) C[0],..., C[m 1] i [0, m) C[i] C[i]+bit(x, i) C[i] C[i] bit(x, i) C[i] = 0 x [0, 2 m ) m m m 1 i=0 [C[i] = 0] 2i [1] R. A. Baeza-Yates, G. H. Gonnet: A new approach to text searching. Commun. ACM, 35(10), pp. 74 82, 1992. [2] P. Bille, M. Thorup: Regular expression matching with multi-strings and intervals. In Proc. 21st SODA, pp. 1297 1308, 2010. [3] R. S. Boyer, J. S. Moore: MJRTY: a fast majority vote algorithm. automated reasoning. Essays in Honor of Woody Bledsoe, pp. 105 118, 1991 [4] J. L. Carter, M. N. Wegman: Universal classes of hash functions. J. Comput. Syst. Sci., 18(2), pp. 143 154, 1979. [5] M. Charikar, K. C. Chen, M. Farach-Colton: Finding frequent items in data streams. Theor. Comput. Sci., 312(1), pp. 3 15, 2004. [6] G. Cormode, M. Hadjieleftheriou: Methods for finding frequent items in data streams. VLDB J., 19(1), pp. 3 20,
2010. [7] G. Cormode, S. Muthukrishnan: What s hot and what s not: tracking most frequent items dynamically. ACM Trans. Database Syst., 30(1), pp 249 278, 2005. [8] G. Cormode, S. Muthukrishnan: An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1), pp. 58 75, 2005. [9] E. D. Demaine, A. López-Ortiz, J. I. Munro: Frequency estimation of internet packet streams with limited space. In Proc. 10th ESA, pp. 348 360, 2002. [10] G. S. Frandsen, S. Skyum: Dynamic maintenance of majority information in constant time per update. Inf. Process. Lett., 63(2), pp. 75 78, 1997. [11] K. Fredriksson, S. Grabowski: Nested counters in bitparallel string matching. In Proc. 3rd LATA, pp. 338 349, 2009 [12] S. Grabowski, K. Fredriksson: Bit-parallel string matching under Hamming distance in O(n m/w ) worst case time. Inf. Process. Lett., 105(5), pp. 182 187, 2008. [13] P. Indyk: Sketching via hashing: from heavy hitters to compressed sensing to sparse fourier transform. In Proc. 32nd PODS, pp. 87 90, 2013. [14] R. M. Karp, S. Shenker, C. H. Papadimitriou: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst., 28(1), pp. 51 55, 2003. [15] A. Metwally, D. Agrawal, A. E. Abbadi: An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst., 31(3), pp. 1095 1133, 2006. [16] J. Misra, D. Gries: Finding repeated elements. Sci. Comput. Program, 2(2), pp. 143 152, 1982. [17] G. S. Manku, R. Motwani: Approximate frequency counts over data streams. In Proc. 28th VLDB, pp. 346 357, 2002. [18] A. Pagh, R. Pagh, S. S. Rao: An optimal Bloom filter replacement. In Proc. 16th SODA, pp. 823 829, 2005. [19], :., 46(1), pp. 4 11, 2005.