Max /Min Online Aggregation in the Cloud

Journal of Chinese Computer Systems 2015 10 10 Vol 36 No 10 2015 Max /Min 100872 E-mailwangfengmingqq@ 163 com / SQL Count Sum Max /Min Max /Min TP311 A Max /Min Online Aggregation in the Cloud 1000-1220201510-2177-06 WANG Feng-ming CI Xiang MENG Xiao-feng School of Information Renmin University of China Beijing 100872 China AbstractAs an important part of data analysis data exploration must be able to efficiently access key indicators of data sets such as max / min average and etc These indicators can be obtained by SQL aggregate functions in relational database In order to achieve this goal in massive dataset scholars have proposed the concept of onlineaggregation In the era of big data online aggregation in the cloud has attracted attentions Most of the research focuses on the aggregation function such as Count Sum and other aggregate functions while there is little works on the Max /Min online aggregation now In this paper we use quantile to measure the accuracy of Max /Min online aggregation which induced by chebyshev's inequality and central limit theorem The experimental results demonstrate the efficiency of the method and it can well adapt to online aggregation for big data Key wordsonline aggregationcloud computingchebyshev's inequalitycentral limit theorem 1 Count Sum Average / Max Min 1Max Min Max /Min 2 Max /Min 2014-07-21 2014-09-09 61379050 91224008 2013AA013204 20130004130001 11XNL010 1991 1986 1964

2178 2015 13 / / 3 Max /Min Max /Min Central Limit Theorem 2 Sum Count 1 2 1 / 3 / Ripple join / Max /Min 4 3 1 SELECT opexpt ij col FROM R 5 - WHERE predicate GROUP BY col R op 6 Max Min exp R predicate R col R HOPHa- Max /Min doop Online Prototype 7 Hadoop MapReduce / HOP MapReduce 95% COLA 8 9 HOP 3 2 chebyshev's inequalitymax / 10 Min Max /Min Max /Min 90 MapReduce 5% MapReduce 1% 11 data skew 0 < p < 1 X Z δ px > Z δ = δ 12 12 Count Sum Average Max /Min

10 Max /Min X μ ε > 0 P X - μ ε σ2 1 ε 2 1 P 槡 n X - μ 槡 { n ε μ T 1 /2 n 2 T } 2Φ 槡 n ε μ - 1 12 ( 1 /2 n 2 T ) 1 /2 n 2 Max /Min Z { δ δ + 1/2 槡 n ε μ PX - μ t σ2 = Z t 0 + t 2 T 1 /2 δ 13 n 2 2 PX - μ t σ2 t < 0 ε μ = Z2 T 1 /2 δ n 2 ( ) + t 2 n M M - μ > 0 2 PX - μ M - μ + M - μ 2 PX M + M - μ 2 3 4 M 1 - + M - μ 2 NN - μ < 0 2 PX - μ N - μ + N - μ 2 PX N + N - μ 2 PX N 1 - + N - μ 2 5 6 7 N + N - μ 2 MapReduce 47 μ Map 8 3 3 47 δ μ ε μ ε ε μ P X - μ ε μ = δ 8 Max /Min 8 { } P 槡 n X - μ 槡 n ε μ = { T } δ 11 1 /2 ε = T 1 /2 n 2 n 2 Z2 T δ n 4 - T 2 1 /2 n 2 ( ) X T n 4 = n i-1 i - X 4 n - 1 n 14 δ M φ M + ε φ M = 1 - + ε + M - μ - ε μ 2 N φ N + ε φ N = + ε + N - μ - ε μ 2 15 16 4 Max /Min X - μ P 槡 /n ε μ 槡 /n = δ 9 2 / P 槡 n X - μ 槡 n ε { μ = σ σ } δ 10 / 3 T n 2 = 4 n X i-1 i - X 2 n - 1 6 2179 1 5 /

2180 2015 7 MapReduce Map 1 1 Map InputObject t OutputText key Text value 1 2 3 4 5 if t satisfies the predicate then key sett tuple lang value sett tuple size end if output collectkey value Reduce 2 2 Reduce InputText key Iterator Textvalues OutputMax Min fi_max fi_min 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 / /size_nnumber of tuples processed by the reducer / /sum i sum of the variables in the last iteration / /Maxevaluate max / /Minevaluate min / /fi_maxmax quantile / /fi_minmin quantile while values hasnext do Text it = values getnext val + = sum t_n2 + = list geti- avg^2 t_n4 + = t_n2^2 sega2 = t_n2 /size_n - 1 avg_err = 1 96^2* sega2 /size_n^1 /2 sega4 = t_n4 /size_n - 1 dev_err = 1 96^2* sega4 - sega2^2 /size_n ^1 /2 16 fi_max = 1 - sega2^2 + dev_err^2/sega2^2 + dev_err^2 + Max - avg - avg_err^2 17 fi_min = sega2^2 + dev_err^2/sega2^2 + dev_err^2 + Min - avg - avg_err^2 18 end while output collectkey new Textres 5 552 3574 0 9170 5 1 911 3574 0 9704 11 1Gbit 1229 3574 0 9769 HDFS MapReduce master 10 slave 2 33G CPU 1441 2922 3574 3574 0 9818 0 9804 7GB 1 8TB HDFS 3574 3574 0 9954 64M MapReduce COLA Max /Min 1TB 100G visit_log pageviews 1 Max Min 1 HDFS Max /Min Q1 = SELECT Maxpageviews language FROM visit_log GROUP BY language Q2 = SELECT Minpageviews language FROM visit_log GROUP BY language 0 95 13 Z δ 0 975 Z δ = 1 96 5 2 Wikipedia relative_error relative_error relative_error = estimatevalue - actualvalue actualvalue 17 avgtime_max avgtime_min avgtime_max 0 95 0 99 avgtime_min 0 95 0 01 1 2 Table 1 Quantile of Max Table 2 Quantile of Min online aggregation online aggregation 165 3574 0 7288 325 12 0 7241 552 3574 0 8691 194 12 0 6571 552 3574 0 8870 85 12 0 4593 59 12 0 2069 59 12 0 2128 3574 3574 0 9935 33 12 0 1002 33 12 0 0994 24 12 0 0226 12 12 0 0124 12 12 0 0135 1 2 Wikipedia 13 0 99 0 01 320GB

10 Max /Min 2181 0 99 Count Sum 0 01 1 2 Max /Min 15% 5% 30% 0 5 2 1 1 Q1 2 Q2 1 Q1 2 Q2 3 Q1 Fig 1 Query error of Q1 Fig 2 Query error of Q2 Fig 3 Query time of Q1english Max Min Max 3 5 2 2 4 Q1 11 100G 10 3 Q1 5 Q2 3 Q2 4 Q1 5 Q2 6 Q2 Fig 4 Query time of Q1french Fig 5 Query time of Q2english Fig 6 Query time of Q2french 6 Q2 Q1 Q2 5 6 Q2 2 100G slave 2 4 6 8 10 8 5 2 3 6 1 20G 40G 60G 80G 100G Q1 Q2 Count Sum Max Min / / 7 8 Fig 7 Scalability of data Fig 8 Scalability of cluster 7 Q1

2182 2015 References Top-K ACM Conference on Management of Data New YorkACM 2010 1115-1118 1Joseph M Hellerstein Peter J Hass Helen J Wang Online aggregationc Proceedings of ACM Conference on Management of DataNew YorkACM 1997171-182 2Peter J Haas Large-sample and deterministic confidence intervals for online aggregationc Proceedings of International Conference on Scientific and Statistical DB ManagementPiscatawayNJ IEEE 199751-63 3Peter J Haas Joseph M Hellerstein Ripple joins for online aggregationc Proc of SIGMOD 1999 New YorkACM 1999287-298 4Gang Luo Curt J Ellmann Peter J Haas et al A scalable hash ripple join algorithmc Proceedings of ACM Conference On Management of Data New YorkACM 2005252-262 5Chris Jermaine Alin Dobra Subramanian Arumugam et al A diskbased join with probabilistic guaranteesc Proceedings of ACM Conference on Management of Data New YorkACM 2005563-574 6Wu Sai Jiang Shou-xu Beng Chin Ooi et al Distributed online aggregationj The Proceedings of the VLDB Endowment 2009 2 1 443-454 7Tyson Condie Neil Conway Peter Alvaro et al Online aggregation and continuous query support in MapreduceC Proceedings of 8Shi Ying-jie Meng Xiao-feng Wang Fu-sheng et al You can stop early with COLAonline processing of aggregate queries in the cloudc Proceedings of ACM International Conference on Information and Knowledge Management New YorkACM 20121223-1232 9COLAEB /OL http/ /idke ruc edu cn /COLA / 2014 10Niketan Pansare Vinayak R Borkar Chris Jermaine et al Online aggregation for large mapreducejobsj The Proceedings of the VLDB Endowment 2011 4111135-1145 11Vasiliki Kalavri Vaidas BrundzaVladimir Vlassov Block samplingefficient accurate online aggregation in MapReduceC Proc of Cloud Com'13 Piscataway NJIEEE 2013250-257 12Wang Yu-xiang Luo Jun-zhou Song Ai-bo et al OATSonline aggregation with two-level sharing strategy in cloud J Distributed and Parallel Databases 2014 321 1-39 13Wu Ming-xi Chris JermaineGuessing the extreme values in a data seta Bayesian method and its applications J The VLDB Journal 2009 182 571-597 14Wikipedia page traffic statisticseb /OL http/ /aws amazon com /datasets /2596 2014