DEWS2007 L5-4 XML E-mail: {mitarai,takea}@i.kyushu-u.ac.jp, ishino@ecei.tohoku.ac.jp XML XML XML XML XML XML XML XML XML XMLTK 4 6 1/6, XML, XML,, Astract Light-weight acceleration for streaming XML ocument filtering Shuichi MITARAI, Akira ISHINO, an Masayuki TAKEDA Department of Informatics, Kyushu University Department of System Information Sciences, Tohoku University E-mail: {mitarai,takea}@i.kyushu-u.ac.jp, ishino@ecei.tohoku.ac.jp This paper proposes a scalale XML filtering asing on preprocessing of XML ata. XML ata is preprocesse an transforme into a pair of a path trie an a inary XML ata. The path trie is the trie representing the set of strings of tag names along with root-to-leaf paths, an the inary XML ata is otaine from the original XML ata y replacing every start tag an the corresponing en tag with special yte coes, respectively. Each occurrence of the special yte coe for a start tag is followe y ID of the corresponing noe of the path trie. Path pattern matching is performe against the path trie. Since the path trie is much smaller than the XML ata, a rastic speeup is possile. Query pattern processing is one y comining the keywor occurrences foun in scanning of the inary XML ata with the information ae to path trie noe implie y noe IDs emee in the inary XML ata. Experimental results show that, the processing time an memory requirement of our metho are, respectively, 1/6 1/4 an 1/6 compare with XMLTK, a state-of-the-art streaming XML processor. Key wors stream processing, XML, XML filtering, Path-trie, DFA 1. SDI (Selective Dissemination of Information) [6 pulish/suscrie ag-of-wors XML (extensile Markup Language) XPath [20 XPath W3C XML ()
XML DOM DOM 50 200MB XML ( [3, [4, [8, [15) XML DOM XML YFilter [8 XMLTK [4 XPath (forwar axes) (ancestor axes), (siling axes) [5 XML YFilter XMLTK XML YFilter XFilter [3 (NFA) XFilter NFA (DFA) NFA DFA XMLTK XML NFA DFA Lazy-DFA [4 XMLTK DFA XML XML [7 XML [2 XMLTK (Stream Inex; SIX) [4 XML XML XML XML XML XML XML XML XML XML 1 2 XML SIX DataGuie [9 XML DataGuie 2. XML 3. YFilter, XMLTK SIX XMLTK SIX 4. preicatecount 5. 2. [22, [23 XQuery DFA XML // * DFA 2. 1 Σ N XML N Σ () 1 N * / // / // - π XML 1 name=value value @name
x y e=true 1 XML (x, y) XPattern π 1 [π 2 : e * // π π * π π π 1 π 2 π 1 [π 2 XML x y x = y π 1 [π 2 (x, y) π 1 π 2 x x y x y x e = e(w 1,..., w m ) Σ w 1,..., w m Σ e w 1,..., w m (truth assignment) e XPattern π 1 [π 2 e π 1 [π 2 : e XPattern π 1 [π 2 : e XML x x y π 1 [π 2 (x, y) y e π 1 [π 2 XPattern π 1 [π 2 : true 1 XPattern π 1 [π 2 : e (x, y) 1 Given: XML T. Query: XPattern P 1,..., P l. Answer: i = 1,..., l T XML P i x XML T XML XML XML x 4. XPattern x [19 2. 2 XML XML XML XML π1 π2 2 Input XML file <a> <> <c>...</c> </> <> <>...</> </> <> <c> <>...</> <> <>...</> </> </c> </> </a> Binary XML file [1 [2 [3... [2 [4... [2 [3 [5... [6 [7... c a 5 1 a 2 c Path trie 3 4 c 6 7 XML XML 2 2 XML XML XML ID XML [ [ XML O(n log N + T ) 2 n XML T N T XML XML 1, 2
1 DBLP [11 xmlgen [17 XML XML (MB) DBLP 352 8,632,812 35 ranom 111 1,666,310 74 2 XML XML CPU (MB) (sec) DBLP 208 138 100.41 ranom 84 515 73.26 1 Hit! a Hit! 1 2 4 6 2 2 2 3 5 7 c c Hit! 3 4 3 8 9 5 6 10 7 1 a 2 3 4 c <P1, 1> 5 6 <P1, 2> 7 <P1, 1>, <P1, 3> 3 XML P 1 = a//[// XML XML 76% 2. 3 XML 2 XML XML P 1 = a//[// 3 P 1 = a//[// XML (x, y) = (4, 5), (6, 8), (6, 10), (9, 10) (x, y) = (2, 5), (2, 7), (6, 7), (2, 4) (x, y) y x, y XML XML (6, 8) P 1 5 P 1, 2 XML XML y = 8 5 5 P 1, 2 y 2 x = 6 (x, y) [21 XML (NFA) NFA NFA [14 π NFA π + 1 NFA 32 64 NFA O( N π ) n, h O(h n) (NFA ) π 1 [π 2 π 1 π 2 NFA M 12 f π 2 π R 2 NFA M 2 M 12 f M 2 XML NFA Mf 12, M 2 P i = π 1 [π 2 (x, y) (x, y) y P i, x, y XML XML YFilter XMLTK 2. 4 XPath (i)/a/[containts(name, mickey ) (ii)/a//name[./text()= mickey mouse (iii)/a/[@month=decemer (i) /a//name mickey (ii) name mickey mouse (iii) /a/ Decemer XML Aho-Corasick(AC) AC AC
2. 5 XML XML XPattern P 1,..., P l e W = w 1,..., w m W AC M XML offset XML epth 2 Occ Q Occ w i Occ[[i Q XPattern P q Q[[q XML 1 c c [ v (v, offset) S epth 1 Occ[epth[1... m Q[epth[1... l M c S (v, offset) v q, XPattern P q e Occ[epth[1... m e Q[epth [q epth 1 M c M c Occ[epth[1... m 3. ReHat Linux Avance Server 2.1, CPU 2.4GHz Intel Pentium4, 2.0GB RAM XML xmlgen [17 111MB XMLTK (location step) YFilter pathgenerator [1 // * (1%, 1%) (10%, 10%) 3. 1 3. 2 XML Processing time (msec) Path patterns Simple path patterns Numer of queries 4 3. 3 YFilter XMLTK 3. 4 SIX 3. 5 AC 3. 1 4 NFA XML ( 5) XML 1,666,310 515 2 NFA 3. 2 XML XML 3 ( ) XML XML [21 3 (sec) 1 0.51 0.86 Pro(//)=Pro(*)=0.01 10 0.52 1.29 100 0.54 4.30 1,000 0.53 21.81 1 0.51 0.86 Pro(//)=Pro(*)=0.01 10 0.52 1.36 100 0.52 5.51 1,000 0.54 34.50
l NFA NFA l 3. 3 XMLTK YFilter 4 10,000, 100,000 (//, * (1%, 1%), (10%, 10%) 2 ) Linux RSS (Resient Set Size) 100,000 YFilter 10,000 (//, * 10%) 4,925KB XMLTK 34,412KB, YFilter 1,494,845KB 6 5 1 100,000 YFilter XMLTK 100,000 XMLTK 4 (KB) YFilter XMLTK 10,000 3,320 1,169,975 30,288 1% 100,000 18,632 285,328 10,000 4,952 1,494,845 34,412 10% 100,000 28,543 318,560 5 (sec) YFilter XMLTK 1 0.57 39.24 2.27 10 0.57 42.54 2.57 100 0.57 45.22 3.09 1, 000 0.67 61.30 4.14 10, 000 1.85 155.30 11.83 100, 000 142.04 270.81 3. 4 SIX Stream IneX (SIX) [4 SIX 6 5 SIX 1.6 7.2 5 XMLTK SIX XMLTK // * 6 SIX (sec) XMLTK 1 0.00 0.14 10 0.07 1.53 100 0.18 2.49 1, 000 0.39 4.37 10, 000 1.81 12.25 100, 000 139.62 275.74 3. 5 AC AC (SA) [13 (CSA) [10, [16 5Nyte N 8 () (1) (2) (3)
7 occ m D CSA, l CSA N (, l) = (8, 128) CSA, SA http://pizzachili.cc.uchile.cl/texts/nlang 100MB 100 ( ) 50 ( ) 100 ( ) AC 2 AC 7 AC, SA, CSA AC O(m + N + occ). SA CSA (1) O(mlogN + occ) O(m log N + occ log ϵ N) (2) O(mlogN + occ) O(m log N + occ log ϵ N) (3) O(mlogN + occ log D) O(m log N + occ(log ϵ N + log D)) 5N (yte) N (yte) 8 AC CSA( = 8, l = 128) SA 39,670,500 2,532,290 2,159 (sec) AC CSA SA 100 1.31800 0.00450 0.00161 100 0.81000 0.00525 0.00156 100 0.36300 0.00630 0.00051 (sec) AC CSA SA 100 1.31800 445.866 0.00161 100 0.81000 31.027 0.00156 100 0.36300 0.0394 0.00051 (sec) AC CSA SA 100 1.59500 517.113 22.880 100 0.91300 36.364 1.460 100 0.44900 0.0429 0.002 4. 4. 1 XPattern P 1 = π1[π 1 2 1 : e 1. P l = π l 1[π l 2 : e l e i w 1,..., w m e i w 1,..., w m P 1,..., P i 1 e i 2. 5 epth Occ[epth[1..m Q[epth[1..i 1 a ) π[π 1 an π 2 XPath P 1 = π[π 1 : true P 2 = π[π 2 : true P 3 = π[ε : P 1 P 2 XPath π[π 1 an π 2 Q[epth[3 XPath π[π 1 or π 2 π[π 1 an π 2 or π 3 ) XPath π 1 [π 2 [π 3 an π 4 P 1 = π 1 π 2 [π 3 : true P 2 = π 1 π 2 [π 4 : true P 3 = π 1 [π 2 : (P 1 P 2 ) XPath π 1 [π 2 [π 3 an π 4 Q[epth[3 P 4 = π 1 [π 2 π 3 : true P 5 = π 1 [π 2 π 4 : true P 6 = π 1 [ε : (P 4 P 5 ) XPath π 1 [π 2 [π 3 an π 4 Q[epth[6 4. 2 P i XML 1 e i true, false 1,0,, an 2 P i = π 1[π 2 : e i x, P i (x, y) P i y e i x P i x 0 XPath π 1[count(π 2) > 1 count sum, max, min, an avg (average) 5. AC XPath 3. YFilter XMLTK
1. Suffix Trees with Applications to Text Inexing an String SDI Matching, STOC 00, pp. 397 406 (2000). [11 Ley, M.: DBLP Computer Science Biliography, http://lp.unitrier.e/. SDI SIX XML [12 M. Takea, et al.: Speeing up string pattern matching y text compression: The awn of a new era, Trans. Information Processing Society of Japan, Vol. 42, No. 3, pp. 370 384 (2001). Special issue for IPSJ 40th anniversary awar papers. XML [13 Maner, U. an Myers, G.: Suffix arrays: A new metho for on-line string searches, SIAM J. Computing, Vol. 22, No. 5, pp. 935 948 (1993). [14 Navarro, G. an Raffinot, M.: Flexile pattern matching in strings: Practical on-line search algorithms for texts an iological sequences, Camrige University Press (2002). (1) [15 Peng, F. an Chawathe, S. S.: XPath queries on streaming ata, SIGMOD 03, pp. 431 442 (2003). (2) [16 Saakane, K.: Compresse Text Dataases with Efficient XML Query Algorithms Base on the Compresse Suffix Array, (3) (twig pattern) Proc. of 11th International Symposium on Algorithms an Computation (ISAAC 00), LNCS 1969, pp. 410 421 (2000). [17 Schmit, A., Waas, F., Kersten, M. L., Carey, M. J., (1) Manolescu, I. an Busse, R.: XMark: A enchmark for [12, [18 (2) XML XML ata management, VLDB 02, pp. 974 985 (2002). [18 Shiata, Y., Kia, T., Fukamachi, S., Takea, M., Shinohara, A., Shinohara, T. an Arikawa, S.: Speeing up pattern Matching y Text Compression, CIAC 00, LNCS 1767, XML Query Processor for a Large Numer of Structural an Textual Patterns, Technical Report DOI-TR-CS-226, pp. 306 315 (2000). [19 Takea, M., Ishino, A. an Mitarai, S.: A Light-Weight Department of Informatics, Kyushu University (2006). [22, [23 DFA [20 W3C: XQuery 1.0 an XPath 2.0 Full-Text Use Cases, XMLTK http://www.w3.org/tr/xmlquery-full-text-use-cases. [21 (3) 4. XPattern XML DBWe2003 (2003). [22 XQuery DBWe2005 (2005). [23 XQuery [1 : Filtering an Transformation for High-Volume XML Message Brokering, http://yfilter.cs.erkeley.eu/coe release.htm. Letters Vol. 4, No. 4 (2006). [2 : Report From the W3C Workshop on Binary Interchange of XML Information Item Sets, http://www.w3.org/2003/08/inaryinterchange-workshop/report.html (2003). [3 Altinel, M. an Franklin, M.: Efficient filtering of XML ocuments for selective issemination, VLDB 00, pp. 53 64 (2000). [4 Avila-Campillo, I., Green, T. J., Gupta, A., Onizuka, M., Raven, D. an Suciu, D.: XMLTK: An XML Toolkit for Scalale XML Processing, PLANX 02 (2002). [5 Barton, C., Charles, P., Goyal, D., Raghavachari, M., Josifovski, V. an Fontoura, M.: Streaming XPath processing with forwar an ackwar axes, ICDE, pp. 455 466 (2003). [6 Carzaniga, A., Rosenlum, D. an Wolf, A.: Design an Evaluation of a Wie-Area Event Notification Servie, Vol. 19, No. 3, pp. 332 383 (2000). [7 Chen, Y., Mihaila, G. A., Davison, S. B. an Pamanahan, S.: EXPeite: A System for Encoe XML Processing, CIKM 04, pp. 108 117 (2004). [8 Diao, Y., Altinel, H., Franklin, M. J., Zhang, H. an Fischer, P. M.: Path Sharing an Preicate Evaluation for HighPerformance, ACMTOD (2003). [9 Golman, R. an Wiom, J.: DataGuies: Enaling Query Formulation an Optimization in Semistructure Dataases, VLDB 97, pp. 436 445 (1997). [10 Grossi, R. an Vitter, J. S.: Compresse Suffix Arrays an