Analyzing Similarity of HTML Structures and Affiliate in Automatic Collection of Splogs

DEIM Forum 2011 F2-2 HTML 305-8573 1-1-1 305-8573 1-1-1 101-8457 2-2 ( ) 141-0031 8-3-6 135-0064. 2-3-26 HTML ID HTML ID SVM HTML Analyzing Similarity of HTML Structures and Affiliate in Automatic Collection of Splogs Akihito MORIJIRI, Taichi KATAYAMA,SoichiISHII,TakehitoUTSURO, Yasuhide KAWADA, and Tomohiro FUKUHARA College of Engineering Systems, School of Science and Engineering, University of Tsukuba Graduate School of Systems and Information Engineering,University of Tsukuba Graduate School of Science and Technology for Future Life, Tokyo Denki University Navix Co., Ltd. Center for Service Research, National Institute of Advanced Industrial Science and Technology Abstract Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for the sole purpose of hosting advertisements or raising the number of in-links of target sites. It has been shown that splogs can be detected based on similarity of HTML structures and affiliate IDs. The similarity of HTML structures of splogs is effective in splog detection, and the identity of affiliate IDs extracted from splogs can identify spammers much more directly than similarity of HTML structures, although it is not easy to achieve high coverage in extracting affiliate IDs. The coverage of the intersection of the two clues, similarity of HTML structures and affiliate IDs, is relatively low, and it is necessary to apply them in a complementary strategy. This paper studies how to detect splogs which cannot be detected based on either similarity of HTML structures nor affiliate IDs. We apply SVMs to this task and show that splogs of above type can be detected with high precision. Key words spam blog detection, HTML structure, affiliate, machine learning 1.

Technorati BlogPulse [2] kizasi.jp Globe of Blogs Best Blogs in Asia Directory ( ) [3], [9], [10], [12] [10] 88% 75% [11] [9], [10], [12] [12] TREC Blog06 / [9], [10] BlogPulse [4], [8], [11], [13] HTML HTML ID [7] HTML 1 13 [5] ID ID 1 23 2 [6] 2 HTML ID ( 1 4) Support Vector Machines (SVMs) [16] 2. HTML 2. 1 HTML DOM [18] HTML DOM [7] 2 HTML s HTML HTML P DIV P DIV P DIV [18] BODY P DIV BODY HTML [18] SCRIPT STYLE HTML HTML s DOM dm(s) 2. 2 DOM HTML s t DOM dm(s) dm(t) DP DP 1 2 DP ( ) edit distance (dm(s),dm(t)) DOM dm(s) dm(s) s t DOM Rdiff(s, t) Rdiff(s, t) = edit distance (dm(s),dm(t)) dm(s) + d m(t) HTML s HTML T t T DOM Rdiff(s, t) MinDF(s, T ) MinDF(s, T ) = min Rdiff(s, t T ) t =s 3. ID 3 ID ID ID ID ID [5] ASP( ) ID 10 1 ID 1 Am At D Gl I Lk R St Tr V

1 KANSHIN [1] RSS Atom S F 2 11 S 48,183 14,352 ( 30%) F 60,977 6,231 ( 10%) ID ASP ID ID ID ID ID ID [5] ID ID [5] ID ASP10 ID 20,583 (S F ) 10 1 10 ID ID H ID SP af (H, x) x 3 S 72 1,101 F 56 953 ID ID 129 ID ID 2,472 ID ID 121 ID 2,054 ( ID 93.8% 83.3%) 2 121 2 ID ID ID ID ID

1 2 HTML DOM DOM ID x H ID x SP af (H,x) ID ID 1 4. SVM 4. 1 / URL / HTML URL URL i) HTML URL ii) HTML 2 URL URL u URL log u u u URL 4. 2 [14], [17] / 3 / w w φ 2. ID ID 3 (http://chasen-legacy.sourceforge.jp/) IPAdic (http://sourceforge.jp/projects/ipadic/)

φ 2 (, w)= w w freq(, w freq(, w )=a )=b freq(, freq(, w) =c w) =d (ad bc) 2 (a + b)(a + c)(b + d)(c + d) ( log φ 2 (, w) w w 4. 3 URL / URL URL ( ) w s AncfB(w, s) AncfW (w, s) s w AncfB(w, s) = URL s w AncfW (w, s) = URL AncfB(w, s) 2 s URL w URL t t URL ( ) log AncfB(w, s) AncfB(w, t) w s AncfW (w, s) 2 s URL w URL t t URL ( ) log AncfW (w, s) AncfW (w, t) w s 5. SVM TinySVM (http://chasen.org/~taku/software/tinysvm/) ) 2 2 6. 2. SVM [15] LBD s LBD ab 6. 6. 1 3. S 48,183 F 60,977 10 ID 1 S 1,101 F 953 ID ID ID ID ID S 72 ID F 56 ID ID ID 500 ( ) SVM ID ID S 13 ID F

(a) 4 (S ) (b) (a) 5 (F ) (b) 12 ID ID ID 500 ( ) SVM 500 4 6. 2 3. S 48,183 F 60,977 10 ID 10 ID HTML (MinDF > 0.15) S 47,029 F 59,982 1 4 S 4 ID 283 F 268 / 6. 3 LBD s LBD s LBD s S 4(a) F 5(a) LBD ab LBD ab LBD ab S 4(b) F 5(b) 4 5 ID ID ID ID

6. 4 4 5 ID ID ID 4(a) 5(a) S 35% F 60% 90% 4(b) 5(b) S 17% F 15% 90% 6. 2 1 4 HTML ID HTML [7] ID [5] 7. HTML ID HTML ID SVM [1],,,.. 21, 2007. [2] N. Glance, M. Hurst, and T. Tomokiyo. Blogpulse: Automated trend discovery for Weblogs. In Proc. Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004. [3] Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. 1st AIRWeb, pp. 39 47, 2005. [4].. Letters, Vol. 6, No. 4, pp. 37 40, 2008. [5],,,. ID. Web (WebDB2010)., 2010. [6],,,,,. HTML. Web (WebDB2010)., November 2010. [7],,,,. HTML. 2 DEIM, 2010. [8] P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog identification and Splog detection. In Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing Weblogs, pp. 92 99, 2006. [9] P. Kolari, T. Finin, and A. Joshi. Spam in blogs and social media. In Tutorial at ICWSM, 2007. [10] P. Kolari, A. Joshi, and T. Finin. Characterizing the splogosphere. In Proc. 3rd Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006. [11] Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Splog detection using self-similarity analysis on blog temporal dynamics. In Proc. 3rd AIRWeb, pp. 1 8, 2007. [12] C. Macdonald and I. Ounis. The TREC Blogs06 collection : Creating and analysing a blog test collection. Technical Report TR-2006-224, University of Glasgow, Department of Computing Science, 2006. [13] G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proc. 1st AIRWeb, 2005. [14] Y. Sato, T. Utsuro, T. Fukuhara, Y. Kawada, Y. Murakami, H. Nakagawa, and N. Kando. Analyzing features of Japanese splogs and characteristics of keywords. In Proc. 4th AIRWeb, pp. 33 40, 2008. [15] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In Proc. 17th ICML, pp. 999 1006, 2000. [16] V. N. Vapnik. Statistical Learning Theory. Wiley- Interscience, 1998. [17] Y.M. Wang, M. Ma, Y. Niu, and H. Chen. Spam doublefunnel: Connecting web spammers with advertisers,. In Proc. 16th WWW, pp. 291 300, 2007. [18],.., Vol. 8, No. 1, pp. 29 34, 2009.