HTML Utilizing Similarities of HTML Structures in Splog Detection by Machine Learning

Σχετικά έγγραφα
HTML. Evaluating Effects of Similarities of HTML Structures in Splog Detection

Analyzing Similarity of HTML Structures and Affiliate in Automatic Collection of Splogs


Automatic extraction of bibliography with machine learning

Web. Web p OutDegree(p) log 7 1/OutDegree(p) A New Difinition of Subjective Distance between Web Pages

ER-Tree (Extended R*-Tree)


2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

{takasu, Conditional Random Field

Maxima SCORM. Algebraic Manipulations and Visualizing Graphs in SCORM contents by Maxima and Mashup Approach. Jia Yunpeng, 1 Takayuki Nagai, 2, 1

ΣΔΥΝΟΛΟΓΗΚΟ ΔΚΠΑΗΓΔΤΣΗΚΟ ΗΓΡΤΜΑ ΗΟΝΗΧΝ ΝΖΧΝ «ΗΣΟΔΛΗΓΔ ΠΟΛΗΣΗΚΖ ΔΠΗΚΟΗΝΧΝΗΑ:ΜΔΛΔΣΖ ΚΑΣΑΚΔΤΖ ΔΡΓΑΛΔΗΟΤ ΑΞΗΟΛΟΓΖΖ» ΠΣΤΥΗΑΚΖ ΔΡΓΑΗΑ ΔΤΑΓΓΔΛΗΑ ΣΔΓΟΤ

SocialDict. A reading support tool with prediction capability and its extension to readability measurement

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

Twitter 6. DEIM Forum 2014 A Twitter,,, Wikipedia, Explicit Semantic Analysis,


Wiki. Wiki. Analysis of user activity of closed Wiki used by small groups

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

Development of a Seismic Data Analysis System for a Short-term Training for Researchers from Developing Countries

Buried Markov Model Pairwise

Applying Markov Decision Processes to Role-playing Game

1 n-gram n-gram n-gram [11], [15] n-best [16] n-gram. n-gram. 1,a) Graham Neubig 1,b) Sakriani Sakti 1,c) 1,d) 1,e)

IMES DISCUSSION PAPER SERIES

[4] 1.2 [5] Bayesian Approach min-max min-max [6] UCB(Upper Confidence Bound ) UCT [7] [1] ( ) Amazons[8] Lines of Action(LOA)[4] Winands [4] 1

Resurvey of Possible Seismic Fissures in the Old-Edo River in Tokyo

Πτυχιακή Εργασι α «Εκτι μήσή τής ποιο τήτας εικο νων με τήν χρή σή τεχνήτων νευρωνικων δικτυ ων»

Anomaly Detection with Neighborhood Preservation Principle

* ** *** *** Jun S HIMADA*, Kyoko O HSUMI**, Kazuhiko O HBA*** and Atsushi M ARUYAMA***

Topic Structure Mining based on Wikipedia and Web Search

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

ES440/ES911: CFD. Chapter 5. Solution of Linear Equation Systems

þÿÿ ÁÌ» Â Ä Å ¹µÅ Å½Ä ÃÄ

þÿ Ç»¹º ³µÃ ± : Ãż²» Ä Â

J. of Math. (PRC) 6 n (nt ) + n V = 0, (1.1) n t + div. div(n T ) = n τ (T L(x) T ), (1.2) n)xx (nt ) x + nv x = J 0, (1.4) n. 6 n

Feasible Regions Defined by Stability Constraints Based on the Argument Principle

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

Ενεργητική Μάθηση Με Χρήση Μηχανών ιανυσµάτων Υποστήριξης. Ανδρέας Βλάχος Πανεπιστήµιο του Εδιµβούργου

VBA Microsoft Excel. J. Comput. Chem. Jpn., Vol. 5, No. 1, pp (2006)

Toward a SPARQL Query Execution Mechanism using Dynamic Mapping Adaptation -A Preliminary Report- Takuya Adachi 1 Naoki Fukuta 2.

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

Quick algorithm f or computing core attribute

Ερευνητική+Ομάδα+Τεχνολογιών+ Διαδικτύου+

Web DEIM Forum 2009 A7-1. Web. Web. Web. Web. 4 Wikipedia. Wikipedia. Web.

Stabilization of stock price prediction by cross entropy optimization

ΔΙΑΧΕΙΡΙΣΗ ΠΕΡΙΕΧΟΜΕΝΟΥ ΠΑΓΚΟΣΜΙΟΥ ΙΣΤΟΥ ΚΑΙ ΓΛΩΣΣΙΚΑ ΕΡΓΑΛΕΙΑ. Data Mining - Classification

HIV HIV HIV HIV AIDS 3 :.1 /-,**1 +332

ΦΥΛΛΟ ΕΡΓΑΣΙΑΣ Α. Διαβάστε τις ειδήσεις και εν συνεχεία σημειώστε. Οπτική γωνία είδησης 1:.

ΖΩΝΟΠΟΙΗΣΗ ΤΗΣ ΚΑΤΟΛΙΣΘΗΤΙΚΗΣ ΕΠΙΚΙΝΔΥΝΟΤΗΤΑΣ ΣΤΟ ΟΡΟΣ ΠΗΛΙΟ ΜΕ ΤΗ ΣΥΜΒΟΛΗ ΔΕΔΟΜΕΝΩΝ ΣΥΜΒΟΛΟΜΕΤΡΙΑΣ ΜΟΝΙΜΩΝ ΣΚΕΔΑΣΤΩΝ

GPGPU. Grover. On Large Scale Simulation of Grover s Algorithm by Using GPGPU

Newman Modularity Newman [4], [5] Newman Q Q Q greedy algorithm[6] Newman Newman Q 1 Tabu Search[7] Newman Newman Newman Q Newman 1 2 Newman 3

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

Detection and Recognition of Traffic Signal Using Machine Learning

The Algorithm to Extract Characteristic Chord Progression Extended the Sequential Pattern Mining

: Active Learning 2017/11/12

Τεχνολογικά υποβοηθούμενη μάθηση: Εργαλεία και τεχνολογίες

Probabilistic Approach to Robust Optimization

Global energy use: Decoupling or convergence?

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Simplex Crossover for Real-coded Genetic Algolithms

Web 論 文. Performance Evaluation and Renewal of Department s Official Web Site. Akira TAKAHASHI and Kenji KAMIMURA

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

J. of Math. (PRC) Banach, , X = N(T ) R(T + ), Y = R(T ) N(T + ). Vol. 37 ( 2017 ) No. 5

Ανάκτηση Εικόνας βάσει Υφής με χρήση Eye Tracker

Evolution of Novel Studies on Thermofluid Dynamics with Combustion

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΠΟΛΙΤΙΚΩΝ ΜΗΧΑΝΙΚΩΝ. «Θεσμικό Πλαίσιο Φωτοβολταïκών Συστημάτων- Βέλτιστη Απόδοση Μέσω Τρόπων Στήριξης»

CorV CVAC. CorV TU317. 1

A research on the influence of dummy activity on float in an AOA network and its amendments

SVM. Research on ERPs feature extraction and classification

GF GF 3 1,2) KP PP KP Photo 1 GF PP GF PP 3) KP ULultra-light 2.KP 2.1KP KP Fig. 1 PET GF PP 4) 2.2KP KP GF 2 3 KP Olefin film Stampable sheet

Identification of Fish Species using DNA Method

Molecular evolutionary dynamics of respiratory syncytial virus group A in

Splice site recognition between different organisms

Homomorphism in Intuitionistic Fuzzy Automata

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΓΕΩΠΟΝΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΒΙΟΤΕΧΝΟΛΟΓΙΑΣ ΚΑΙ ΕΠΙΣΤΗΜΗΣ ΤΡΟΦΙΜΩΝ. Πτυχιακή εργασία

ΨΗΦΙΑΚΗ ΕΠΕΞΕΡΓΑΣΙΑ ΕΙΚΟΝΑΣ

Ψηφιακή ανάπτυξη. Course Unit #1 : Κατανοώντας τις βασικές σύγχρονες ψηφιακές αρχές Thematic Unit #1 : Τεχνολογίες Web και CMS

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ ΤΜΗΜΑ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΤΕΧΝΟΛΟΓΙΑΣ ΥΠΟΛΟΓΙΣΤΩΝ ΤΟΜΕΑΣ ΣΥΣΤΗΜΑΤΩΝ ΗΛΕΚΤΡΙΚΗΣ ΕΝΕΡΓΕΙΑΣ

, -.

Σύγκριση Πραγματολογικών Ικανοτήτων σε Παιδιά Σχολικής Ηλικίας με Διάχυτες Αναπτυξιακές Διαταραχές και Νοητική Υστέρηση

Αλγοριθµική και νοηµατική µάθηση της χηµείας: η περίπτωση των πανελλαδικών εξετάσεων γενικής παιδείας 1999

Εκπαίδευση και Πολιτισμός: έρευνα προκαταρκτικής αξιολόγησης μίας εικονικής έκθεσης

Εκτεταμένη περίληψη Περίληψη

Professional Tourism Education EΠΑΓΓΕΛΜΑΤΙΚΗ ΤΟΥΡΙΣΤΙΚΗ ΕΚΠΑΙΔΕΥΣΗ. Ministry of Tourism-Υπουργείο Τουρισμού

Ζητήματα Τυποποίησης στην Ορολογία - ο ρόλος και οι δράσεις της Επιτροπής Ορολογίας ΤΕ21 του ΕΛΟΤ

ΑΝΙΧΝΕΥΣΗ ΓΕΓΟΝΟΤΩΝ ΒΗΜΑΤΙΣΜΟΥ ΜΕ ΧΡΗΣΗ ΕΠΙΤΑΧΥΝΣΙΟΜΕΤΡΩΝ ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ

ΕΛΕΓΧΟΣ ΤΩΝ ΠΑΡΑΜΟΡΦΩΣΕΩΝ ΧΑΛΥΒ ΙΝΩΝ ΦΟΡΕΩΝ ΜΕΓΑΛΟΥ ΑΝΟΙΓΜΑΤΟΣ ΤΥΠΟΥ MBSN ΜΕ ΤΗ ΧΡΗΣΗ ΚΑΛΩ ΙΩΝ: ΠΡΟΤΑΣΗ ΕΦΑΡΜΟΓΗΣ ΣΕ ΑΝΟΙΚΤΟ ΣΤΕΓΑΣΤΡΟ

Technical Research Report, Earthquake Research Institute, the University of Tokyo, No. +-, pp. 0 +3,,**1. No ,**1

PUBLICATION. Participation of POLYTECH in the 10th Pan-Hellenic Conference on Informatics. April 15, Nafplio

ST5224: Advanced Statistical Theory II

Research on Economics and Management

Χρήση οντολογιών στη χαρτογράφηση γνώσης: Μελέτη περίπτωσης σε μία ακαδημαϊκή βιβλιοθήκη

ΠΑΝΔΠΙΣΗΜΙΟ ΜΑΚΔΓΟΝΙΑ ΠΡΟΓΡΑΜΜΑ ΜΔΣΑΠΣΤΥΙΑΚΧΝ ΠΟΤΓΧΝ ΣΜΗΜΑΣΟ ΔΦΑΡΜΟΜΔΝΗ ΠΛΗΡΟΦΟΡΙΚΗ

Working Paper Series 06/2007. Regulating financial conglomerates. Freixas, X., Loranth, G. and Morrison, A.D.

1 (forward modeling) 2 (data-driven modeling) e- Quest EnergyPlus DeST 1.1. {X t } ARMA. S.Sp. Pappas [4]

Πανεπιστήµιο Πειραιώς Τµήµα Πληροφορικής

ΠΑΝΔΠΗΣΖΜΗΟ ΠΑΣΡΩΝ ΣΜΖΜΑ ΖΛΔΚΣΡΟΛΟΓΩΝ ΜΖΥΑΝΗΚΩΝ ΚΑΗ ΣΔΥΝΟΛΟΓΗΑ ΤΠΟΛΟΓΗΣΩΝ ΣΟΜΔΑ ΤΣΖΜΑΣΩΝ ΖΛΔΚΣΡΗΚΖ ΔΝΔΡΓΔΗΑ

Transcript:

2009 Technical Report on Information-Based Induction Sciences 2009 (IBIS2009) HTML Utilizing Similarities of HTML Structures in Splog Detection by Machine Learning Taichi Katayama Takayuki Yoshinaka Takehito Utsuro Yasuhide Kawada Tomohiro Fukuhara Abstract: Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for the sole purpose of hosting advertisements or raising the PageRank of target sites. Among those splogs, this paper focuses on detecting a group of splogs which are estimated to be created by an identical spammer. We especially show that similarities of html structures among those splogs created by an identical spammer contribute to improving the performance of splog detection. In measuring similarities of html structures, we extract a list of blocks (minimum unit of content) from the DOM tree of a html file. We show that the html files of splogs estimated to be created by an identical spammer tend to have similar DOM trees and this tendency is quite effective in splog detection. Keywords: spam blog, machine learning, HTML structure, confidence measure, SVM 1, 305-8573 1-1-1, email: {katayamataichi2008, utsuro} @ nlp.iit.tsukuba.ac.jp Graduate School of Systems and Information Engineering,University of Tsukuba, Tsukuba, 305-8573, Japan, 101-8457 2-2, email: yoshinaka @ cdl.im.dendai.ac.jp Graduate School of Engineering, Tokyo Denki University, Tokyo, 101-8457, Japan ( ), 141-0031 8-3-6, Navix Co., Ltd., 8-3-6 Nishi-Gotanda, Shinagawa-Ku Tokyo 141-0031, Japan, 277-8568 5-1-5, email: fukuhara @ race.u-tokyo.ac.jp Research into Artifacts, Center for Engineering, University of Tokyo Kashiwa, Chiba 277-8568, Japan Technorati 1 BlogPulse 2 kizasi.jp 3 blogwatcher 4 [1] Globe of Blogs 5 Best Blogs in Asia Directory 6 Blogwise 7 ( ) [2,3,4,5,6] 1 http://technorati.com/ 2 http://www.blogpulse.com/ 3 http://kizasi.jp/ ( ) 4 http://blogwatcher.pi.titech.ac.jp/ ( ) 5 http://www.globeofblogs.com/ 6 http://www.misohoni.com/bba/ 7 http://www.blogwise.com/

1: / 1 (a) / CC SS 198 293 277 768 210 259 2849 3318 408 552 3126 4086 (b) CC SS ID 1 2 3 4 163 26 31 33 [4] 88% 75% [3, 7] [5] TREC 8 Blog06 / [4, 6] BlogPulse [8, 9, 10, 4] HTML Support Vector Machines [11] (SVM) SVM SVM [12] HTML SVM 8 http://trec.nist.gov/ 1: HTML DOM DOM 2 / 2007 9 2008 2 / / [13, 14] / / [13, 14] ID ID 1 CC ID=1 SS ID= 2, 3, 4 [13, 14] 2 HTML

3 HTML 3.1 HTML DOM [15] HTML DOM 1 HTML s HTML HTML P DIV P DIV P DIV [15] BODY P DIV BODY HTML [15] SCRIPT STYLE HTML HTML s DOM dm(s) 3.2 DOM HTML s t DOM dm(s) dm(t) DP DP 1 2 DP edit distance (dm(s),dm(t)) s t DOM Rdiff(s, t) Rdiff(s, t) = edit distance (dm(s),dm(t)) dm(s) + dm(t) 1 HTML DOM 3.3 DOM HTML S T HTML s S t T DOM DOM HTML s S HTML T t T Rdiff(s, t) 10 AvMinDF 10 (s, T ) 9 AvMinDF 10 (s, T ) = T Rdiff(s, t T ) 10 t Rdiff(s, t) (ID=1, CC ) (ID=2, 3, 4, SS ) DOM 2 ID ( ID=1) S T S s AvMinDF 10 (s, T ) (ID=1, CC ) ID=1 S CC T S s AvMinDF 10 (s, T ) SC SS 2(a) CC S T S s AvMinDF 10 (s, T ) 2(b) (c) (d) SS ( 2(b) (c) (d) ) 2(a),(c),(d) 2(a) AvMinDF 10 (s, T ) 0 0.15 9 AvMinDF k (s, T ) k =1,...25 k =10

(a) (ID = 1 CC ) (b) (ID = 2 SS ) (c) (ID = 3 SS ) (d) (ID = 4 SS ) 2: DOM 2(b) ID=2, AvMinDF 10 (s, T ) ID HTML DOM 4 SVM 4.1 DOM 3 HTML DOM s 3.3 AvMinDF 10 (s, T ) log AvMinDF 10 (s, T ) T 1. T DOM ( ) 2. T DOM ( ) 4.2 [16, 17] 4.2.1 / URL / HTML URL URL i) HTML URL ii) HTML 2 URL

URL u URL log u u u 4.2.2 [18, 13, 14] / 10 / w w φ 2. w w freq(, freq(, w )=a w )=b freq( freq(, w) =c φ 2 (, w) =, w) =d (ad bc) 2 (a + b)(a + c)(b + d)(c + d) log ( ) φ 2 (, w) w w 4.2.3 URL / URL URL ( ) w s AncfB(w, s) AncfW (w, s) s w Ancf B(w, s) = URL 10 (http://chasen-legacy. sourceforge.jp/) ipadic s w AncfW (w, s) = URL AncfB(w, s) 2 s URL w URL t t URL log w ( s ) Ancf B(w, s) AncfB(w, t) AncfW (w, s) 2 s URL w URL t t URL log ( ) AncfW (w, s) AncfW (w, t) w s 5 5.1 SVM SVM TinySVM (http://chasen.org/~taku/software/tinysvm/) 2 2 6 2.

5.2 SVM [12] 11 LBD p LBD n 6 6.1 1(a) CC SS / ( 3) / ( 4) /CC 408 SS 552 / CC 163 163 SS 90 90 10 6.2 / / LBD p LBD p LBD p LBD n LBD n 11 ( [19, 12, 20] ) LBD n 6.3 3 4 +DOM ( ) DOM ( ) +DOM ( ) DOM 3 4 (a-1) (a-2) 2 DOM ( ) URL DOM ( ) 3 4 (b-1) (b-2) 2 DOM ( ) URL URL DOM ( ) 3 (a-1) (b-1) (a-2) (b-2) DOM ( ) (a-2) DOM ( ) 4 DOM ( ) DOM ( ) 12 3 4 2 DOM ( ) +DOM ( ) 12 3 4 +DOM ( ) +DOM ( ) 4/5 +DOM ( ) +DOM ( )

(a-1) (CC ) (a-2) (CC ) (b-1) (SS ) (b-2) (SS ) 3: / (a-1) (CC ) (a-2) (CC ) (b-1) (SS ) (b-2) (SS ) 4: /

DOM ( ) 2 DOM ( ) DOM ( ) 7 HTML SVM HTML SVM DOM [21] HTML DOM DOM [1] T. Nanno, T. Fujiki, Y. Suzuki, and M. Okumura. Automatically collecting, monitoring, and mining Japanese weblogs. In WWW Alt. 04: Proc. 13th WWW Conf. Alternate Track Papers & Posters, pp. 320 321, 2004. [2] Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. 1st AIRWeb, pp. 39 47, 2005. [3] Wikipedia, Spam blog. http://en.wikipedia.org/wiki/spam blog. [4] P. Kolari, A. Joshi, and T. Finin. Characterizing the splogosphere. In Proc. 3rd Ann. Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006. [5] C. Macdonald and I. Ounis. The TREC Blogs06 collection : Creating and analysing a blog test collection. Technical Report TR-2006-224, University of Glasgow, Department of Computing Science, 2006. [6] P. Kolari, T. Finin, and A. Joshi. Spam in blogs and social media. In Tutorial at ICWSM, 2007. [7] Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Splog detection using self-similarity analysis on blog temporal dynamics. In Proc. 3rd AIRWeb, pp. 1 8, 2007. [8].. Letters, Vol. 6, No. 4, pp. 37 40, 2008. [9].. Web (WebDB Forum)2008., 2008. [10] P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog identification and Splog detection. In Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing Weblogs, pp. 92 99, 2006. [11] V. N. Vapnik. Statistical Learning Theory. Wiley- Interscience, 1998. [12] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In Proc. 17th ICML, pp. 999 1006, 2000. [13],,,,,,.. DEWS2008, 2008. [14] Y. Sato, T. Utsuro, T. Fukuhara, Y. Kawada, Y. Murakami, H. Nakagawa, and N. Kando. Analysing features of Japanese splogs and characteristics of keywords. In Proc. 4th AIRWeb, pp. 33 40, 2008. [15],.., Vol. 8, No. 1, pp. 29 34, 2009. [16],,,,,.. DEIM, 2009. [17] T. Katayama, Y. Sato, T. Utsuro, T. Yoshinaka, Y. Kawada, and T. Fukuhara. An empirical study on selective sampling in active learning for splog detection. In Proc. 5th AIRWeb, pp. 29 36, April 2009. [18] Y.M. Wang, M. Ma, Y. Niu, and H. Chen. Spam double-funnel: Connecting web spammers with advertisers,. In Proc. 16th WWW, pp. 291 300, 2007. [19] D. D. Lewis and W. A. Gale. A sequential algorithm for training text classifiers. In Proc. 17th SIGIR, pp. 3 12, 1994. [20] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Proc. 17th ICML, pp. 839 846, 2000. [21] J. Suzuki, H. Isozaki, and E. Maeda. Convolution kernels with feature selection for natural language processing. In Proc. 42nd ACL, pp. 110 126, 2004.