Analyzing Similarity of HTML Structures and Affiliate in Automatic Collection of Splogs



Σχετικά έγγραφα
HTML. Evaluating Effects of Similarities of HTML Structures in Splog Detection

HTML Utilizing Similarities of HTML Structures in Splog Detection by Machine Learning

Automatic extraction of bibliography with machine learning

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices


Wiki. Wiki. Analysis of user activity of closed Wiki used by small groups

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

SocialDict. A reading support tool with prediction capability and its extension to readability measurement

2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

Web 論 文. Performance Evaluation and Renewal of Department s Official Web Site. Akira TAKAHASHI and Kenji KAMIMURA

{takasu, Conditional Random Field

Maxima SCORM. Algebraic Manipulations and Visualizing Graphs in SCORM contents by Maxima and Mashup Approach. Jia Yunpeng, 1 Takayuki Nagai, 2, 1

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

ER-Tree (Extended R*-Tree)

Web. Web p OutDegree(p) log 7 1/OutDegree(p) A New Difinition of Subjective Distance between Web Pages

Resurvey of Possible Seismic Fissures in the Old-Edo River in Tokyo

ΣΔΥΝΟΛΟΓΗΚΟ ΔΚΠΑΗΓΔΤΣΗΚΟ ΗΓΡΤΜΑ ΗΟΝΗΧΝ ΝΖΧΝ «ΗΣΟΔΛΗΓΔ ΠΟΛΗΣΗΚΖ ΔΠΗΚΟΗΝΧΝΗΑ:ΜΔΛΔΣΖ ΚΑΣΑΚΔΤΖ ΔΡΓΑΛΔΗΟΤ ΑΞΗΟΛΟΓΖΖ» ΠΣΤΥΗΑΚΖ ΔΡΓΑΗΑ ΔΤΑΓΓΔΛΗΑ ΣΔΓΟΤ

Technical Research Report, Earthquake Research Institute, the University of Tokyo, No. +-, pp. 0 +3,,**1. No ,**1

Policy Coherence. JEL Classification : J12, J13, J21 Key words :

IMES DISCUSSION PAPER SERIES

[4] 1.2 [5] Bayesian Approach min-max min-max [6] UCB(Upper Confidence Bound ) UCT [7] [1] ( ) Amazons[8] Lines of Action(LOA)[4] Winands [4] 1

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

Μηχανισμοί πρόβλεψης προσήμων σε προσημασμένα μοντέλα κοινωνικών δικτύων ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

ST5224: Advanced Statistical Theory II

ΦΥΛΛΟ ΕΡΓΑΣΙΑΣ Α. Διαβάστε τις ειδήσεις και εν συνεχεία σημειώστε. Οπτική γωνία είδησης 1:.

The Research on Sampling Estimation of Seasonal Index Based on Stratified Random Sampling

J. of Math. (PRC) 6 n (nt ) + n V = 0, (1.1) n t + div. div(n T ) = n τ (T L(x) T ), (1.2) n)xx (nt ) x + nv x = J 0, (1.4) n. 6 n

Κοινότητες μάθησης και πρακτικής σε επίπεδο Ευρωπαϊκής Ένωσης

Buried Markov Model Pairwise

ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ ΕΠΑΝΑΣΧΕΔΙΑΣΜΟΣ ΓΡΑΜΜΗΣ ΣΥΝΑΡΜΟΛΟΓΗΣΗΣ ΜΕ ΧΡΗΣΗ ΕΡΓΑΛΕΙΩΝ ΛΙΤΗΣ ΠΑΡΑΓΩΓΗΣ REDESIGNING AN ASSEMBLY LINE WITH LEAN PRODUCTION TOOLS

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΤΜΗΜΑ ΝΟΣΗΛΕΥΤΙΚΗΣ

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

Topic Structure Mining based on Wikipedia and Web Search

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΓΕΩΠΟΝΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΒΙΟΤΕΧΝΟΛΟΓΙΑΣ ΚΑΙ ΕΠΙΣΤΗΜΗΣ ΤΡΟΦΙΜΩΝ. Πτυχιακή εργασία

HIV HIV HIV HIV AIDS 3 :.1 /-,**1 +332

Modern Greek Extension

* ** *** *** Jun S HIMADA*, Kyoko O HSUMI**, Kazuhiko O HBA*** and Atsushi M ARUYAMA***

Ερευνητική+Ομάδα+Τεχνολογιών+ Διαδικτύου+

ΠΤΥΧΙΑΚΗ ΕΡΓΑΣΙΑ ΒΑΛΕΝΤΙΝΑ ΠΑΠΑΔΟΠΟΥΛΟΥ Α.Μ.: 09/061. Υπεύθυνος Καθηγητής: Σάββας Μακρίδης

The challenges of non-stable predicates

Homomorphism in Intuitionistic Fuzzy Automata

Twitter 6. DEIM Forum 2014 A Twitter,,, Wikipedia, Explicit Semantic Analysis,

Feasible Regions Defined by Stability Constraints Based on the Argument Principle


A summation formula ramified with hypergeometric function and involving recurrence relation

Μαθαίνοντας µε την Τεχνολογία των Πολυµέσων Υπόσχεση ή Πραγµατικότητα;

þÿÿ ÁÌ» Â Ä Å ¹µÅ Å½Ä ÃÄ

Development of a Tiltmeter with a XY Magnetic Detector (Part +)

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΕΠΙΣΤΗΜΩΝ ΥΓΕΙΑΣ

No. 7 Modular Machine Tool & Automatic Manufacturing Technique. Jul TH166 TG659 A

Development of a Seismic Data Analysis System for a Short-term Training for Researchers from Developing Countries



Global energy use: Decoupling or convergence?

Evolution of Novel Studies on Thermofluid Dynamics with Combustion

Phys460.nb Solution for the t-dependent Schrodinger s equation How did we find the solution? (not required)

ΑΡΙΣΤΟΤΕΛΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΟΝΙΚΗΣ

ΠΕΡΙΕΧΟΜΕΝΑ. Μάρκετινγκ Αθλητικών Τουριστικών Προορισμών 1

Research on Economics and Management

Π Τ Υ Χ Ι Α Κ Η /ΔΙ Π Λ Ω Μ ΑΤ Ι Κ Η Ε Ρ ΓΑ Σ Ι Α

ΑΓΓΛΙΚΑ Ι. Ενότητα 7α: Impact of the Internet on Economic Education. Ζωή Κανταρίδου Τμήμα Εφαρμοσμένης Πληροφορικής

ΑΝΙΧΝΕΥΣΗ ΓΕΓΟΝΟΤΩΝ ΒΗΜΑΤΙΣΜΟΥ ΜΕ ΧΡΗΣΗ ΕΠΙΤΑΧΥΝΣΙΟΜΕΤΡΩΝ ΔΙΠΛΩΜΑΤΙΚΗ ΕΡΓΑΣΙΑ

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

Ηλεκτρονικός Βοηθός Οδηγού Μοτοσυκλέτας

ΜΕΛΕΤΗ ΤΗΣ ΗΛΕΚΤΡΟΝΙΚΗΣ ΣΥΝΤΑΓΟΓΡΑΦΗΣΗΣ ΚΑΙ Η ΔΙΕΡΕΥΝΗΣΗ ΤΗΣ ΕΦΑΡΜΟΓΗΣ ΤΗΣ ΣΤΗΝ ΕΛΛΑΔΑ: Ο.Α.Ε.Ε. ΠΕΡΙΦΕΡΕΙΑ ΠΕΛΟΠΟΝΝΗΣΟΥ ΚΑΣΚΑΦΕΤΟΥ ΣΩΤΗΡΙΑ

Quick algorithm f or computing core attribute

A research on the influence of dummy activity on float in an AOA network and its amendments

ΕΛΕΓΧΟΣ ΚΑΙ ΤΡΟΦΟΔΟΤΗΣΗ ΜΕΛΙΣΣΟΚΟΜΕΙΟΥ ΑΠΟ ΑΠΟΣΤΑΣΗ

1 h, , CaCl 2. pelamis) 58.1%, (Headspace solid -phase microextraction and gas chromatography -mass spectrometry,hs -SPME - Vol. 15 No.

Detection and Recognition of Traffic Signal Using Machine Learning

«ΑΓΡΟΤΟΥΡΙΣΜΟΣ ΚΑΙ ΤΟΠΙΚΗ ΑΝΑΠΤΥΞΗ: Ο ΡΟΛΟΣ ΤΩΝ ΝΕΩΝ ΤΕΧΝΟΛΟΓΙΩΝ ΣΤΗΝ ΠΡΟΩΘΗΣΗ ΤΩΝ ΓΥΝΑΙΚΕΙΩΝ ΣΥΝΕΤΑΙΡΙΣΜΩΝ»

Dynamic types, Lambda calculus machines Section and Practice Problems Apr 21 22, 2016

Simplex Crossover for Real-coded Genetic Algolithms

Splice site recognition between different organisms

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΠΟΛΙΤΙΚΩΝ ΜΗΧΑΝΙΚΩΝ. «Θεσμικό Πλαίσιο Φωτοβολταïκών Συστημάτων- Βέλτιστη Απόδοση Μέσω Τρόπων Στήριξης»

Τ.Ε.Ι. ΔΥΤΙΚΗΣ ΜΑΚΕΔΟΝΙΑΣ ΠΑΡΑΡΤΗΜΑ ΚΑΣΤΟΡΙΑΣ ΤΜΗΜΑ ΔΗΜΟΣΙΩΝ ΣΧΕΣΕΩΝ & ΕΠΙΚΟΙΝΩΝΙΑΣ

Elements of Information Theory

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

1000 VDC 1250 VDC 125 VAC 250 VAC J K 125 VAC, 250 VAC

The Nottingham eprints service makes this work by researchers of the University of Nottingham available open access under the following conditions.

Queensland University of Technology Transport Data Analysis and Modeling Methodologies

ΜΕΛΕΤΗ ΑΠΟΤΙΜΗΣΗΣ ΠΑΡΑΓΟΜΕΝΗΣ ΕΝΕΡΓΕΙΑΣ Φ/Β ΣΥΣΤΗΜΑΤΩΝ ΕΓΚΑΤΕΣΤΗΜΕΝΩΝ ΣΕ ΕΛΛΑ Α ΚΑΙ ΤΣΕΧΙΑ

TAMIL NADU PUBLIC SERVICE COMMISSION REVISED SCHEMES

Prey-Taxis Holling-Tanner

( ) , ) , ; kg 1) 80 % kg. Vol. 28,No. 1 Jan.,2006 RESOURCES SCIENCE : (2006) ,2 ,,,, ; ;

CorV CVAC. CorV TU317. 1

Applying Markov Decision Processes to Role-playing Game

GF GF 3 1,2) KP PP KP Photo 1 GF PP GF PP 3) KP ULultra-light 2.KP 2.1KP KP Fig. 1 PET GF PP 4) 2.2KP KP GF 2 3 KP Olefin film Stampable sheet

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

1 n-gram n-gram n-gram [11], [15] n-best [16] n-gram. n-gram. 1,a) Graham Neubig 1,b) Sakriani Sakti 1,c) 1,d) 1,e)

Development of a basic motion analysis system using a sensor KINECT

Study of urban housing development projects: The general planning of Alexandria City

Comparison of Evapotranspiration between Indigenous Vegetation and Invading Vegetation in a Bog

High order interpolation function for surface contact problem

A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

Παραδοτέο 6.2 Παρουσίαση PPT. RenoValue. Κίνητρα για αλλαγή: Ενισχύοντας το ρόλο των εκτιμητών ακίνητης περιουσίας στις μεταβολές της αγορά

Transcript:

DEIM Forum 2011 F2-2 HTML 305-8573 1-1-1 305-8573 1-1-1 101-8457 2-2 ( ) 141-0031 8-3-6 135-0064. 2-3-26 HTML ID HTML ID SVM HTML Analyzing Similarity of HTML Structures and Affiliate in Automatic Collection of Splogs Akihito MORIJIRI, Taichi KATAYAMA,SoichiISHII,TakehitoUTSURO, Yasuhide KAWADA, and Tomohiro FUKUHARA College of Engineering Systems, School of Science and Engineering, University of Tsukuba Graduate School of Systems and Information Engineering,University of Tsukuba Graduate School of Science and Technology for Future Life, Tokyo Denki University Navix Co., Ltd. Center for Service Research, National Institute of Advanced Industrial Science and Technology Abstract Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for the sole purpose of hosting advertisements or raising the number of in-links of target sites. It has been shown that splogs can be detected based on similarity of HTML structures and affiliate IDs. The similarity of HTML structures of splogs is effective in splog detection, and the identity of affiliate IDs extracted from splogs can identify spammers much more directly than similarity of HTML structures, although it is not easy to achieve high coverage in extracting affiliate IDs. The coverage of the intersection of the two clues, similarity of HTML structures and affiliate IDs, is relatively low, and it is necessary to apply them in a complementary strategy. This paper studies how to detect splogs which cannot be detected based on either similarity of HTML structures nor affiliate IDs. We apply SVMs to this task and show that splogs of above type can be detected with high precision. Key words spam blog detection, HTML structure, affiliate, machine learning 1.

Technorati BlogPulse [2] kizasi.jp Globe of Blogs Best Blogs in Asia Directory ( ) [3], [9], [10], [12] [10] 88% 75% [11] [9], [10], [12] [12] TREC Blog06 / [9], [10] BlogPulse [4], [8], [11], [13] HTML HTML ID [7] HTML 1 13 [5] ID ID 1 23 2 [6] 2 HTML ID ( 1 4) Support Vector Machines (SVMs) [16] 2. HTML 2. 1 HTML DOM [18] HTML DOM [7] 2 HTML s HTML HTML P DIV P DIV P DIV [18] BODY P DIV BODY HTML [18] SCRIPT STYLE HTML HTML s DOM dm(s) 2. 2 DOM HTML s t DOM dm(s) dm(t) DP DP 1 2 DP ( ) edit distance (dm(s),dm(t)) DOM dm(s) dm(s) s t DOM Rdiff(s, t) Rdiff(s, t) = edit distance (dm(s),dm(t)) dm(s) + d m(t) HTML s HTML T t T DOM Rdiff(s, t) MinDF(s, T ) MinDF(s, T ) = min Rdiff(s, t T ) t =s 3. ID 3 ID ID ID ID ID [5] ASP( ) ID 10 1 ID 1 Am At D Gl I Lk R St Tr V

1 KANSHIN [1] RSS Atom S F 2 11 S 48,183 14,352 ( 30%) F 60,977 6,231 ( 10%) ID ASP ID ID ID ID ID ID [5] ID ID [5] ID ASP10 ID 20,583 (S F ) 10 1 10 ID ID H ID SP af (H, x) x 3 S 72 1,101 F 56 953 ID ID 129 ID ID 2,472 ID ID 121 ID 2,054 ( ID 93.8% 83.3%) 2 121 2 ID ID ID ID ID

1 2 HTML DOM DOM ID x H ID x SP af (H,x) ID ID 1 4. SVM 4. 1 / URL / HTML URL URL i) HTML URL ii) HTML 2 URL URL u URL log u u u URL 4. 2 [14], [17] / 3 / w w φ 2. ID ID 3 (http://chasen-legacy.sourceforge.jp/) IPAdic (http://sourceforge.jp/projects/ipadic/)

φ 2 (, w)= w w freq(, w freq(, w )=a )=b freq(, freq(, w) =c w) =d (ad bc) 2 (a + b)(a + c)(b + d)(c + d) ( log φ 2 (, w) w w 4. 3 URL / URL URL ( ) w s AncfB(w, s) AncfW (w, s) s w AncfB(w, s) = URL s w AncfW (w, s) = URL AncfB(w, s) 2 s URL w URL t t URL ( ) log AncfB(w, s) AncfB(w, t) w s AncfW (w, s) 2 s URL w URL t t URL ( ) log AncfW (w, s) AncfW (w, t) w s 5. SVM TinySVM (http://chasen.org/~taku/software/tinysvm/) ) 2 2 6. 2. SVM [15] LBD s LBD ab 6. 6. 1 3. S 48,183 F 60,977 10 ID 1 S 1,101 F 953 ID ID ID ID ID S 72 ID F 56 ID ID ID 500 ( ) SVM ID ID S 13 ID F

(a) 4 (S ) (b) (a) 5 (F ) (b) 12 ID ID ID 500 ( ) SVM 500 4 6. 2 3. S 48,183 F 60,977 10 ID 10 ID HTML (MinDF > 0.15) S 47,029 F 59,982 1 4 S 4 ID 283 F 268 / 6. 3 LBD s LBD s LBD s S 4(a) F 5(a) LBD ab LBD ab LBD ab S 4(b) F 5(b) 4 5 ID ID ID ID

6. 4 4 5 ID ID ID 4(a) 5(a) S 35% F 60% 90% 4(b) 5(b) S 17% F 15% 90% 6. 2 1 4 HTML ID HTML [7] ID [5] 7. HTML ID HTML ID SVM [1],,,.. 21, 2007. [2] N. Glance, M. Hurst, and T. Tomokiyo. Blogpulse: Automated trend discovery for Weblogs. In Proc. Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2004. [3] Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proc. 1st AIRWeb, pp. 39 47, 2005. [4].. Letters, Vol. 6, No. 4, pp. 37 40, 2008. [5],,,. ID. Web (WebDB2010)., 2010. [6],,,,,. HTML. Web (WebDB2010)., November 2010. [7],,,,. HTML. 2 DEIM, 2010. [8] P. Kolari, T. Finin, and A. Joshi. SVMs for the Blogosphere: Blog identification and Splog detection. In Proc. 2006 AAAI Spring Symp. Computational Approaches to Analyzing Weblogs, pp. 92 99, 2006. [9] P. Kolari, T. Finin, and A. Joshi. Spam in blogs and social media. In Tutorial at ICWSM, 2007. [10] P. Kolari, A. Joshi, and T. Finin. Characterizing the splogosphere. In Proc. 3rd Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 2006. [11] Y.-R. Lin, H. Sundaram, Y. Chi, J. Tatemura, and B. L. Tseng. Splog detection using self-similarity analysis on blog temporal dynamics. In Proc. 3rd AIRWeb, pp. 1 8, 2007. [12] C. Macdonald and I. Ounis. The TREC Blogs06 collection : Creating and analysing a blog test collection. Technical Report TR-2006-224, University of Glasgow, Department of Computing Science, 2006. [13] G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In Proc. 1st AIRWeb, 2005. [14] Y. Sato, T. Utsuro, T. Fukuhara, Y. Kawada, Y. Murakami, H. Nakagawa, and N. Kando. Analyzing features of Japanese splogs and characteristics of keywords. In Proc. 4th AIRWeb, pp. 33 40, 2008. [15] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In Proc. 17th ICML, pp. 999 1006, 2000. [16] V. N. Vapnik. Statistical Learning Theory. Wiley- Interscience, 1998. [17] Y.M. Wang, M. Ma, Y. Niu, and H. Chen. Spam doublefunnel: Connecting web spammers with advertisers,. In Proc. 16th WWW, pp. 291 300, 2007. [18],.., Vol. 8, No. 1, pp. 29 34, 2009.