HTML. Evaluating Effects of Similarities of HTML Structures in Splog Detection

Σχετικά έγγραφα
HTML Utilizing Similarities of HTML Structures in Splog Detection by Machine Learning

Analyzing Similarity of HTML Structures and Affiliate in Automatic Collection of Splogs

Automatic extraction of bibliography with machine learning

{takasu, Conditional Random Field



2016 IEEE/ACM International Conference on Mobile Software Engineering and Systems

Web. Web p OutDegree(p) log 7 1/OutDegree(p) A New Difinition of Subjective Distance between Web Pages

SocialDict. A reading support tool with prediction capability and its extension to readability measurement

ΣΔΥΝΟΛΟΓΗΚΟ ΔΚΠΑΗΓΔΤΣΗΚΟ ΗΓΡΤΜΑ ΗΟΝΗΧΝ ΝΖΧΝ «ΗΣΟΔΛΗΓΔ ΠΟΛΗΣΗΚΖ ΔΠΗΚΟΗΝΧΝΗΑ:ΜΔΛΔΣΖ ΚΑΣΑΚΔΤΖ ΔΡΓΑΛΔΗΟΤ ΑΞΗΟΛΟΓΖΖ» ΠΣΤΥΗΑΚΖ ΔΡΓΑΗΑ ΔΤΑΓΓΔΛΗΑ ΣΔΓΟΤ

Buried Markov Model Pairwise

Maxima SCORM. Algebraic Manipulations and Visualizing Graphs in SCORM contents by Maxima and Mashup Approach. Jia Yunpeng, 1 Takayuki Nagai, 2, 1

Wiki. Wiki. Analysis of user activity of closed Wiki used by small groups

ER-Tree (Extended R*-Tree)

Twitter 6. DEIM Forum 2014 A Twitter,,, Wikipedia, Explicit Semantic Analysis,

1 n-gram n-gram n-gram [11], [15] n-best [16] n-gram. n-gram. 1,a) Graham Neubig 1,b) Sakriani Sakti 1,c) 1,d) 1,e)

Anomaly Detection with Neighborhood Preservation Principle

Ερευνητική+Ομάδα+Τεχνολογιών+ Διαδικτύου+

IMES DISCUSSION PAPER SERIES

Study of In-vehicle Sound Field Creation by Simultaneous Equation Method

þÿÿ ÁÌ» Â Ä Å ¹µÅ Å½Ä ÃÄ

Applying Markov Decision Processes to Role-playing Game

MIDI [8] MIDI. [9] Hsu [1], [2] [10] Salamon [11] [5] Song [6] Sony, Minato, Tokyo , Japan a) b)

User Behavior Analysis for a Large2scale Search Engine

Topic Structure Mining based on Wikipedia and Web Search

: Monte Carlo EM 313, Louis (1982) EM, EM Newton-Raphson, /. EM, 2 Monte Carlo EM Newton-Raphson, Monte Carlo EM, Monte Carlo EM, /. 3, Monte Carlo EM

IPSJ SIG Technical Report Vol.2014-CE-127 No /12/6 CS Activity 1,a) CS Computer Science Activity Activity Actvity Activity Dining Eight-He

Web 論 文. Performance Evaluation and Renewal of Department s Official Web Site. Akira TAKAHASHI and Kenji KAMIMURA

Stabilization of stock price prediction by cross entropy optimization


[1] DNA ATM [2] c 2013 Information Processing Society of Japan. Gait motion descriptors. Osaka University 2. Drexel University a)

Αντώνης Βεντούρης. Επίκουρος Καθηγητής Διδακτικής των Γλωσσών Τμήμα Ιταλικής Γλώσσας και Φιλολογίας Αριστοτέλειο Πανεπιστήμιο Θεσσαλονίκης

Detection and Recognition of Traffic Signal Using Machine Learning

Nov Journal of Zhengzhou University Engineering Science Vol. 36 No FCM. A doi /j. issn

3: A convolution-pooling layer in PS-CNN 1: Partially Shared Deep Neural Network 2.2 Partially Shared Convolutional Neural Network 2: A hidden layer o

VBA Microsoft Excel. J. Comput. Chem. Jpn., Vol. 5, No. 1, pp (2006)

Web DEIM Forum 2009 A7-1. Web. Web. Web. Web. 4 Wikipedia. Wikipedia. Web.

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ ΤΜΗΜΑ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΤΕΧΝΟΛΟΓΙΑΣ ΥΠΟΛΟΓΙΣΤΩΝ ΤΟΜΕΑΣ ΣΥΣΤΗΜΑΤΩΝ ΗΛΕΚΤΡΙΚΗΣ ΕΝΕΡΓΕΙΑΣ

Development of a Seismic Data Analysis System for a Short-term Training for Researchers from Developing Countries

CorV CVAC. CorV TU317. 1

An Automatic Modulation Classifier using a Frequency Discriminator for Intelligent Software Defined Radio

J. of Math. (PRC) 6 n (nt ) + n V = 0, (1.1) n t + div. div(n T ) = n τ (T L(x) T ), (1.2) n)xx (nt ) x + nv x = J 0, (1.4) n. 6 n

HIV HIV HIV HIV AIDS 3 :.1 /-,**1 +332

þÿ Ç»¹º ³µÃ ± : Ãż²» Ä Â

ΔΙΠΛΩΜΑΤΙΚΕΣ ΕΡΓΑΣΙΕΣ

Retrieval of Seismic Data Recorded on Open-reel-type Magnetic Tapes (MT) by Using Existing Devices

ΔΙΑΧΕΙΡΙΣΗ ΠΕΡΙΕΧΟΜΕΝΟΥ ΠΑΓΚΟΣΜΙΟΥ ΙΣΤΟΥ ΚΑΙ ΓΛΩΣΣΙΚΑ ΕΡΓΑΛΕΙΑ. Data Mining - Classification

* ** *** *** Jun S HIMADA*, Kyoko O HSUMI**, Kazuhiko O HBA*** and Atsushi M ARUYAMA***

Ψηφιακή ανάπτυξη. Course Unit #1 : Κατανοώντας τις βασικές σύγχρονες ψηφιακές αρχές Thematic Unit #1 : Τεχνολογίες Web και CMS

Probabilistic Approach to Robust Optimization

A Method for Creating Shortcut Links by Considering Popularity of Contents in Structured P2P Networks

, -.

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΠΟΛΙΤΙΚΩΝ ΜΗΧΑΝΙΚΩΝ. «Θεσμικό Πλαίσιο Φωτοβολταïκών Συστημάτων- Βέλτιστη Απόδοση Μέσω Τρόπων Στήριξης»

Πτυχιακή Εργασι α «Εκτι μήσή τής ποιο τήτας εικο νων με τήν χρή σή τεχνήτων νευρωνικων δικτυ ων»

Χρήση οντολογιών στη χαρτογράφηση γνώσης: Μελέτη περίπτωσης σε μία ακαδημαϊκή βιβλιοθήκη

Vol. 31,No JOURNAL OF CHINA UNIVERSITY OF SCIENCE AND TECHNOLOGY Feb

[15], [16], [17] [6] [2] [5] Jiang [6] 2.1 [6], [10] Score(x, y) y ( 1) ( 1 ) b e ( 1 ) b e. O(n 2 ) Jiang [6] (word lattice reranking)

ES440/ES911: CFD. Chapter 5. Solution of Linear Equation Systems

ΚΑΤΑΣΚΕΥΑΣΤΙΚΟΣ ΤΟΜΕΑΣ

(Υπογραϕή) (Υπογραϕή) (Υπογραϕή)

GPGPU. Grover. On Large Scale Simulation of Grover s Algorithm by Using GPGPU

Schedulability Analysis Algorithm for Timing Constraint Workflow Models

Τοποθέτηση τοπωνυµίων και άλλων στοιχείων ονοµατολογίας στους χάρτες

ΖΩΝΟΠΟΙΗΣΗ ΤΗΣ ΚΑΤΟΛΙΣΘΗΤΙΚΗΣ ΕΠΙΚΙΝΔΥΝΟΤΗΤΑΣ ΣΤΟ ΟΡΟΣ ΠΗΛΙΟ ΜΕ ΤΗ ΣΥΜΒΟΛΗ ΔΕΔΟΜΕΝΩΝ ΣΥΜΒΟΛΟΜΕΤΡΙΑΣ ΜΟΝΙΜΩΝ ΣΚΕΔΑΣΤΩΝ

Faruqui [7] WordNet [15] FrameNet [2] PPDB [8]

Molecular evolutionary dynamics of respiratory syncytial virus group A in

Τεχνολογικά υποβοηθούμενη μάθηση: Εργαλεία και τεχνολογίες

Random Forests Leo. Hitoshi Habe 1

Simplex Crossover for Real-coded Genetic Algolithms

Πανεπιστήμιο Κρήτης, Τμήμα Επιστήμης Υπολογιστών Άνοιξη HΥ463 - Συστήματα Ανάκτησης Πληροφοριών Information Retrieval (IR) Systems

Quick algorithm f or computing core attribute

Resurvey of Possible Seismic Fissures in the Old-Edo River in Tokyo

SVM. Research on ERPs feature extraction and classification

Evolution of Novel Studies on Thermofluid Dynamics with Combustion

A research on the influence of dummy activity on float in an AOA network and its amendments

EM Baum-Welch. Step by Step the Baum-Welch Algorithm and its Application 2. HMM Baum-Welch. Baum-Welch. Baum-Welch Baum-Welch.

[4] 1.2 [5] Bayesian Approach min-max min-max [6] UCB(Upper Confidence Bound ) UCT [7] [1] ( ) Amazons[8] Lines of Action(LOA)[4] Winands [4] 1

ΠΑΝΕΠΙΣΤΗΜΙΟ ΠΑΤΡΩΝ ΠΟΛΥΤΕΧΝΙΚΗ ΣΧΟΛΗ ΤΜΗΜΑ ΜΗΧΑΝΙΚΩΝ Η/Υ & ΠΛΗΡΟΦΟΡΙΚΗΣ. του Γεράσιμου Τουλιάτου ΑΜ: 697

Research on Economics and Management

ΑΓΓΛΙΚΑ Ι. Ενότητα 7α: Impact of the Internet on Economic Education. Ζωή Κανταρίδου Τμήμα Εφαρμοσμένης Πληροφορικής

The Algorithm to Extract Characteristic Chord Progression Extended the Sequential Pattern Mining

Bayesian Discriminant Feature Selection

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ

«ΑΓΡΟΤΟΥΡΙΣΜΟΣ ΚΑΙ ΤΟΠΙΚΗ ΑΝΑΠΤΥΞΗ: Ο ΡΟΛΟΣ ΤΩΝ ΝΕΩΝ ΤΕΧΝΟΛΟΓΙΩΝ ΣΤΗΝ ΠΡΟΩΘΗΣΗ ΤΩΝ ΓΥΝΑΙΚΕΙΩΝ ΣΥΝΕΤΑΙΡΙΣΜΩΝ»

Splice site recognition between different organisms

Indexing Methods for Encrypted Vector Databases

Re-Pair n. Re-Pair. Re-Pair. Re-Pair. Re-Pair. (Re-Merge) Re-Merge. Sekine [4, 5, 8] (highly repetitive text) [2] Re-Pair. Blocked-Repair-VF [7]

ΤΕΧΝΟΛΟΓΙΚΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΚΥΠΡΟΥ ΣΧΟΛΗ ΓΕΩΠΟΝΙΚΩΝ ΕΠΙΣΤΗΜΩΝ ΒΙΟΤΕΧΝΟΛΟΓΙΑΣ ΚΑΙ ΕΠΙΣΤΗΜΗΣ ΤΡΟΦΙΜΩΝ. Πτυχιακή εργασία

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ

Automatic Domain2Specific Term Extraction and Its Application in Text Cla ssification

HOSVD. Higher Order Data Classification Method with Autocorrelation Matrix Correcting on HOSVD. Junichi MORIGAKI and Kaoru KATAYAMA

Toward a SPARQL Query Execution Mechanism using Dynamic Mapping Adaptation -A Preliminary Report- Takuya Adachi 1 Naoki Fukuta 2.

How to register an account with the Hellenic Community of Sheffield.

Prey-Taxis Holling-Tanner

PUBLICATION. Participation of POLYTECH in the 10th Pan-Hellenic Conference on Informatics. April 15, Nafplio

GPU. CUDA GPU GeForce GTX 580 GPU 2.67GHz Intel Core 2 Duo CPU E7300 CUDA. Parallelizing the Number Partitioning Problem for GPUs

ΦΥΛΛΟ ΕΡΓΑΣΙΑΣ Α. Διαβάστε τις ειδήσεις και εν συνεχεία σημειώστε. Οπτική γωνία είδησης 1:.

Development of a basic motion analysis system using a sensor KINECT

ELIXIR-GR / BiP! Finder

ΠΑΝΔΠΗΣΖΜΗΟ ΠΑΣΡΩΝ ΣΜΖΜΑ ΖΛΔΚΣΡΟΛΟΓΩΝ ΜΖΥΑΝΗΚΩΝ ΚΑΗ ΣΔΥΝΟΛΟΓΗΑ ΤΠΟΛΟΓΗΣΩΝ ΣΟΜΔΑ ΤΣΖΜΑΣΩΝ ΖΛΔΚΣΡΗΚΖ ΔΝΔΡΓΔΗΑ

Transcript:

HTML 1 2 1 3 4 ( ) HTML HTML DOM SVM Evaluating Effects of Similarities of HTML Structures in Splog Detection Taichi Katayama, 1 Takayuki Yoshinaka, 2 Takehito Utsuro, 1 Yasuhide Kawada 3 and Tomohiro Fukuhara 4 Spam blogs or splogs are blogs hosting spam posts, created using machine generated or hijacked content for the sole purpose of hosting advertisements or raising the number of inward of target sites. Among those splogs, this paper focuses on detecting a group of splogs which are estimated to be created by an identical spammer. We especially show that similarities of html structures among those splogs created by an identical spammer contribute to improving the performance of splog detection. In measuring similarities of html structures, we extract a list of blocks (minimum unit of content) from the DOM tree of a html file. We show that the html files of splogs estimated to be created by an identical spammer tend to have similar DOM trees and this tendency is quite effective in splog detection. 1. Technorati BlogPulse 5) kizasi.jp blog- Watcher 16) Globe of Blogs Best Blogs in Asia Directory Blogwise ( ) 1),6),11),12),14) 12) 88% 75% 2) 1) 13) 11),12),14) 14) TREC Blog06 / 11) 12) BlogPulse 7) 8) 10) 12) 13) 15) HTML 1 Graduate School of Systems and Information Engineering, University of Tsukuba 2 Graduate School of Engineering, Tokyo Denki University 3 ( ) Navix Co., Ltd. 4 Research into Artifacts, Center for Engineering, University of Tokyo 1 c 2009 Information Processing Society of Japan

1 / (a) / CC SS 198 293 277 768 210 259 2849 3318 408 552 3126 4086 (b) CC SS ID 1 2 3 4 163 26 31 33 HTML HTML HTML DOM DOM DOM DOM Support Vector Machines 21) (SVMs) HTML SVM 2. / 2007 9 2008 2 / / 17) / / 17) ID ID 1 CC ID=1 SS ID= 2, 3, 4 17) 2 HTML 1 3. HTML HTML DOM DOM 3.1 HTML DOM 23) HTML DOM 1 2. 1 23) 3) 4) HTML 23) HTML 2 HTML 20) 20) HTML HTML 2 c 2009 Information Processing Society of Japan

1 HTML s HTML HTML P DIV P DIV P DIV 23) BODY P DIV BODY HTML 23) SCRIPT STYLE HTML HTML s DOM dm(s) 3.2 DOM HTML s t DOM dm(s) dm(t) DP DP 1 2 DP ( ) edit distance (dm(s),dm(t)) DOM dm(s) dm(s) s t DOM Rdiff(s, t) edit distance (dm(s),dm(t)) Rdiff(s, t) = dm(s) + dm(t) 1 HTML DOM 3.3 DOM HTML S T HTML s S t T DOM DOM HTML s S HTML T t T Rdiff(s, t) 10 AvMinDF 10(s, T ) 1 AvMinDF 10(s, T ) = T Rdiff(s, t T ) 10 t Rdiff(s, t) (ID=1, CC ) (ID=2, 3, 4, SS ) DOM 2 ID ( ID=1) S T S s AvMinDF 10(s, T ) (ID=1, CC ) ID=1 S CC T S s AvMinDF 10(s, T ) SC SS 2(a) CC S T S s AvMinDF 10(s, T ) 2(b) (c) (d) SS ( 2(b) (c) (d) ) 2(a),(c),(d) 2(a) AvMinDF 10(s, T ) 0 0.15 2(b) ID=2, AvMinDF 10(s, T ) ID HTML 1 AvMinDF k (s, T ) k =1,...25 k =10 3 c 2009 Information Processing Society of Japan

(a) (ID = 1 CC ) (b) (ID = 2 SS ) (c) (ID = 3 SS ) (d) (ID = 4 SS ) 2 DOM DOM 4. SVM 4.1 DOM 3 HTML DOM s 3.3 AvMinDF 10(s, T ) log AvMinDF 10(s, T ) 1. T (1) T DOM ( ) (2) T DOM 1 ( ) 4.2 9) 4.2.1 / URL / HTML URL URL i) HTML URL ii) HTML 2 URL URL u URL ( ) ( ) u log u u URL 4.2.2 17) 22) 4 c 2009 Information Processing Society of Japan

/ 1 / w w φ 2. w w freq(, w )=a freq(, w )=b freq(, w) =c freq(, w) =d φ 2 (, w)= (ad bc) 2 (a + b)(a + c)(b + d)(c + d) ( ) log φ 2 (, w) w w 4.2.3 URL / URL URL ( ) w s AncfB(w, s) AncfW (w, s) s w AncfB(w, s) = URL s w AncfW (w, s) = URL AncfB(w, s) 2 URL w URL 1 (http://chasen-legacy.sourceforge.jp/) ipadic s t t URL ( ) log AncfB(w, s) AncfB(w, t) w s AncfW (w, s) 2 URL w URL t t URL ( ) log AncfW (w, s) AncfW (w, t) w s 5. 5.1 SVM SVM TinySVM (http://chasen.org/~taku/ software/tinysvm/) 2 2 6 2. 5.2 SVM 19) 2 LBD s LBD ab 2 s 5 c 2009 Information Processing Society of Japan

(a-1) (CC ) (a-2) (CC ) CC (b-1) (SS ) (b-2) (SS ) SS 3 / (a-1) (CC ) (a-2) (CC ) (b-1) (SS ) (b-2) (SS ) CC SS 4 / 6 c 2009 Information Processing Society of Japan

6. 6.1 1(a) CC SS / ( 3) / ( 4) / CC 408 SS 552 / CC 163 163 SS 90 90 10 6.2 / / LBD p LBD p LBD p LBD n LBD n LBD n 6.3 3 4 +DOM ( ) DOM ( ) +DOM ( ) DOM 3 4 (a-1) (a-2) 2 DOM ( ) URL DOM ( ) 3 4 (b-1) (b-2) 2 DOM ( ) URL URL DOM ( ) 3 (a-2) CC DOM ( ) DOM ( ) (a-1) (b-1) (b-2) DOM ( ) DOM ( ) HTML / 3 DOM ( ) DOM ( ) 4 (a-2) SS DOM ( ) (a-1) (b-1) (b-2) DOM ( ) DOM ( ) HTML 4 DOM ( ) DOM ( ) 3 4 DOM ( ) 2 DOM ( ) DOM ( ) 2 7 c 2009 Information Processing Society of Japan

7. HTML SVM HTML SVM DOM DOM 24) 18) HTML DOM DOM 1) : Wikipedia, Spam blog. http://en.wikipedia.org/wiki/spam blog. 2) : Wikipedia, Ping (blogging). http://en.wikipedia.org/wiki/ping (blogging). 3) Bing, L., Wang, Y., Zhang, Y. and Wang, H.: Primary Content Extraction with Mountain Model, Proc. 8th IEEE CIT, pp.479 484 (2008). 4) Debnath, S., Mitra, P., Pal, N. and Giles, C.L.: Automatic Identification of Informative Sections of Web Pages, IEEE Transactions on Knowledge and Data Engineering, Vol.17, No.9, pp.1233 1246 (2005). 5) Glance, N., Hurst, M. and Tomokiyo, T.: BlogPulse: Automated Trend Discovery for Weblogs, Proc. WWW 2004 Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2004). 6) Gyöngyi, Z. and Garcia-Molina, H.: Web Spam Taxonomy, Proc. 1st AIRWeb, pp. 39 47 (2005). 7) Letters Vol.6, No.4, pp.37 40 (2008). 8) Web (WebDB Forum)2008 (2008). 9) Katayama, T., Sato, Y., Utsuro, T., Yoshinaka, T., Kawada, Y. and Fukuhara, T.: An Empirical Study on Selective Sampling in Active Learning for Splog Detection, Proc. 5th AIRWeb, pp.29 36 (2009). 10) Kolari, P., Finin, T. and Joshi, A.: SVMs for the Blogosphere: Blog Identification and Splog Detection, Proceedings of the 2006 AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs, pp.92 99 (2006). 11) Kolari, P., Finin, T. and Joshi, A.: Spam in Blogs and Social Media, Tutorial at ICWSM (2007). 12) Kolari, P., Joshi, A. and Finin, T.: Characterizing the Splogosphere, Proc. 3rd Ann. Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics (2006). 13) Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J. and Tseng, B.L.: Splog Detection using Self-similarity Analysis on Blog Temporal Dynamics, Proc. 3rd AIRWeb, pp. 1 8 (2007). 14) Macdonald, C. and Ounis, I.: The TREC Blogs06 Collection : Creating and Analysing a Blog Test Collection, Technical Report TR-2006-224, University of Glasgow, Department of Computing Science (2006). 15) Mishne, G., Carmel, D. and Lempel, R.: Blocking Blog Spam with Language Model Disagreement, Proc. 1st AIRWeb (2005). 16) Nanno, T., Fujiki, T., Suzuki, Y. and Okumura, M.: Automatically Collecting, Monitoring, and Mining Japanese Weblogs, WWW Alt. 04: Proc. 13th WWW Conf. Alternate Track Papers & Posters, pp.320 321 (2004). 17) Sato, Y., Utsuro, T., Fukuhara, T., Kawada, Y., Murakami, Y., Nakagawa, H. and Kando, N.: Analysing Features of Japanese Splogs and Characteristics of Keywords, Proc. 4th AIRWeb, pp.33 40 (2008). 18) Suzuki, J., Isozaki, H. and Maeda, E.: Convolution Kernels with Feature Selection for Natural Language Processing, Proc. 42nd ACL, pp.110 126 (2004). 19) Tong, S. and Koller, D.: Support Vector Machine Active Learning with Applications to Text Classification, Proc. 17th ICML, pp.999 1006 (2000). 20) Urvoy, T., Lavergne, T. and Filoche, P.: Tracking Web Spam with Hidden Style Similarity, Proc. 2nd AIRWeb, pp.25 30 (2006). 21) Vapnik, V.N.: Statistical Learning Theory, Wiley-Interscience (1998). 22) Wang, Y., Ma, M., Niu, Y. and Chen, H.: Spam Double-Funnel: Connecting Web Spammers with Advertisers,, Proc. 16th WWW, pp.291 300 (2007). 23) Vol.8, No.1, pp.29 34 (2009). 24) HTML (2009). 8 c 2009 Information Processing Society of Japan