61 2 2015 4 J Wuhan Univ Nat Sci Ed Vol 61 No 2 Apr 2015 124 ~ 130 DOI10 14188 /j 1671-8836 2015 02 004 1 2 1 1 1 1 1 430072 2 518057 SVM 14 000 TP 391 A 1671-8836201502-0124-07 Stock Research Reports Classification Based on Sentiment Analysis PENG Min 1 2 WANG Qing 1 HUANG Jimin 1 ZHOU Li 1 HU Xinhui 1 1 School of ComputerWuhan University Wuhan 430072HubeiChina 2 Shenzhen Institute of Wuhan UniversityShenzhen 518057GuangdongChina AbstractThe stock research report is the important professional investment advice in stock areas Based on web information extraction and retrievalautomatic analysis of the investment advice in massive stock reports will make the significant impact on investors behaviors In this paperwe propose a classification strategy for the stock research report based on sentiment analysis methods Firstlywe extract the integrated features in the stock research report Secondlywe leverage feature selection with the improved CHI statistical methodsand classify the stock research reports through the SVM and Naive Bayes classifiers Finallywe evaluate the classification result considering feature weight feature dimension and sampling number Based on 14000 research reports collected from the www eastmoney com the experimental results show thatthe strategy of integrated features selectiondimension reduction as well as training resamplingcan achieve higher performance Key wordssentiment analysisfeature selectionsvmsupport vector machinenaive Bayesimbalanced data stock research report 0 Schu- 2014-07-08 6147229161303115 E-mailpengm@ whu edu cn
2 125 maker 1 O Hare 2 3 /// 3 1 2 4 Naive Bayes k k- NearestNeighbor Support Vector Machine SVM Pang 5 1 SVM SVM 2 6 SVM Naive 3 1 1 Bayes 7 2
126 61 VSM VSM uni-gram 8 1 1 DF IG MI CHI Qiu 9 + CHI + CHI + + + + 2 2 1 12 000 1 400 600 TF( b ) = k m tf( b k d ) i 1 i = 1 tf( b k d i ) b k d i TF( bk ) 1 2 2 10 6 119 3 874 4 510 97 125 2 6 000 3 000 1 500 TF( bk ) 2 2 n-gram 2 3 CHI t k c i χ 2 t k uni-gram c i CHI χ 2 N AD - BC 2 ( t k c ) i = A + C B + D A + B C + D 2
2 127 A c i t k B c i t k C c i t k D c i t k CHI χ 2 ( t k c i ) 2 4 χ 2 ( t k c i ) = 0 2 4 1 AD - BC CHI 2 CHI d i D d i = w 1i w 2i w ki 5 w ki k d i CHI 11 2 4 2 12 0 13 CHI FI CI DI CHI TF TF- IDF IDF( tk ) t k TF-IDF 3 χ 2 ( t k c ) i = N AD - BC 2 A + C B + D A + B C + D FI + CI + DI AD - BC > 0 3 0 AD - BC 0 FI = TF ( t k c ) i A + C CI = DI = A A + B A A + C S( t ) k m = max 1 χ 2 ( c ) i { t k } 4 w ki = TF( t ) IDF( ki t ) = k TF( t ) lg N ki ( n tk + 0 5 + 0 5) 6 N n tk t k 14 TF-IDF IDF TF-CHI m 3 w ki = TF( t ) S( ki t ) = TF( k t ) max{ ki χ 2 ( t 1 k c ) } i TF( t k c i ) t k c i 7 FI t k CI t k DI t S( tk ) 4 k TF TF-IDF t k TF-CHI t k 2 5 4 1 2 3 SVM Naive Bayes
128 61 Naive Bayes 2 5 1 SVM SVM c i Xx 1 x 2 x n P( c i X ) = P ( X c ) i P( c ) i 9 P( X) P( c i X) ( x i y i ) = 1 x i R y i c i P ( X c i ) ( P X) { 1-1} SVM min{ 1 2 w 2 + C n ζi } P( X c ) i = 1 8 i = n p( x k c ) i 10 k = 1 subject to y i [ wx i + b ] 1 - ζ i p ( x k c i ) x k c i P C ζ i ( X c i ) 3 SVM SVM 3 1 1 one-against-one one-against-all one-against-one 14 000 600 SVM k kk - 1/ 1 400 12 000 2 SVM 2 one-against-all 15 k k SVM one-against-all 3 4 one-against-one 3 3 2 5 80% 20% 5 2 5 2 Naive Bayes Naive Bayes ICTCLAS P Precision R Recall F1 F1 MacroF1 1
2 129 5 TF-IDF TF 600 1 400 1 500 40% TF-IDF 1 400 TF SVM 3 3 % 72 8 78 0 75 3 71 2 82 2 76 3 Bayes 6 3 6 Macro F1 11 6% F1 10% 20% 40% 60% SVM 0 579 0 633 0 658 0 668 Naive Bayes 0 476 0 501 0 535 0 587 6 Naive Bayes 2SVM SVM 2 Naive Bayes Pang 5 5 600 1 400 1 500 10% 20% 40% 60% TF SVM 600 1 400 1 500 600 1 400 3 000 600 1 400 6 4 000 TF 4 F1 10% 20% 40% 50% 60% 0 521 0 608 0 597 0 606 0 603 0 462 0 52 0 613 0 612 0 629 0 755 0 771 0 763 0 774 0 774 4 40% 0 597 0 569 0 490 0 613 0 475 0 303 0 763 0 896 0 874 3 7 600 1 400 1 500 10% 20% 40% 60% TF TF-IDF TF-CHI SVM 5 5 Macro F1 10% 20% 40% 60% TF 0 579 0 633 0 658 0 668 TF-IDF 0 627 0 634 0 632 0 637 TF-CHI 0 546 0 591 0 624 0 616 TF TF-CHI 40% 4 F1 F1 600 1 400 1 500 63 4 49 7 55 8 74 0 50 1 59 7 10% 20% 40% 60% 62 3 55 6 58 7 64 7 58 2 61 3 TF SVM Naive 40% SVM 7 7 600 1 400 1 500 F1 600 1 400 3 000 600 1 400 6 000 SVM TF 40%
130 61 4 8Bolón-Canedo VSánchez-Maro no 珘 NAlonso-Betanzos A A review of feature selection methods on synthetic data J Knowledge and Information Systems 2013 343 CHI 483-519 9Qiu L QZhao R YZhou G et al An extensive empirical study of feature selection for text categorizationdb / OL 2014-02-03 http/ /ieeexplore ieee org /xpls / icp jsparnumber = 4529838 1Schumaker R PZhang YHuang C N et al Evaluating sentiment in financial news articles J Decision Support Systems2012533458-464 2O hare NDavy MBermingham Aet al Topic-dependent sentiment analysis of financial blogsc/ /Proceedings of the 1st International CIKM Workshop on Topic- Sentiment Analysis for Mass Opinion New YorkACM 20099-16 3 J 20133361574-1607 Yang L GZhu JTang S P Survey of text sentiment a- nalysis J Journal of Computer Applications201333 6 1574-1607Ch 4 J classification on imbalanced data distribution J Journal of Chinese Information Processing2012 26 3 33-37 Ch 10Drummond CHolte R C C4 5class imbalanceand cost sensitivitywhy under-sampling beats over-sampling C/ /Workshop on Learning from Imbalanced Datasets II WashingtonD CICML200311 11Galavotti LSebastiani FSimi M Experiments on the use of feature selection and negative evidence in automated text categorizationc/ /Research and Advanced Technology for Digital Libraries Berlin HeidelbergSpringer- Verlag200059-68 12 CHI J 201147 4 128-130 Pei Y BLiu X X Study on Improved CHI for feature selection in Chinese text categorization J Computer Engineering and Applications201147 4 128-130 Ch 13 x 2 J 2008282 513-514 2010218 1834-1848 Xiong Z YZhang P ZZhang Y F Improved approach to Zhao Y YQin BLiu T Sentiment Analysis J Journal of Software2010218 1834-1848Ch CHI in feature extraction J Journal of Computer Applications2008282 513-514Ch 5Pang BLee LVaithyanathan S Thumbs upsenti- 14Debole FSebastiani F Supervised term weighting for ment classification using machine learning techniques automated text categorizationc Text mining and its C/ /Proceedings of the ACL-02 conference on Empirical applications Berlin HeidelbergSpringer-Verlag2004 methods in natural language processing-volume 10 81-97 StroudsburgAssociation for Computational Linguistics 15 200279-86 J 201327 6Debole FSebastiani F An analysis of the relative hardness of Reuters-21578 subsets J Journal of the Ameri- 4113-118 Wang Z HWang Z QLi S Set al Feature selection can Society for Information Science and technology2005 for imbalanced sentiment classificationj Journal of 566 584-596 Chinese Information Processing201327 4 113-118 7 Ch J 201226333-37 Wang Z QLi S SZhu Q Met al Chinese sentiment