ISSN 000-9825, CODEN RUXUEW E-ail: jos@iscas.ac.cn Journal of Software,207,28():3072 3079 [doi: 0.3328/j.cnki.jos.005345] http://www.jos.org.cn. Tel: +86-0-62562563, ((), 20023) :, E-ail: li@lada.nju.edu.cn :,,,.,,.,,,,,.,(cost-sensitive argin distribution optiization, CSMDO),,,. : ;;;; : TP3 :,..,207,28():3072 3079. http://www.jos. org.cn/000-9825/5345.ht : Xie Z, Li M. Cost-Sensitive argin distribution optiization for software bug localization. Ruan Jian Xue Bao/ Journal of Software, 207,28():3072 3079 (in Chinese). http://www.jos.org.cn/000-9825/5345.ht Cost-Sensitive Margin Distribution Optiization for Software Bug Localization XIE Zheng, LI Ming (State Key Laboratory for Novel Software Technology (Nanjing University), Nanjing 20023, China) Abstract: It is costly to identify bugs fro nuerous source code files in a large software project. Thus, locating bug autoatically and effectively becoes a worthy proble. Bug report is one of the ost valuable source of bug description, and precisely locating related source codes linked to the bug reports can help reducing software developent cost. Currently, ost of the research on bug localization based on deep neural networks focus on design of network structures while lacking attention to the loss function, which ipacts the perforance significantly in prediction tasks. In this paper, a cost-sensitive argin distribution optiization (CSMDO) loss function is proposed and applied to deep neural networks. This new ethod is capable of handling the ibalance of software defect data sets, and iproves the accuracy significantly. Key words: software ining; achine learning; bug localization; convolutional neural network; argin distribution optiization,,,.,, [,2]. [3 0],,,;,. : (6422304, 627227) Foundation ite: National Natural Science Foundation of China (6422304, 627227). :207-05-09; : 207-06-6; : 207-08-23
: 3073,,.,,,.,.,,,.,.,,.,,,.,Poshyvanyk [5] ;Lukins [6] LDA(latent Dirichlet allocation),;gay [7] (vector space odel, VSM);Zhou [8] BugLocator, VSM.,.La [9] auto encoder ;Huo [0],,.,,. [9 ] sigoid ( softax ),.,., AdaBoost [2,3].Zhang [4],(optial argin distribution achine, ODM),,., ODM.,(cost-sensitive argin distribution optiization, CSMDO),,,. CSMDO.,CSMDO.. 2. 3. 4..,.,.,,. C={c,c 2,,c n },R={r,r 2,,r n2 },y ij Y={+, }., f:r C, sgn(f(r i,c j )) Y={+, } r i c j.: in L ( f ) = λ L( f( r, c ), y ) + Ω( f) () f i, j i j ij,l(, ),Ω( ).λ,..2,[5,6]. [0] :
3074 Journal of Software Vol.28, No., Noverber 207,.,,., x ij =φ(r i,c j ),.,, f(r i,c j ). Fig. Convolutional neural network structure for bug localization one-hot, [7]., V={ bug, class, contains, file, is, this }, S= this file contains bug. 0 0 0 0 0 this 0 0 0 0 0 file x = (2) 0 0 0 0 0 contains 0 0 0 0 0 bug one-hot,.,., [8]. 2 2 : 2,; 2,.,.,,.,,,.
: 3075 2 2.,,.,, AdaBoost.Schapire [9] AdaBoost.Breian [20], boosting Arc-gv,.,Rayzin [2],Arc-gv,,., AdaBoost, [2,3].Zhang, ODM [4].ODM w, ;,ODM.,,.,,: 2 2 2 in w w+ C+ i i 0 i,, 2 ξ + C ξ + C ε w ξε yi=+ yi = i= s.t. yw x D ξ, yw x + D+ ε, i i i i i i ξ 0, ε 0, i =,...,. i i,,; 2,. D,. C + C., C + >C,. C 0. 2.2,(3) : C in L( w) w w ax{0, D yw x} w + 2 = + i i + 2 yi =+ C 2 0 2 ax{0, D yiw xi} + ax{0, yiw xi D} yi = i=,... (x i,y i ), L(w,x i,y i ) L(w). :L(w) 2C+ L( w) = w+ ( yiw xi + D ) yixii( yiw xi < D) I( yi =+ ) + 2C 2C 0 i= i= i= ( yw x + D ) y xi( yw x < D) I( y = ) + i i i i i i i ( yw x D ) y xi( yw x > + D) C i i i i i i (3) (4) (5)
3076 Journal of Software Vol.28, No., Noverber 207,I( ),, 0.,: E[ L( wx, i, yi)] = w+ 2 C+ E[( yiw xi + D ) yixii( yiw xi < D) I( yi =+ )] + 2 C E[( yiw xi + D ) yixii( yiw xi < D) I( yi = )] + 2 C0E[( yiw xi D ) yixii( yiw xi > + D)] 2C+ = w+ ( yiw xi + D ) yixii( yiw xi < D) I( yi =+ ) + i= (6) 2C ( yiw xi + D ) yixii( yiw xi < D) I( yi = ) + i= 2C0 ( yiw xi D ) yixii( yiw xi > + D) i= = L( w), L(w,x i,y i ) L(w).,., (ini-batch gradient descent),. 3,.. 3. 4,,.,, [22]. [8,9],,JDT(Java developent tools) Eclipse Java IDE,PF(eclipse platfor) Eclipse,PDE(plug-in developent environent) Eclipse,AspectJ Java.. Table Statistics of the data sets JDT 2 826 2 272 4.39 PF 4 893 02 6.79 PDE 4 034 2 970 8.34 AspectJ 734 36.73,,,., Top 0 Rank,AUC(area under ROC curve) MAP(ean average precision) [8].,,BugLocator [8] Zhou,;Two- Phase [8] Ki,, ;HyLoc [9] La, auto encoder,..,,.
: 3077 2,,.,NP-CNN logistic (cross entropy), Logistic logistic.np-cnn hinge SVM hinge,,.., CSMDO. 3.2 0,.Top 0 Rank,MAP AUC 2~ 4.,/ CSMDO /(Wilcoxon, 0.05). Table 2 2 Top 0 Rank of copared ethods on all data sets Top 0 Rank Data set BugLocator Two-Phase HyLoc NP-CNN logistic NP-CNN higne CSMDO JDT 0.742 0.642 0.789 0.938 0.933 0.935 PF 0.725 0.732 0.760 0.894 0.867 0.897 PDE 0.673 0.67 0.78 0.842 0.85 0.86 AspectJ 0.622 0.570 0.74 0.848 0.842 0.87 Avg. 0.69 0.640 0.752 0.880 0.873 0.89 Table 3 3 MAP of copared ethods on all data sets MAP Data set BugLocator Two-Phase HyLoc NP-CNN logistic NP-CNN higne CSMDO JDT 0.44 0.372 0.455 0.522 0.525 0.533 PF 0.350 0.398 0.40 0.537 0.532 0.539 PDE 0.422 0.38 0.552 0.624 0.632 0.674 AspectJ 0.48 0.397 0.433 0.545 0.546 0.577 Avg. 0.408 0.387 0.463 0.557 0.559 0.58 Table 4 4 AUC of coparedethods on all data sets AUC Data set BugLocator Two-Phase HyLoc NP-CNN logistic NP-CNN higne CSMDO JDT 0.78 0.724 0.83 0.88 0.97 0.927 PF 0.772 0.803 0.830 0.93 0.895 0.94 PDE 0.753 0.763 0.824 0.94 0.925 0.927 AspectJ 0.683 0.664 0.76 0.857 0.852 0.926 Avg. 0.747 0.739 0.807 0.89 0.897 0.930,CSMDO. 4, BugLocator,Two-Phase HyLoc 3,CSMDO AUC 8.3%,9.% 2.3%,.:,CSMDO,. NP-CNN logistic,csmdo Top 0 Rank,MAP AUC 3.%,2.4% 3.9%, hinge NP-CNN hinge,.8%,2.2% 3.3%., CSMDO MAP AUC 3 Top 0 Rank.,,
3078 Journal of Software Vol.28, No., Noverber 207,. 4, (CSMDO),.,,.,,,.,,.,.,,. References: [] Li M, Huo X. Software defect ining based on sei-supervised learning. Journal of Data Acquisition and Processing, 206,3(): 56 64 (in Chinese with English abstract). [doi: 0.6337/j.004-9037.206.0.005] [2] Li M. Software defect ining. Chinese Association for Artificial Intelligence Conunication, 206,2(8):44 49 (in Chinese with English abstract). [3] Tu WW, Li M, Zhou ZH. Mining software defect factor. Journal of Jilin University (Engineering and Technology Edition), 202, 42(s):383 386 (in Chinese with English abstract). [4] Ding H, Chen L, Qian J, Xu L, Xu BW. Fault localization ethod using inforation quantity. Ruan Jian Xue Bao/Journal of Software, 203,24(7):484 494 (in Chinese with English abstract). http://www.jos.org.cn/000-9825/4294.ht [doi: 0.3724/SP.J. 00.203.04294] [5] Poshyvanyk D, Gueheneuc YG, Marcus A, Antoniol G, Rajlich VC. Feature location using probabilistic ranking of ethods based on execution scenarios and inforation retrieval. IEEE Trans. on Software Engineering, 2007,33(6):420 432. [doi: 0.09/tse. 2007.06] [6] Lukins SK, Kraft N, Etzkorn LH. Source code retrieval for bug localization using latent dirichlet allocation. In: Proc. of 5th Working Conf. on Reverse Engineering. IEEE, 2008. 55 64. [doi: 0.09/wcre.2008.33] [7] Gay G, Haiduc S, Marcus A, Menzies T. On the use of relevance feedback in IR-based concept location. In: Proc. of Int l Conf. on Software Maintenance. 2009. 35 360. [doi: 0.09/ics.2009.530635] [8] Zhou J, Zhang H, Lo D. Where should the bugs be fixed? More accurate inforation retrieval-based bug localization based on bug reports. In: Proc. of the 34th Int l Conf. on Software Engineering. 202. 4 24. [doi: 0.09/icse.202.622720] [9] La AN, Nguyen AT, Nguyen HA, Nguyen TN. Cobining deep learning with inforation retrieval to localize buggy files for bug reports. In: Proc. of the 28th Int l Conf. on Autoated Software Engineering. 205. [doi: 0.09/ase.205.73] [0] Huo X, Li M, Zhou ZH. Learning unified features fro natural and prograing languages for locating buggy source code. In: Proc. of the 25th Int l Joint Conf. on Artificial Intelligence (IJCAI 206). New York, 206. 606 62. [] Mou L, Li G, Zhang L, Wang T, Jin Z. Convolutional neural networks over tree structures for prograing language processing. In: Proc. of the 30th AAAI Conf. on Artificial Intelligence. 206. 287 293. [2] Wang L, Sugiyaa M, Jing Z, Yang C, Zhou ZH, Feng J. A refined argin analysis for boosting algoriths via equilibriu argin. Journal of Machine Learning Research, 20,2:835 863. [3] Gao W, Zhou ZH. On the doubt about argin explanation of boosting. Artificial Intelligence, 203,203: 8. [doi: 0.06/j.artint. 203.07.002] [4] Zhang T, Zhou ZH. Optial argin distribution achine. arxiv:604.03348, 206.
: 3079 [5] Johnson R, Zhang T. Effective use of word order for text categorization with convolutional neural networks. In: Proc. of the Conf. of the North Aerican Chapter of the Association for Coputational Linguistics: Huan Language Technologies. 205. 03 2. [doi: 0.35/v/n5-0] [6] Zhang X, LeCun Y. Text understanding fro scratch. arxiv:502.070, 205. [7] Ki Y. Convolutional neural networks for sentence classification. In: Proc. of the Conf. on Epirical Methods in Natural Language Processing. ACL, 204. 746 75. [doi: 0.35/v/d4-8] [8] Ki D, Zeller A, Tao Y, Ki S. Where should we fix this bug? A two-phase recoendation odel. IEEE Trans. on Software Engineering, IEEE, 203,39():597 60. [doi: 0.09/tse.203.24] [9] Schapire RE, Freund Y, Barlett P, Lee WS. Boosting the argin: A new explanation for the effectiveness of voting ethods. In: Fisher DH, ed. Proc. of the 4th Int l Conf. on Machine Learning (ICML 97). Morgan Kaufann Publishers, 997. 322 330. [20] Breian L. Prediction gaes and arcing algoriths. Neural Coputation, 999,(7):493 57. [doi: 0.62/08997669930006 06] [2] Reyzin L, Schapire RE. How boosting the argin can also boost classifier coplexity. In: Cohen WW, Moore A, eds. Proc. of the 23rd Int l Conf. on Machine Learning (ICML 2006), Vol.48. 2006. 753 760. [doi: 0.45/43844.43939] [22] Fischer M, Pinzger M, Gall H. Populating a release history database fro version control and bug tracking systes. In: Proc. of the Int l Conf. on Software Maintenance. IEEE, 2003. 23 32. [doi: 0.09/ics.2003.235403] : [],..,206,3():56 64. [doi: 0.6337/j.004-9037.206.0.005] [2]..,206,2(8):44 49. [3],,..(),202,42(s):383 386. [4],,,,..,203,24(7):484 494. http://www.jos.org.cn/000-982 5/4294.ht [doi: 0.3724/SP.J.00.203.04294] (994),,,, (980),,,,,.,CCF,,.