SocialDict 1 2 2 2 Web SocialDict A reading support tool with prediction capability and its extension to readability measurement Yo Ehara, 1 Takashi Ninomiya, 2 Nobuyuki Shimizu 2 and Hiroshi Nakagawa 2 We have been proposing a system to support reading English Web pages for ESL (English as a second language) users by annotate the words they may not know by personalized prediction of those words using the word click logs. In this paper, we propose its extension to readability measurement. As this paper is not refereed, the content of this paper is not counted as a conference paper. Web Web 1 pop 1 popin 2 Web SocialDict 3 4) Web 2 4 1. Web Web Web 1 pop 1 Graduate School of Information Science and Technology, The University of Tokyo 2 Information Technology Center, The University of Tokyo 1 http://www.popjisyo.com/ 2 http://www.popin.cc/ 3 http://www.socialdict.com/ 4 1 c 2010 Information Processing Society of Japan
SocialDict Web Gmail Web URL http://www.socialdict.com/ 1 SocialDict / ESL (English as a Second Language) I.S.P. Nation 8) 8) I.S.P. Nation text coverage 98% 2. 3 98% text coverage text coverage 1-gram 8) 98% I.S.P. Nation 7) non-native speakers (NNS) text coverage text coverage 80% 90% 95% 90% 100% 8) text coverage98% 7) native speakers (NS) 3) 2 SocialDict 1 http://www.socialdict.com/http://www.cnn.com/ 3 (0) u Gmail (1) l URL SocialDict URL URL URL Web URL (2) l Web (Web Server) (3) Web l l Web D D = find(l) 2 c 2010 Information Processing Society of Japan
(4) D Web 4 (a) Web Dom D Dom T Web D Dom D R Dom T R = P arse(d) R Dom T Web D Dom D D = Unparse(R ) t y = 0 (4) t y = 1 (6) y = 0 t g(t) g(t) (u, t, y) w (k) Update(w (k), (u, t, y)) 3 SocialDict Web Google App Engine (GAE) 2 3. 4 (a) Web, (b) 3(4) 4(a) 4(b) Tokenize PredictGloss 2 Tokenize R Dom T R R Dom T 4(a) 4(b) 4(b) <span> 3(5),(6) JavaScript PredictGloss R Dom T g u w (k) R u t h(u, t, w (k) ) t g(t) R g t T t g(t) 3(5), (6) AJAX asynchronous JavaScript and XML 1 (5) D t (4) y (4) 1 jquery JavaScript http://semooh.jp/jquery/ 10) 3 item response theory, IRT TOEFL (Test of English as a Foreign Language) T T U U u U t T y Y (u, t, y) 1 Y N {(u n, t n, y n ) n {1,..., N}} {(u n, t n, y n ) n {1,..., N}} u n U t n T y n Y Y Y = {0, 1} y n = 1 u n t n y n = 0 u n t n Rasch Rasch P (y n = 1 u n, t n ) = σ(θ un d tn )p(θ un, d tn ) 1 σ(x) = p(θ 1+exp( x) u n, d tn ) θ un u n θ un 2 http://code.google.com/intl/ja/appengine/ 3 3 c 2010 Information Processing Society of Japan
u n d tn t n d tn u n t n h h(u n, t n, w (k) ) = log P (y n = 1 u n, t n ; w (k) ) log P (y n = 0 u n, t n ; w (k) ) h(u n, t n, w (k) ) 0 h(u n, t n, w (k) ) < 0 e u u 1 0 U e t t 1 0 T w rasch = (θ d) T ϕ rasch (u, t) = (e u e t) T (1) θ = (θ 1,..., θ u,..., θ U ) d = ( d 1,..., d t,..., d T ) (1) Rasch P (y n = 1 u n, t n ) = σ (θ un d tn p(θ un, d tn )) = σ(w T raschϕ rasch (u n, t n ))p(w) (1) p(w) w m m N (0, 1 I) C Rasch Rasch u θ u θ u = θ T e u t t d t d t = d T e t u u t Rasch Rasch Rasch Rasch (y n, u n, t n) (y n, u n, t n ) (1) Rasch w rasch w ext = (θ w d ) T, ϕ rasch ϕ ext (u, t) = (e u w ext ϕ t ) T {(u n, t n, y n) n {1,..., N}} N 1 Rasch MAP (1) MAP w N O(N) N Update N O(N) N O(1) Rasch Stochastic Gradient Descent (SGD) 1) SGD n Update w (k+1) = w (k) η k E n (w (k) ) (2) (2) η k = 1 λ(k+k 0 ) λ, k 0 SGD 4 c 2010 Information Processing Society of Japan
SocialDict NEWikipedia 4. N 1 % / 1 1 SVL12000 11,999 5 1 16 11, 999 Google 1-gram 11, 271 271 10, 400 600 600 N 1 {10, 30, 100, 300, 600} N 1 3 C {0.0315, 0.125, 0.5, 1.0, 2.0, 8.0, 32.0, 128.0, 512.0, 2048.0, 8192.0, 32768.0} 3 Rasch EXT IRT+EXT IRT EXT 3 Rasch 3 e t 3 Rasch Google 1-gram SVL12000 2 Google 1-gram 1 Web 2) p logp SVL12000 12,000 12 1 9) IRT+EXT IRT EXT disjoint IRT EXT 1 (%) N 1 = 10 30 100 300 600 IRT 65.84 66.23 66.38 67.23 66.58 EXT 74.82 79.06 79.48 79.50 79.76 IRT+EXT 74.63 77.66 79.27 79.23 79.52 5 4 5 IRT N 1 IRT 5 c 2010 Information Processing Society of Japan
IRT EXT EXT+IRT EXT IRT EXT EXT 600 16 C 12 192 0.01331 0.02281 1 CPUXeon 5160 3.0GHz 2core x 2 6) 5) 5. EXT Brown corpus Brown corpus 44 news, 17 reviews, 29 fiction 16 11,271 98% / studying study nltk WordNet Lemmatizer / / N 1 = 600 5 / / 10 Flesch-Kincaid F-K 6. 2 user 1 44 / 44 17 / 17 29 / 29 user 2 44 / 44 17 / 17 28 / 29 user 3 26 / 44 12 / 17 13 / 29 user 4 42 / 44 17 / 17 28 / 29 user 5 41 / 44 17 / 17 25 / 29 user 6 34 / 44 14 / 17 19 / 29 user 7 44 / 44 17 / 17 29 / 29 user 8 42 / 44 17 / 17 27 / 29 user 9 4 / 44 7 / 17 10 / 29 user 11 4 / 44 7 / 17 11 / 29 user 12 44 / 44 17 / 17 29 / 29 user 13 44 / 44 17 / 17 29 / 29 user 14 44 / 44 17 / 17 29 / 29 user 15 43 / 44 14 / 17 21 / 29 user 16 44 / 44 17 / 17 29 / 29 F-K 65.35 61.6647058824 82.0689655172 Web Web SocialDict SocialDict 8) 98% text coverage 5 fiction Flesch-Kincaid 6 c 2010 Information Processing Society of Japan
Fiction non-fiction text coverage 8) 1) Bishop, C.M.: Pattern recognition and machine learning, Springer (2006). 2) Brants, T. and Franz, A.: Web 1T 5-gram Version 1, Linguistic Data Consortium, Philadelphia (2006). 3) Carver, R.: Percentage of unknown vocabulary words in text as a function of the relative difficulty of the text: Implications for instruction., Journal of Reading Behavior, Vol.26, No.4, pp.413 437 (1994). 4) Ehara, Y., Shimizu, N., Ninomiya, T. and Nakagawa, H.: Personalized reading support for second-language web documents by collective intelligence, 2010 International Conference on Intelligent User Interfaces (IUI 2010), pp.51 60 (2010). 5) Fan, R., Chang, K., Hsieh, C., Wang, X. and Lin, C.: LIBLINEAR: A library for large linear classification, The Journal of Machine Learning Research, Vol.9, pp. 1871 1874 (2008). 6) Lin, C., Weng, R. and Keerthi, S.: Trust region Newton method for logistic regression, The Journal of Machine Learning Research, Vol.9, pp.627 650 (2008). 7) M.Hu, I.N.: Vocabulary density and reading comprehension, Reading in a Foreign Language, Vol.13, No.1, pp.403 430 (2000). 8) Nation, I.: How large a vocabulary is needed for reading and listening?, Canadian Modern Language Review, Vol.63, No.1, pp.59 82 (2006). 9) SPACE ALC Inc.: Standard Vocabulary List 12,000 (1998). Data available at http://www.alc.co.jp/goi/pw top all.htm. 10) [ ]- - (2005). 7 c 2010 Information Processing Society of Japan