Indexing Methods for Encrypted Vector Databases

Computer Security Symposium 2013 21-23 October 2013 305-0006 1-1-1 junpei.kawamoto@acm.org LSH LSH LSH Indexing Methods for Encrypted Vector Databases Junpei Kawamoto Faculty of Engineering, Information and Systems, University of Tsukuba 1-1-1 Tennodai, Tsukuba, Ibaraki 305-0006, JAPAN junpei.kawamoto@acm.org Abstract We introduce a filtering methodology based on locality sensitive hashing (LSH) and whitening transformation to reduce candidate tuples which encrypted vector databases (EVDBs) must compute similarity between for query processing. LSH is a hashing methodology which is efficient for estimating similarities between two vectors. It hashes a vector space using randomly chosen vectors. We can filter vectors which are less similar to the querying vectors by recording which hashed space each vector belongs to. However, if vectors in EVDBs are found locally, then most vectors are in a same hashed space and so the filter will not work. Since we can treat those cases using whitening transformation to distribute the vectors broadly, our proposal filtering methodology will work effectively on any vector space. We also show that the server s query processing cost is reduced by our filter. 1 k v (k.v) 1-978 -

(LSH; locality sensitive hashing) [2] LSH [3, 4] LSH LSH 2 V DB(Key, V alue) Key V alue Key V alue Key V alue R- [1] α q sim(k, q) α k - 979 -

EV DB(Key e, V alue e ) Key e V alue e k Key k e Key e Enc k k e = Enc k (k) v V alue v e V alue e Enc v v e = Enc v (v) q α Enc q q e = Enc q (q) α sim(k e, q e ) α (k e, v e ) k, q v k e, q e v e (k e, v e ) Dec k Dec v k = Dec k (k e ) v = Dec v (v e ) Enc k, Enc q, Enc v, Dec k Dec v Enc q, Dec k Dec v (LSH) LSH LSH LSH 3.1 LSH [5, 2, 6] Charikar [2] LSH m h i b i 1; v b i 0 h i (v) = 0; otherwise v b i m v LSH lsh(v) lsh(v) = (h 1 (v), h 2 (v),, h m (v)). (1) u v LSH lsh(u) lsh(v) Pr[lsh(u) = 3 LSH lsh(v)] 1 θ(u, v)/π. Pr[lsh(u) = lsh(v)] lsh(u) lsh(v) h i (u) = h i (v) - 980 - i θ(u, v)

LSH cos(u, v) cos(u, v) cos (π(1 Pr[lsh(u) = lsh(v)])) (2) LSH m 2 m (2) m b i v LSH LSH m LSH 3.2 LSH LSH LSH Σ µ v Σ Σ = ΦΛΦ 1 Φ i Σ i Λ W k W k = ΦΛ 1/2. (4) v v w v w = Wk T (v µ) E(v w vw) T = E(Wk T (v µ)(v µ) T W k ) = E(Λ 1/2 Φ T ΣΦΛ 1/2 ) = I LSH LSH 4 3.3 2 Enc k Enc q Enc v Dec k Dec v LSH Enc k Enc q Dec k Enc k Enc q Dec k LSH V DB LSH k Key Σ = E ( (v µ)(v µ) T ) - 981 - (3)

µ (3) Σ = E ( (Enc k (k) µ)(enc k (k) µ) T ) Σ Σ = ΦΛΦ 1 (4) W k Enc k (k) Enc k (k) = W T k (Enc k(k) µ) Enc q Dec k Enc q(q) = W 1 k Enc q(q), Dec k (k e) = Dec k ((Wk T ) 1 k e + µ) V DB EV DB Enc q Dec k Dec v µ q α sim(k, q) α k k q α k e q e α µ Enc q (q) k e = Enc k (k) q e = Enc q(q) q α = α µ Enc q (q) k e q e α µ Enc q (q) k q α LSH m (1) k e LSH lsh(k e) LSH EV DB (LSH, Key e, V alue e ) LSH LSH S lsh LSH q e α α = α µ Enc q (q) LSH h q = lsh(q e) LSH S lsh S cand S lsh S cand LSH h S cand cos (π(1 Pr[h = h q ])) α (5) (2) q e (5) LSH α S cand LSH q e LSH S cand S cand LSH k e q e α 4 IPP [7] - 982 -

5000 size 10000 size 4000 min. max. 8000 min. max. 3000 6000 2000 4000 1000 2000 0 16 32 64 128 256 512 1024 the number of base vectors m. (a) n = 10000 0 16 32 64 128 256 512 1024 the number of base vectors m. (a) n = 10000 20000 15000 size min. max. 100000 size min. 80000 max. 10000 5000 60000 40000 20000 0 16 32 64 128 256 512 1024 the number of base vectors m. (b) n = 100000 0 16 32 64 128 256 512 1024 the number of base vectors m. (b) n = 100000 1: LSH LSH 2: LSH LSH n 1) LSH 2) LSH 3) LSH Python 2.7 Intel R Core TM i7-860 Processor (8M Cache, 2.80 GHz), 8GB RAM OS Ubuntu 12.04 LTS 1 n = 10000 n = 100000 LSH LSH m size (1) LSH LSH max. LSH min. LSH 1 1 1 LSH m LSH LSH m m LSH n 2 1 min. max. - 983 -

LSH size m 1 LSH LSH max. m < 64 min. n LSH m 64 min. LSH 1 2 3 LSH m 3(a) recall recall 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 16 32 64 128 256 1024 the number of base vectors m (a) 1 10 20 30 40 50 requesting width (b) (m = 256) 3: (n = 1000). n = 10000 n = 100000 LSH m 3 m LSH n = 10000 4(a) m > 128 LSH m 3(b) m = 256 n = 100000 m = 512 4 m - 984 - LSH

10 0 m = 16 m = 32 10-1 m = 64 m = 128 m = 256 10-2 10-3 w/o lsh filter 3 10-4 10 20 30 40 50 requesting width (a) n = 10000 10 0 m = 32 m = 64 10-1 m = 128 m = 256 m = 512 10-2 10-3 w/o lsh filter 10-4 10 20 30 40 50 requesting width (b) n = 100000 4: (sec). m = 256 5 (LSH) R 4 1 2 LSH 4 LSH m m (2) - 985 - [1] Wang, J., Wu, S., Gao, H., Li, J., Ooi, B.C.: Indexing Multi-dimensional Data in a Cloud System. In: Proc. of the 30th ACM SIGMOD International Conference on Management of Data, pp. 591 602. ACM Press, Indianapolis, IN, USA (2010) [2] Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: Proc. of the 34th Annual ACM Symposium on Theory of Computing, pp. 380 388. ACM Press, Montreal, Quebec, Canada (2002) [3] Kirsch, A., Mitzenmacher, M.: Distance- Sensitive Bloom Filters. In: The 18th Workshop on Algorithm Engineering and Experiments. Miami, FL, USA (2006) [4] Hua, Y., Xiao, B., Veeravalli, B., Feng, D.: Locality-Sensitive Bloom Filter for Approximate Membership Query. IEEE Transactions on Computers 61(6), 817 830 (2011) [5] Gionis, A., Indyk, P., Motwani, R.: Similarity Search in High Dimensions via Hashing. In: Proc. of the 25th International Conference on Very Large Data Bases, pp. 518 529. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1999) [6] Kulis, B., Grauman, K.: Kernelized Locality- Sensitive Hashing for Scalable Image Search. In: Proc. of the 12th IEEE International Conference on Computer Vision, pp. 2130 2137. IEEE Computer Society, Kyoto, Japan (2009) [7] Kawamoto, J., Yoshikawa, M.: Private Range Query by Perturbation and Matrix Based Encryption. In: Proc. of the Sixth IEEE International Conference on Digital Information Management, pp. 211 216. IEEE Computer Society, Melbourne, Australia (2011)