Free-viewpoint Video Rendering in Large Outdoor Space such as Soccer Stadium based on Object Extraction and Tracking Technology

Free-viewpoint Video Rendering in Large Outdoor Space such as Soccer Stadium based on Object Extraction and Tracking Technology Hiroshi Sankoh and Sei Naito Abstract In this paper, we propose a robust object tracking scheme for multi-view cameras and consecutive frames for rendering an immersive free-viewpoint video in a large outdoor space such as a soccer stadium. For a free-viewpoint video that provides users with an immersive experience, each object has to be identified consistently among all cameras for every frame to share the textures of the same objects and replace the textures when an occlusion occurs. To satisfy this requirement, the proposed method extracts objects silhouette regions and tracks each identified object by associating a closed silhouette region with a tracking ID for every camera. During the frame by frame process, our method confirms whether occlusion occurs for each tracking region and modifies the texture region by projecting the world coordinate of the object in 3D-space, which can be estimated from a camera image without occlusion if one is available. The experimental results revealed that the proposed method achieved more robust texture extraction of multiple objects especially for occluded regions compared to the conventional methods. Furthermore, it was confirmed that the proposed scheme can improve the subjective image quality for free-viewpoint video as a result of precise reconstruction of occluded regions., 1. 3 1) 3 2) 3 2 3) 2013 6 14 2013 11 26 2014 1 8 KDDI 356 8502 2 1 15, TEL 049 278 7685 KDDI R&D Laboratories Inc. (2 1 15, Ohara, Fujimino-shi, Saitama 356 8502, Japan) 4) 2 2

2 2 3 4 5 2. 5) 6) 7) 8) 9) 10) 3 10) 2 3 10) 9) 3 9) 10) 3. 2 ID 1 1 3. 1 1 2

(a) (b) 2 2 Extractions of object regions. (a) (b) ID 3 ID Setting of object ID. 1 Flowchart of the proposed method. 3) 2 (X, Y ) 2 (u, v) H cam s (1) (X, Y, 1) T = sh cam (u, v, 1) T (1) 2 2 3) k RGB YUV c I c (k) I c (k) u (k) c σ c (k) th c 8 2 2 u (k) c σ c (k) th c <I c (k) <u (k) c + σ c (k) + th c (2) 3 ID 2 8 3(a) 3(a) ID 3(b) ID 2 ID 2 ID ID ID ID 4 ID P c(t) =(u(t),v(t), Δu(t), Δv(t)) T (3) (u(t),v(t)) t 2 (Δu(t), Δv(t))

N(0, Σ 4 ) ω(t) 0 Σ 4 c(t) 1 0 1 0 c(t) = 0 1 0 1 0 0 1 0 c(t 1)T + ω(t) (4) 0 0 0 1 (a) ID (b) ID 4 1 Occlusion detections (cam1). ω(t) N(0, Σ 4 ) (5) ID 2 2 5 1 3. 2 5 t ID ID tr p w tr,p w tr,p p k p ρ (kp) c,exi ID tr ρ (tr,k p) c,uni λ (6) w tr,p = λρ (kp) c,exi +(1 λ)ρ(tr,kp) c,uni (6) k 2 ρ (k p) c,exi p) ρ (k p) c,exi =1 exp( (I(k c u (k p) 2σ (kp) c c ) 2 ) (7) Ī(tr) c σ c (tr) ρ (tr,k p) c,uni ρ (tr,k p) c,uni = c exp( (I(k p) c Ī(tr) c ) 2 2σ c (tr) ) (8) w tr,p (a) ID (b) ID 5 2 Non-Occlusion detections (cam2). (a) cam1 (b) cam2 6 2 2d-world coordinate of each object estimated in two cameras. 7 Modification of tracker regions. 6 ID tr t 2 3 ID label ID ID tr ID label ID ID 1 ID ID 4 4 ID label =3 ID tr =1 tr =2 7 6

ID tr ID 2 2 4 1 2 ID ID 5 5 1 ID tr = 1 tr = 2 2 1 2 ID 2 6 6 1 ID tr =1 tr = 2 2 ID tr = 1 tr = 2 ID tr = 1 tr = 2 2 2 1 7 4(b) 7 ID t-1 2 2 2 T 5 7 3. 3 8 ID tr 6 2 ID 2 (a) A (b) B 8 Camera arrays of sequence A and sequence B. 2 2 ID 2 3 CG CG 3 4. 1 2 2 2 A 2 8(a) B 4 8(b) 4096 2304 30fps 9 10

(a) camera 1 (b) camera 2 9 A Initial frames of sequence A. (a) camera 1 (b) camera 2 (c) camera 3 (d) camera 4 10 B Initial frames of sequence B. 4. 1 1: 1 A 70 B 660 A 2 B 5 1 2 1 9) 2 10) (2) 2 ID 5) t (ground truth) gt t ID gmme t A 22 B 22 1 23 ID ID gmme t gmme gmme t gt t T T t=1 gmme = gmme t T t=1 gt 100 (9) t 2 1 2 A B 11 16 11 14 1 12 15 2 13 16 ID 1 ID ID 2 ID ID ID 11 A B 1 11 11 A B 12 22 A 1 25 2 8 1 ID14 ID15 2 ID2 ID5 1 ID ID 2

ID 3 2 7 B 200 ID6 ID8 ID9 ID20 ID21 3 230 14(a) 14(b) ID ID ID ID 1 1 ID8 ID21 2 ID6 ID9 ID ID10 2 ID6 ID8 ID9 ID20 ID21 ID 2 1 2 ID ID ID 3 ID 3 (9) 1 1 2 A 1 2 A 3 (a) frame 25 (cam 1) (b) frame 52 (cam 1) (c) frame 8 (cam 2) (d) frame 47 (cam 2) 11 ( A) Results of the proposed method for seq A. (a) frame 25 (cam 1) (b) frame 52 (cam 1) (c) frame 8 (cam 2) (d) frame 47 (cam 2) 12 1 ( A) Results of the conventional method 1 for seq A. (a) frame 25 (cam 1) (b) frame 52 (cam 1) (c) frame 8 (cam 2) (d) frame 47 (cam 2) 13 2 ( A) Results of the conventional method 2 for seq A. (a) frame 230 (cam 1) (b) frame 230 (cam 2) (c) frame 230 (cam 3) (d) frame 230 (cam 4) 14 ( B) Results of the proposed method for seq B. B 1 2

4. 2 2: (a) frame 230 (cam 1) (b) frame 230 (cam 2) (c) frame 230 (cam 3) (d) frame 230 (cam 4) 15 1 ( B) Results of the conventional method 1 for seq B. (a) frame 230 (cam 1) (b) frame 230 (cam 2) (c) frame 230 (cam 3) (d) frame 230 (cam 4) 16 2 ( B) Results of the conventional method 2 for seq B. 1 Comparison of Quantitative Measurement. A B 0 1.617 1 6.875 9.265 2 0 16.548 B 3 2 ID 1 2 ID ID ID ID ID B 1 1 2 8 2 1 2 1 2 ID ID 2 2 3 2 A 17 1 2 3 18 19 20 21 20 (c) 20 (d) 21 3 17 (c) 17 (d) 3 21 18 19

(a) viewpoint 1 (b) viewpoint 2 (a) viewpoint 1 (b) viewpoint 2 (c) zooming up of viewpoint 1 (d) zooming up of viewpoint 2 17 Free-viewpoint video rendered by the proposed method (c) zooming up of viewpoint 1 (d) zooming up of viewpoint 2 19 2 Free-viewpoint video rendered by the comparative method 2. (a) viewpoint 1 (b) viewpoint 2 (a) viewpoint 1 (b) viewpoint 2 (c) zooming up of viewpoint 1 (d) zooming up of viewpoint 2 18 1 Free-viewpoint video rendered by the comparative method 1. (c) zooming up of viewpoint 1 (d) zooming up of viewpoint 2 20 3 Free-viewpoint video rendered by the comparative method 3. 2 3 2 5. (a) cam 1 (b) cam 2 (c) cam 3 (d) cam 4 21 Closeup of each camera. ID 1 T. Kanade, P. W. Rander, and P. J. Narayanan: Virtualized reality: Constructing virtual worlds from real scenes, IEEE Multimedia, 4, 1, pp. 34-47 (1997) 2 T. Koyama, I. Kitahara, and Y. Ohta: Live Mixed-reality 3D Video in Soccer Stadium, In Proc of IEEE/ACM Conference on

ISMAR, pp. 178-187 (2003) 3 K. Hayashi and H. Saito: Synthesizing Free-viewpoint Images from Multiple View Videos in Soccer Stadium, In Proc of IEEE Conference on CGIV, pp. 220-225 (2004) 4 M. Germann, A. Hornung, R. Keiser, R. Ziegler, S. Wurmlin, and M. Gross: Articulated Billboards for Video-based Rendering, In Proc of EUROGRAPHICS, pp. 585 594 (2010) 5 H. B. Shitrit, J. Berclaz, F. Fleuret, and P. Fua: Tracking Multiple People under Global Appearance Constraints, In Proc of IEEE Conference on ICCV (2011) 6 A. Mittal and L. Davis: M2tracker: A Multi-view Approach to Segmenting and Tracking People in a Cluttered Scene, IJCV, 51, 3, pp. 189-203 (2003) 7 C. Yang, R. Duraiswami, and L. Davis: Fast Multiple Object Tracking via a Hierarchical Particle Filter, In Proc of IEEE Conference on ICCV (2005) 8 K. Okuma, A. Taleghani, N. D. Freitas, J. J. Littele, and D. G. Lowe: A boosted Particle Filter: Multi Target Detection and Tracking, In Proc of ECCV, pp. 28-39 (2004) 9 M. Breitenstein, F. Reichin, B. Leibe, E. Koller-Meier, and L. V. Gool: Robust Tracking-by-detection using a Detector Confidence Particle Filter, IN Proc of IEEE Conference on ICCV, pp. 1515-1522 (2009) 10 S. Iwase and S. Saito: Parallel Tracking of All Soccer Players by Integrating Detected Positions in Multiple View Images, In Proc of IEEE Conference on ICPR, pp. 751-754 (2004) 11 C. Rother, V. Kolmogorov, A. Blake: GrabCut: interactive foreground extraction using iterated graph cuts, In Proc of ACM SIGGRAPH, 23, pp. 309-314 (2004) 12 :, CVIM, pp. 193-204, (2007) 2006 2008 KDDI( ) 3 ( )KDDI 1994 1996 ( ) HDTV, SDTV ( )KDDI