Multimedia Signal Processing
Our expertise in multimedia signal processing covers speech and audiovisual signal processing, human-computer interaction and pattern recognition. We focus on various signal processing and statistical machine learning techniques to study correlations, dependencies and independicies across modalities, which are loosely synchronous such as speech and gestures, speech and emotion, or strongly synchronous such as acoustic and throat microphone recordings, music and dance.
Digital Image and Video Processing
Digital image and video processing has become an important enabling technology for many applications including digital multimedia services over the wired and wireless Internet, digital surveillance, smart homes/vehicles/environments, digital cinema, HDTV, and HD-DVD. In the MVGL, we conduct active research in all aspects of digital video processing including video analysis, compression, filtering and data embedding. Some example projects are: i) Object Motion Segmentation and Tracking – Motion estimation and object segmentation are fundamental to all digital video processing applications. Video object segmentation and tracking methods may be classified as those that follow a coarse outline (e.g., a bounding box or silhouette) of an object vs. those which extract pixel-accurate contour of the object. We have developed automatic object segmentation methods by integration of motion and color information (silhouette), as well as 2-D mesh and snake-based tracking (pixel-accurate) of objects using a piece-wise adaptive snake model of the contour that are manually marked or refined on key frames. ii) Multi-Camera Human Motion Monitoring and Gait Analysis – The main subject in most videos are humans. We are developing algorithms for real-time segmentation and tracking of humans from single and multiple cameras as well as understanding their motions. We employ both 2-D and 3-D human models for human tracking applications, including gait analysis. iii) Multi-Camera Surveillance and Super-Resolution Recovery – We are studying new methods for obtaining super-resolution frames from one or more cameras capturing uncompressed and compressed video under different scene motion and multi-camera registration models. iv) Sports Video Processing for Indexing and Summarization – Semantic-level analysis of video is possible only for specific domains which provide strong context information. Sports video has strong context originating from the rules of the game as well as cinematographic conventions used by the broadcasters. We are developing algorithms for field (dominant-color) detection, shot boundary detection, shot type classification, slow-motion replay detection, and important event detection for various different types of sports, including soccer, basketball, football, etc. v) Digital Watermarking and Security -Digital images and video has become increasingly susceptible to spatio-temporal manipulations as a result of recent advances in editing tools. We propose secure and flexible fragile digital authentication watermarking methods, which enable self-recovery of video after malicious manipulations. In particular, we are studying fragile and semi-fragile methods for generalized lossless (invertible) data embedding to images and videos for embedding metadata or hierarchical authentication watermarks. We are also developing collusion-resilient fingerprinting applications.
Multimodal Signal Processing: Multimodal Human-Computer Interfaces
Multimodal signal processing refers to combined processing of signals from multiple modalities such as speech, still images, video, and other sources. It plays a key role in the design of future human-computer interfaces and intelligent systems, such as intelligent vehicles. The ultimate goal of human-computer interface research is to develop a machine that is able to identify humans, to analyze and understand them from biometric input signals and to synthesize a human-like output in response, in a similar way to human-to-human communication. The joint use of multiple modalities derived from biometric signals, in particular from audiovisual sensors such as voice, face, fingerprint, iris, gestures, body and head motion, gait, speech, lip movements, reinforces recognition, provides robustness and yields more natural interaction with computers. The study of relations and correlations between different modality signals plays an important role in effective use of this multimodal information (see SIMILAR project). Speech recognition, person identification, speaker recognition, body motion analysis, speech-driven face gesture analysis and synthesis, speaker animation, audio-driven body animation are some of the active research areas in the MVGL in the area of multimodal signal processing.
3D Computer Vision & Graphics
Vision technologies can create replica of the real world, analyze and understand it, not only by the geometry and appearance of objects and surfaces, but also by their motion and behavior. 3D acquisition and display devices and techniques have advanced significantly in the last decades. In parallel to this progress, 3D information has already become an essential component of vision and graphics technologies as a means to better analyze and understand the visible world as well as to convey information during human-computer interaction with applications in numerous fields such as medicine, education, entertainment, bioinformatics, biometrics security, surveillance, etc. Since in principle 3D contains simply more information than conventional 2D images and is more suited to human perception (if of course effectively used), with the continuing trend in the advancement of 3D acquisition and display devices, in near future 3D information is expected to be a key component in most vision technologies and the primary means to convey visual information. MVGL is actively involved with research activities addressing these problems, especially those which are common to 3D vision and graphics. Topics of interest include (but not limited to) 3D shape correspondence, 3D reconstruction from depth/rgb video, deep learning for computer vision, 3D scene capture and analysis, surface reconstruction, 3D object retrieval and recognition, multi-camera motion capture.
The Internet Protocol (IP) architecture is highly flexible in accommodating a wide range of multimedia communication applications ranging from the ongoing replacement of classical telephone services by Voice over IP (VoIP) applications to new video over IP services which aims to replace classical TV services by InternetTV. Transmission of video and graphics over IP is currently an active research and development area where significant results have already been achieved. There are already video-on-demand services, both for news and entertainment applications, offered over the Internet. Also, 2.5G and 3G mobile network operators started to use IP successfully to offer wireless video services. Flexible transport of a variety of 3DTV representations over IP networks seems to be a natural extension of monoscopic video over IP applications, which is an active research area in the MVGL.
Video streaming architectures can be classified as i) server unicasting to one or more clients, ii) server multicasting to several clients, iii) peer-to-peer (P2P) distribution, where each peer forwards packets to another peer, and iv) P2P multicasting, where each peer forwards packets to several other peers. Multi-view video streaming protocols include RTP/UDP/IP, which is the current state of the art, and RTP/DCCP/IP, which is the next generation protocol. Multicasting protocols can be supported at the network-layer or application layer. We are actively working on monocular and multi-view video streaming over DCCP and P2P video streaming using various video codecs including scalable and multiple description video coders.
Streaming video involves encoding and packetization of digital video. Streaming systems may be one-way (video-on-demand) or two-way (interactive). Furthermore, they may employ real-time or off-line video encoding. The main problem in streaming is that channel conditions in Internet and wireless applications are time-varying and usually vary rapidly. Therefore, it is of interest to adapt the rate video source either in real-time coding or by rate-shaping to best match the channel conditions. Channel feedback can be received either by application-layer mechanisms such as RTCP, or directly from the physical layer assuming a cross-layer design. We are working on rate-distortion optimized and content-based rate adaptation of H.264 video encoder based on several strategies in order to optimize the quality of the delivered video for a streaming server. These strategies include transmission buffer management, switching between a fixed number of pre-encoded streams, multiple description coding, error-resilient encoding at the server as well as post-processing at the receiver.