Facial feature extraction has been attracting considerable attention over the past decade. Among the key facial features, the mouth is of particular importance, as its shape and shape dynamics convey the content of a communicated message, biometric information about the speaker, as well as the subject's emotion. For this reason, the problem of detecting, modeling and tracking lips has been studied intensively in the context of different applications. These include lip reading to enhance automatic speech recognition, lip modeling for speaking face synthesis in low bit rate communication systems and for communicating speech to people hard of hearing, expression recognition for affective computing, facial feature extraction for face retrieval from image and video databases, face modeling for photo fit kits and personal identity recognition and verification.
The aim of the project is to develop a system for automatic detection, modeling and tracking of lips in video of talking faces. The project will build on previous work carried out at the Centre for Vision, Speech and Signal Processing by Ulises Ramos Sanchez and Budhaditya Goswami. The main goal of the project is to make the detection, modeling and tracking more robust and to demonstrate the utility of the system developed in two applications: personal identity recognition and verification and man-machine interface.
The project can be modularised into the following sections:
The prerequisite to lip shape modelling is that the mouth region in a face image is reliably detected and lip pixels extracted. The current implementation of the lip detection, modelling and tracking system uses a very simple technique of mouth region extraction based on a coarse face model. The aim of this task will be to investigate more sophisticated techniques of image segmentation and object detection, such as those based on local binary masks for texture representation and AdaBoost trained detectors.
A variety of techniques were studied in the literature to solve the problem of lip region segmentation. A novel technique for lip segmentation has been proposed. The system description is shown in Figure 1.

The system makes use of a robust statistical estimator, called the Minimum Covariance Determinant Estimator to delineate the largest chromatically correlated region the lower half of the face. A binary system can thus be created consisting of the dominant cluster and the remaining pixels. Performing connected components analysis on this system, followed by a simple cost function implementation (based on distance from the image centre and the area of the connected region) yields a reliable classification of the lip region. The example results are shown in Figure 2.

The above results show example images from the stages in the processing in the following order:
The remainder of the tasks are briefly described in the next few sections.
The system developed previously within the CVSSP uses splines to create lip shape models. Currently, the location of the nodes of the model during fitting is purely data driven. This may result in unrealistic models of lip shape being generated. The aim of this task will be to develop a statistical model of spline node interactions and use this model to enhance the process of spline fitting.
The current system fits a lip shape model on a frame-by-frame basis. Clearly, the motion of the lips can be predicted to a certain extent, from the knowledge of their past evolution. For instance, when the mouth starts opening, it is likely to continue opening. When it reaches a fully open state, it will be expected to start closing. A dynamic model of lip shape expressed in terms of a Kalman filter, particle filter or Hidden Markov model should be able to capture the lip dynamics. The aim of this task is to explore the use of some of these models to exploit the temporal context in lips dynamics to enhance the quality and speed of the modeling process.
The proposed approached in the above tasks can be compared with other approaches such as the face appearance model which fits and tracks the complete face simultaneously, as well as approaches which model lip dynamics in terms of mouth region pixel motion patterns. As far as possible, the comparative evaluation will be carried out using publicly available software, or in collaboration with research groups elsewhere.
The lip shape modeling and tracking system will be evaluated in the context of personal identity verification. Two multimodal video databases, namely the XM2VTS and BANCA databases are available for use. The biometric models of subjects based on the lip dynamic properties will be evaluated using the standard evaluation protocols defined for these databases.
The lip modeling and tracking system will be evaluated either in face expression recognition or in speech recognition. If we opt for the latter application, a commercial speech recognition system will be used for the study, with acoustic information replaced or augmented by lip shape dynamics.
For more information, please contact Prof. Josef Kittler.