Multimodal Hierarchical Face Recognition using Information from 2.5 D Images

Facial recognition under uncontrolled acquisition environments faces major challenges that limit the deployment of real-life systems. The use of 2.5 D information can be used to improve discriminative power of such systems in conditions where RGB information alone would fail. In this paper we propose a multimodal extension of a previous work, based on SIFT descriptors of RGB images, integrated with LBP information obtained from depth scans, modeled by an hierarchical framework motivated by principles of human cognition. The framework was tested on EURECOM dataset and proved that the inclusion of depth information improved signiﬁcantly the resul ts in all the tested conditions, compared to independent unimodal approaches.


Introduction
Over the past few years, the issue of face recognition has been in the spotlight of many research works in pattern recognition, due to its wide array of real-world applications.The face is a natural, easily acquirable trait with a high degree of uniqueness, representing one of the main sources of information during human interaction.These marked advantages, however, fall short when images of limited quality, acquired under unconstrained environments, are pre-sented to the system.Whereas technological improvements in image capturing and transmitting equipment managed to attenuate most noise factors, partial face occlusions still pose a genuine challenge to automated face recognition (Li et al., 2014).Facial occlusions may occur due to a multiplicity of deliberate or unintentional reasons.Whereas accessories, such as sunglasses and scarves, and facial hair represent quite common sources of occlusion in daily life, they can also be explored by bank robbers and shop thieves in an attempt to avoid recognition.Furthermore, the use of some accessories might be enforced in restricted environments (such as medical masks in hospitals and protection helmets in construction areas) or by religious or cultural constraints (Min et al., 2014a).The fact that humans perform and rely on face recognition routinely and effortlessly throughout their daily lives leads to an increased interest in replicating this process in an automated way, even when such limitations are known to frequently occur.Even though there is no consensus in the cognitive science field as to how the human brain recognizes faces, either based on their individual local features or, more holistically, on the basis of their overall shape, several works have shown that both levels of information play a non-negligible role in human face perception (Schwaninger et al., 2007, Gold et al., 2012).In previous works (Monteiro and Cardoso, 2015a,b), the authors explored the global precedent hypothesis for human perception as the basis for a new decision strategy to guide the face recognition process, in an hierarchical manner, in RGB color images.Such hypothesis claims that face recognition is performed by the human brain in a global-to-local flow, with holistic information gaining precedence over a more detailed local analysis.In the present work, we built upon the aforementioned previous research, incorporating information from the three-dimensional structure of the face, through the use of 2.5D depth images acquired using the Microsoft Kinect low-cost sensor.By exploring information from an extra spatial dimension we aim to grant the original algorithm with higher robustness in scenarios, such as critically low illumination, where the acquisition of color images is severely limited.With this goal in mind, we performed a detailed analysis of the state-of-the-art works on 3D face recognition, in order to identify trends of research to help guide the design of the extension of the referred previous works, as well as defining future prospects of research.We start by presenting in Section 2 a thorough review of the state-of-the-art concerning the evolution of face recognition to 3D scenarios, with special focus to recent works on Kinect depth images.We then present, in Section 3, a detailed description of the extension of the original hierarchical algorithm to incorporate depth information.The most relevant preliminary results are presented and discussed in Section 4, while the main conclusions and prospects for future work can be found in Section 5.

State-of-the-art in 3D Face Recognition
As referred in the previous section, face recognition is a challenging pattern recognition problem especially in the presence of variations in illumination conditions, occlusions, pose and facial expression changes and disguises.Due to the inherent 3D structure of the face, changes in illumination and non-frontal pose from the individuals could lead to changes in some facial features, thus conditioning the performance of the system.To overcome the decrease of performance in these situations, 3D face recognition can be used to improve the recognition rate, giving a more robust facial description and not being affected by illumination variation, leading possibly to a greater discriminative power.There are two main ways of representing 3D facial structure (Abate et al., 2007): the 2.5-Depth images and 3D images.The 3D images retain all the facial geometry information, whereas the 2.5D or range images are a bi-dimensional representation of a set of 3D points in which each pixel in the 0 plane stores the depth z value.The disadvantage of this representation is that it only takes information from one point of view, allowing only a single facial model.Also, the 3D image depends only on internal anatomical structure while 2.5D scans are affected by environmental conditions and external appearance.Both of these representations can increase the performance of recognition algorithms, but it is important to evaluate in which type of systems the acquisition of 3D facial data poses a feasible challenge.Table 1 lists some of most often used acquisition solutions used in the creation of most datasets found in literature.The offered solutions can be either stereoscopic camera systems, structured light systems or laser range systems, obtaining both 3D and intensity information.While the Minolta and Inspeck sensors are generic 3D sensors, CyberWare and 3dMD were designed specifically to face 3D scanning.All of those solutions are very precise, yet they are also very expensive.The natural evolution of the 3D sensors is towards low cost sensors, with a decrease in the resolution, that could be used for the creation of real-time systems that are cheap but at the same time robust enough to perform face recognition in adverse conditions.Kinect is one of the most used sensors, and it contains an infra-red (IR) laser emitter and an IR camera in addition to a RGB camera.The RGB camera captures the RGB images directly, whereas the laser emitter and IR camera work together to capture the depth map.The depth map is obtained via a triangulation process based on those two sensors.First the IR laser projects a predesigned pattern of spots in the scene (using a raster) and the reflection of the pattern is captured by the IR camera.(Min et al., 2014b) Although the Kinect is the most commonly used low-cost sensor for this type of applications SoftKinetic DS325 (Mracek et al., 2014), Structure sensor (mobile sensor in tablets) (Gutfeter and Pacut, 2015) and PrimeSense (Min et al., 2012) (bought recently by Apple but currently not acquirable) also have been used in some facial recognition datasets, being an alternative to the Microsoft sensor.Using these sensors, many datasets are available for algorithm testing.All these datasets can be divided in two groups: the high-resolution scans that use high-quality and expensive scans like Minolta and 3dMDface Systems, and the low-resolution scans that use low-cost sensors with lower precision and resolution like Kinect, SoftKinectic and Structure sensors.The databases of facial surfaces should have a large variety of subjects and conditions in order to simulate the most important challenges in facial recognition (pose, facial expression, illumination and occlusion).The information relative to these sensors was obtained from (Min et al., 2014b), where most of this sensors and databases were analysed in detail.The first datasets created for the 3D facial recognition problem used high precision sensors.Some of the most important datasets are the Bosphorus, York, FRGC, GavabDB, Binghamton University, Texas-3D, UMB-DB and 3D-RMA (Abate et al., 2007).All these datasets use expensive and high resolution sensors.Alongside the evolution of sensors towards the lowcost, lower resolution and faster acquisitions, recent databases were also constructed with this type of sensors.Some examples are the CurtinFaces (Li et al., 2013), NASK-StructureFacebase (Gutfeter and Pacut, 2015), BIWI Kinect Head Pose Dataset (Hayat et al., 2015), UWA Kinect dataset (Hayat et al., 2015), FaceWareHouse (Cao et al., 2014) and EURECOM dataset (Min et al., 2014b).Although the number of 2D+3D datasets are still in low number comparatively to the 2D datasets, these databases are increasing in number and in variety and are fundamental to the testing and assessment of performance of new algorithms.The specifications of these datasets are shown on Table 2. Through the analysis of the Table 2 we can observe that EURECOM database seems to be the most complete database, although the number of scans is limited.A test with different type of sensors and conditions is crucial for a construction of a good framework.The generation of 2.5D or 3D datasets leads to a necessary adaptation of the frameworks designed for 2D images to be capable to receive tridimensional information as input.Most of the datasets presented above were built due to the need of achieving and objective assessment of how newly designed algorithms worked on a variety of new 3D Face recognition problems.There are three main types of approaches for this pattern recognition problem: 2D Based, 3D based and multimodal.The first type uses synthetic 3D face models to increase the robustness in respect to pose variations as well in changes in illumination and facial expression.3D-based methodologies don't use intensity information and only use 3D or 2.5D data for the algorithms.Finally, the multimodal approaches take advantage of information from both previous approaches in order to attempt fusion of the first two types described earlier.
The 2D based approaches were in the genesis of the 3D facial recognition and, although they only use a 2D input query face, a 3D model is used to improve the robustness of a system.Many approaches like (Blanz and Vetter, 2003), (Lu et al., 2004) and(Hu et al., 2004) in which many virtual 3D models are generated to simulate the variations in pose and facial expression.The problems with these approaches are the several limitations in constructing a model from a single frame, and the non proximity to reality of the generated models.The use of methodologies based only on 3D, thus called unimodal, has shown to be a good alternative to RGB in conditions of varying illumination, facial expression and pose.The main problem with such approaches concerns the need of a correct alignment of 3D data between two face surfaces.In 1992, Besl (Besl and McKay, 1992) introduced the Iterative Closest Point (ICP) to perform a correct alignment of facial models.One of the first works with 3D facial recognition was introduced by Gordon (Gordon, 1991), based on the calculation of distance measures between some regions (like shape of forehead, jaw line, eye corner cavities and cheeks).A few years later, Tanaka (Tanaka et al., 1998) proposed a curvature-based approach.By extracting the principal curvatures and their orientations in a facial model, some features are extracted and mapped on two unit spheres Extended Gaussian Images (EGI).Chua (Chua et al., 2000) found some regions (nose, eye socket and forehead) and a Point Signature twoby-two comparison among different facial expressions of the same person and similarity measure is used with a rank vote process using a training indexed table.The Local shape descriptors in these type of scans were introduced by Moreno (Moreno et al., 2003) where different regions are segmented using the median signs and the Gaussian curvatures, isolating regions with significant curvatures.Some features are extracted (areas, distances, angles, area ratios, mean of areas, mean curvatures, variances, etc.) in order to achieve a good description of these regions.In recent works, Rui Min (Min et al., 2012), using the Apple PrimeSense, proposed a canonical face based system using only frontal pose images.The facial region obtained is divided on nose, eye region, cheeks and the remaining parts (each region is associated with a respective weight).A feature vector is formed containing the L2 distances between each facial region and their corresponding areas.Naveen (Naveen and Moni, 2015) in FRAV3D database proposed the use of 2D spectral and 2D spatial domain information to solve the problem of facial recognition, based on 2D DWT (Discrete Wavelet Transform) and 2D DCT (Discrete Cosine Transform).Using landmark detection based on three principal curvatures, Tang (Tang et al., 2015) determines the geometric properties of each vertex using an asymptotic cone in order to generate three curvature faces to which are applied Local Normal Patterns.Neto (Cardia Neto and Marana, 2015), used 3D-Local Binary Patterns (LBP) with Histogram Oriented Gradients (HOG) as approach on Kinect Eurecom images.Bondi (Bondi et al., 2015) also used real-time Kinect sequences by generating high resolution models every-time someone passes through the sensor.Keypoints are detected using SIFT and spatial clustering, used in pairs to evaluate the facial curves between pairs of points.
The inclusion of two modalities has shown to be the most promising for real-time systems and uncontrolled environments.The results have shown to be always improved with the fusion of 2D and 3 modalities.(Abate et al., 2007) In 2003, Chang et al. (Chang et al., 2003) investigated the benefits of integrating 3D data (using a Minolta Vivid 900 sensor) with 2D images, using PCA separately on 2D and 3D data.The authors state that 2D and 3D individually get similar performances, but when combined (with a simple weighting system) they get a significant increase in the performance.Tsalakanidou et al. (Tsalakanidou et al., 2003) applied Eigenfaces on both 2.5D and 2D scans.Here the multimodal approach has shown significant improvements when compared with independent 2.5D and 2D recognition.Later, in his works, Mian (Mian et al., 2007) (Mian et al., 2008) proposed some new approaches for multimodal face recognition.In 2007 (Mian et al., 2007), using a combination between 3D Spherical Face Representation (SFR) and 2D SIFT, big part of the candidate faces are removed from the query images.Then the eyes-forehead and the nose regions are automatically segmented.One year later he proposed a new method using a new keypoint detection using the high shape variations in 3D data and a Local Feature Matching (Mian et al., 2008) Bennamoun (Hayat et al., 2015) proposed a new raw depth pose estimation, not assuming a strong statistical relationship between the training data and the query faces, followed by the application of a Riemannian manifold for feature selection.
As we can observe, new developments show the use of unimodal 3D and multimodal approaches for developing face recognition frameworks, although the use of the multimodal ones seem to be the most promising strategies for real-time systems.Table 3 summarizes the most relevant information extracted from the works described above, regarding a series of parameters (feature extraction, classifiers, datasets ...) whose rational choice must be thought of when designing and assessing a new approach to 3D face recognition.

Original algorithm overview
The hierarchical recognition algorithm that we work upon on the present work was first proposed and explored in two previous works (Monteiro and Cardoso, 2015a,b) and is schematically represented in Figure 1.Figures 1(a) and 1(b) depicts the enrollment process in the proposed approach.During enrollment, a new individual's biometric data is added into a previously existent database of individuals.For each individual, a hierarchical ensemble of M partial face models is trained.The M individual-specific models are built by maximum a posteriori (MAP) adaptation of the corresponding set of M universal background models (UBM) using individual-specific data.The UBM is a representation of the distribution that a biometric trait presents in the universe of all individuals.MAP adaptation works as a specialization of the UBM based on each subject's biometric data.The idea of MAP adaptation of the UBM was first proposed by Reynolds (Reynolds et al., 2000), for speaker verification.The database is probed during the recognition process to assess either the validity of an identity claim (verification) or the k most probable identities (identification) given an unknown sample of biometric data.In the aforementioned previous works, the authors proposed an innovative approach to the recognition process based on the global precedence hypothesis of face perception by the human brain.Recognition is performed hierarchically, as depicted in Figure 1(c), with global models taking precedence over more detailed ones.Partial models are hierarchically organized into levels.Each level is composed by a set of non-superimposing subregions, Il of equal size, with subregions at the same level summing to the full-face image,  0 .
During recognition, a test image from an unknown source follows the hierarchical flow depicted in Figure 1(c), until a decision can be made with a significant degree of certainty.The significance of a decision carried out at a single level is defined through the analysis of the likelihood-ratio values obtained for each possible identity claim, through the computation of a certainty index,   : where   * , represents the highest observed likelihood-ratio value (true identity) and the average of all other values (average impostor) is represented by . If the   value exceeds a previously optimized threshold,   , the maximum likelihood-ratio decision is accepted.
When   >   , however, the algorithm will consider that an analysis at a more detailed level is necessary to achieve a decision with a higher degree of confidence.At this point, the algorithm proceeds to the next level, working on subregions  1−2 , the second in the hierarchical chain depicted in Figure 1.When one level is composed by multiple subregions, each one of them is treated independently, and only the maximum   value among them is considered.All models are trained using Gaussian Mixture Models (GMM) and sets of SIFT keypoint descriptors for feature representation.In the next section we present some alterations to this choices in order to adapt the outlined framework to depth images.

Proposed extension to depth images
On the present work we carried out some preliminary experiments using the framework detailed in the last section, but using Kinect depth images as the input for the whole system.The architecture of the system remains as described above and depicted in Figure 1.By analysis of some of the recent works listed in Section 2, we decided to focus the extension of our framework on feature description.With this in mind, two feature descriptors were chosen to describe Kinect depth images, taking the place of the original SIFT keypoint descriptor from the original works: − Dense SIFT grid: while the original SIFT algorithm includes a keypoint detection block, the noisy nature of depth images, associated to their low intrinsic detail, might severely hinder the correct functioning of this detection.Therefore, we used a dense grid of equally separated keypoints to compute the SIFT descriptors and guarantee that enough information is present for robust modeling.− Local Binary Patterns (LBP): as an alternative to dense SIFT, we also perform uniform LBP description locally on a set of 4 × 4 sub-images.The resulting histograms are then concatenated to achieve a full description of the image.We chose LBP not only due to its vast array of applications in computer vision in works concerning texture description, but also because of the promising performance it presented in some recent datasets built with Kinect depth images (Min et al., 2014b).With this extension we end up with two instances of the whole hierarchical framework, based on either RGB or depth images.In the next section we discuss how information from both sources is integrated into a single decision.

Multimodal fusion
In this work we performed fusion at the score level, using the likelihood-ratio values from two hierarchical pipelines: one for RGB images,   , and one for depth images,   .The final fusion score,   , is obtained by a weighted averaging of the two scores,   =   ×   +   ×   .The optimal values for the weighting parameters were found through grid search, under the constraint Σ    = 1.

EURECOM Dataset
The experiments were conducted in the EURECOM database.Using Kinect Sensor, this database has a set of well-aligned 2D, 2.5D, 3D and video data.It includes scans from 52 subjects (38 males and 14 females) from two sessions interleaved from 5 to 14 days.Each session has nine types of scans that include: neutral face, open mouth, smiling, strong illumination, occlusion with sunglasses, occlusion by hand, occlusion by paper, right face profile and left face profile.The acquisition environment is controlled in terms of luminosity, with the individuals always in a range from 0.7 to 0.9 meters to the sensor.A blank background was chosen to make the processing of the data easier.An example of the 2D and 2.5D images from a single individual is presented in Figure 2. We chose to not consider the profile images as the designed framework is still limited as far as pose variations are concerned.

Experimental setup
In our framework, neutral face images were used for the training of the models and the remaining scans were used as query faces inputted in the system (profile images were eliminated).The images were manually cropped in order to only analyze the facial region.
We chose to assess the rate of correctly identified individuals, by checking if the true identity is present among the N highest ranked identities.The N parameter is generally referred to as rank.This allows us to define the Rank-1 recognition rate, r1, as the recognition rate at N=1.

Performance analysis
The main results obtained with the framework detailed above, for both RGB and depth images, are summarized in Tables 4-7.For each tested scenario we present the individual performance observed for each condition present in the EURECOM dataset: light on (LO), occluded eyes (OE), occluded mouth (OM), occluded paper (OP), open mouth (OPM) and smile (S).For each of such conditions and scenarios we define three reference values extracted from our framework: − Full-face, L0: performance observed when considering only the first level of the hierarchical framework.− Optimal, L*: performance observed for the full hierarchical framework, optimized with regard to the   parameter.− Reject option, L0.2: performance observed with the option of not-classifying a image if it reaches the last level of the hierarchical framework with no certain classification being achieved.We choose to assess performance in the specific case of 20% rejection.From the results obtained using RGB images, we can conclude that our 2D approach has a similar or higher performance in all tested conditions, even though a fair comparison can only be performed between our results and the SIFT approach presented in (Min et al., 2014b), while the LGBP results are displayed on Table 4 because they were the best ones obtained in the same work.The SIFT algorithm tested by Min et al. (Min et al., 2014b), was outperformed by our GMM modeling approach to SIFT description.Only in the face occluded with paper, can we observe worse results when compared to the literature.This indicates that more work needs to be done to overcome this drawback condition.Tables 5 and 6 summarize the most relevant results concerning the application of our hierarchical framework to depth images.One common observation that can be made is that the application of the rejection mode alternative doesn't bring about any improvement, as it did on the original RGB scenario.This might relate to the higher probability of getting strong false positives from depth images.A very high score in a wrong identity exerts a strong limitation over the computation of the quality criterium defined on Equation 1.It is also interesting to note how the optimal performance from the whole hierarchical flow shows very little improvement for all the test scenarios, using dense SIFT, when compared to the holistic representation from the first level.The advantages of using our approach for this specific modeling strategy can, therefore, be questioned.However, when comparing to the traditional SIFT detector and descriptor, used by Min et al. (Min et al., 2014b), we can see that our approach of modeling a densely sampled grid of SIFT descriptors achieves a considerably higher performance and should be considered in its simplest form (using only the holistic representation from the first level) as an aid to more traditional RGB-based approaches.Fusion LGBP [Min et al., 2014b] 1.000 0.894 0.981 0.856 0.981 1.000 Fusion LBP [Min et al., 2014b] 0.990 0.933 0.962 0.817 0.981 1.000 Table 7: Multimodal fusion results obtained using information from RGB and depth images When considering the LBP results, presented in Table 6, we might observe that, opposed to the dense SIFT modeling, the hierarchical framework brings about considerable increase in performance for almost all test scenarios.All observed results are also slightly to considerably better than the performance observed for their SIFT counterpart, corroborating the observations presented by Min et al. (Min et al., 2014b) in the original work on the EURECOM dataset.For that reason the multimodal fusion shown below are obtained by considering the original SIFT formulation for 2D images and only the LBP version of the framework applied to depth 2.5D images.
The analysis of the multimodal fusion results, presented in Table 7, shows that significant improvement was obtained with respect to both 2D and 2.5D unimodal alternatives, for all test scenarios except OP, where the discrepancy between the individual performances of the unimodal approaches serves as a simple justification for this observation.When comparing with the results from the state-of-the-art we can observe that, once again, besides the OP scenario, we achieved performance either in the same range or slightly better than the ones reported in literature.With this observations in mind we can readily conclude that our framework follows the trend observed in previous works, where multimodal fusion of multiple sources of information leads to an improvement over all individual performances.If we manage to improve the individual performances of each framework we should, thus, be able to also improve the discriminative power of multimodal fusion and, consequently, increase the real-life applicability of systems based on such approaches.

Conclusions and Future Work
In the present work we propose an extension of some works on hierarchical face recognition to 2.5D Kinect depth images.We approach this problem by proposing alternative feature description strategies, such as dense SIFT and LBP.We achieved or improved over state-ofthe-art performance in most tested scenarios.However, some potential improvements can be easily suggested in order to achieve higher performance and also to further assess the effectiveness of the proposed algorithm in a higher variety of interesting scenarios.First of all, exploring feature descriptors other than the proposed ones, or even exploring fusion of multiple features might result in a more complete description and, thus, result in better performance in a wider variety of acquisition conditions.One of such conditions, not yet tested due to its non-existence on the EURECOM dataset, is severe low illumination.This case, in theory, represents a situation where 3D information should severely improve over RGB alone.

Figure 1 :
Schematic representation of the proposed algorithm and its main blocks: a) training of the universal background models using data from multiple individuals; b) maximum a posteriori (MAP) adaptation of the universal background models (UBM) to generate individual specific models; and (c) testing with new data from unknown sources From(Monteiro and Cardoso, 2015a)

Figure 2 :
Example images from the EURECOM dataset, for a single subject: (a-g) RGB (h-n) depth

Table 1 :
List of some sensors used in 3D Facial Recognition

Table 2 :
List of some databases created for the assessment of 3 Facial Recognition algorithms (Ajmera et al., 2014)sentation for depth data.Using Kinect Scans, Li et al.(Li et al., 2013)obtained canonical frontal views (shape and textural).Here the RGB data is also transformed to the discriminant color space and a sparse representation classifier (SRC) is applied in both types of scans.In high resolution scans Hiremath et al.(Hiremath and  Manjunatha, 2013), used Radon transform on both texture and depth images in order to obtain binary maps to crop the facial region.Gabor features are extracted from both type of scans and obtained vectors in which PCA is applied as the input in an AdaBoost classifier that selects the most discriminant features.Also using Kinect, Ajmera(Ajmera et al., 2014)proposed the use of SURF-based descriptors in Kinect scans (tested on EURECOM and CurtinFaces datasets).Here, images with variation in pose are generated, and SURF is also used for face matching independently on depth and intensity images.Mrácek (Mracek et al.,  2014)used Gabor and Gauss-Laguerre filters to describe texture and depth information.In recent works, Elaiwat(Elaiwat et al., 2015)in high resolution scans, used curvelet coefficients to represent the facial geometrical features, to identify keypoints and extract local information about its neighbourhood.Nair(Naveen et al., 2015), used a Local Polynomial Approximation Filter (LPA) to obtain directional faces.These faces are optimized using the Intersection of Confidence Interval Rule (ICI) and feature extraction is done using mLBP.Krishnan (Krishnan andNaveen, 2015)introduced a new framework using entropy maps of the texture and depth maps independently and using saliency maps on texture images.Dai et  al. (Dai et al., 2015), using Kinect data, proposed a new local descriptor for feature extraction after the use of Gabor filters: Enhanced Local Mixed Derivative Pattern (ELMDP).Finally,

Table 3 :
Summary of the most relevant works concerning 3 face recognition

Table 4 :
Main results obtained using RGB images and SIFT feature description

Table 5 :
Main results obtained using depth images and dense SIFT feature description

Table 6 :
Main results obtained using depth images and LBP feature description