Dense-HOG-based drift-reduced 3D face tracking for infant pain monitoring

This paper presents a new algorithm for 3D face tracking intended for clinical infant pain monitoring. The algorithm uses a cylinder head model and 3D head pose recovery by alignment of dynamically extracted templates based on dense-HOG features. The algorithm includes extensions for drift reduction, using re-registration in combination with multi-pose state estimation by means of a square-root unscented Kalman ﬁlter. The paper reports experimental results on videos of moving infants in hospital who are relaxed or in pain. Results show good tracking behavior for poses up to 50 degrees from upright-frontal. In terms of eye location error relative to inter-ocular distance, the mean tracking error is below 9%.


INTRODUCTION
Continuous pain monitoring of infants will benefit many clinical contexts.For example, in the context of gastro-esophageal reflux disease (GERD), infants suspected of GERD undergo 24-hour reflux monitoring, and pain monitoring will allow to analyze detailed time relations between pain and reflux to improve diagnosis.Recent work on detecting discomfort [1] and acute pain [2] of infants shows that automatic monitoring can be based on video analysis of facial expressions (cf.[3]).
In the previous context, we study infant face tracking in this paper, as a prerequisite for facial expression analysis.Face tracking, in general, has many applications and many proposed solutions.Most solutions aim at tracking faces of adults with limited pose variations in front of a single camera (e.g. in human-computer interaction and vehicle driver monitoring).However, in our context the tracking problem has three characteristics that existing solutions cannot handle.Firstly, for infants, and especially very young ones, facial texture is less pronounced than for adults.For example, infants have no prominent eyebrows, wrinkles or creases, and their eyes are closed for long periods of time.Secondly, in clinical settings, parts of the face may be covered.For example, in case of monitoring for GERD, there are plasters on the face and a tube into the nose for a pH probe in the esophagus.Also, infants may have a pacifier in their mouth to reduce stress, with a large variety of appearances.In addition, cuddles, toys or blankets may partly occlude the face.Thirdly, infants being monitored are not oriented towards a camera, and more cameras may be needed to keep the face in view.For example, in case of monitoring for GERD, infants are in bed where they can freely shift and turn their head.As a result, tracking has to handle much larger pose variations.
Here, we propose a face tracking algorithm that accommodates the above-mentioned characteristics.In order to show results focusing on the first two characteristics (less pronounced texture and partial occlusion), we present its single-camera version here.However, it is purposely developed for extension to multiple cameras, so as to fully accommodate the third characteristic.The algorithm is based on tracking 3D head pose using dense-HOG features, and it extends our work of [4] by including drift reduction.Below, Section 2 discusses our approach in relation to other work.Section 3 describes the basic algorithm, while Section 4 presents its extensions for drift reduction.Section 5 provides experiments and results.

APPROACH AND RELATED WORK
Our face tracking approach models the face as part of the head and solves the more general problem of tracking 3D head pose.Our main motivation is that this allows maximum use of image information for robustness, because visible features of both face and non-face parts of the head can be used.This also allows to extend to multi-camera setups.As a further motivation, 3D head tracking informs us about head movements, which may serve as extra parameter for pain detection.
For general face tracking, state-of-the-art approaches are based on aligning a deformable 2D or 3D face model to input images.Facial alignment methods, e.g. the Supervised Descent Method (SDM) [5] or the Constrained Local Model (CLM) method of [6], all require a form of training, and it is essential that the training result can capture all variations of shape and appearance.In our case this is hardly feasible, because of wide pose ranges, unknown appearances of plasters, pacifiers, etc.For this reason, we need an approach without a pre-determined model of appearance.Many of such approaches [7] recover head motion, using an assumed 3D head shape.Some, e.g.[8], recover motion from keypoints.For infant heads this is not feasible, as they have few and unstable points.Others recover motion by Lucas-Kanade template alignment [9], e.g. with a cylinder [10] or ellipsoid head model and for multi-camera setups [11].We adopt the same principle for our algorithm.
In [10,11] , the principle of head pose recovery by 3D alignment is used with pixel intensity templates.This may give problems with less-pronounced texture and less-uniform illumination, as e.g. in case of a sleeping infant in bed.For this reason, we use densely sampled Histogram-of-Oriented-Gradient ('dense-HOG') features [12], which can improve Lucas-Kanade alignment [13].In addition, we use insights from [14] to introduce a probabilistic algorithm for drift reduction.Our main contribution is the use of dense-HOG features for 3D tracking, in a refined algorithm with drift reduction, and its evaluation for real-life infant monitoring conditions.The paper offers a comprehensive description of the algorithm for the single-camera case, with main innovations at the end of Section 3.1 and in 4.2, and minor innovations in 3.2, 3.3 and 4.1.

Modeling
We represent 2D image locations as u = [u, v] T ∈ U, where U = R 2 .For an image I, we use U(I) ⊂ U to denote its pixel locations.We represent 3D points in terms of homogeneous coordinates as x = [x, y, z, 1] T ∈ X, where X = R 3 ×{1}.We model the infant head as an oriented 3D surface with centroid at [0, 0, 0, 1] T .The head pose is defined by applying a 3D rigid-body transformation g ∈ SE(3), while using the vector p = [ω x , ω y , ω z , t x , t y , t z ] T ∈ R 6 of exponential coordinates [15] of g as our representation of pose.From p we obtain the homogeneous matrix G for g as follows [15]: Here, I is the 4×4 identity matrix.For practical calculation of the exponential mapping e p and its inverse log G, see [15].We assume full-perspective image projection as a function W from points x to locations u: Here, C is the 3×4 camera projection matrix that combines the intrinsic and extrinsic characteristics of the camera.Because parts of a projected surface may not be visible, we also assume a function w(x; C) that yields positive values for visible surface points and zero for all other points.For posed-head points e p x, we write W(x;Ce p ) instead of W(e p x;C) etc. to associate the rigid-body transformation e p with the camera.As a result, head point references x remain in the un-posed coordinate system centered at [0, 0, 0, 1] T .The equivalence follows from (2).
For an image I, we use I(u) to denote the appearance value a ∈ A associated with a pixel location u ∈ U(I).As a main innovation over [10,11] , we use dense-HOG feature vectors as appearance values instead of pixel intensities.We use 36-dimensional vectors, from a HOG variant with blocks of 2×2 cells, cells of 8×8 pixels, and 9-bin histograms for signed orientations.The variant uses trilinear interpolation of location and orientation, and L 2 -Hys normalization [12].In [4], we have experimentally shown how dense-HOG yields more accurate tracking than intensity, for a similar basic algorithm.We also use I(u) for non-pixel locations u ∈ U.It is then defined by linear inter-/extrapolation of pixel appearance values.

Template alignment using Lucas-Kanade gradient descent and IRLS
Our aim for tracking is to estimate the pose of the head in each new image based on pose knowledge derived from previous images.For this, we repeatedly align templates derived from previous images with new images.For initialization, we need an initial pose p 0 and a specification of the un-posed head surface in the initial image I 0 .We present the algorithm for any general shape, but in the sequel we employ a cylinder shape, as it is least sensitive to initial pose error [10].
We define a template T ⊂ X × A as a set X(T ) of 3D points situated on the un-posed head surface with associated appearance values and the intuition that it represents a textured part of the head.We use T (x) to denote the appearance value a, associated with 3D point x in template T .We also define a template extraction function τ (I, p; C) that yields a template by reverse-projecting pixel locations of image I (with their appearance values) onto the surface of the head, for a presumed pose p.Here, reverse projection of a location u is defined to yield (if it exists) the unique 3D point x = R(u) on the un-posed head surface that is visible and mapped to u by image projection for pose p, i.e. so that W (R(u); Ce p ) = u.
We define an alignment step as a process that takes an image I, a template T , and a start pose p start and yields a stop pose p such that, for all template points x, image appearance I(u) at projected locations u = W(x; Ce p ) is close to template appearance T (x).We formulate this as an optimization for minimum sum-of-weighted-squared-error cost, as follows: , where Ω(T, Ce Here, summation is performed only over those template points x that are visible in the new pose p, as defined by Ω, and the positive values of function w specify weights to adjust the influence of those individual points in the alignment process. To implement an alignment step, we use the Lucas-Kanade (LK) method of gradient descent optimization [9].Starting from p = p start , LK iteratively updates p by approximately solving a reformulation of (3) in terms of an update ∆p for p: Comparing ( 4) with the standard LK formulation in [16], there is a difference in the function W and its arguments, and in the Ω-restriction on contributing points.Our motion model is a 3D→2D warp function W with three parameters ∆p,p,C defined by W(x; ∆p,p,C) = W(x; Ce p e ∆p ).As a result, updating also differs from [16]: we update p using e p ← e p e ∆p .An LK-iteration approximates ∆p opt in (4) using first-order Taylor expansion of W(x; Ce p e ∆p ) for small ∆p to yield where ∇I is the gradient [ ∂I ∂u , ∂I ∂v ] of image I, ∂W ∂∆p is the Jacobian of the warp (for ∆p = 0 and given x, p and C), and H is the 6×6 Gauss-Newton approximation of the Hessian: Partly similar to [10], we adapt w per LK-iteration as product of a density term w D and a robustness term w R : w = w D •w R .
The density term relates to image projection of the head surface.For this, consider an infinitesimal area around a template point x and its projection in the image plane.We define w D (x; C) as the area ratio of the latter divided by the former.It varies with the direction of the surface normal at x and the distance of x to the image plane.As a result, points seen from the side contribute less than points seen from the front (and, unlike in [10], our definition here extends to multiple cameras).The robustness term relates to the IRLS method of [10], which is used to handle noise, non-rigid motion and occlusions.At each LK-iteration, this method adapts w R based on the error resulting from the current pose estimate p:

Template use and outlier removal
For tracking, we supply sequentially images I k for k = 0, 1, 2, . . . .With p 0 given as input, poses p k are produced as outputs by repeating a single routine for each k, k ≥ 1.In the basic algorithm, this routine performs one alignment step per image I k .The template and start pose for this step are derived from the previous image: For robustness, and especially for handling temporary occlusions, each alignment step is preceded by outlier removal.Outlier removal takes the candidate template T for alignment plus an outlier reference image index r (for which pose p r was already computed) and replaces T by a smaller template T by removing template points x with a large difference between their appearance I r (W(x; Ce pr )) in the outlier reference image I r and their template appearance T (x), as follows: where σ r and m r are defined as in (7) with I and p indexed by r and using L ∞ -norm.This is similar to outlier removal for intensity appearance in [10], but we use max-min thresholds because dense-HOG appearances are normalized vectors.
Threshold c • σ r (with c = 1.5) is based on robust statistics as in [10] (for low-median cases), d (set at 0.35) avoids ineffectively high values (for mid-median cases), and m r guarantees that never more than half of the template is removed (for high-median cases).In the basic algorithm, outlier removal of template T k−1 in step k (k ≥ 2) uses outlier reference r = k −2, which refers to the image preceding image I k−1 from which T k−1 was derived (for k = 1 outlier removal does not apply).

ALGORITHM EXTENSIONS FOR DRIFT REDUCTION
4.1 Selective storage and re-use of templates The basic algorithm from Section 3.3 may drift because alignment errors accumulate during tracking.As a first measure for drift reduction, we store a limited number of templates as key references, and re-use them to correct this.This is partly similar to re-registration as mentioned (but not detailed) in [10].We use S ⊂ N 0 as a shorthand for the set of stored key references, where an element s ∈ S is an index of an input image I s from which a key reference template was extracted (using pose p s ).Also, for selection of key references for storage and re-use, we define a boolean function close for a pair of poses p, p and a f , such that where angle and dist denote rotation angle and translation distance, with α set at 10 degrees and ρ set at the head radius.
For storage, we initialize the key reference set from the initial image: S = {0}.During tracking, we add a new reference s whenever we output a pose p s that is far from any of the references stored sofar, viz.when ∀s ∈ S : ¬ close(p s , p s ; 2).For re-use, we extend the routine for image I k from Section 3.3 with a second alignment step (with preceding outlier removal).The second alignment step re-uses a key reference template T s and uses the stop pose p of the first step as its start pose, and its preceding outlier removal uses outlier reference r = k −1.We select key reference s ∈ S such that its pose p s is close to the current pose, viz.so that close(p, p s ; 1) holds.(If more candidates exist, we select the most recent one.In the rare cases where none exist, we skip the second alignment step.This is based on the intuition that texture from T s may not sufficiently resemble head texture in the current image if there is a large difference in pose.)

Probabilistic poses and square-root unscented Kalman filter for multi-pose estimation
As a second measure for drift reduction, we introduce a probabilistic version of the algorithm-with-re-use from Section 4.1.Based on method 3 in [14], the idea is to consider pose outputs as estimates, and to keep improving estimates from the past that may be used as references for the future.For this, consider the multi-pose state vector p * k that represents the collection of pose variables that had to be stored in the routine for image I k in the algorithm-with-re-use.This collection consists of poses of key references in S = {s 1 , s 2 , . . ., s |S| } plus poses p k−2 and p k−1 of outlier references plus output pose p k .From these, we define p * k (with dimension varying with k) by vertical concatenation, in order of increasing image index: In the probabilistic algorithm, we maintain a probabilistic counterpart of this multi-pose state.For this, we model the same collection of poses as a multivariate Gaussian distribution Pr(p and we use the combination of mean µ * k and covariance Σ * k as new state.By analogy with (10), it has parts µ i , Σ i,j etc. relating to image indices i, j.
The probabilistic algorithm uses templates as before, but template extraction now takes a pose mean µ i as an estimate where it used a value p i as parameter before.Also, key templates are not stored but re-extracted from stored image parts, so that they can improve as estimates µ i improve.Similarly, alignment steps are used as before, but an alignment step now starts from a pose mean µ start .Based on Appendix A in [14], we can approximate the uncertainty in its stop pose p, viz. in terms of a zero-mean Gaussian distribution for the optimization variable ∆p from (4), with covariance Σ ∆p given by where Using the above, we define the algorithm as a recursive state estimator that uses an alignment step as its measurement step.The underlying world states for this estimator are composite multi-pose vectors p * as in (10) After an alignment step for image I k , we update state µ * k , Σ * k by means of a Kalman measurement incorporation step.Because a measurement (i.e.stop pose p) is highly non-linear as a function of both measurement noise (i.e.alignment uncertainty ∆p) and world state (viz.start pose p start ), we have to use a version of a Square-Root Unscented Kalman Filter for a system in general form (see [17]).This filter uses an unscented transform that first creates so-called sigma points to approximate the probability distribution of the current multi-pose world state (plus the noise state ∆p), then applies the non-linear function to the set of sigma points, and finally computes a new µ * k , Σ * k from the function outputs.Using a square-root implementation, we actually compute and store roots of Σ's, to avoid non-positive-definite Σ-results.

EXPERIMENTS 5.1 Dataset and evaluation criteria
For experiments, we used sequences captured with handheld cameras at Máxima Medical Center Veldhoven, with parental consent [2].They show infants in bed, in various conditions: relaxed, experiencing acute pain during interventions, experiencing post-operative pain, etc.For quantitative evaluation, we also selected 10 videos of different infants with a lot of motion and, for some, also large changes of expression from acute pain.In order to judge tracking accuracy quantitatively from both eyes, we selected sub-sequences without extremely non-frontal poses.For this, we used the angle θ(p) of rotation from pose p to an upright-frontal pose (with respect to the camera) as a measure of non-frontality and limit θ < 50 degrees.(Note that a θ-value can correspond with many rotation axes and therefore with many combinations of yaw, pitch and roll.)Video (30 fps) was input as down-scaled gray-level signal with 480×270 pixels for Sequence 1 and 320×180 for others.
For qualitative judgment, we consider faces in normalized frontal view (NFV).For this, visible head texture is projected into an image that corresponds with a camera positioned upright-frontally before the head (cf. Figure 1).For quantitative judgment, we consider center-of-eye locations in NFVs.As ground truth, center-of-eye annotations in input images are reverse-projected as 3D points on the posed head and then NFV-projected.We compare them with NFV-projections of fixed 3D head points, viz. the reverse-projected annotations of the first image of the sequence.As comparison metric, we use the eye location error (ELE), defined for an NFV as the distance between same-eye locations divided by the distance between left/right-eye ground-truths in the NFV of the first image.This metric allows to compare results for different sequences, as well as alternative ground-truth annotations.(We annotated eye centers at full input resolution in each 10-th frame ( 1 3 sec).Re-annotation trials have shown low annotation noise: mean ELE below 0.02, outliers up to 0.08.)

§
Sequence 5 has 1 frame where re-use output has 1 invisible eye (excluded from mean/max-ing).
. As the temporal model for the estimator, we model the relationship between two successive world states as follows: pose components referring to the same image in both world states remain unchanged, and the value of a new pose component p k depends only on the value of p k−1 in the preceding world state.For the latter, we use a simple transition model Pr(p k |p k−1 ) = Norm p k [p k−1 , Σ trans ], which represents Brownian head motion in exponential-coordinate space.We can now use Kalman techniques, as follows: Before input of a new image I k , we predict a new state µ * k , Σ * k from the previous state by adding new state parts for image index k, and by removing (through function reduce) parts that are no longer needed (viz.for index k−3 if k−3 / ∈ S):

Table 1 :
Tracking accuracy.Values for θ are from the output of the probabilistic algorithm.† Mean/max ELE are over all left and right eyes with ground truth.‡ |S| is number of key references at end of the sequence.