Scene based camera pose estimation in Manhattan worlds

Darko Vehar; Rico Nestler; Karl-Heinz Franke

doi:10.1117/12.2530875

17 September 2019 Scene based camera pose estimation in Manhattan worlds

Darko Vehar, Rico Nestler, Karl-Heinz Franke

Author Affiliations +

Proceedings Volume 11144, Photonics and Education in Measurement Science 2019; 111440L (2019) https://doi.org/10.1117/12.2530875
Event: Joint TC1 - TC2 International Symposium on Photonics and Education in Measurement Science 2019, 2019, Jena, Germany

Abstract

This paper presents a principle for scene-related camera calibration in Manhattan worlds. The proposed estimation of extrinsic camera parameters from vanishing points represents a useful alternative to the traditional target-based calibration methods, especially in large urban or industrial environments. We analyse the effects of errors in the calculation of camera poses and derive general restrictions for the use of our approach. In addition, we present methods for calculating the position and orientation of several cameras to a world coordinate system and discuss the effect of imprecise or incorrectly calculated vanishing points. Our approach was evaluated with real images of a prototype for human-robot collaboration installed at ZBS e.V. The results were compared with a perspective n-Point (PnP) method.

1. INTRODUCTION AND MOTIVATION

A prerequisite step and an important challenge for a 3D reconstruction with multiple cameras according to the geometric principle of triangulation is the precise estimation of the camera’s imaging properties and their spatial relation to a common world coordinate system. The later is also referred to as the camera pose.

Traditional techniques determine the intrinsic imaging properties of a camera from a picture of a calibration object with known geometry. Using known world-to-pixel correspondences, the internal camera parameters can be computed precisely with calibration methods.

Calibration techniques can also be used to determine the extrinsic camera parameters—position and orientation of the camera in world coordinates. However, when wide-baseline camera systems for observation of large workspaces have to be calibrated extrinsically, this form of calibration becomes impractical. An easy-to-handle calibration object should ideally cover the entire measuring space but is often imaged too small or is only partially visible in the camera frustums. In such cases, the estimation of the target point positions is uncertain or inaccurate due to the image resolution. Furthermore, scenes objects, such as machines and equipment, interfere with the positioning and visibility of the calibration objects significantly. Thus, in practice, often only a part of the measuring volume can be included in the system calibration.

In urban and industrial environments a variety of 3D object edges usually appear parallel to each other. Furthermore, these parallel lines can be assigned to three orthogonal directions, as they occur on 3D objects with mainly right angles. A scene with these stated constraints is referred to as Manhattan world¹ and is represented with a Manhattan world model. In the perspective image, the characteristics of this world model form the basis for the proposed approach of scene-related camera calibration. We exemplify how a camera pose with respect to the Manhattan world and subsequently, the relative orientation and position between two cameras are calculated, given images of a Manhattan world with intrinsically precalibrated, distortion-free cameras. The method described here is part of a calibration toolkit 3D-EasyCalib^TM developed by the ZBS e.V., which solves several geometrical calibration problems and can be used with minimal expert knowledge.

Figure 1.

Prototype for a human-robot collaboration installed at the ZBS e.V. The line segments in the left image are shown in colours according to the assigned axes of a Manhattan world. The image on the right shows a world coordinate system that was derived from vanishing points.

2. THEORY

2.1

Pinhole camera model and camera calibration

The foundation for the proposed approach is the perspective projection of a 3D Manhattan world scene modelled by an ideal pinhole camera. The pinhole camera model defines the transformation of the three-dimensional Euclidean space onto the two-dimensional image plane according to the principle of central projection. If the 3D coordinate system of the world is set equal to that of the camera with the origin at point C, then by using a notation of Refs. 2, the projection can be described compactly with the equation

Image point x and world point X_C are represented by homogeneous vectors. Both are uniquely determined up to a scale factor. The index C shows that the world point X_C is defined in the camera coordinate system. The 3×3 matrix

is called the camera matrix or the calibration matrix of the pinhole camera model. It consists in its basic form, with zero skew and unit aspect ratio, of the focal length f, the distance of the image plane from the camera centre, and of the principal point (p_x, p_y). Both are expressed here in terms of pixel dimensions.

If the point X is defined in a 3D world, for example in a Manhattan world, and the camera pose (orientation and position) in this world is defined with the rotation matrix R and the translation t, then X must first be rigidly transformed into the camera coordinate frame with before it is projected (1) onto the image plane. In summary, the well-known algebraic relationship of the projective camera is derived as (see Fig. 2, left)

Figure 2.

Pinhole geometry of a camera placed at the coordinate origin C. Camera and Manhattan world coordinate frames are related via rotation and translation (left). Formation of a vanishing point as intersection of two imaged parallel world lines (right).

Traditional calibration approaches, such as in Refs. 3, calculate the unknown camera parameters K, R, t with the help of a geometric calibration target whose known world points {X_i} and corresponding image points {x_i} are used to obtain the solution of (3). Also for the method presented in this article, the cameras are precalibrated intrinsically (K) according to this principle.

2.2

Vanishing points in a Manhattan world

A set of parallel lines with direction d defined in the 3D camera coordinate system intersect in the ideal point, the point at infinity, X_∞ = (d^T, 0)^T of the associated projective space. The perspective projection of this ideal point on the image plane is called vanishing point. It can be calculated by inserting the ideal point into the equation (1) as υ = Kd. Analogously, the inverse projection of the vanishing point is defined as a ray through the camera centre and the point υ with the direction d = K⁻¹υ as shown in Fig. 2, right. Consequently, for all vanishing points found in the image of a Manhattan world, which belong to three mutually perpendicular directions, the vectors d₁ = K⁻¹υ₁, d₃ = K⁻¹υ₃ and d₃ = K⁻¹υ₃ are parallel to the axes of a Manhattan world which are orthogonal to each other (see Fig. 3, left).

Figure 3.

The derivation of the rotation (left) and the translation (right) from vanishing points. The 3D points O′ and P′ lie on the image plane and are defined in the camera coordinate system with its origin in C.

2.3

Derivation of the camera pose’s rotation matrix R

The normalized vectors d₁, d₂, d₃ derived according to the above considerations form an orthonormal basis of a world coordinate system

The sought rotation matrix R = [r₁, r₂, r₃] is composed of these basis vectors. To construct a valid rotation in a right-handed coordinate system, the order of the r₁, r₂ and r₃ has to be chosen accordingly.

2.4

Derivation of the camera pose’s translation vector t

The translation vector t cannot be computed unambiguously from an image without additional knowledge of the scene. A practical way to estimate the translation is to use known distances in the scene. For example, two known point pairs {x_i, X_i} can be used to solve (3) with the previously determined rotation matrix R. Consequently, errors from the calculation of the rotation could propagate in the determination of the translation.

An alternative was presented in Refs. 4. The required parameters consist of a known direction d that is parallel to one of the axes of the Manhattan world, the real length of a line segment on the direction d, and the image coordinates of the start and end points of the depicted line segment. The principle of this approach is shown in Fig. 3, right.

From the similarity of the triangles CO′I and COP, the distance of the point O from the camera centre is and thus the translation vector t can be determined as

3. VANISHING POINT DETECTION

It was shown above how vanishing points can be used to derive extrinsic camera parameters. A vanishing point results as the intersection of all parallel lines defined in 3-space which are projected perspectively onto an image plane, as shown in the Fig. 2, right. As the accuracy of the vanishing point positions is crucial for the precision of the parameters derived therefrom, a short revision of the methods for the vanishing point detection is in place. This is followed by a short review of approaches for line segmentation in the section below, which is a common prerequisite for the vanishing point detection algorithms.

Since the first successful attempt by Barnard⁷ in 1983, many methods have been proposed for the detection of vanishing points in images. Most approaches consist of three steps: line extraction, lines to vanishing point assignment and vanishing point position estimation. In particular, the clustering of the lines into groups sharing the same vanishing point distinguishes the methods, since this can be done either in the image space or in a transformed space, for example in Hough space or on the Gaussian sphere. Manhattan world assumption can either be used in the clustering-step or in the vanishing point computation.

According to our own research,⁸ the agglomerative clustering method, called J-Linkage^{9, 10} produced the best results because of its robustness against outliers, that occur in images of non-ideal Manhattan worlds. Similar to a RANSAC scheme, J-Linkage evaluates vanishing point hypotheses using consensus sets of lines. The separation of the clusters is based on the Jaccardi distance.

3.1

Procedures for line segment extraction

An established and widespread approach uses the robust Canny edge detector and a parametric line search, e.g. in Hough space, to extract lines from an image. Its susceptibility to faults (i.e false-positive lines), especially in regions with fine textures, leads to erroneous results of the subsequent vanishing point detection.

We evaluated two alternative methods, the Fast Line Detector⁵ (FLD) and the Line Segment Detector⁶ (LSD). Both output line segments with sub-pixel accuracy and produce subjectively correct results. The FLD processes the results of the Canny detector by using a co-linearity constraint to track straight line segments along contours. The LSD splits the image into so-called line support regions, i.e. regions of similar local gradient orientation. These regions are merged into the line segments by a region growing approach.

4. PROPOSED METHOD AND EVALUATION

The theory for the camera pose estimation from an image of a Manhattan world and the J-Linkage algorithm for the vanishing point detection from pre-detected line segments were merged into a processing pipeline of our proposed method. It consists of four consecutive steps: line segment extraction, line clustering, vanishing point calculation and camera pose (rotation and translation) computation.

The process steps were carried out exemplary with an image of the sample scene shown in Fig. 4, left. Colour images were converted to grayscale before processing. The intrinsic parameters of the camera are: f = 1721.11 pixels, principal point (p_x, p_y) = (1001.15, 753.91) at an image resolution of 2016 × 1512 pixels. The intrinsic calibration including a lens distortion correction was done beforehand.

Figure 4.

Input image of a test scene (left), line segments extracted with FLD (center) and LSD (right).

4.1

Line segment extraction

Utilizing an intensity image as an input, both algorithms (LSD and FLD) parameterize line segments as start and end points with subpixel-precision. Some parameters, e.g. minimum line length, may be adjusted depending on the resolution of the input image and scene properties to obtain the best results.

For the example image (Fig. 4, left) FLD outputs 563 and LSD 611 line segments. According to the subjective evaluation, the consistency of the results is similar. Statistical evaluation of the line segments length shows, that the LSD favours shorter line segments with a smaller length variance compared to the FLD. Since no preference on this basis can be made at this point, the final method selection takes place in the context of the subsequent processing steps.

4.2

Edge clustering and vanishing point calculation

Estimated line segments were input to the J-Linkage algorithm and labelled to corresponding vanishing points as output. Since a Manhattan world was assumed, the three clusters with the largest number of assigned line segments correspond to the wanted Manhattan world axes (vanishing points).

The intersection of lines in a cluster was computed using a least squares method. In addition to the calculated vanishing point, the root-mean-square (RMS) over the smallest distances of the lines to the intersection (vanishing point) was computed. This represents the accuracy of the line detectors and gives an indication of the validity of the estimated vanishing points in case of degenerate camera orientation (Sec. 4.3). Some results of the here favoured FLD and the LSD methods are shown in the Tab.1.

Table 1.

Comparison of two favoured line extraction methods with coordinates of the estimated vanishing points and their accuracy for the scene shown in Fig. 4. The value of #v is the number of distinctive vanishing points found by J-Linkage algorithm.

Method (#v)	υ1	υ2	υ3
FLD (28) RMS	(−1054.88,43.10) 40.57	(2672.13, −20.03) 21.86	(1087.19,4668.74) 59.54
LSD (11) RMS	(−1087.41, 30.96) 27.47	(2653.04, −5.65) 21.44	(1086.31,4711.57) 36.4

The RMS of the LSD line segments was slightly smaller than that of the FLD in every case. The J-linkage output nearly twice as many clusters for lines extracted with FLD (28) as for LSD (11). Both results depend on the precision of the extracted line segments. The evaluation of additional images resulted in similar differences in favour of the LSD, so it is the recommended method for the line segmentation step.

4.3

Estimation of the camera pose’s rotation

For the calculation of the rotation matrix according to (4), the vanishing points were arranged clockwise with respect to the principle point. For the direction vectors computed from the left image in Fig. 4, the pairwise angles between the basis vectors were: and . The resulting rotation matrix was orthogonalized with the help of Rodrigues’ rotation formula.

Depending on the orientation of the camera in the Manhattan world, a rotation matrix from (4) can not always be estimated. These cases occur when one or two, hence all three, axes of the camera coordinate system are parallel to axes of the world coordinate system. Thus, the projections of parallel lines in the world are also parallel on the image plane and the associated vanishing point is at the point of infinity.

If only one infinite vanishing point is present, the corresponding direction vector can be calculated as the cross product of the other two. The second case should be obviated when possible, for example by changing the camera pose, as further analysis is otherwise impossible.

4.4

Estimation of the camera pose’s translation

The computation of the equation (5) of camera pose’s translation is trivial. Only point I can possibly be misinterpreted. It results as an intersection of the rays and O + λ₂d in 3-space as shown in Fig. 3, right.

We selected points O and P on the upper side of the robot frame (see Fig. 5, left), determined their image coordinates and computed the translation vector t according to the equation (5). The length of the side of the frame is = 880 mm. Thereby the distance from the camera centre to the origin of the Manhattan world is ||t|| = 2411.37 mm and the orientation, for easier interpretation presented as Euler angles (z, y’, x” convention), is: yaw = 138.98°, pitch = −0.68°, roll = 66.72°.

Figure 5.

Input image (left), line clustering based on the lines extracted with FLD (middle) and LSD (right). The edges are shown in colour according to the assigned vanishing points. Black lines could not be assigned to vanishing points of the imaged Manhattan world.

4.5

Estimation of the relative pose between two cameras

The motivation for the proposed scene-based calibration was not only to estimate the exterior orientation of multiple cameras in a common world frame but to determine their relative orientation and position to each other. Given a camera pair with known extrinsics, a selected point in the world is transformed to the respective camera frames with equations and . Substitution of the X from the second equation with the X from the first one gives us the orientation and translation of the camera C₂ with respect to C₁ as

It should be noted, that the R₁, R₂, t₁ and t₂ have to be defined with respect to the same coordinate system. In the procedure described in Sec. 4.3 the rotation matrices were computed by manually assigning the world axes. This step will be automated with image-based matching methods in the future.

5. COMPARISON OF THE PROPOSED APPROACH WITH A PNP METHOD

To compare the presented scene-based calibration approach with a traditional target-based method, a calibration target was placed in a test scene and aligned with the Manhattan world. The coordinate system of the Manhattan world is in this case equal to the coordinate system of the calibration target. We determined the camera pose (R_ref, t_ref) using an iterative PnP algorithm¹¹ and then using our proposed method (R,t). The control points of the target were calculated with sub-pixel accuracy. The measured length of the line segment required for the calculation of the translation according to (5) is = 720 mm. The intrinsic parameters of the cameras are equal to the intrinsics in the previous section.

We chose the following metrics to compare the results. The error in the translation is represented by the Euclidean norm of the vector difference ||t_ref − t||. The difference of the orientations is determined by the angle which defines a metric for 3D rotations.¹²

The vanishing points with associated RMS distances are shown in the Tab. 2. In the Fig. 6, second row, right, the green line segments are almost parallel to each other and the calculated vanishing point υ₂ lies distinctly outside of the image. The associated RMS error is as expected the largest. Nonetheless, the basis vectors and are almost perpendicular to each other, thereby no additional processing steps were necessary.

Figure 6.

Input images of two different camera views of the same scene with a coordinate system of the target found by our method (top row). The classified line segments from LSD and J-linkage algorithm are shown in the second row. 3D view at the bottom shows both cameras and the Manhattan world coordinate system.

Table 2.

Vanishing points and RMS of the shortest distances of clustered lines to the corresponding vanishing points.

	υ1	υ2	υ3
Fig. 6, top RMS	(818.41, 4805.54) 43.81	(−1551.14, −137.14) 40.54	(2435.64, 53.72) 30.03
Fig. 6, bottom RMS	(639.56, 2826.98) 33.06	(−9053.44, −2453.98) 113.97	(1696.53, −559.24) 23.95

The results of the camera poses with respect to the world and the resulting relative orientation and position of the cameras to each other are shown in the Tab. 3. The observed differences in the translations were only a few millimetres large. The error in the rotation was smaller than one angular degree. The overall accuracy of the proposed method was only slightly worse than that of the PnP method. The evaluation of additional images confirmed these results. In the case of the stereo setup, Tab. 3, bottom, the computation errors of the relative pose between two cameras accumulate as expected.

Table 3.

Comparison of camera poses with respect to Manhattan world. Translation and rotation of the camera in Fig. 6, bottom with respect to camera in Fig. 6, top.

	\|\|tref\|\| [mm]	\|\|t\|\| [mm]	\|\|tref − t\|\| [mm]	θ [°]
Fig. 6, top	2712.89	2710.76	2.13	0.60
Fig. 6, bottom	2092.83	2091.54	1.34	0.28
	\|\|t21,ref\|\| [mm]	\|\|t21\|\| [mm]	\|\|t21,ref\|\| − t21\|\| [mm]	θ21 [°]
Fig. 6, stereo	2750.18	2746.08	7.57	0.33

6. SUMMARY AND FUTURE WORK

In environments that do not allow target-based extrinsic calibration and are represented by a Manhattan world, the proposed method of scene-related calibration is a useful alternative. The comparison with a traditional approach shows that our procedure achieves an accuracy that is suitable for many applications.

Manhattan world assumption is a prerequisite for the techniques presented in this paper. When a scene only partially fulfils the constraints of this model, for example, when only two orthogonal directions can be extracted from the imaged scene, the Manhattan world can still be identified. We can supplement the missing information by using suitable heuristics or improve the erroneous results. With the error metrics presented in the article, such situations can be reliably identified and automatically corrected. These challenges are the subject of our further research.

ACKNOWLEDGMENTS

The results presented in this article are based on work that was supported by the German Federal Ministry of Education and Research (BMBF) as part of the funding program “Twenty20 Partnership for Innovation” in the joint project “Ergonomics Assistance Systems for Contactless Human Machine Operation” (EASY COHMO) of the consortium 3Dsensation.

REFERENCES

[1]

Coughlan, J. M. and Yuille, A. L., “Manhattan world: Compass direction from a single image by bayesian inference,” in Proceedings of the International Conference on Computer Vision, 941 –947 (1999). Google Scholar

[2]

Multiple View Geometry in Computer Vision, second edCambridge University Press, New York (2004). https://doi.org/10.1017/CBO9780511811685 Google Scholar

[3]

Zhang, Z., “A flexible new technique for camera calibration,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 22 (11), 1330 –1334 (2000). https://doi.org/10.1109/34.888718 Google Scholar

[4]

Guillou, E., Meneveaux, D., Maisel, E., and Bouatouch, K., “Using vanishing points for camera calibration and coarse 3d reconstruction from a single image,” The Visual Computer, 16 (7), 396 –410 (2000). https://doi.org/10.1007/PL00013394 Google Scholar

[5]

Lee, J. H., Lee, S., Zhang, G., Lim, J., Chung, W. K., and Suh, I. H., “Outdoor place recognition in urban environments using straight lines,” in IEEE International Conference on Robotics and Automation, 5550 –5557 (2014). Google Scholar

[6]

Grompone von Gioi, R., Jakubowicz, J., Morel, J.-M., and Randall, G., “Lsd: a line segment detector,” Image Processing On Line, 2 35 –55 (2012). https://doi.org/10.5201/ipol Google Scholar

[7]

Barnard, S. T., “Interpreting perspective images,” Artificial Intelligence, 21 435 –462 (1983). https://doi.org/10.1016/S0004-3702(83)80021-6 Google Scholar

[8]

Rehawi, L., Geometrical Estimation of Multiple Cameras in a Manhattan-World, (2018). Google Scholar

[9]

Toldo, R. and Fusiello, A., “Robust multiple structures estimation with j-linkage,” Computer Vision – ECCV 2008, 537 –547 Springer Berlin Heidelberg, Berlin, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88682-2 Google Scholar

[10]

Tardif, J., “Non-iterative approach for fast and accurate vanishing point detection,” in 2009 IEEE 12th International Conference on Computer Vision (ICCV), 1250 –1257 (2009). Google Scholar

[11]

Bradski, G., “The opencv library,” Dr. Dobb’s Journal of Software Tools, (2000). Google Scholar

[12]

Huynh, D. Q., “Metrics for 3d rotations: Comparison and analysis,” Journal of Mathematical Imaging and Vision, 35 155 –164 (2009). https://doi.org/10.1007/s10851-009-0161-2 Google Scholar

Citation Download Citation

Darko Vehar, Rico Nestler, and Karl-Heinz Franke "Scene based camera pose estimation in Manhattan worlds", Proc. SPIE 11144, Photonics and Education in Measurement Science 2019, 111440L (17 September 2019); https://doi.org/10.1117/12.2530875

Access the abstract

PROCEEDINGS
9 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Cameras

Image segmentation

Calibration

Imaging systems

3D modeling

Image processing

Sensors

1.

INTRODUCTION AND MOTIVATION