In the MIS (minimally invasive surgery), precise measurement and mastery of human organs is very important, even a slight wobble of instrument can cause a great deal of error. Digital 3D reconstruction technology can help doctor to confirm the real distance. Considering depth data is difficult to acquire and annotate on a large scale, especially in the human body scene, so many researches focus on self-supervised network. Most of them are based on a widely adopted hypothesis that Image brightness remains constant in adjacent frames, which cannot be satisfied in the body sense, with unstable lighting conditions in the narrow scale and complicated senses. To solve this problem, this paper proposes a high-precision depth reconstruction method for endoscopic surgical scenarios based on AFNet. Firstly, to overcome the limitations of the constant illumination assumption, we extract local features with illumination invariance and introduce quantization calculations for illumination-invariant feature descriptors in the loss function, to reduce the impact of illumination changes during the supervised process. Secondly, leveraging the synchronous movement of the light source and lens in the endoscope, we establish an association model between the brightness variation and depth prediction. This helps the network grasp the image context and smooth the depth better. Experiments show that our method improves the accuracy of depth estimation in endoscopic environments compared to baseline methods. The reconstruction effects on the ENDOVIS datasets of laparoscopic endoscopy and the ENDOSLAM datasets of gastrointestinal endoscopy are significantly improved.
Semantic segmentation in aquatic scenes is key technology water environment monitoring. Small-scale object detection and segmentation in aquatic scenes are major challenges in semantic segmentation of water bodies. Current typical semantic segmentation methods often use multi-scale feature fusion operations, features of different scales from different network layers are aggregated, enabling the features to have both strong semantic representation from high-level features and strong feature detail expression capability from low-level features. However, current methods, although they focus on the details of small-scale objects, primarily rely on low-level features to determine the presence of objects in the network scale adaptation for small object detection, resulting in the loss of accuracy when using high-level semantic features for prediction. Moreover, cross-scale fusion does not depend on category characteristics. Therefore, existing methods are not ideal for semantic-constrained small object segmentation, such as water surface garbage and plant debris. Our method focuses on the cross-level semantic information aggregation and utilization for object segmentation in aquatic scenes, providing a new approach for small object segmentation in complex semantic environments. In aquatic scenes, the category of objects has strong contextual relevance. Therefore, this paper proposes a cross-level semantic aggregation network to address the problem of small object segmentation in aquatic scenes. The cross-level semantic aggregation method guides the high-level features to perform semantic aggregation using low-level features, enabling the aggregation of features with high-level semantic features of the same category as small objects, while introducing relevant contextual scene features of different categories. Compared to traditional scale fusion, this introduces a new aggregation method within the semantic framework to handle small object segmentation in complex contextual relationships. We conducted extensive experiments on our self-built water body scene dataset, ColorWater, and the public dataset Aeroscapes. In addition to achieving state-of-the-art performance in overall segmentation, we particularly achieved significant advantages in small object categories such as floating garbage on the water surface and plant debris, which are the focus of this paper.
The static world assumption is common in most Simultaneous Localization and Mapping (SLAM) algorithms. However, this assumption introduces errors in real-world environments because the real-world is non-static. Furthermore, explicit motion information of the surroundings helps with decision making and scene understanding. In this paper, we present a robust dynamic SLAM for RGB-D cameras that is capable of tracking rigid objects in a scene and generating their 3D bounding box proposals without any prior knowledge, and incorporate this information into the SLAM formulation. As a result, it improves the accuracy of SLAM trajectories in dynamic environments. To achieve this, our system combines instance segmentation and dense optical flow to detect and track dynamic objects. We evaluate our algorithm in TUM and KITTI datasets. The results show that the absolute trajectory accuracy of our system can be improved significantly compared with ORB-SLAM2. We also compare our algorithm with DynaSLAM and VDO-SLAM, which are also designed for dynamic environments, and achieve significant improvement in counterparts.
This paper proposes the generation of a pedestrian ROI region, which is mainly aimed at pedestrian segmentation in far-infrared (FIR) images of in-vehicle systems. Since the FIR image is a grayscale image, the pixel value of the pedestrian is usually higher than the background, so the previous segmentation method is mainly threshold segmentation. However, this method will cause problems due to the uneven brightness of pedestrians caused by pedestrian wear, etc. We propose a new method for generating pedestrian ROI regions, which is based on the combination of image region merging and pixel-intensity vertical projection, and adopts the time domain semantic model to constrain the parameter space. Experiments show that our method has achieved good results in urban scenes.
In this paper we present an application solution for real-time augmented reality. We achieve almost 30 frames per second on our device and maintained good result for augmented. We use textured planar object as target. Consider of the computational complexity, we use patch feature and ZSSD template matching method for point matching. In the meantime, we maintain a database of the target template as semantics information. The semantics includes multiple key frame images of target object in different position. With this semantic database, we figure out the problems caused by viewport change and achieve robust performance.
This article presents a method for the object classification that combines a generative template and a discriminative
classifier. The method is a variant of the support vector machine (SVM), which uses Multiple Kernel Learning (MKL). The
features are extracted from a generative template so called Active Basis template. Before using them for object
classification, we construct a visual vocabulary by clustering a set of training features according to their orientations. To
keep the spatial information, a "spatial pyramid" is used. The strength of this approach is that it combines the rich
information encoded in the generative template, the Active Basis, with the discriminative power of the SVM algorithm. We
show promising results of experiments for images from the LHI dataset.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.