Simultaneous Underwater Navigation and Mapping

The use of underwater autonomous vehicles has been growing, allowing the performance of tasks that cause inherent risks to Human, namely in inspection processes near to structures. With growth in usage of systems with autonomous navigation, visual acquisition methods have also gotten more developed because, they have appealing cost and they also show interesting results when operate at a short distance. It is possible to improve the quality of navigation through visual SLAM techniques which can map and locate simultaneously and its key aspect is the detection of revisited areas. These techniques are not usually applied to underwater scenarios and, therefore, its performance in environment is unknown. The paper presents a more reliable navigation system for underwater vehicles, resorting to some visual SLAM techniques from literature. The results, conducted in a realistic scenario, demonstrated the ability of the system to be applied to underwater environment.


Introduction
The growing use of underwater autonomous vehicles (AUV) is related with the fact of its actual features (Wynn et al. 2014) allow its application in tasks that may involve risks for Human, such as environment monitoring, inspection and demining. However, to ensure the use of an AUV in diverse applications, these vehicles must be able to navigate autonomously and the data must be obtained accurately (Bosch et al. 2016). Since the underwater environment is unknown, we have an increase effort in exploitation and development of techniques to allow the navigation of underwater vehicles. In this context, Simultaneous Localization and Mapping (SLAM) techniques (Paull et al. 2014) has been arise to help in autonomous navigation namely in unstructured environments and when the initial information is poor. In this technique the robot constructs a coherent map of its environment while, at the same time, determines its location within that same map. However, this approach have to deal with some problems (Aulinas et al. 2008), namely the data association that occur when the number of possible hypotheses that identify the landmark grow. The adversities of the underwater environment reduce the sensors to use. In this environment, the radio signal only propagates in short distances that makes the using of Global Positioning System (GPS) and techniques based in Wi-Fi communications less appealing. Although expensive and with some possible errors, the acoustic sensors present better performance in this environment. However, the autonomous vehicles are crucial in depth applications, such as monitoring and inspection of underwater structures. In this context, the visual sensors have been an important focus to ensure that the vehicles can execute close-range missions autonomously and in real-time. Although the images could be affected by the poor visibility and light attenuation (Prados Gutiérrez 2013), have been applied some techniques to restore and to enhance the quality of imaging (Corchs and Schettini 2010;Pérez-Alcocer et al. 2016). The SLAM approach with visual sensors (Pi et al. 2014) has the main goal to estimate the camera motion while reconstructs the environment. This technique assumes the extraction of features to estimate the position of the vehicle (Taketomi, Uchiyama, and Ikeda 2017) and for that presents two main methodologies based in:  Filters -the motion is estimated with all frames, processing by a filter;  Keyframes -the motion is estimated based some frames previously selected. In addition, is common to use Bundle Adjustment (BA) to adjust the trajectory in certain moments, depending of the implementation. Moreover, to limit the estimation error, the vehicle must be able to recognize revisited areas -loop closure detection (Lowry et al. 2016). However, with visual systems is considered the greatest challenge since it implies the association of nonconsecutive images. As solution comes up the visual vocabulary concept, considered an efficient approach (Nister and Stewenius 2006), to determine the similarity between images, namely the Bag-of-Words (BoW) technique (Nicosevici and Garcia 2012; Law, Thome, and Cord 2014). The vocabulary can be created offline or online. In the first case, it is constructed a priori from a large set of training images. The online approach does not require human intervention: it is constructed according to robot motion. This approach stands out by accurately modeling the objects and scenes present in the surroundings and by being constructed as visual information becomes available. This paper presents a robust, accurate and efficient visual system for simultaneous navigation and mapping in underwater environment. Thus, it is possible to contribute for a location of vehicles more reliable and safer and, therefore, to allow the use of AUV in long-term operations in an unknown environment. Thus, this the paper is organized as follows: section 2 presents the steps and its description to obtain the developed system. Afterwards, the results obtained with the system in underwater environment are illustrate in section 3. Moreover, this section presents a comparative analysis between vocabulary and Bundle Adjustment approaches to detect revisited areas. Finally, in section 4 are discussed the major conclusions of this work.

Visual Navigation in Close-Range Scenarios
To develop the visual navigation system for underwater vehicles, the more appropriate SLAM technique of the literature was selected. After that, to recognize revisited areas and, so, to allow a more reliable motion estimation, a vocabulary method was developed. Figure 1 presents the modules that compose the developed system. . The first implementation allows the use of external odometry methods, such as viso2, to estimate 6DOF motion, but is not fit to pure rotations and it assumes that the camera is always at the same high. Moreover, it resorts to bucketing technique to ensure that the features extraction is uniform over entire image. The S-PTAM divides de SLAM problem into parallel tasks: tracking (detects and matches features to estimation the camera motion) and map optimization (removes the outliers). The ORB-SLAM2 uses the same features not only to map and tracking but also for local recognition, that allows that the system could be more efficient and reliable. These two methods use a keyframe-based approach for motion estimation, that avoids a computational growing. For loop-closure detection, the RTAB-MAP and ORB-SLAM2 use vocabulary approaches, that evaluate the similarity between the current frame and the others. The S-PTAM only performs an iterative bundle adjustment. The ORB-SLAM2 implementation uses an offline vocabulary and RTAB-Map constructs a vocabulary as the images become available. These implementations were compared qualitatively and quantitatively to determined what of them is more fit to the intended context (real-time conditions). Thus, the quality of the motion estimation (Euclidean error), Central Processing Unit (CPU) and Random-Access Memory (RAM) utilization and the processing time were evaluated. For that some datasets online available were used, namely KITTI dataset (Geiger, Lenz, and Urtasun 2012) that depicts the outdoor environment (acquired by a car in urban areas and highway), see Figure 2. This dataset includes ground-truth of some trajectories and loop-closure situations. The ORB-SLAM2 is the only that estimates all camera positions, detects the loop-closures and, consequently, adjusts its trajectory. This implementation presents lower errors between estimated trajectory and ground-truth. The RTAB-Map, with viso2 to motion estimation, tries to replicate the motion but with some deviations, that can be explained by the susceptibility to sudden rotations of the camera. The S-PTAM is far from to achieve the intended trajectory, since frames at the beginning were lost. Thus, it is not possible to conclude about the loopclosure detection to relation these two last implementations. , with others sequences of KITTI and with others datasets online available (Stereo_20Hz and MIT Stata Center), was evident that the ORB-SLAM2 implementation was the best in the motion estimation and computational requirements -crucial in real-time operations. In terms of processing time, RTAB-Map and S-PTAM give better results. However, as they do not use all the frames for the motion estimation, the obtained values are insignificant. Moreover, these implementations have worst performance in Euclidean error and CPU. The S-PTAM only estimated correctly the motion in simple trajectories (without direction changes) and it is the approach that presents the highest computational requirements. Finally, in a qualitative way, the ORB-SLAM2 is the only that is suitable for the reliable recognition of revisited areas (loop-closure detection). Table 1 presents an overview of the performance obtained by the different methods.

RTAB-Map S-PTAM ORB-SLAM2 Euclidean error
- Moreover, the behavior in ideal conditions -offline processing -was also considered. As expected, when the ratio between the processed frames and the acquired frames is higher, the performance is better. Over these conditions, RTAB-Map and S-PTAM significantly improved their behavior by detecting existing loops, see Figure 4. In general, the ORB-SLAM2 and S-PTAM implementations are the most complete. The S-PTAM emphasizes the minimization of the dependency between two threads and the use of binary features that decrease the computational requirements (increasing the detection process).
The ORB-SLAM2 proves the efficiency and effectiveness of the quality of the loop detection. The possibility to adapt the vocabulary to the context and the performance of motion estimate in real-time, highlights the ORB-SLAM2 implementation. Thus, according to intended applicability this implementation was selected to underwater context. To evaluate the selected SLAM technique and, consequently, the impact of the visual vocabulary in underwater environment, data were acquired by a visual system (composed by two cameras and two lasers) in a scenario that simulates this environment. Moreover, two anchors were placed as a control point and distance information points to evaluate (quantitatively) the obtained motion estimation were marked. Figure 5 depicts the data acquisition scenario. Thus, some trajectories to evaluate the behavior of the ORB-SLAM2 in underwater environment were acquired. The trajectories present some direction changes and distance to the ground was kept, as possible. Loop-closure situations were included to assess the visual vocabulary method developed and, consequently, understand the impact of these approaches in the performance of the motion estimation. Figure 6 illustrates a realized trajectory in the underwater scenario to verify the performance of the ORB-SLAM2 with the developed vocabulary method in this environment. The system starts in left side -point i -and it moves about 2 meters (passing by the anchors). Here, it performs a direction change (loop-closure situation) and, after, moves in diagonal direction with a direction change up to near to initial position -point f -but without loop situation in this point. In Figure 7 is visible a good motion estimation obtained by the ORB-SLAM2. However, this estimation does not include the intended initial point, because the initial requirements of this implementation, at this moment, were not satisfied. It is also noticeable that a direction change -point mid -near to final position was wrong. However, this aspect does not present relevance, because the trajectory ends close to initial point without loop detection, as expected. The loop-closure situations present in this trajectory were detected (after the direction changes). In addition, the z-axis value is kept and the obtained scale was suitable as visible in Table 2. To analyze the influence of a vocabulary approach against the use of bundle adjustment, this experiment was conducted with S-PTAM implementation, see Figure 8. As visible, a low performance was obtained, since the motion estimation was not as desired and the revisited areas were not detected. Therefore, the obtained scale is far to real values and it was not intended that, in the ends, the x and y values been similar. Besides that, the final point is closer to the initial point than desired. Thus, to use only iteratively Bundle Adjustment is not able to detect revisited areas along to the camera motion.

Conclusions
Several visual SLAM methods that are typically applied to the navigation of mobile robots were analyzed in this paper. Results conducted that the ORB-SLAM2 is the more suitable technique. The comparative study of these methods was important to develop the final system, since that allow to achieve better results. The ORB-SLAM2 also proved its good performance in underwater scenario, since that could be following the motion with correct quantitative information. Presented low errors, namely 40cm between the initial and mid points and 10cm between initial and final positions. Moreover, it detected all loop-closure situations, adjusting its trajectory at these moments. This fact proved that the developed vocabulary method presented a good performance. In addition, it was visible that the iteratively Bundle Adjustment is not enough to recognize revisited areas since that the S-PTAM does not detected any loop. Thus, the importance of visual vocabulary approach to increase navigation performance was proved. So, it is possible to use the presented system in a real environment, allowing operational flexibility.
In future, it is intended to increase the robustness of the system, increased the number of experiences in underwater environment and to improve the computational requirements of the ORB-SLAM2 method.