Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst

2025/05/3019:31:40 hotcomm 1812

Abstract

We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all instances in the point cloud, while predicting the point-level mask of each instance. It consists of a backbone network and two parallel network branches for 1) bounding box regression and 2) point mask prediction. 3D-BoNet is single-stage, anchor-free and end-to-end trainable. Furthermore, it is computationally efficient because unlike existing methods, it does not require any post-processing steps such as non-maximum suppression, feature sampling, clustering or voting. A large number of experiments show that our method goes beyond existing work on ScanNet and S3DIS datasets, while improving computational efficiency by about 10 times. Comprehensive ablation studies demonstrate the effectiveness of our design.

1 Introduction

enables machines to understand 3D scenes as a basic necessity for autonomous driving, augmented reality and robotics. The core issues of 3D geometric data such as point cloud include semantic segmentation, object detection and instance segmentation. In these problems, instance segmentation begins to be solved in the literature. The main obstacle is that point clouds are inherently disordered, unstructured, and uneven. The widely used convolutional neural networks require voxelization of 3D point clouds, resulting in high computing and memory costs.

The first neural algorithm to directly handle 3D instance segmentation is SGPN [50], which groups the features of each point through similarity matrix learning. Similarly, ASIS [51], JSIS3D [34], MASC [30], 3D-BEVIS [8], and [28] apply the same per-point feature grouping pipeline to segmented 3D instances. Mo et al. described instance segmentation as a point-by-point feature classification problem in PartNet [32]. However, the learning fragments of these proposal-free methods are not very objective because they do not explicitly detect the target boundaries. Furthermore, they inevitably require post-processing steps such as mean offset clustering [6] to obtain the final instance label, which is computationally arduous. Another pipeline is 3D-SIS [15] and GSPN [58] based on proposals, which typically rely on two-stage training and expensive non-maximum suppression to prune dense target proposals.

In this paper, we propose an elegant, efficient and novel 3D instance segmentation framework that uses single forward phases of efficient MLPs to perform loose but unique detection of objects, and then accurately segment each instance through a simple point-level binary classifier. To this end, we introduce a new bounding box prediction module along with a series of carefully designed loss functions to directly learn the target boundaries. Our framework is very different from the existing proposal and proposal-free methods because we are able to efficiently segment all instances with high objectives, but do not rely on expensive and intensive target proposals. Our code and data are available at https://github.com/Yang7879/3D-BoNet.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 1: 3D-BoNet framework for instance segmentation on 3D point cloud.

bounding box prediction branch is the core of our framework. This branch is intended to predict a unique, directionless rectangular bounding box for each instance in the single forward stage without relying on predefined spatial anchors or regional proposal networks [39]. As shown in Figure 2, we believe that roughly drawing a 3D bounding box for an instance is relatively possible, because the input point cloud explicitly contains 3D geometric information, which is very beneficial before processing point-level instance segmentation, because a reasonable bounding box can ensure a high degree of objectiveness of the learning segment. However, the Learning Examples box involves key issues: 1) The total number of instances is variable, i.e. from 1 to many, 2) There is no fixed order of all instances. These issues present a huge challenge to correctly optimizing the network, as there is no information that can directly link the prediction box to the ground truth tag to oversee the network. However, we show how to solve these problems gracefully. This box prediction branch simply takes the global eigenvector as input and directly outputs a large number of bounding boxes and confidence scores. These scores are used to indicate whether the box contains a valid instance.To supervise the network, we designed a novel bounding box association layer, followed by a multi-standard loss function. Given a set of ground-truth instances, we need to determine which prediction box is best for them. We describe this association process as an optimal allocation problem with existing solvers. After the boxes are optimally associated, our multicriteria loss function not only minimizes the Euclidean distance of the paired boxes, but also maximizes the coverage of effective points in the prediction box.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 2: A rough example box.

then input the predicted box along with the points and global features into the subsequent point mask prediction branch to predict a point-level binary mask for each instance. The purpose of this branch is to classify whether each point in the bounding box belongs to a valid instance or a background. Assuming the estimated instance box is quite good, it is very likely to get an accurate point mask, because this branch simply rejects points that do not belong to the detected instance. Random guessing may lead to a 50% correction.

Overall, our framework is different from all existing 3D instance segmentation methods in three aspects. 1) Compared with the proposal-free pipeline, our method segments instances with high objectiveness by explicitly learning the 3D target boundaries. 2) Our framework does not require expensive and intensive proposals compared to widely used proposal-based methods. 3) Our framework is very efficient because instance-level masks are learned in a single-forward pass without any post-processing steps. Our main contribution is:

  • We propose a new framework for instance segmentation on 3D point cloud. The framework is single-stage, anchor-free and end-to-end trainable without any post-processing steps.
  • We designed a novel bounding box association layer, followed by a multi-standard loss function to supervise the box prediction branch.
  • We demonstrate significant improvements to baselines and provide an intuitive basis for our design choices through extensive ablation studies.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 3: General workflow of 3D-BoNet framework.

2 3D-BoNet

2.1 Overview

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

2.2 Bounding Box Prediction

"Border Box Encoding:" In the existing target detection network, the bounding box is usually represented by the center position and the length [3] of the three dimensions or the corresponding residual [60] and the direction. Instead, for simplicity, we parameterize the rectangular bounding box only through two min-max vertices:

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 4: Boundary box regression branch architecture. Before calculating multi-criteria losses, the predicted box is best associated with the ground truth box.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

In order to solve the above optimal association problem, the existing Hungarian algorithm [20; 21] is applied. Association matrix calculation: To evaluate the similarity between the first prediction box and the first ground truth, a simple and intuitive criterion is the Euclidean distance between two pairs of minimum-maximum vertices. However, it is not optimal. Basically, we want the prediction box to contain as many valid points as possible. As shown in Figure 5, input point clouds are usually sparse and are unevenly distributed in 3D space. For the same ground truth box #0 (blue), candidate box #2 (red) is considered much better than candidate box #1 (black), because box #2 has more valid points overlapping with #0. Therefore, when calculating the cost matrix, the coverage of the valid points should be included. In this article, we consider the following three criteria:

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 5: Sparse input point cloud.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

2.3 Point Mask Prediction

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Table 1 Instance segmentation results on ScanNet(v2) benchmark (hidden test set). The metric is AP (%) with an IoU threshold of 0.5. Accessed on June 2, 2019,

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 6: Schema of the point mask prediction branch.Point features are fused with each bounding box and fraction, and then a point-level binary mask is predicted for each instance

2.4 End-to-End Implementation

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

3 Experiments

3.1 Evaluation on ScanNet Benchmark

We first evaluate our method on the ScanNet(v2) 3D semantic instance segmentation benchmark [7]. Similar to SGPN [50], we divide the original input point cloud into 1mx1m blocks for training, while testing with all points, and then assemble the blocks into a complete 3D scene using the BlockMerging algorithm [50]. In our experiments, we observe that semantic predictor sub-branches based on vanilla PointNet++ are limited in performance and cannot provide satisfactory semantics. Due to the flexibility of our framework, we can easily train a parallel SCN network [11] to estimate more accurate semantic labels for our predicted instances of 3D-BoNet. The average accuracy (AP) with an IoU threshold of 0.5 was used as the evaluation index.

We compared the leading method of the 18 target categories in Table 1. In particular, SGPN [50], 3D-BEVIS [8], MASC [30] and [28] are methods based on point feature clustering; RPointNet [58] learns to generate dense target proposals and then perform point-level segmentation; 3D-SIS [15] is a proposal-based method that uses point clouds and color images as inputs. PanopticFusion [33] learns to segment instances on multiple 2D images via Mask-RCNN [13] and then reprojected back to the 3D space using the SLAM system. Our approach goes beyond them using point clouds alone. It is worth noting that our framework performs relatively satisfactorily across all categories without favoring specific classes, which proves the superiority of our framework.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 7: This shows a lecture room with hundreds of targets (e.g. chairs, tables), highlighting the challenge of instance segmentation. Different colors represent different instances. The same instance may have different colors. Our framework predicts more precise instance labels than other frameworks.

3.2 Evaluation on S3DIS Datasett

We further evaluated the semantic instance segmentation of our framework on S3DIS[1], which included 3D full scans of 271 rooms from 6 large areas. Our data preprocessing and experimental setup strictly follows PointNet[37], SGPN[50], ASIS[51], and JSIS3D[34]. In our experiment, H is set to 24 and we follow a 6x evaluation [1; 51].

We compare with ASIS[51], S3DIS latest technologies and PartNet baseline[32]. For fair comparison, we carefully trained the PartNet baseline using the same PointNet++ backbone and other settings as used in our framework. For evaluation, the classical metric mean accuracy (mPrec) and average recall (mRec) with an IoU threshold of 0.5 were reported. Note that for our method and the PartNet baseline, we use the same BlockMerging algorithm [50] to merge instances from different blocks. The final score is the average of a total of 13 categories. Table 2 shows the mPrec/mRec scores, and Figure 7 shows the qualitative results. Our approach greatly exceeds PartNet baseline [32] and is also better than ASIS [51], but is not significant, mainly because our semantic prediction branch (based on vanilla PointNet++) is inferior to ASIS, which tightly fuses semantic and instance features for mutual optimization. We combine features into our future exploration

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Table 2: Instance segmentation results on S3DIS dataset.

3.3 Ablation Study

To evaluate the effectiveness of each component of our framework, we performed 6 sets of ablation experiments on the largest region 5 of the S3DIS dataset.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Table 3: Example segmentation results of all ablation experiments on S3DIS region 5.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

"Analysis." Table 3 shows the scores of the ablation experiment. (1) The box score sub-branch does favor the overall instance segmentation performance, as it tends to penalize duplicate box predictions. (2) Compared with Euclidean distance and cross-entropy scores, the sIoU costs of box association and supervision tend to be better due to our differentiable algorithm 1. Since three separate criteria prefer different types of point structures, the criteria may not always be optimal on a particular dataset.(3) Without supervision of box prediction, performance will drop significantly, mainly because the network cannot infer satisfactory instance 3D boundaries and the quality of the predicted point mask decreases accordingly. (4) Compared with focal loss, due to the imbalance of instance and background points, the standard cross entropy loss has poor effect on point mask prediction.

3.4 Computation Analysis

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

4 Related Work

In order to extract features from 3D point clouds, traditional methods usually manually create features [5; 42]. Recent learning-based approaches mainly include voxel-based [42; 46; 41; 23; 40; 11; 4] and point-based schemes [37; 19; 14; 16; 45].

"Semantic Segmentation" PointNet[37] shows the leading results of classification and semantic segmentation, but it does not capture context features. To solve this problem, many methods [38; 57; 43; 31; 55; 49; 26; 17] have been proposed recently. Another pipeline is a convolution kernel-based method [55; 27; 47]. Basically, most of these methods can be used as our backbone network and trained in parallel with our 3D-BoNet to learn every point of semantics.

"Object Detection" The common method of detecting targets in a 3D point cloud is to project the point onto a 2D image to regress the bounding box [25; 48; 3; 56; 59; 53]. By integrating the RGB images in [3], the detection performance further improves the fused RGB images [3;54;36;52]. Point clouds can also be divided into voxels for object detection [9; 24; 60]. However, most of these methods rely on predefined anchor points and two-stage regional proposal networks [39]. Scaling them on 3D point clouds is inefficient. Without relying on anchors, recent PointRCNN [44] learns to detect through foreground point segmentation, while VoteNet [35] detects targets through point feature grouping, sampling, and voting. In contrast, our box prediction branches are completely different from them. Our framework regresses the 3D target bounding box directly from the compact global features through a single forward pass.

"Instance Segmentation" SGPN[50] is the first neural algorithm to segment 3D point cloud instances by grouping point-level embeddings. ASIS[51], JSIS3D[34], MASC[30], 3D-BEVIS[8], and [28] use the same policy to group point-level features, such as instance segmentation. Mo et al. introduced a segmentation algorithm in PartNet [32] by classifying point features. However, the learning fragments of these proposal-free methods are not highly targeted because it does not explicitly detect the target boundaries. By drawing on successful 2D RPN [39] and RoI [13], GSPN [58] and 3D-SIS [15] are proposal-based 3D instance segmentation methods. However, they usually rely on two-stage training and a post-processing step for intensive proposal pruning. In contrast, our framework predicts a point-level mask directly for each instance within the explicitly detected object boundary without any post-processing steps.

5 Conclusion

Home framework is simple, effective and efficient, and can be used for instance segmentation on 3D point cloud. However, it also has some limitations that lead to future work. (1) Instead of using an unweighted combination of three criteria, design a module to automatically learn weights to suit different types of input point clouds. (2) More advanced feature fusion modules can be introduced to improve semantics and instance segmentation with each other, rather than training separate branches for semantic prediction. (3) Our framework follows MLP design, so it has nothing to do with the number and order of input points. By drawing on recent work [10][22], it is desirable to train and test directly on large-scale input point clouds rather than on segmented small pieces.

Abstract

We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all instances in the point cloud, while predicting the point-level mask of each instance. It consists of a backbone network and two parallel network branches for 1) bounding box regression and 2) point mask prediction. 3D-BoNet is single-stage, anchor-free and end-to-end trainable. Furthermore, it is computationally efficient because unlike existing methods, it does not require any post-processing steps such as non-maximum suppression, feature sampling, clustering or voting. A large number of experiments show that our method goes beyond existing work on ScanNet and S3DIS datasets, while improving computational efficiency by about 10 times. Comprehensive ablation studies demonstrate the effectiveness of our design.

1 Introduction

enables machines to understand 3D scenes as a basic necessity for autonomous driving, augmented reality and robotics. The core issues of 3D geometric data such as point cloud include semantic segmentation, object detection and instance segmentation. In these problems, instance segmentation begins to be solved in the literature. The main obstacle is that point clouds are inherently disordered, unstructured, and uneven. The widely used convolutional neural networks require voxelization of 3D point clouds, resulting in high computing and memory costs.

The first neural algorithm to directly handle 3D instance segmentation is SGPN [50], which groups the features of each point through similarity matrix learning. Similarly, ASIS [51], JSIS3D [34], MASC [30], 3D-BEVIS [8], and [28] apply the same per-point feature grouping pipeline to segmented 3D instances. Mo et al. described instance segmentation as a point-by-point feature classification problem in PartNet [32]. However, the learning fragments of these proposal-free methods are not very objective because they do not explicitly detect the target boundaries. Furthermore, they inevitably require post-processing steps such as mean offset clustering [6] to obtain the final instance label, which is computationally arduous. Another pipeline is 3D-SIS [15] and GSPN [58] based on proposals, which typically rely on two-stage training and expensive non-maximum suppression to prune dense target proposals.

In this paper, we propose an elegant, efficient and novel 3D instance segmentation framework that uses single forward phases of efficient MLPs to perform loose but unique detection of objects, and then accurately segment each instance through a simple point-level binary classifier. To this end, we introduce a new bounding box prediction module along with a series of carefully designed loss functions to directly learn the target boundaries. Our framework is very different from the existing proposal and proposal-free methods because we are able to efficiently segment all instances with high objectives, but do not rely on expensive and intensive target proposals. Our code and data are available at https://github.com/Yang7879/3D-BoNet.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 1: 3D-BoNet framework for instance segmentation on 3D point cloud.

bounding box prediction branch is the core of our framework. This branch is intended to predict a unique, directionless rectangular bounding box for each instance in the single forward stage without relying on predefined spatial anchors or regional proposal networks [39]. As shown in Figure 2, we believe that roughly drawing a 3D bounding box for an instance is relatively possible, because the input point cloud explicitly contains 3D geometric information, which is very beneficial before processing point-level instance segmentation, because a reasonable bounding box can ensure a high degree of objectiveness of the learning segment. However, the Learning Examples box involves key issues: 1) The total number of instances is variable, i.e. from 1 to many, 2) There is no fixed order of all instances. These issues present a huge challenge to correctly optimizing the network, as there is no information that can directly link the prediction box to the ground truth tag to oversee the network. However, we show how to solve these problems gracefully. This box prediction branch simply takes the global eigenvector as input and directly outputs a large number of bounding boxes and confidence scores. These scores are used to indicate whether the box contains a valid instance.To supervise the network, we designed a novel bounding box association layer, followed by a multi-standard loss function. Given a set of ground-truth instances, we need to determine which prediction box is best for them. We describe this association process as an optimal allocation problem with existing solvers. After the boxes are optimally associated, our multicriteria loss function not only minimizes the Euclidean distance of the paired boxes, but also maximizes the coverage of effective points in the prediction box.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 2: A rough example box.

then input the predicted box along with the points and global features into the subsequent point mask prediction branch to predict a point-level binary mask for each instance. The purpose of this branch is to classify whether each point in the bounding box belongs to a valid instance or a background. Assuming the estimated instance box is quite good, it is very likely to get an accurate point mask, because this branch simply rejects points that do not belong to the detected instance. Random guessing may lead to a 50% correction.

Overall, our framework is different from all existing 3D instance segmentation methods in three aspects. 1) Compared with the proposal-free pipeline, our method segments instances with high objectiveness by explicitly learning the 3D target boundaries. 2) Our framework does not require expensive and intensive proposals compared to widely used proposal-based methods. 3) Our framework is very efficient because instance-level masks are learned in a single-forward pass without any post-processing steps. Our main contribution is:

  • We propose a new framework for instance segmentation on 3D point cloud. The framework is single-stage, anchor-free and end-to-end trainable without any post-processing steps.
  • We designed a novel bounding box association layer, followed by a multi-standard loss function to supervise the box prediction branch.
  • We demonstrate significant improvements to baselines and provide an intuitive basis for our design choices through extensive ablation studies.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 3: General workflow of 3D-BoNet framework.

2 3D-BoNet

2.1 Overview

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

2.2 Bounding Box Prediction

"Border Box Encoding:" In the existing target detection network, the bounding box is usually represented by the center position and the length [3] of the three dimensions or the corresponding residual [60] and the direction. Instead, for simplicity, we parameterize the rectangular bounding box only through two min-max vertices:

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 4: Boundary box regression branch architecture. Before calculating multi-criteria losses, the predicted box is best associated with the ground truth box.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

In order to solve the above optimal association problem, the existing Hungarian algorithm [20; 21] is applied. Association matrix calculation: To evaluate the similarity between the first prediction box and the first ground truth, a simple and intuitive criterion is the Euclidean distance between two pairs of minimum-maximum vertices. However, it is not optimal. Basically, we want the prediction box to contain as many valid points as possible. As shown in Figure 5, input point clouds are usually sparse and are unevenly distributed in 3D space. For the same ground truth box #0 (blue), candidate box #2 (red) is considered much better than candidate box #1 (black), because box #2 has more valid points overlapping with #0. Therefore, when calculating the cost matrix, the coverage of the valid points should be included. In this article, we consider the following three criteria:

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 5: Sparse input point cloud.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

2.3 Point Mask Prediction

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Table 1 Instance segmentation results on ScanNet(v2) benchmark (hidden test set). The metric is AP (%) with an IoU threshold of 0.5. Accessed on June 2, 2019,

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 6: Schema of the point mask prediction branch.Point features are fused with each bounding box and fraction, and then a point-level binary mask is predicted for each instance

2.4 End-to-End Implementation

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

3 Experiments

3.1 Evaluation on ScanNet Benchmark

We first evaluate our method on the ScanNet(v2) 3D semantic instance segmentation benchmark [7]. Similar to SGPN [50], we divide the original input point cloud into 1mx1m blocks for training, while testing with all points, and then assemble the blocks into a complete 3D scene using the BlockMerging algorithm [50]. In our experiments, we observe that semantic predictor sub-branches based on vanilla PointNet++ are limited in performance and cannot provide satisfactory semantics. Due to the flexibility of our framework, we can easily train a parallel SCN network [11] to estimate more accurate semantic labels for our predicted instances of 3D-BoNet. The average accuracy (AP) with an IoU threshold of 0.5 was used as the evaluation index.

We compared the leading method of the 18 target categories in Table 1. In particular, SGPN [50], 3D-BEVIS [8], MASC [30] and [28] are methods based on point feature clustering; RPointNet [58] learns to generate dense target proposals and then perform point-level segmentation; 3D-SIS [15] is a proposal-based method that uses point clouds and color images as inputs. PanopticFusion [33] learns to segment instances on multiple 2D images via Mask-RCNN [13] and then reprojected back to the 3D space using the SLAM system. Our approach goes beyond them using point clouds alone. It is worth noting that our framework performs relatively satisfactorily across all categories without favoring specific classes, which proves the superiority of our framework.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Figure 7: This shows a lecture room with hundreds of targets (e.g. chairs, tables), highlighting the challenge of instance segmentation. Different colors represent different instances. The same instance may have different colors. Our framework predicts more precise instance labels than other frameworks.

3.2 Evaluation on S3DIS Datasett

We further evaluated the semantic instance segmentation of our framework on S3DIS[1], which included 3D full scans of 271 rooms from 6 large areas. Our data preprocessing and experimental setup strictly follows PointNet[37], SGPN[50], ASIS[51], and JSIS3D[34]. In our experiment, H is set to 24 and we follow a 6x evaluation [1; 51].

We compare with ASIS[51], S3DIS latest technologies and PartNet baseline[32]. For fair comparison, we carefully trained the PartNet baseline using the same PointNet++ backbone and other settings as used in our framework. For evaluation, the classical metric mean accuracy (mPrec) and average recall (mRec) with an IoU threshold of 0.5 were reported. Note that for our method and the PartNet baseline, we use the same BlockMerging algorithm [50] to merge instances from different blocks. The final score is the average of a total of 13 categories. Table 2 shows the mPrec/mRec scores, and Figure 7 shows the qualitative results. Our approach greatly exceeds PartNet baseline [32] and is also better than ASIS [51], but is not significant, mainly because our semantic prediction branch (based on vanilla PointNet++) is inferior to ASIS, which tightly fuses semantic and instance features for mutual optimization. We combine features into our future exploration

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Table 2: Instance segmentation results on S3DIS dataset.

3.3 Ablation Study

To evaluate the effectiveness of each component of our framework, we performed 6 sets of ablation experiments on the largest region 5 of the S3DIS dataset.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

Table 3: Example segmentation results of all ablation experiments on S3DIS region 5.

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

"Analysis." Table 3 shows the scores of the ablation experiment. (1) The box score sub-branch does favor the overall instance segmentation performance, as it tends to penalize duplicate box predictions. (2) Compared with Euclidean distance and cross-entropy scores, the sIoU costs of box association and supervision tend to be better due to our differentiable algorithm 1. Since three separate criteria prefer different types of point structures, the criteria may not always be optimal on a particular dataset.(3) Without supervision of box prediction, performance will drop significantly, mainly because the network cannot infer satisfactory instance 3D boundaries and the quality of the predicted point mask decreases accordingly. (4) Compared with focal loss, due to the imbalance of instance and background points, the standard cross entropy loss has poor effect on point mask prediction.

3.4 Computation Analysis

Abstract We propose a novel, conceptually simple general framework for instance segmentation on 3D point clouds. Our approach, called 3D-BoNet, follows the simple design philosophy of a multi-layer perceptron (MLP). The framework directly regresses the 3D bounding box of all inst - DayDayNews

4 Related Work

In order to extract features from 3D point clouds, traditional methods usually manually create features [5; 42]. Recent learning-based approaches mainly include voxel-based [42; 46; 41; 23; 40; 11; 4] and point-based schemes [37; 19; 14; 16; 45].

"Semantic Segmentation" PointNet[37] shows the leading results of classification and semantic segmentation, but it does not capture context features. To solve this problem, many methods [38; 57; 43; 31; 55; 49; 26; 17] have been proposed recently. Another pipeline is a convolution kernel-based method [55; 27; 47]. Basically, most of these methods can be used as our backbone network and trained in parallel with our 3D-BoNet to learn every point of semantics.

"Object Detection" The common method of detecting targets in a 3D point cloud is to project the point onto a 2D image to regress the bounding box [25; 48; 3; 56; 59; 53]. By integrating the RGB images in [3], the detection performance further improves the fused RGB images [3;54;36;52]. Point clouds can also be divided into voxels for object detection [9; 24; 60]. However, most of these methods rely on predefined anchor points and two-stage regional proposal networks [39]. Scaling them on 3D point clouds is inefficient. Without relying on anchors, recent PointRCNN [44] learns to detect through foreground point segmentation, while VoteNet [35] detects targets through point feature grouping, sampling, and voting. In contrast, our box prediction branches are completely different from them. Our framework regresses the 3D target bounding box directly from the compact global features through a single forward pass.

"Instance Segmentation" SGPN[50] is the first neural algorithm to segment 3D point cloud instances by grouping point-level embeddings. ASIS[51], JSIS3D[34], MASC[30], 3D-BEVIS[8], and [28] use the same policy to group point-level features, such as instance segmentation. Mo et al. introduced a segmentation algorithm in PartNet [32] by classifying point features. However, the learning fragments of these proposal-free methods are not highly targeted because it does not explicitly detect the target boundaries. By drawing on successful 2D RPN [39] and RoI [13], GSPN [58] and 3D-SIS [15] are proposal-based 3D instance segmentation methods. However, they usually rely on two-stage training and a post-processing step for intensive proposal pruning. In contrast, our framework predicts a point-level mask directly for each instance within the explicitly detected object boundary without any post-processing steps.

5 Conclusion

Home framework is simple, effective and efficient, and can be used for instance segmentation on 3D point cloud. However, it also has some limitations that lead to future work. (1) Instead of using an unweighted combination of three criteria, design a module to automatically learn weights to suit different types of input point clouds. (2) More advanced feature fusion modules can be introduced to improve semantics and instance segmentation with each other, rather than training separate branches for semantic prediction. (3) Our framework follows MLP design, so it has nothing to do with the number and order of input points. By drawing on recent work [10][22], it is desirable to train and test directly on large-scale input point clouds rather than on segmented small pieces.

Original link: https://arxiv.org/abs/1906.01140

References

[1] I. Armeni, O. Sener, A. Zamir, and H. Jiang. 3D Semantic Parsing of Large-Scale Indoor Spaces. CVPR, 2016.

[2] Y. Bengio, N. Léonard, and A. Courville. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arXiv, 2013.

[3] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-View 3D Object Detection Network for Autonomous Driving. CVPR, 2017.

[4] C. Choy, J. Gwak, and S. Savarese. 4D Spatio-Temporal ConvNets: Minkowski Convolutional Neural Networks. CVPR, 2019.

[5] C. S. Chua and R. Jarvis. Point signatures: A new representation for 3d object recognition. IJCV, 25(1):63–85, 1997.

[6] D. Comaniciu and P. Meer. Mean Shift: A Robust Approach towards Feature Space Analysis. TPAMI, 24(5):603–619, 2002.

[7] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes. CVPR, 2017.

[8] C. Elich, F. Engelmann, J. Schult, T. Kontogianni, and B. Leibe. 3D-BEVIS: Birds-Eye-View Instance Segmentation. GCPR, 2019.

[9] M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner. V ote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks. ICRA, 2017.

[10] F. Engelmann, T. Kontogianni, A. Hermans, and B. Leibe. Exploring Spatial Context for 3D Semantic Segmentation of Point Clouds. ICCV Workshops, 2017.

[11] B. Graham, M. Engelcke, and L. v. d. Maaten. 3D Semantic Segmentation with Submanifold Sparse Convolutional Networks. CVPR, 2018.

[12] A. Grover, E. Wang, A. Zweig, and S. Ermon. Stochastic Optimization of Sorting Networks via Continuous Relaxations. ICLR, 2019.

[13] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. ICCV, 2017.

[14] P . Hermosilla, T. Ritschel, P .-P . V azquez, A. Vinacua, and T. Ropinski. Monte Carlo Convolution for Learning on Non-Uniformly Sampled Point Clouds. ACM Transactions on Graphics, 2018.

[15] J. Hou, A. Dai, and M. Nießner. 3D-SIS: 3D Semantic Instance Segmentation of RGB-D Scans. CVPR, 2019.

[16] B.-S. Hua, M.-K. Tran, and S.-K. Yeung. Pointwise Convolutional Neural Networks. CVPR, 2018.

[17] Q. Huang, W. Wang, and U. Neumann. Recurrent Slice Networks for 3D Segmentation of Point Clouds. CVPR, 2018.

[18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.

[19] R. Klokov and V. Lempitsky. Escape from Cells: Deep Kd-Networks for The Recognition of 3D Point Cloud Models. ICCV, 2017.

[20] H. W. Kuhn. The Hungarian Method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.

[21] H. W. Kuhn. V ariants of the hungarian method for assignment problems. Naval Research Logistics Quarterly, 3(4):253–258, 1956.

[22] L. Landrieu and M. Simonovsky. Large-scale Point Cloud Semantic Segmentation with Superpoint Graphs. CVPR, 2018.

[23] T. Le and Y. Duan. PointGrid: A Deep Network for 3D Shape Understanding. CVPR, 2018.

[24] B. Li. 3D Fully Convolutional Network for V ehicle Detection in Point Cloud. IROS, 2017.

[25] B. Li, T. Zhang, and T. Xia. V ehicle Detection from 3D Lidar Using Fully Convolutional Network. RSS, 2016.

[26] J. Li, B. M. Chen, and G. H. Lee. SO-Net: Self-Organizing Network for Point Cloud Analysis. CVPR, 2018.

[27] Y . Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen. PointCNN : Convolution On X -Transformed Points. NeurlPS, 2018.

[28] Z. Liang, M. Yang, and C. Wang. 3D Graph Embedding Learning with a Structure-aware Loss Function for Point Cloud Semantic Instance Segmentation. arXiv, 2019.

[29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal Loss for Dense Object Detection. ICCV, 2017.

[30] C. Liu and Y. Furukawa. MASC: Multi-scale Affinity with Sparse Convolution for 3D Instance Segmentation. arXiv, 2019.

[31] S. Liu, S. Xie, Z. Chen, and Z. Tu. Attentional ShapeContextNet for Point Cloud Recognition. CVPR, 2018.

[32] K. Mo, S. Zhu, A. X. Chang, L. Yi, S. Tripathi, L. J. Guibas, and H. Su. PartNet: A Large-scale Benchmark for Fine-grained and Hierarchical Part-level 3D Object Understanding. CVPR, 2019.

[33] G. Narita, T. Seno, T. Ishikawa, and Y. Kaji. PanopticFusion: Online V olumetric Semantic Mapping at the Level of Stuff and Things. IROS, 2019.

[34] Q.-H. Pham, D. T. Nguyen, B.-S. Hua, G. Roig, and S.-K. Yeung. JSIS3D: Joint Semantic-Instance Segmentation of 3D Point Clouds with Multi-Task Pointwise Networks and Multi-V alaue Conditional Random Fields. CVPR, 2019.

[35] C. R. Qi, O. Litany, K. He, and L. J. Guibas. Deep Hough V oting for 3D Object Detection in Point Clouds. ICCV, 2019.

[36] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum PointNets for 3D Object Detection from RGB-D Data. CVPR, 2018.

[37] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. CVPR, 2017.

[38] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. NIPS, 2017.

[39] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-time Object Detection with Region Proposal Networks. NIPS, 2015.

[40] D. Rethage, J. Wald, J. Sturm, N. Navab, and F. Tombari. Fully-Convolutional Point Networks for Large-Scale Point Clouds. ECCV, 2018.

[41] G. Riegler, A. O. Ulusoy, and A. Geiger. OctNet: Learning Deep 3D Representations at High Resolutions. CVPR, 2017.

[42] R. B. Rusu, N. Blodow, and M. Beetz. Fast point feature histograms (fpfh) for 3d registration. ICRA, 2009.

[43] Y . Shen, C. Feng, Y . Yang, and D. Tian. Mining Point Cloud Local Structures by Kernel Correlation and Graph Pooling. CVPR, 2018.

[44] S. Shi, X. Wang, and H. Li. PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud. CVPR, 2019.

[45] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M.-H. Yang ang, and J. Kautz. SPLA TNet: Sparse Lattice Networks for Point Cloud Processing. CVPR, 2018.

[46] L. P. Tchapmi, C. B. Choy, I. Armeni, J. Gwak, and S. Savarese. SEGCloud: Semantic Segmentation of 3D Point Clouds. 3DV, 2017.

[47] H. Thomas, C. R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas. KPConv: Flexible and Deformable Convolution for Point Clouds. ICCV, 2019.

[48] V aquero, I. Del Pino, F. Moreno-Noguer, J. Soì, A. Sanfeliu, and J. Andrade-Cetto. Deconvolutional Networks for Point-Cloud V ehicle Detection and Tracking in Driving Scenarios. ECMR, 2017.

[49] C. Wang, B. Samari, and K. Siddiqi. Local Spectral Graph Convolution for Point Set Feature Learning. ECCV, 2018.

[50] W. Wang, R. Y u, Q. Huang, and U. Neumann. SGPN: Similarity Group Proposal Network for 3D Point Cloud Instance Segmentation. CVPR, 2018.

[51] X. Wang, S. Liu, X. Shen, C. Shen, and J. Jia. Associatively Segmenting Instances and Semantics in Point Clouds. CVPR, 2019.

[52] Z. Wang, W. Zhan, and M. Tomizuka. Fusing Bird View LIDAR Point Cloud and Front View Camera Image for Deep Object Detection. arXiv, 2018.

[53] B. Wu, A. Wan, X. Y ue, and K. Keutzer. SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud. arXiv, 2017.

[54] D. Xu, D. Anguelov, and A. Jain. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation. CVPR, 2018.

[55] Y . Xu, T. Fan, M. Xu, L. Zeng, and Y . Qiao. SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters. ECCV, 2018.

[56] G. Yang, Y. Cui, S. Belongie, and B. Hariharan. Learning Single-View 3D Reconstruction with Limited Pose Supervision. ECCV, 2018.

[57] X. Ye, J. Li, H. Huang, L. Du, and X. Zhang. 3D Recurrent Neural Networks with Context Fusion for Point Cloud Semantic Segmentation. ECCV, 2018.

[58] L. Yi, W. Zhao, H. Wang, M. Sung, and L. Guibas. GSPN: Generative Shape Proposal Network for 3D Instance Segmentation in Point Cloud. CVPR, 2019.

[59] Y. Zeng, Y. Hu, S. Liu, J. Y e, Y . Han, X. Li, and N. Sun. RT3D: Real-Time 3D V ehicle Detection in LiDAR Point Cloud for Autonomous Driving. IEEE Robotics and Automation Letters, 3(4):3434–3440, 2018.

[60] Y . Zhou and O. Tuzel. V oxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. CVPR, 2018.

hotcomm Category Latest News