LiSD: An Efficient Multi-Task Learning Framework for LiDAR Segmentation and Detection (2024)

\useunder

\ul

Abstract

With the rapid proliferation of autonomous driving, there has been a heightened focus on the research of lidar-based 3D semantic segmentation and object detection methodologies, aiming to ensure the safety of traffic participants. In recent decades, learning-based approaches have emerged, demonstrating remarkable performance gains in comparison to conventional algorithms. However, the segmentation and detection tasks have traditionally been examined in isolation to achieve the best precision. To this end, we propose an efficient multi-task learning framework named LiSD which can address both segmentation and detection tasks, aiming to optimize the overall performance. Our proposed LiSD is a voxel-based encoder-decoder framework that contains a hierarchical feature collaboration module and a holistic information aggregation module. Different integration methods are adopted to keep sparsity in segmentation while densifying features for query initialization in detection. Besides, cross-task information is utilized in an instance-aware refinement module to obtain more accurate predictions. Experimental results on the nuScenes dataset and Waymo Open Dataset demonstrate the effectiveness of our proposed model. It is worth noting that LiSD achieves the state-of-the-art performance of 83.3% mIoU on the nuScenes segmentation benchmark for lidar-only methods.

Index Terms— multi-task learning, semantic segmentation, object detection

1 Introduction

Semantic segmentation and 3D object detection tasks play pivotal roles in autonomous driving, serving as foundational components for establishing a comprehensive environmental perception system, which is crucial for mitigating labor costs and ensuring traffic safety. To facilitate advancements in this field, large-scale databases such as nuScenes [1] and Waymo Open Dataset (WOD) [2] have been created, which are invaluable resources for guiding the refinement and evaluation of the perception algorithms designed to handle complicated road conditions. Based on the aforementioned databases, semantic segmentation and object detection tasks are conventionally researched independently to accomplish the best accuracy, for example, frameworks such as Cylinder3D are purposefully tailored for semantic segmentation, which can not yield remarkable performance compared to frameworks explicitly designed for object detection [3]. Hence, there exists an urgent need to investigate unified frameworks for achieving optimal performance across both semantic segmentation and object detection tasks. Concurrently, the generation of segmentation and detection outcomes within a singular inference process proves time-saving, presenting an advantage over the separate execution of distinct tasks [4].

LiSD: An Efficient Multi-Task Learning Framework for LiDAR Segmentation and Detection (1)

LiSD: An Efficient Multi-Task Learning Framework for LiDAR Segmentation and Detection (2)

Recently, deep learning frameworks have achieved considerable success in the domain of lidar perception, which can be roughly categorized into point-based, voxel-based, Range View (RV) based, Bird’s Eye View (BEV) based, and hybrid methodologies. Voxel-based methods have emerged as the predominant paradigm in segmentation and detection tasks, owing to the development of sparse convolution techniques [5]. Standard sparse convolution yields output points when there existing related points in the receptive field. In contrast, submanifold sparse convolution limits the output location as active only if the corresponding input location is active [6]. Submanifold convolution becomes indispensable in 3D networks to reduce memory consumption, which implies the receptive field is constrained when adopting submanifold convolution [5]. VoxelNeXt [7] incorporates additional down-sampling and sparse height compression to generate robust feature representation with sufficient receptive fields. Nevertheless, these operations will induce alterations in the density of the input sparse feature, consequently increasing the difficulty in the implementation of inverse sparse convolution. To address this problem, Ye et al. propose the global context pooling (GCP) in the multi-task framework LidarMultiNet [8]. GCP transforms the 3D sparse features into 2D BEV dense features to extract the global information. Moreover, a cross-space transformer module is adopted in LiDARFormer [9] to learn long-range information in the BEV feature. However, direct extraction of the global information from the conversion of 2D BEV dense features leads to an increase in memory consumption, given the storage of inactive points.

Regarding the cross-task information interaction during multi-task learning, LidarMTL [4] adopts a conventional approach in which only low-level features are shared. In LidarMultiNet [8], a second-stage refinement module is introduced to enhance the first-stage semantic segmentation and produce the panoptic segmentation results. Zhou et al. adopt a cross-task module [9] to transfer high-level features through cross-task attention mechanisms with high computational complexity.

In this paper, to reduce memory consumption and computational complexity while keeping precision, we propose an efficient multi-task learning framework, denoted as LiSD, for lidar semantic segmentation and object detection as shown in Fig. 1. Instead of the direct placement of voxels from various scales onto the ground used in VoxelNeXt, we introduce a memory-friendly holistic information aggregation module, which interpolates the high-level features to the relevant active positions of low-level features. Through this methodology, sparsity is preserved with the acquisition of global information. Besides, the hierarchical feature collaboration is adopted in our LiSD to enhance the voxel feature representation. Moreover, in contrast to the aforementioned cross-task information interaction methodologies, we propose a straightforward yet effective instance-aware refinement module. This module is specifically designed to enhance the feature representation of foreground points through the incorporation of proposal features. LiSD is evaluated on two databases, namely nuScenes and WOD, demonstrating competitive performance for both segmentation and detection tasks. Notably, LiSD attains a leading segmentation performance of 83.3% mIoU on nuScene, outperforming all the lidar-based methods currently ranked on the leaderboard.

The main contributions can be listed as follows:

  • We propose an efficient multi-task learning framework LiSD for lidar semantic segmentation and object detection.

  • We introduce a memory-friendly holistic information aggregation module (HIAM) to integrate global information suitable for segmentation and detection tasks, and a hierarchical feature collaboration module (HFCM) to enhance the voxel feature representation.

  • We present an instance-aware refinement module (IARM) to improve the foreground point feature representation with the assistance of object proposals.

  • The proposed LiSD achieves competitive performance in segmentation and detection tasks on the nuScenes and WOD datasets.

2 Method

In this section, we delineate the structure of our multi-task learning framework, LiSD, which seamlessly integrates three perception tasks, namely, semantic segmentation, object detection, and the auxiliary BEV segmentation through a single feed-forward pass, as illustrated in Fig. 2.

2.1 Overview

Given the input point cloud P={pipi3+c}i=1N𝑃subscriptsuperscriptconditional-setsubscript𝑝𝑖subscript𝑝𝑖superscript3𝑐𝑁𝑖1\displaystyle P=\left\{p_{i}\mid p_{i}\in\mathbb{R}^{3+c}\right\}^{N}_{i=1}italic_P = { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 + italic_c end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, the proposed LiSD yields semantic segmentation labels L={lili(1K)}i=1N𝐿subscriptsuperscriptconditional-setsubscript𝑙𝑖subscript𝑙𝑖1𝐾𝑁𝑖1\displaystyle L=\left\{l_{i}\mid l_{i}\in\left(1\cdots K\right)\right\}^{N}_{i%=1}italic_L = { italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( 1 ⋯ italic_K ) } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT and object detection bounding boxes B={bibi9}i=1M𝐵subscriptsuperscriptconditional-setsubscript𝑏𝑖subscript𝑏𝑖superscript9𝑀𝑖1\displaystyle B=\left\{b_{i}\mid b_{i}\in\mathbb{R}^{9}\right\}^{M}_{i=1}italic_B = { italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT } start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT, where N𝑁\displaystyle Nitalic_N represents the number of points, M𝑀\displaystyle Mitalic_M, K𝐾\displaystyle Kitalic_K denote the number of predicted boxes and semantic classes, respectively. Each point is endowed with (3+c)3𝑐\displaystyle\left(3+c\right)( 3 + italic_c )-dimensional features, e.g., 3D coordinate, intensity, elongation, timestamp, etc. The predicted boxes are characterized by the center coordinates, sizes, orientations, and velocities.

Firstly, the voxelized point cloud is fed into the Voxel Feature Encoder (VFE) for producing the sparse voxel feature representation Fvj=mean(pi),i=1nformulae-sequencesubscript𝐹subscript𝑣𝑗𝑚𝑒𝑎𝑛subscript𝑝𝑖𝑖1𝑛\displaystyle F_{v_{j}}=mean(p_{i}),i=1\cdots nitalic_F start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_m italic_e italic_a italic_n ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i = 1 ⋯ italic_n through an average pooling layer, where pisubscript𝑝𝑖\displaystyle p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the feature representation of the i𝑖\displaystyle iitalic_i-th point within the voxel vjsubscript𝑣𝑗\displaystyle v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and vjsubscript𝑣𝑗\displaystyle v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT contains a total of n𝑛\displaystyle nitalic_n points. Then, the voxel-based encoder-decoder formed from 3D sparse convolution is adopted to generate voxel and BEV feature representation for segmentation and detection tasks. The encoder comprises four stages of sparse convolution blocks to downsample the spatial resolution, thereby acquiring high-level voxel features for the detection head. Conversely, the decoder is equipped with four symmetrical stages of sparse inverse convolution blocks to recover to the original voxel size for the segmentation head. The holistic information aggregation module and the hierarchical feature collaboration module are introduced in the encoder-decoder to enlarge the receptive field and enhance the feature representation. Ultimately, the segmentation and detection heads are employed to produce the semantic labels and object bounding boxes. An instance-aware refinement module is designed to integrate cross-task information to improve the accuracy of the prediction results.

2.2 Holistic Information Aggregation Module

As inspired by VoxelNeXt [7], sufficient receptive fields are required to ensure correct predictions when dealing with sparse voxel features. Additional two stages of down-sampling are introduced in VoxelNeXt to generate features {F5,F6}subscript𝐹5subscript𝐹6\displaystyle\left\{F_{5},F_{6}\right\}{ italic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT } with strides {16,32}1632\displaystyle\left\{16,32\right\}{ 16 , 32 }, and the multi-scale features of the original encoder with strides {1,2,4,8}1248\displaystyle\left\{1,2,4,8\right\}{ 1 , 2 , 4 , 8 } are denoted as {F1,F2,F3,F4}subscript𝐹1subscript𝐹2subscript𝐹3subscript𝐹4\displaystyle\left\{F_{1},F_{2},F_{3},F_{4}\right\}{ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT }. The enhanced feature representation Fdsubscript𝐹𝑑\displaystyle F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT for the detection head is obtained as follows:

Fdsubscript𝐹𝑑\displaystyle F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT=F4F5F6absentsubscript𝐹4subscript𝐹5subscript𝐹6\displaystyle=F_{4}\cup F_{5}\cup F_{6}= italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∪ italic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∪ italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT(1)
Pdsubscript𝑃𝑑\displaystyle P_{d}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT=P4P5P6absentsubscript𝑃4superscriptsubscript𝑃5superscriptsubscript𝑃6\displaystyle=P_{4}\cup P_{5}^{{}^{\prime}}\cup P_{6}^{{}^{\prime}}= italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∪ italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ∪ italic_P start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT

where Pdsubscript𝑃𝑑\displaystyle P_{d}italic_P start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT represents the position of the enhanced voxel feature, and P4,P5,P6subscript𝑃4subscript𝑃5subscript𝑃6\displaystyle P_{4},P_{5},P_{6}italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT correspond to the positions of F4,F5,F6subscript𝐹4subscript𝐹5subscript𝐹6\displaystyle F_{4},F_{5},F_{6}italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT, respectively. P5superscriptsubscript𝑃5\displaystyle P_{5}^{{}^{\prime}}italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT is aligned to P4subscript𝑃4\displaystyle P_{4}italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT by doubling the 3D coordinates (xp5,yp5,zp5)subscript𝑥subscript𝑝5subscript𝑦subscript𝑝5subscript𝑧subscript𝑝5\displaystyle\left(x_{p_{5}},y_{p_{5}},z_{p_{5}}\right)( italic_x start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) of the position p5P5subscript𝑝5subscript𝑃5\displaystyle p_{5}\in P_{5}italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∈ italic_P start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, and the same as P6superscriptsubscript𝑃6\displaystyle P_{6}^{{}^{\prime}}italic_P start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT. Nevertheless, the produced feature Fdsubscript𝐹𝑑\displaystyle F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT exhibits a significantly higher density compared to F4subscript𝐹4\displaystyle F_{4}italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. The different sparsity of Fdsubscript𝐹𝑑\displaystyle F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and F4subscript𝐹4\displaystyle F_{4}italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT poses challenges in the implementation of inverse sparse convolution.

LiSD: An Efficient Multi-Task Learning Framework for LiDAR Segmentation and Detection (3)

To solve this problem, we take another approach to integrate holistic information for the voxel-based decoder, as depicted in Fig. 3. Specifically, the voxel features in F5superscriptsubscript𝐹5\displaystyle F_{5}^{{}^{\prime}}italic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT corresponding to the position P4subscript𝑃4\displaystyle P_{4}italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are interpolated with neighboring voxel features in F5subscript𝐹5\displaystyle F_{5}italic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT to maintain sparsity, and this process is replicated for F6superscriptsubscript𝐹6\displaystyle F_{6}^{{}^{\prime}}italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT. The enhanced feature representation Fssubscript𝐹𝑠\displaystyle F_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for the voxel-based decoder is denoted as follows:

Fssubscript𝐹𝑠\displaystyle F_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT=F4+F5+F6absentsubscript𝐹4superscriptsubscript𝐹5superscriptsubscript𝐹6\displaystyle=F_{4}+F_{5}^{{}^{\prime}}+F_{6}^{{}^{\prime}}= italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT + italic_F start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT + italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT(2)
Pssubscript𝑃𝑠\displaystyle P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT=P4absentsubscript𝑃4\displaystyle=P_{4}= italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT

where Pssubscript𝑃𝑠\displaystyle P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT denotes the position of the refined voxel feature, identical to P4subscript𝑃4\displaystyle P_{4}italic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT. Consequently, the implementation of inverse sparse convolution becomes straightforward. Note that Fdsubscript𝐹𝑑\displaystyle F_{d}italic_F start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is further projected to the BEV feature map F¯dsubscript¯𝐹𝑑\displaystyle\overline{F}_{d}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT by putting all voxels onto the ground and summing up features in the same positions. Leveraging the HIAM, the proposed LiSD could significantly extend the receptive fields, which is a crucial aspect for improving semantic segmentation. Furthermore, the BEV feature map obtained through HIAM is further leveraged by object detection and BEV segmentation tasks.

2.3 Hierarchical Feature Collaboration Module

As indicated in [10], hierarchical features possess robust semantic information across various scales, which is beneficial for the semantic segmentation task. Apart from the 3D U-Net architecture, we design an additional hierarchical feature collaboration module to augment the voxel feature representation for the segmentation head. As previously noted, the multi-scale features of the encoder and decoder with strides {1,2,4,8}1248\displaystyle\left\{1,2,4,8\right\}{ 1 , 2 , 4 , 8 } are represented as {F1,F2,F3,F4}subscript𝐹1subscript𝐹2subscript𝐹3subscript𝐹4\displaystyle\left\{F_{1},F_{2},F_{3},F_{4}\right\}{ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT } and {F1,F2,F3,Fs}superscriptsubscript𝐹1superscriptsubscript𝐹2superscriptsubscript𝐹3subscript𝐹𝑠\displaystyle\left\{F_{1}^{{}^{\prime}},F_{2}^{{}^{\prime}},F_{3}^{{}^{\prime}%},F_{s}\right\}{ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT }, respectively. The enhanced feature F¯ssubscript¯𝐹𝑠\displaystyle\overline{F}_{s}over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is obtained via the top-down pathway and lateral connections as follows:

F¯s=cat(θ1(F1+F1),θ2(F2+F2),θ3(F3+F3),θsFs)subscript¯𝐹𝑠𝑐𝑎𝑡subscript𝜃1subscript𝐹1superscriptsubscript𝐹1subscript𝜃2subscript𝐹2superscriptsubscript𝐹2subscript𝜃3subscript𝐹3superscriptsubscript𝐹3subscript𝜃𝑠subscript𝐹𝑠\overline{F}_{s}=cat\left(\theta_{1}(F_{1}+F_{1}^{{}^{\prime}}),\theta_{2}(F_{%2}+F_{2}^{{}^{\prime}}),\theta_{3}(F_{3}+F_{3}^{{}^{\prime}}),\theta_{s}F_{s}\right)over¯ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_c italic_a italic_t ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )(3)

where θ1subscript𝜃1\displaystyle\theta_{1}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, θ2subscript𝜃2\displaystyle\theta_{2}italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, θ3subscript𝜃3\displaystyle\theta_{3}italic_θ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, θssubscript𝜃𝑠\displaystyle\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represent the encoding and up-scaling functions applied to hierarchical features, and cat𝑐𝑎𝑡\displaystyle catitalic_c italic_a italic_t denotes the concatenation operation across feature channels.

2.4 Instance-Aware Refinement Module

Given that the devised LiSD is a multi-task learning framework, there exists the potential for the integration of cross-task information between the segmentation and detection heads. As illustrated in Fig. 2, voxel features undergo an initial conversion to point features for foreground probability estimation, which is guided by semantic segmentation labels during the training phase. The foreground mask mfisubscript𝑚𝑓𝑖\displaystyle{m}_{fi}italic_m start_POSTSUBSCRIPT italic_f italic_i end_POSTSUBSCRIPT of the i𝑖\displaystyle iitalic_i-th point is denoted as:

mfi={  1,pfiδf  0,pfi<δf{m}_{fi}=\left\{\begin{matrix}\ \,1,\quad p_{fi}\geq\delta_{f}\\\ \,0,\quad p_{fi}<\delta_{f}\end{matrix}\right.italic_m start_POSTSUBSCRIPT italic_f italic_i end_POSTSUBSCRIPT = { start_ARG start_ROW start_CELL 1 , italic_p start_POSTSUBSCRIPT italic_f italic_i end_POSTSUBSCRIPT ≥ italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , italic_p start_POSTSUBSCRIPT italic_f italic_i end_POSTSUBSCRIPT < italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_CELL end_ROW end_ARG(4)

where pfisubscript𝑝𝑓𝑖\displaystyle p_{fi}italic_p start_POSTSUBSCRIPT italic_f italic_i end_POSTSUBSCRIPT represents the foreground probability of the i𝑖\displaystyle iitalic_i-th point, and δfsubscript𝛿𝑓\displaystyle\delta_{f}italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT indicates the probability threshold to distinguish between foreground and background points, e.g. δf=0.5subscript𝛿𝑓0.5\displaystyle\delta_{f}=0.5italic_δ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 0.5 in this paper. Simultaneously, the detection head predicts M𝑀\displaystyle Mitalic_M initial boxes, and the proposal mask mbijsubscript𝑚𝑏𝑖𝑗\displaystyle{m}_{bij}italic_m start_POSTSUBSCRIPT italic_b italic_i italic_j end_POSTSUBSCRIPT of the i𝑖\displaystyle iitalic_i-th point (pisubscript𝑝𝑖\displaystyle p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) to j𝑗\displaystyle jitalic_j-th box (bjsubscript𝑏𝑗\displaystyle b_{j}italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) is computed as:

mbij={1,piinsidebj  0,pioutsidebj{m}_{bij}=\left\{\begin{matrix}1,\quad p_{i}\ inside\ b_{j}\\\ \;0,\quad p_{i}\ outside\ b_{j}\end{matrix}\right.italic_m start_POSTSUBSCRIPT italic_b italic_i italic_j end_POSTSUBSCRIPT = { start_ARG start_ROW start_CELL 1 , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_i italic_n italic_s italic_i italic_d italic_e italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_o italic_u italic_t italic_s italic_i italic_d italic_e italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_CELL end_ROW end_ARG(5)

Then, the refined point feature fpisuperscriptsubscript𝑓𝑝𝑖\displaystyle{f}_{pi}^{{}^{\prime}}italic_f start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT of the i𝑖\displaystyle iitalic_i-th point is obtained as:

fpi=fpi+j=0MmfimbijMLP(fbj)superscriptsubscript𝑓𝑝𝑖subscript𝑓𝑝𝑖superscriptsubscript𝑗0𝑀subscript𝑚𝑓𝑖subscript𝑚𝑏𝑖𝑗MLPsubscript𝑓𝑏𝑗f_{pi}^{{}^{\prime}}=f_{pi}+\sum_{j=0}^{M}m_{fi}\cdot m_{bij}\cdot\mathrm{MLP}%(f_{bj})italic_f start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_f italic_i end_POSTSUBSCRIPT ⋅ italic_m start_POSTSUBSCRIPT italic_b italic_i italic_j end_POSTSUBSCRIPT ⋅ roman_MLP ( italic_f start_POSTSUBSCRIPT italic_b italic_j end_POSTSUBSCRIPT )(6)

where fpisubscript𝑓𝑝𝑖\displaystyle{f}_{pi}italic_f start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT represents the feature of the i𝑖\displaystyle iitalic_i-th point before instance-aware refinement, fbjsubscript𝑓𝑏𝑗\displaystyle{f}_{bj}italic_f start_POSTSUBSCRIPT italic_b italic_j end_POSTSUBSCRIPT represents the feature of the j𝑗\displaystyle jitalic_j-th box generated from the prediction head. MLP denotes the Multi-Layer Perceptron to adjust the feature dimension of fbjsubscript𝑓𝑏𝑗\displaystyle{f}_{bj}italic_f start_POSTSUBSCRIPT italic_b italic_j end_POSTSUBSCRIPT to align with that of fpisubscript𝑓𝑝𝑖\displaystyle{f}_{pi}italic_f start_POSTSUBSCRIPT italic_p italic_i end_POSTSUBSCRIPT. Benefiting from the IARM, the feature representation of the foreground points is enhanced with the incorporation of the proposal features, and constraints are applied to the box regression process concurrently.

2.5 Joint Training

LiSD is trained in an end-to-end manner via a multi-task loss function. Specifically, for the semantic and BEV segmentation tasks, the optimization is guided by the cross-entropy loss and Lovasz loss [11]. For the detection task, binary cross-entropy loss, L1subscript𝐿1\displaystyle L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT loss, and IoU loss [12] are employed to minimize the classification, regression, and IoU cost, respectively. Subsequently, the final loss L𝐿\displaystyle Litalic_L is defined as a weighted sum of the task-specific losses:

L=i{seg,bev,det}12σi2Li+12logσi2𝐿subscript𝑖𝑠𝑒𝑔𝑏𝑒𝑣𝑑𝑒𝑡12superscriptsubscript𝜎𝑖2subscript𝐿𝑖12logsuperscriptsubscript𝜎𝑖2L=\sum_{i\in\left\{seg,bev,det\right\}}\frac{1}{2\sigma_{i}^{2}}L_{i}+\frac{1}%{2}\mathrm{log}\sigma_{i}^{2}italic_L = ∑ start_POSTSUBSCRIPT italic_i ∈ { italic_s italic_e italic_g , italic_b italic_e italic_v , italic_d italic_e italic_t } end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG roman_log italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT(7)

where σisubscript𝜎𝑖\displaystyle\sigma_{i}italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the noise parameter for task i𝑖\displaystyle iitalic_i to compute task-dependent uncertainty [13]. Hence, as the uncertainty increases for task i𝑖\displaystyle iitalic_i, the contribution of the task-specific loss Lisubscript𝐿𝑖\displaystyle L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to L𝐿\displaystyle Litalic_L diminishes.

ModelmIoU

barrier

bicycle

bus

car

construction

motorcycle

pedestrian

traffic cone

trailer

truck

driveable

other flat

sidewalk

terrain

manmade

vegetation

PolarNet [14]69.880.119.978.684.153.247.970.566.970.056.796.768.777.772.088.585.4
PolarStream [15]73.471.427.878.182.061.377.875.172.479.663.796.066.576.973.088.584.8
SPVNAS [16]77.480.030.0\ul91.990.864.779.075.670.981.074.697.469.280.076.189.387.1
Cylinder3D++ [3]77.982.833.984.389.469.679.477.373.484.669.497.7\ul70.280.375.590.487.6
AF2S3Net [17]78.378.952.289.984.2\ul77.474.377.372.083.973.897.166.577.574.087.786.8
SPVCNN++ [16]81.186.443.1\ul91.9\ul92.275.975.783.477.386.877.497.771.2\ul81.177.291.789.0
LidarMultiNet [8]81.480.448.494.390.071.5\ul87.2\ul85.2\ul80.486.974.8\ul97.867.380.776.592.189.6
LidarFormer [9]81.5\ul84.440.884.792.672.791.084.981.788.673.897.969.381.477.4\ul92.489.6
UDeerPep\ul81.885.5\ul55.590.591.672.285.681.476.3\ul87.374.097.7\ul70.2\ul81.177.492.790.2
Proposed LiSD83.382.167.189.8\ul92.280.586.987.479.386.6\ul76.197.567.280.577.092.3\ul89.7

3 Experiments

In this section, the databases used in our experiment are introduced at first. Then, we conduct the performance comparison of LiSD and other methods on these databases. Finally, the effectiveness of the specially designed modules is verified via the ablation study.

3.1 Datasets

Two large-scale autonomous driving databases equipped with point-wise semantic labels and 3D object bounding box annotations are utilized in our experiment, namely nuScenes dataset [1] and WOD dataset [2].

NuScenes dataset [1]: This database includes 1000 scenarios, each lasting 20 seconds and captured using a 32-beam lidar sensor at a sampling rate of 20Hz. Keyframes within each scenario are annotated with a 2Hz sampling rate. For the semantic segmentation task, point-wise semantic labels are provided for 16 categories, including 10 foreground classes and 6 background classes, and mean Intersection over Union (mIoU) is employed as the evaluation metric. For the object detection task, bounding box annotations are provided for the same 10 foreground categories as those in the segmentation task, and the evaluation metrics include mean Average Precision (mAP) and NuScenes Detection Score (NDS).

WOD dataset [2]: This dataset contains 2000 scenarios captured by a 64-beam lidar sensor at a sampling rate of 10Hz. Similar to nuScenes, each scenario spans 20 seconds, with detection annotations available for all frames, while point-wise semantic labels are only provided for selected keyframes. For the semantic segmentation task, semantic labels are provided for 23 classes, and the evaluation metric employed is mIoU as well. For the object detection task, bounding box annotations are provided for 3 categories, e.g., vehicles, pedestrians, and cyclists. Average Precision Weighted by Heading (APH) is used as the evaluation metric for detection, and the ground truth objects are categorized as LEVEL__\displaystyle\__1 (L1) and LEVEL__\displaystyle\__2 (L2) samples based on the detection difficulty.mAPH L1 is calculated by considering samples labeled as L1, while mAPH L2 is computed by incorporating both L1 and L2 samples.

3.2 Experiment Setup

The AdamW optimizer, coupled with a one-cycle scheduler, is adopted to train the proposed LiSD for 65 epochs, with a max learning rate of 3e-3 and a weight decay of 0.01. The voxel size of cloud range [54.0,54.0]m×[54.0,54.0]m×[5.0,4.6]m54.054.0𝑚54.054.0𝑚5.04.6𝑚\displaystyle\left[-54.0,54.0\right]m\times\left[-54.0,54.0\right]m\times\left%[-5.0,4.6\right]m[ - 54.0 , 54.0 ] italic_m × [ - 54.0 , 54.0 ] italic_m × [ - 5.0 , 4.6 ] italic_m is set as [0.075,0.075,0.2]0.0750.0750.2\displaystyle\left[0.075,0.075,0.2\right][ 0.075 , 0.075 , 0.2 ] for nuScenes, and voxel size of cloud range [75.2,75.2]m×[75.2,75.2]m×[2,4]m75.275.2𝑚75.275.2𝑚24𝑚\displaystyle\left[-75.2,75.2\right]m\times\left[-75.2,75.2\right]m\times\left%[-2,4\right]m[ - 75.2 , 75.2 ] italic_m × [ - 75.2 , 75.2 ] italic_m × [ - 2 , 4 ] italic_m is set as [0.1,0.1,0.15]0.10.10.15\displaystyle\left[0.1,0.1,0.15\right][ 0.1 , 0.1 , 0.15 ] for WOD. Standard data augmentation techniques, including flipping, scaling, rotation, translation, and ground-truth sampling [5] with fade strategy [18] are utilized during training. In our experiment, 6 Nvidia A30 GPUs with a batch size of 18 are employed for NuScenes, while a batch size of 12 is configured for WOD.

ModelmIoUmAPNDS
SegmentationPolarNet [14]71.0--
Cylinder3D [3]76.1--
RPVNet [19]77.6--
DetectionCenterPoint [20]-57.465.2
VoxelNeXt [7]-60.566.6
TransFusion-L [21]-60.066.8
Multi-taskLidarMultiNet [8]82.063.8\ul69.5
LidarFormer [9]\ul82.766.670.8
Proposed LiSD83.0\ul65.0\ul69.5

3.3 Experiment Results

Segmentation and detection results for both the nuScenes and WOD datasets are presented to substantiate the effectiveness of LiSD.

NuScenes dataset: Performance comparisons of LiSD and other SOTA methodologies are listed in Table 1, and we can observe that LiSD achieves top segmentation performance of 83.3% mIoU on the test split of nuScenes. The mIoU of LiSD is 1.5% higher than that of UDeerPep, establishing its superiority over the best-performing lidar-based methods currently positioned atop the leaderboard. In terms of multi-task models, a second stage is regarded as unnecessary in LiSD when compared to LidarMultiNet [8]. Besides, LiSD demonstrates competitive performance when contrasted with the complicated cross-space and cross-task transformers featured in LidarFormer [9]. Furthermore, the detection and segmentation performance on the val split is illustrated in Table 2. LiSD outperforms models tailored for segmentation tasks, including PolarNet [14], Cylinder3D [3], and RPVNet [19], in terms of mIoU. Concurrently, LiSD surpasses models specifically designed for detection tasks, such as CenterPoint [20], VoxelNeXt [7], and TransFusion-L [21], by achieving higher mAP.

ModelmIoUL2mAPH
SegmentationPolarNet [14]61.6-
Cylinder3D [3]66.6-
DetectionCenterPoint++ [20]-71.6
CenterFormer [22]-73.7
MPPNet [23]-74.9
Multi-taskLidarMultiNet [8]71.975.2
LidarFormer [9]\ul72.276.2
Proposed LiSD 72.6\ul76.1

WOD dataset: Table 3 illustrates the performance comparison of semantic segmentation and object detection on the val split of the WOD dataset. The segmentation results of PolarNet [14] and Cylinder3D [3] are reproduced by [9]. As observed in Table 3, the proposed LiSD attains a mIoU of 72.6% for the segmentation task and an L2mAPH of 76.1% for the detection task, outperforming the single-task model. Meanwhile, LiSD exhibits competitive performance in comparison with multi-task models, such as LidarMultiNet [8] and LidarFormer [9].

The experimental results of LiSD on both datasets illustrate that multi-task learning facilitates the interaction of information across tasks, thereby contributing to the high performance for both tasks.

3.4 Ablation study

The ablation study, depicted in Table 4, systematically validates the effectiveness of each key component within our proposed LiSD. The baseline segmentation model employed in our experiment is based on the standard 3D sparse U-Net architecture, and the baseline detection model is Transfusion-L [21]. The performance improvements brought by specially designed modules, including HIAM, HFCM, and IARM are further verified on the val split of the nuScenes dataset.

BaselineHIAMHFCM
Multi
Task
BEV
Loss
IARMTTAmIoUmAP
77.160.0
77.961.7
79.062.1
80.462.8
81.164.1
81.964.5
83.065.0

As we can observe from Table 4, the baseline model achieves a mIoU of 77.1% and a mAP of 60.0% on the validation set. Building upon the baseline model, the incorporation of the HIAM aimed at enlarging the receptive field yields an improvement of 0.8% in mIoU and 1.7% in mAP. Incorporating the HFCM leads to an increase of 1.1% in mIoU and 0.4% in mAP. The combination of segmentation and detection tasks further improves mIoU by 1.4% and mAP by 0.7%. Besides, through the utilization of BEV segmentation loss coupled with uncertainty weighting, the mIoU and mAP increase by 0.7% and 1.3%. The IARM, designed for the collaboration of cross-task information, brings 0.8% mIoU and 0.4% mAP improvement. The visualization results are depicted in Fig. 4. Forming our optimal model on the validation set, the implementation of Test Time Augmentation (TTA) improves the mIoU and mAP to 83.0% and 65.0%, respectively.

LiSD: An Efficient Multi-Task Learning Framework for LiDAR Segmentation and Detection (4)

4 Conclusion

In this paper, we propose an efficient multi-task learning framework named LiSD for lidar segmentation and detection, which are predominantly addressed separately in previous works. Comprehensive experimental results verified the effectiveness of LiSD’s design, including key components such as HIAM, HFCM, and IARM. Moreover, LiSD achieves a higher mIoU compared to the top-performing lidar-based method currently positioned on the leaderboard of nuScenes lidar segmentation task. We hope that our proposed LiSD can serve as an inspiration for future endeavors in the development of multi-modal multi-task learning frameworks.

References

  • [1]Holger Caesar, Varun Bankiti, AlexH Lang, Sourabh Vora, VeniceErin Liong, Qiang Xu, Anush Krishnan, YuPan, Giancarlo Baldan, and Oscar Beijbom,“nuscenes: A multimodal dataset for autonomous driving,”in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11621–11631.
  • [2]Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, etal.,“Scalability in perception for autonomous driving: Waymo open dataset,”in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454.
  • [3]Xinge Zhu, Hui Zhou, Tai Wang, Fangzhou Hong, Yuexin Ma, Wei Li, Hongsheng Li, and Dahua Lin,“Cylindrical and asymmetrical 3d convolution networks for lidar segmentation,”in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 9939–9948.
  • [4]DiFeng, Yiyang Zhou, Chenfeng Xu, Masayoshi Tomizuka, and Wei Zhan,“A simple and efficient multi-task network for 3d object detection and road understanding,”in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 7067–7074.
  • [5]Yan Yan, Yuxing Mao, and BoLi,“Second: Sparsely embedded convolutional detection,”Sensors, vol. 18, no. 10, pp. 3337, 2018.
  • [6]Benjamin Graham and Laurens Vander Maaten,“Submanifold sparse convolutional networks,”arXiv preprint arXiv:1706.01307, 2017.
  • [7]Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia,“Voxelnext: Fully sparse voxelnet for 3d object detection and tracking,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21674–21683.
  • [8]Dongqiangzi Ye, Zixiang Zhou, Weijia Chen, Yufei Xie, YuWang, Panqu Wang, and Hassan Foroosh,“Lidarmultinet: Towards a unified multi-task network for lidar perception,”in Proceedings of the AAAI Conference on Artificial Intelligence, 2023, vol.37, pp. 3231–3240.
  • [9]Zixiang Zhou, Dongqiangzi Ye, Weijia Chen, Yufei Xie, YuWang, Panqu Wang, and Hassan Foroosh,“Lidarformer: A unified transformer-based multi-task network for lidar perception,”arXiv preprint arXiv:2303.12194, 2023.
  • [10]Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie,“Feature pyramid networks for object detection,”in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
  • [11]Maxim Berman, AmalRannen Triki, and MatthewB Blaschko,“The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,”in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4413–4421.
  • [12]Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, and Ruigang Yang,“Iou loss for 2d/3d object detection,”in 2019 international conference on 3D vision (3DV). IEEE, 2019, pp. 85–94.
  • [13]Alex Kendall, Yarin Gal, and Roberto Cipolla,“Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,”in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7482–7491.
  • [14]Yang Zhang, Zixiang Zhou, Philip David, Xiangyu Yue, Zerong Xi, Boqing Gong, and Hassan Foroosh,“Polarnet: An improved grid representation for online lidar point clouds semantic segmentation,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9601–9610.
  • [15]QiChen, Sourabh Vora, and Oscar Beijbom,“Polarstream: Streaming object detection and segmentation with polar pillars,”Advances in Neural Information Processing Systems, vol. 34, pp. 26871–26883, 2021.
  • [16]Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, JiLin, Hanrui Wang, and Song Han,“Searching efficient 3d architectures with sparse point-voxel convolution,”in European conference on computer vision. Springer, 2020, pp. 685–702.
  • [17]Ran Cheng, Ryan Razani, Ehsan Taghavi, Enxu Li, and Bingbing Liu,“2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network,”in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12547–12556.
  • [18]Chunwei Wang, Chao Ma, Ming Zhu, and Xiaokang Yang,“Pointaugmenting: Cross-modal augmentation for 3d object detection,”in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11794–11803.
  • [19]Jianyun Xu, Ruixiang Zhang, Jian Dou, Yushi Zhu, Jie Sun, and Shiliang Pu,“Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation,”in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16024–16033.
  • [20]Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl,“Center-based 3d object detection and tracking,”in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 11784–11793.
  • [21]Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai,“Transfusion: Robust lidar-camera fusion for 3d object detection with transformers,”in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1090–1099.
  • [22]Zixiang Zhou, Xiangchen Zhao, YuWang, Panqu Wang, and Hassan Foroosh,“Centerformer: Center-based transformer for 3d object detection,”in European Conference on Computer Vision. Springer, 2022, pp. 496–513.
  • [23]Xuesong Chen, Shaoshuai Shi, Benjin Zhu, KaChun Cheung, Hang Xu, and Hongsheng Li,“Mppnet: Multi-frame feature intertwining with proxy points for 3d temporal object detection,”in European Conference on Computer Vision. Springer, 2022, pp. 680–697.
111© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
LiSD: An Efficient Multi-Task Learning Framework for LiDAR Segmentation and Detection (2024)
Top Articles
Latest Posts
Article information

Author: Carlyn Walter

Last Updated:

Views: 6336

Rating: 5 / 5 (50 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Carlyn Walter

Birthday: 1996-01-03

Address: Suite 452 40815 Denyse Extensions, Sengermouth, OR 42374

Phone: +8501809515404

Job: Manufacturing Technician

Hobby: Table tennis, Archery, Vacation, Metal detecting, Yo-yoing, Crocheting, Creative writing

Introduction: My name is Carlyn Walter, I am a lively, glamorous, healthy, clean, powerful, calm, combative person who loves writing and wants to share my knowledge and understanding with you.