Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training
Computer Vision and Pattern Recognition(2024)
摘要
Contrastive learning has emerged as a promising paradigm for 3D open-worldunderstanding, i.e., aligning point cloud representation to image and textembedding space individually. In this paper, we introduce MixCon3D, a simpleyet effective method aiming to sculpt holistic 3D representation in contrastivelanguage-image-3D pre-training. In contrast to point cloud only, we develop the3D object-level representation from complementary perspectives, e.g.,multi-view rendered images with the point cloud. Then, MixCon3D performslanguage-3D contrastive learning, comprehensively depicting real-world 3Dobjects and bolstering text alignment. Additionally, we pioneer the firstthorough investigation of various training recipes for the 3D contrastivelearning paradigm, building a solid baseline with improved performance.Extensive experiments conducted on three representative benchmarks reveal thatour method significantly improves over the baseline, surpassing the previousstate-of-the-art performance on the challenging 1,156-category Objaverse-LVISdataset by 5.7as text-to-3D retrieval and point cloud captioning, further evidencing itsefficacy in diverse scenarios. The code is available athttps://github.com/UCSC-VLAA/MixCon3D.
更多查看译文
关键词
Contrastive learning,Lauage-Image-3D Pre-training,Multi-Modality Training
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
数据免责声明
页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果,我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问,可以通过电子邮件方式联系我们:report@aminer.cn