HiT: Hierarchical Transformers for Unsupervised 3D Shape Abstraction

Aditya Vora ¹, Lily Goli², Andrea Tagliasacchi^{1, 2}, Hao (Richard) Zhang¹

¹Simon Fraser University ²University of Toronto

We present an attention-based architecture for hierarchical part abstraction of 3D objects. Our model flexibly adapts the number of parts, without supervision, allowing semantically similar regions (e.g. chair legs) to be decomposed, through the hierarchies, into different numbers of child parts depending on their geometry. While the resulting parts at finer levels of abstractions may no longer be semantic (e.g., split of the chair seats), part correspondences between shapes remain meaningful at different levels of the hierarchies.

Abstract

We introduce HiT, a novel hierarchical neural field representation for 3D shapes that learns general hierarchies in a coarse-to-fine manner across different shape categories in an unsupervised setting. Our key contribution is a hierarchical transformer HiT, where each level learns parent–child relationships of the tree hierarchy using a compressed codebook. This codebook enables the network to automatically identify common substructures across potentially diverse shape categories. Unlike previous works that constrain the task to a fixed hierarchical structure (e.g., binary), we impose no such restriction, except for limiting the total number of nodes at each tree level. This flexibility allows our method to infer the hierarchical structure directly from data, over multiple shape categories, and representing more general and complex hierarchies than prior approaches. When trained at scale with a reconstruction loss, our model captures meaningful containment relationships between parent and child nodes. We demonstrate its effectiveness through an unsupervised shape segmentation task over all 55 ShapeNet categories, where our method successfully segments shapes into multiple levels of granularity.

We propose a hierarchical transformer that learns part codebooks at each level, representing shapes from coarse to fine when trained across shapes. Cross-attention ``connects'' levels, establishing learnable part–subpart relationships. The decoded parts are mapped to 3D convex primitives that provide geometric explanations.

Results on Objaverse Shapes

Characters

Input

Level 0

Level 1

Level 2

Input

Level 0

Level 1

Level 2

Input

Level 0

Level 1

Level 2

Input

Level 0

Level 1

Level 2

Rifles

Input

Level 0

Level 1

Level 2

Input

Level 0

Level 1

Level 2

Ground Vehicles

Input

Level 0

Level 1

Level 2

Input

Level 0

Level 1

Level 2

Results on ShapeNet Shapes

Aeroplanes

Input

Level 0

Level 1

Level 2

Input

Level 0

Level 1

Level 2

Faucets

Input

Level 0

Level 1

Level 2

Input

Level 0

Level 1

Level 2