Learning Physical Graph Representations from Visual Scenes

Bear et al., 2020


  • Attempts to address unsupervised learning of visual scene representations that can support scene segmentation tasks on complex real world images
  • Introduces “Physical Scene Graphs” (PSGs) that represent scenes as hierarchical graphs, with nodes that correspond to object parts at different scales and edges to physical connections between parts
  • PSGNet, a novel architecture, outperforms alternative models and generalizes to unssen object types and scene arrangements
  • Links: [ website ] [ pdf ]


  • While CNNs have been very successful at learning visual representations for tasks like object classification, their success has been limited on tasks that require a structured understanding of the visual scene
  • Humans group visual scenes into object-centric representations many properties (e.g. object parts, poses, material properties, etc.) are explicitly available which support high-level planning and inference
  • Recent work limited by architectural choices, learning from static images, or requiring detailed supervision (e.g. meshes)
  • PSGs aims to geometrically rich and explicit enough to handle reasoning about complex shapes, but also flexible enough to learn from real world data through self-supervision


  • “A PSG is a vector-labeled hierarchical graph whose nodes are registered to non-overlapping locations in a base spatial tensor”
  • PSGNet takes in RGB movie inputs and outputs RGB reconstructions, depths, normals, object segmentation map, and next-frame RGB deltas for each next-frame
  • Feature extraction:
    • Generate feature map for input movie with a ConvRNN using features from first conv layer
    • Concatenate one timestep backward differential to input
  • Graph construction:
    • Hierarchical sequence of learnable graph pooling and graph vectorization operations
    • Graph pooling combines nodes from the previous layer into nodes of the new layer with corresponding child-parent edges
      • Generate with-in layer edges by thresholding learnable affinity function on attribute vectors
      • Affinity function based on perceptual grouping principles: feature, co-occurrence, motion-driven, and learned motion similarity
      • Cluster nodes based on these edges without needing to specify the final number of groups
    • Graph vectorization aggregates the attributes of the merged nodes, and transforms it into attributes for the new nodes
      • Uses a combination of statistical summary functions
  • Graph rendering:
    • “Paint-by-numbers” using node attributes and spatial registration from graph construction
    • Produce shapes from node attributes to generate output image
  • Loss on each graph level and on rendered scene reconstructions


  • Datasets:
    • Primitives: synthetic dataset of primitive shapes (e.g. spheres, cubes) in simple 3D room
    • Playroom: synthetic dataset of complex shapes with realistic textures (e.g. animals, furniture, tools)
    • Gibson: RGB-D interior scans of buildings on Stanford campus
  • Metrics:
    • mean intersection over union (mIoU)
    • Recall: proportion of ground truth foreground objects whose IoU with predicted mask is > 0.5
    • BoundF: average F1-score on ground truth and predicted boundary pixels of each segment
    • Adjusted Rand Index (ARI)
  • Baselines:
    • MONet
    • IODINE
    • OP3
    • Quickshift++: non-learned
  • PSGNet outperforms baselines in static training, where models are given single RGB frames and trained with RGB reconstruction and depth and normal estimation, for Primitive and Gibson datasets
  • PSGNet with motion-based grouping also outperforms other models for the Playroom dataset with four-frame movie inputs


  • Segmentations are still not great especially for more complex scenes
  • PSGs provide a flexible representation for encoding scenes
Elias Z. Wang
Elias Z. Wang
AI Researcher | PhD Candidate