All we want is an empty room: How emptying your digital twin unlocks design possibilities

Matterport enables the transformation of any room in your home into a redecorated space, all without moving a single item of furniture.

Matterport Vision & Learning Team

June 28, 2024

Imagine being able to completely redecorate your living room without lifting a single piece of furniture. This is what we are building.

We recently introduced Matterport’s Three Pillars of AI - our philosophy on applications of AI at Matterport. To build on the conversation, we’re exploring how advanced technologies in 3D semantic understanding and inpainting are enabling a host of exciting new applications for our digital twins.

Matterport initially focused on creating photorealistic, yet static, reconstructions of real-world spaces, providing an excellent foundation for virtual tours and a variety of consumer applications. However, to truly transform these spaces, assess their potential uses, or manage their daily maintenance and operations, static reconstructions are not enough.

To address this, we've been developing advanced Property Intelligence tools. These tools use semantic understanding to deliver deeper insights and valuable information about properties.

Now, with the latest breakthroughs in generative AI, we have expanded our focus to creating new content and experiences within Matterport spaces, enriching the way users interact with and perceive these digital environments.

Incorporating Matterport’s decade of machine learning and AI experience with the power of new generative AI tools for interior design, Matterport is bringing new design and furnishing ideas to life at the click of a button through Project Genesis – starting with the ability to defurnish any space, instantly.

What is Defurnishing?

Defurnishing, a key technique in digital image processing and 3D modeling, entails the removal of furniture and movable items from images of a space, rendering it empty.

This approach is crucial for applications that require the visualization of unoccupied spaces, including interior design, real estate, and virtual staging, offering a clear view of the potential of a space.

Defurnishing is an under-development capability for all Matterport digital twins and has three steps:

Reconstruction: We first capture and reconstruct a space to build the digital twin.
Understanding: We then gain semantic understanding of the reconstructed space, specifically determining the pixels (in our images) and mesh faces (in our dollhouse view) that belong to items of furniture, which we wish to remove.
Synthesis: Since the areas obstructed by the furniture were never captured directly, after removing the furniture, we end up with blank pixels in the images and holes in the mesh. The images require that plausible “empty space” content be inpainted, while the holes in the mesh require filling and texturing.

Experience a preview of our defurnishing capabilities in our 2024 Winter Release. In this part of our blog series, we focus on semantic segmentation - a crucial first step to automatic defurnishing.

Understanding Semantic Segmentation

Semantic segmentation, a critical computer vision task, involves dividing an image into distinct regions and assigning each a specific category. The objective is to label every pixel with a class (e.g., "floor," "wall," "window," "table"), facilitating a comprehensive understanding of the scene by pinpointing objects and delineating their boundaries.

Unlike object detection, which focuses on surrounding objects with bounding boxes, or image classification, where a single label is applied to the entire image, semantic segmentation achieves a granular analysis of the scene, enhancing the depth of interpretation.

Semantic segmentation stands as a cornerstone technique in computer vision, finding application in autonomous vehicles, medical imaging, robotics, and beyond.

Recently, it has emerged as a critical element in virtual interior design. Upon the initial capture of a space, the primary data available outlines the space's overall structure and aesthetics. Semantic segmentation plays a pivotal role in enriching our understanding of a Matterport space's contents, enabling precise manipulation—whether moving, editing, indexing, or removing elements.

To effectively alter any aspect of a Matterport space, a detailed semantic segmentation is essential, distinguishing the key components of the space from one another.

The Role of Segmentation in Inpainting for Defurnishing

To remove furniture from the images and 3D structures of our digital twins, we must first identify the pixels/mesh faces that belong to items of furnishing.

Removing those pixels/faces often results in missing information. This is because the area behind/under pieces of furniture cannot be seen during the capture of the digital twin.

Consequently, we need to generate some plausible image/3D content to fill these holes, after the furniture has been removed. This process is referred to as image inpainting (more on this in Part 2 of our blog series). Inpainting is an advanced technique used in image editing and restoration, designed to fill in missing or damaged sections of an image, ensuring it looks complete and natural. Its primary objective is to seamlessly reconstruct these areas, making them blend indistinguishably with the surrounding image, thereby maintaining its structural integrity and visual continuity.

Many inpainting methods depend on precise segmentation masks for the areas designated for removal and subsequent inpainting. Any discrepancies or artifacts impacting the furniture segmentation masks can greatly influence the inpainting outcomes. For example:

Removing a portion of a building's structure rather than its furnishings can lead to significant structural hallucinations (e.g. instead of inpainting some floor and wall content, we may end up creating a doorway to a nonexistent room.
Incorrect segmentation of furniture, where object parts are not properly masked, can result in the unintended inpainting of spurious objects rather than the desired "empty space" (typically interpreted as walls and floors, depending on the perspective).
False negatives, which occur when actual pieces of furniture are not segmented, can result in remnants of furniture appearing in the final outcome.

Consequently, ensuring that we can obtain an accurate semantic segmentation is crucial to achieving a high-quality defurnishing result.

Our Approach to Semantic Segmentation

1. Data

We conduct semantic segmentation on 360-degree panoramic images using equirectangular projection to capture the widest possible visual context in a single frame. Context plays a critical role in computer vision tasks, especially when using contemporary neural network frameworks like Vision Transformers.

2. Custom Ontology

Initially, we utilized a segment of the ADE20k ontology, which includes 150 categories commonly found in built environments. However, this approach was not perfectly suited for our specific needs.

In our scenario, our objective is to eliminate all detachable furnishings while retaining those that are built-in. Public datasets often group these distinct types together under general categories (for example, categorizing both freestanding and built-in wardrobes simply as "wardrobe").

Therefore, to address our specific needs, we had to consider several additional task-specific factors and compile a custom dataset with annotations for furniture segmentation.

3. Network Architecture

We decided to leverage the capabilities of Vision Transformer architectures, which have been successfully used in various AI applications within our projects. In particular, we chose the Vision Transformer Adapter as the foundation for our segmentation experiments.

This model modifies the Vision Transformer, originally designed to produce a single feature vector from an image input, enabling it to handle image-to-image tasks that require a feature map instead of just a single vector.

Despite not being specifically trained on 360-degree equirectangular images, the ViT-Adapter demonstrated impressive performance with this data type, even though it was not initially designed to address the ontological discrepancies mentioned earlier.

4. Deployment

We recently elevated semantic segmentation to a primary position in our pipeline, along with depth estimation, so that it is now executed for each image captured. Consequently, our inference operates in the cloud, providing resilience against abrupt traffic fluctuations, simplifying maintenance, and enabling smoother updates.

5. 3D Semantic Understanding

Matterport uniquely excels at understanding 3D spaces semantically. By weaving 3D context into our semantic segmentation, we offer a deeper insight into both the spatial and semantic connections within any captured space.

Our innovative use of a 3D dollhouse view allows us to combine perspectives from multiple angles, significantly enhancing the precision of our predictions. This advanced approach empowers us to execute more accurate and meaningful modifications.

A prime example is the defurnishing scenario, which demands an intricate and accurate comprehension of the environment's 2D and 3D characteristics.

Technical Challenges and Limitations of Defurnishing

Even the most advanced semantic segmentation models fall short of perfection, struggling to generalize effectively to new, unseen data. This reality requires the development of strategies to either correct errors or create workarounds.

While supervised approaches to semantic segmentation often yield the best results, the task of defining and managing ontologies presents a significant challenge. These ontologies are prone to shifts and changes based on the specific application, requiring frequent data annotation when significant adjustments are made.

Consequently, the more a model can be trained in a self-supervised manner, the more we can reduce the time, effort, and financial resources needed to adapt a segmentation model to new ontologies.

Designing these ontologies presents numerous challenges. Take, for example, the task of defurnishing, where the goal is to remove "freestanding" furniture while retaining "built-in" fixtures.

Determining when a piece of furniture qualifies as "built-in" is a complex task that often requires a comprehensive set of rules for consistent and reproducible decision-making. Without a clear set of guidelines, data annotation efforts are likely to yield poor-quality results, thereby compromising the performance of the segmentation model.

Looking Forward

Self-Supervised Learning

We have been exploring self-supervised learning for some time, and with a variety of image-based models successfully launched, the timing is ideal to deepen our investment in this domain.

Self-supervised learning presents substantial advantages, such as minimizing the need for annotated data, accelerating training processes, and enhancing performance in our specific tasks.

Integrating 3D Context

Exploring the integration of 3D context into our workflows offers a promising avenue for advancing our processes. Presently, our approach to data aggregation is passive, relying on a heuristic-based method to weight features projected from multiple views.

By examining methods to integrate 3D context during the training phase, we have the opportunity to develop features that are independent of the viewing angle, thereby enhancing our models' comprehension.

Furthermore, we are exploring the potential of end-to-end 3D techniques to see if processing semantic understanding directly through 3D representations can improve our outcomes. This includes reevaluating our reconstruction methods.

Adopting cutting-edge techniques such as Neural Radiance Fields (NeRFs) or other innovative strategies could radically transform our current practices, leading to significantly enhanced model understanding and performance.

Multi-task Models

The idea of multi-task models, capable of executing multiple tasks at once, continues to captivate interest. Nevertheless, these models necessitate upkeep as a cohesive system, which makes the strategy of employing a shared backbone across multiple models more attractive.

As we progress, striking the right balance between the advantages and complexities of multi-task models will be key to enhancing our workflows and results.

Open Vocabulary Models

Another exciting area of development is open vocabulary models. Traditional models, which are tethered to a fixed ontology, can be restrictive due to the wide range of customer needs.

However, open vocabulary models break free from these limitations. They possess the capability to recognize a much wider array of objects and concepts, unconfined by predefined categories.

This adaptability is invaluable for Matterport, enabling a more generalized semantic understanding across a variety of spaces and applications. The adoption of open vocabulary methods promises substantial improvements in meeting diverse customer demands and enhancing the interoperability of our assets with other tools.

Exploring multi-modal approaches, such as vision-language models, represents another frontier for our technology. By combining visual and textual data, these models can understand and generate descriptions of complex scenes in a more nuanced manner.

By integrating multi-modal models, Matterport could substantially improve its semantic understanding of spaces. This integration would link visual elements with descriptive language, providing richer and more nuanced insights. Such an advancement could lead to more intuitive user interactions and enhance the capabilities of applications like virtual staging and property intelligence.

Conclusion

Expanding our semantic understanding of spaces will unlock a range of applications across multiple industries. We recognize that a single ontology cannot meet all customer needs. Thus, we see value in open vocabulary techniques and other methods that are not limited by strict ontological frameworks

Another objective is enhancing the compatibility of our resources with various tools, and to this end, we are developing multiple integrations.

In the initial installment of our blog series, we explored the critical role of semantic understanding and segmentation in our defurnishing process. Through semantic segmentation, we precisely identify and classify each component within a space, enabling the effective removal of furniture and other movable items.

This step is vital for the rest of our defurnishing workflow, guaranteeing that the final representation of the empty space is accurate and visually cohesive.

Up next in our series, we will delve into the intriguing realm of inpainting and reveal more about our inpainting techniques and their applications in creating realistic, defurnished spaces. We will examine the process of filling in voids left by object removal and showcase our latest publication, which sets a new benchmark in this field.