Integration of convolutional and adversarial networks into building design: A review

Convolutional and adversarial networks are found in various fields of knowledge and activities. One such field is building design, a multi-disciplinary and multi-task process involving many different requirements and preferences. Although showing several advantages over traditional computational methods, they are still far from being part of the daily design practice. Never-theless, if fully integrated, these methods are expected to accelerate design and automate procedures. This paper reviews these methods ’ latest advances and applications to identify current barriers and suggests future developments. For that, a systematic literature review extended with forward and backward snowball methods was carried out. The focus was on the first design phases, including site layout, floor planning, furniture arrangement


Introduction
Researchers have strived to develop generative design methods for several decades to create candidate solutions from the sometimes conflicting project requirements and physical constraints of the building [1].These methods require the user to specify numerous requirements and constraints while demanding user experience in fine-tuning the method's parameters to produce feasible solutions [2].In contrast, current learning methods find new solutions with minimal user input, relying on past training data to learn implicit patterns [3].Once the model learns the probability distribution, the method synthesizes new examples with a high degree of realism [4]-an efficient supervised approach in terms of time and cost [5].
Building design is a creative process that cannot be fully described mathematically and may express personal preferences and judgments [6].Although several technical, functional, and performance aspects may be objectively formulated [7], its associated subjectivity poses an obstacle to any automated generation or optimization procedure [8].Nonetheless, researchers developed many efforts to use heuristic models.For example, the shape of a building may result from the interaction between project requirements and the physical constraints of the building, such as automated floor planning [9,10].However, such an approach requires the user to specify many design inputs and to choose adequate algorithm parameters to produce feasible solutions [2].Other examples of automated procedures are facade design [11,12], generation of indoor 3D scenes [13], and building energy and thermal optimization [14,15].
The literature divides the generative methods used in building design into image analysis (or computer vision) and image synthesis (or computer graphics) [16].While computer vision can analyze an image and create a model, computer graphics provides tools that enable the analysis model to create images [17].Image analysis and synthesis are foundation procedures in image processing [18].
Image analysis has achieved relevant results in visual recognition tasks [19].In general, image analysis methods can be divided into four steps [20]: (i) image acquisition, (ii) image input and pre-processing, (iii) feature extraction and segmentation, and (iv) recognition and application.In image acquisition, digital images are collected or produced from satellites, cameras in aerial vehicles, photographic cameras, or 3D scanners.In pre-processing, image denoising or enhancements are carried out to improve the quality of the images.In feature extraction, specific image information, such as edges, corners, and contours, is obtained, and images are segmented into homogeneous groups without overlapping regions.Lastly, in image recognition, images are classified (categorizing and labeling pixels, or regions in an image, into several groups), objects are detected (bounding and classifying objects), registered (overlaying images into a single image), retrieved (searching for images in a database), and reconstructed (creating a 3D model from images).
Unlike traditional generative design methods, deep learning methods discover implicit patterns from the training data [3].Once the probability distribution is learned, these models synthesize new examples with high realism [4].Deep learning is an efficient supervised approach in terms of time and cost, which can be identified from the increasing accuracy rates in various technology areas [5] as in the following building design tasks: image classification [21], semantic segmentation [22], object detection [23].
Two deep learning methods are predominant in literature: Convolutional Neural Networks (CNN) and Generative Adversarial Networks (GAN).CNN is the most widely used deep learning approach in computer vision applications [24], and its architecture involves three main components: the visible, hidden, and output layers.In the visible layer, the image's descriptor is the input, and the hidden layers (convolutional, pooling, and fully connected) learn high-level representation by gradually extracting low-level features [25,26].In the case of convolutional layers, these layers extract features from the convolution operation developed for a filter (usually a kernel) composed of weights and biases, which generate feature maps [26].Then, an activation function is applied, making it possible to model non-linear functions in the network-this process repeats on every convolutional layer.Next, the pooling layers replace a small neighborhood of feature maps with an invariant function, such as max or mean operations.This replacement reduces the dimension of the feature image and network parameters, thus invariant to shift and distortion [26].Lastly, fully connected layers recognize and classify the image by adjusting the weights, which determine the degree of influence of the input layer on the output layer.
A model generates a new image by learning the distribution of features from past information [16,18].For example, a house layout may be created using a model that learns the general rules of possible arrangements from a dataset with a specific building typology.The Boltzmann machine, deep belief network, Variational AutoEncoder (VAE), and GAN are examples of methods that can be used.GANs comprise two competing networks, a generator and a discriminator [36].These networks behave like a min-max game in an iterative optimization process.While a generator produces synthetic data from random noise, the discriminator evaluates results as real or fake images.As the two models compete, the generator learns to deceive the discriminator, culminating in an undistinguishable synthetic image from a real one.
The number of studies using deep learning in building design has been increasing steadily.The applications are vast, including object detection in floor plans [23], indoor scene segmentation [22], and indoor image classification [21].Other building design-related applications are the design of an indoor scene [37], facade layout [38], and floor plan arrangement [39].However, a critical literature review on applying these methods in architectural design needs to be included.The reviews which might be considered closest to the topic cover artificial intelligence in general [40], real estate [41], and construction [42].
This paper attempts to fill the gap mentioned above and answers the following research questions: Q1 -'Which are the latest advances in CNN and GAN that are oriented towards architectural design?' Q2 -'What are the challenges and barriers to integrating these methods into architectural design?' The present paper is divided into six sections.After the current introduction, the systematic literature review method is described in the material and methods section.Then, the literature analysis is carried out by covering different building design-related applications and benchmarking sections.Next, the research questions are answered, and the implications of the findings are debated in the discussion section.Finally, the main conclusions of this study are presented.

Material and methods
In order to answer the stated questions, a systematic literature review on the latest advancements in CNN and GAN in architectural design was carried out in a three-step methodology.
In the first step, 'Literature Survey,' title words, such as "architectural design," "building footprint," "deep learning," "CNN," "GAN," "networks," and "image" were combined to search for publications in the Scopus and Google Scholar databases.A period filter was applied from January 2016 to December 2022 to limit the analysis to the last six years.An initial sample of 1201 publications was obtained (Fig. 1).After eliminating review articles, thesis, and publications dealing with different topics but having the same nomenclature, 155 relevant publications were selected and used as a gold set.Lastly, the snowballing method was used, an iterative process with two complementary parts: backward and forward [43].Backward snowballing starts by selecting other articles from the reference list of the 155 publications.The relevance of the identified publication is then analyzed, and the reference list is once again screened.When the new publications match the search goal, these are included in the overall sample-this iterative process ends when new papers can no longer be found.A similar case occurs in forward snowballing, but other papers that cite the initial set of documents are selected.After completing both parts, the final sample included 394 publications, divided into 109 conference proceedings, 23 preprints (ArXiv), and 262 journal articles.Foundational publications in deep learning are also included in this sample, despite some being published before 2016.As the initial search had produced a number of false positives due to taxonomy similarities among the different topic areas, a method that allowed us to find and screen each publication was required.This explains the reason for the use of snowballing methods, resulting in a significant increase in publications from the initial set.
During the second stage, 'Data Extraction and Tabulation,' information from the retrieved publications was extracted according to four applications of deep computer vision in building design: building footprint (perimeter of a building measured on a horizontal plane), building floor plan (visual representation of a home layout on a horizontal plane), 3D indoor scene (3D representation of interior room), and building facade (representation of external faces of a building on an elevation).The information is then grouped in each category according to topics in computer vision, such as segmentation, retrieval, generation, vectorization, fusion, detection, classification, and reconstruction.Finally, references, method architecture, input and output variables, and publication conclusions are tabulated.
In the third and last step, 'Synthesis of Findings,' the tabulated information is analyzed and synthesized to answer the research questions.Lastly, information about network architecture and dataset sources is compiled to develop deep learning approaches.

CNN and GAN-based applications
The initial stage of this study entailed the analysis of publications on CNN and GAN in building design.The statistical analysis included: (i) quantification of publications that used CNN or GAN-based methods in image analysis or image synthesis procedures and (ii) distribution of publications per computer vision tasks into four building design categories: building footprint, building floor plan, building facade, and 3D indoor scene.
This initial assessment indicated that the CNN model was the most used technique in image analysis (Fig. 2).The model is particularly effective in building footprint and 3D indoor scene tasks due to its ability to extract features accurately and rapidly without any human supervision from remote sensing data or other imagery.Specific examples of applications are to identify roofs, facades, furniture, and other building system components.Several derived architectures were implemented, including VGGNet, FCN, Faster-R-CNN, Mask R-CNN, U-Net, and residual connections (ResNet).
Rather than extracting information, the GAN model creates new synthetic images.An example is creating building floor plans from a simple building boundary or the location of the openings in the perimeter.Another example is recreating a building facade or reconstructing a 3D indoor scene.This capability to produce or complete imagery is particularly helpful when the data sample is insufficient or the image resolution needs to be enhanced.
As shown in Fig. 3, building extraction is the most common application in the building footprint category, having the same number as all other applications combined (segmentation, detection, classification, and generation).However, CNN and GAN-based techniques generating new building footprints have nine cases.
The generation process stands out in the building floor plan category with 34 studies (nine CNN and 25 GAN), while segmentation has 14 (13 CNN and one GAN).In this case, the number of applications matches one of the most creative yet most common tasks among designers: the spatial organization of a building layout.A segmentation task in a layout design involves recognizing building floor plan elements from learning semantic information, such as walls, doors, windows, furniture, and their relationships.Other applications (two GAN and nine CNN in total) include the detection of spaces, retrieval of information, reconstruction of design, and classification of spaces.
In the building facade task, segmentation is dominant with 21 studies (19 CNN and two GAN).It fulfills similar functions as in the floor plan category but is applied to the exterior surfaces of the building and its components.This is followed by generation, CNN with seven and GAN with one publication, as this plays an essential role in facade composition studies.Finally, detection, reconstruction, and retrieval complete the other ten studies (12 CNN).In the 3D indoor scene category, reconstruction (49 studies in CNN), which estimates the 3D arrangement of a room from 2D images gathered using robots and virtual or augmented reality, is followed by segmentation (27 studies in CNN) and classification (20 and one in CNN and GAN, respectively).Generation also stands out with ten publications (five studies in both methods) where studies show creating 3D layouts for indoor scenes or changing a layout from a 3D image.Finally, detection and retrieval add up to the nine CNN cases.
The following subsections compare the last CNN and GAN methods with traditional ones for each main building design task.Segmentation (CNN) and generation (GAN) stand out among all construction categories.However, other relevant tasks in their specific categories are shown, such as building extraction (footprint), retrieval and detection (floor plan), classification, and reconstruction (facade and 3D indoor scene).Furthermore, these comparisons show the evolution of the analysis (CNN) and synthesis (GAN) processes in different building categories.Thus, it is also possible to identify integration paths between computer graphics and computer vision in building design.

Building footprint
The building footprint may be retrieved from remote sensing techniques, such as through digital sensors aboard satellites and airplanes [44] or other sources, including unmanned aerial vehicles (such as light detection and ranging), crowdsourcing (phone imagery), and advanced driver-assistance systems.
Thanks to CNN, mainly Fully Convolutional Networks (FCN), the interest in image segmentation in remote sensing has grown recently [45].Some examples of CNN architectures in building footprint include FCN, Faster/Mask R-CNN [46], and Res-U-Net, which are based on ResNet and SegNet.The main advantage of CNN is the capacity to accumulate contextual information about large objects over vast receptive fields.However, the process becomes challenging with low spatial resolution and blurry object boundaries [47].Nonetheless, several strategies were employed in many recent studies to improve the semantic segmentation process in remote sensing [46].
For decades, non-deep learning methods such as Random Forest [48], Adaboost, and Support Vector Machine [49] were used in building extraction.However, these were not applied to complex, high-diversity urban regions [50].Moreover, these methods were based on handcrafted features, making it a problem for non-experts as experience is required in order to define specific design features [51], particularly due to the contour features of buildings, their complex morphology, and occlusions from shadows.The CNN-based methods presented higher accuracy [51][52][53][54][55].
Automated building extraction reduces time and cost when used to produce large-scale maps.In addition to FCN, other CNN approaches are widely explored in remote sensing tasks, such as ResNet [56][57][58] and U-Net [59,60].Nevertheless, these methods still demanded efforts to reduce the prediction error to the ground truth pixels [61].To address this issue, GAN has a specific architecture for classifying output images as real or fake.GAN to segmentation map [62] improves accuracy with an F1-score of 96.8%, compared with 95.2% from an FCN model.To help with image completion, Wasserstein GAN architecture (WGAN) [63] creates patches to fill gaps in an unfinished image.The network uses residual learning procedures to improve the produced picture.In a remote sensing application, WGAN and CGAN make up the CWGAN, which improves the quality of building footprint generation.When exploring the advantages of WGAN and CGAN, the CWGAN-GP method [61] improved the quality of building footprint generation compared to methods such as CGAN and U-Net.
Res2-UNet [64] is a new and prominent study of building footprint detection from high-resolution remotely sensed images.The purpose of this algorithm is to deal with intricate details of background objects and problems, such as small buildings, which can be omitted in an urban context.Res2-UNet achieved state-of-the-art performance over three public datasets: WHU Aerial building, Satellite II Dataset, and Massachusetts building.However, this method's limitation is distinguishing some confusing bare roads with buildings, which causes background objects to be classified erroneously as buildings.

Building floor plan
Image recognition of floor plans is an effective procedure in automated document analysis, part of the computer vision tasks [6], from which techniques such as floor plan retrieval [65], floor plan vectorization [66], and automatic floor plan generation [39] can be developed.
Architectural floor plan analysis is a specific application domain of graphics recognition whose automation was incorporated into computer-aided design platforms [67].In recent decades, several works explored techniques for floor plan recognition, such as room and wall segmentation [67][68][69][70][71][72].Other steps follow floor plan analysis in deep computer vision, such as information segmentation, structural analysis, and semantic information extraction and alignment [70].These steps made it possible for graphic segmentation VGG [73], graphic and text segmentation using FCN architecture [6], wall extraction and detection using U-Net [74], and Faster R-CNN [23].Beyond the traditional orthogonal forms of floor plan representation, irregular shapes may also be recognized from kernels and the GAN training process [75].
The floor plan retrieval procedure is (i) an image-based recommendation that includes the preferences of a home purchaser or (ii) a creative source of information to help the designer in the initial stage of the design process.Traditionally, researchers use these techniques for symbol recognition [76,77] or segmentation of the floor plan [69,78].For example, the a.SCatch system is a sketch-based application that automatically extracts semantic structure from old projects or an architect's sketch to retrieve a similar floor plan in a repository [78].In the last years, modelers used a modified AlexNet [65], FRCNN [79], and Cyclic GAN [80,81] to capture semantic features from floor plans.The latter is more efficient in matching the sketch and the image query [80].While the Conditional GAN needs an aligned image pair to be trained, the Cyclic GAN does not.
An architectural drawing, such as a floor plan or cross-section, is often created in a vector format and later converted into a raster image for publication in a digital medium.This procedure limits the post-process, such as analysis, synthesis, or modifications.Moreover, this procedure results in the loss of structured geometry and semantic information [82].Recovering this is a challenging process (known as vectorization), which converts raster data into vector data from the transformation of pixels into geometric primitives, such as lines, polylines, and curves.This problem may be solved using several methods, including a skeleton-based approach [68] and coupling of Hough Transforms with vectorization [69].For example, a CNN algorithm developed for low-level rasterized semantic and geometric information achieved 90% accuracy in a vectorization process [66].In another example, an application of vectorization used a multi-task ResNet to create a large-scale floor plan with 5000 samples arranged into 80 categories from the dataset known as CubiCasa5K [82].
Recently, researchers have been developing generative models to create novel floor plans.The primary method used in layout and furniture generation is GAN, which can be adopted singly [39,83,84] or linked to some complementary source, such as a graph of the connections between the building spaces (each node represents a room, and each edge represents an adjacency).For example, because GAN only has the noise signal to control the generation process, InfoGAN [85] improves the method by using a given graph to produce and control a variety of design solutions.Using this method in floor plan generation allows the creation of early conceptual designs through latent code learning to incorporate topology features or functional configuration spaces obtained from various designs [86].Another example of this application is the House-GAN [87].House-GAN was developed specifically to generate a house floor plan which combines GAN and a graph constraint.The output quality is measured according to realism, diversity, and compatibility with the input graph constraint.Similarly, even though Graph2graph is a CNN-based approach, it is also graph-constrained and allows the user to generate a variety of floors for the same input boundary plans [3].
Conditional GAN (cGAN) [88] is a notable generate method that creates synthetic layouts satisfying the geometric constraint from bubbles diagrams and space allocation heat maps to achieve topological preferences.In addition, this method allows the designers greater control over the result, which is often a challenge for many unconditional GAN methods [89].However, the main limitation of this work is that there is no application for typologies with more than one floor.When addressing the multi-story generation, vertical circulation spaces and their relationship with the remaining rooms must be addressed in order to have coherent designs.

3D indoor scene
The 3D scene understanding of the indoor environment has grown in recent years, mainly in applications such as augmented reality, virtual reality, robotics, games, and interior design.The artificial intelligence agent must recognize the scene's functional attributes and semantic labels and understand apparent and hidden relationships between these components [90].Therefore, these applications consider the complex geometric and semantic context of all parts of the analyzed space and their relationship.
Regarding human visual scene interaction, modelers should consider human perception beyond the analysis of image features and include the user's memory, language application, and constraints of the visual system [91].For example, measuring a person's feelings in interior design is promising research on human behavior and decision-making.In this case, the emotional responses are determined using an electroencephalography-based deep learning model [92].
The reconstruction of 3D indoor scenes, which consists in building a three-dimensional shape from two-dimensional images, was frequently studied in the last decade.Manhattan world assumption [93], Bayesian network recognition [94], and geometric context [95] were some of the methods used.Recent methods, such as CNN, improve 3D reconstruction, which can be accomplished with a single input image [96][97][98][99][100][101][102].However, incomplete images, low-depth resolutions, missing data, and sensor noise still pose a challenge [90].For example, LayoutNet [98] works directly on the panoramic image, unlike other recent studies that decompose it into perspective images.Compared to previous methods, this method is similar to RoomNet [102] but has better accuracy with perspective images.The improvement of the LayoutNet occurs due to image alignment from the vanishing points, predicting multiple layout elements, using the Manhattan constraints.Another example is HorizonNet [101], which also uses panoramic images but outperforms LayoutNet by representing room layout as three 1D vectors that encode the edges between walls-floors, ceiling-wall, and wall-wall.
A relevant way to understand a 3D scene is semantic segmentation from a 3D point cloud.A volumetric 3D point cloud representation can be achieved with a voxel grid, which consists of a regular grid in the 3D space.Several investigations have adopted the combination of voxel grid and convolutional frameworks, such as DGCNN [103], 3D-FCN [104], SCSS-Net [105], and CNN with conditional random fields (CRFs) [106].The CRFs combine the advantage of classification, graphical modeling, and efficient parameter optimization.Therefore, it is an expected resource for 3D point semantic segmentation [107][108][109][110][111]. In addition, 3D fully convolutional networks (3D-FCN), combined with CRFs, are potential deep learning methods for the classifier stage in 3D point cloud segmentation [112].
There are some approaches to 3D indoor scene modelingthe arrangement of a room from the relationship of a given set of furniture and elements-such as human activities centric [113], example-based data-driven scene synthesis [114], and action-driven 3D indoor scene evolution [115].The data-driven scene modeling may also be based on other types of resources, such as text [116,117], sketches [118][119][120], and images [121].Recently, researchers used a graph conditional prediction network (GC-LPN) to estimate the layout of a room and a language conditional texture GAN (LCT-GAN) to generate an interior texture [122].The generative recursive autoencoders [123] and hybrid generative models (GAN and VAE) [124] are also alternatives in indoor scene synthesis.
To achieve high accuracy in 3D indoor scenes, parsing methods, such as PGDENet [125], brought substantial improvements.In the RGB-D image case, the method works by having multimodal information (color and depth) and fusing it into discriminable features that allows a correct scene classification.Trained with NYUv2 and SUN datasets, the model achieved superior performance than previous methods.However, despite the availability of structured models which integrate various data types to enhance scene understanding, there are still challenges in effectively retrieving and fully utilizing the different modalities to ensure accurate and efficient results.Additionally, this method often suffers from reduced performance in the presence of changes in illumination, resulting in unclear object boundaries and difficulties in detecting small objects [126].

Building facade
One of the applications in building facade design is classification according to the facades' historical style.Modelers used a multimodal latent logistic regression [127], Support Vector Machines [128], and k-means [129][130][131][132] to group images according to a building's historical period in Mexico [133].In addition, these techniques, among others, allow the grouping of architectural elements into categories, such as doors, windows, and columns.
The parsing of building facades has evolved over the last years in applications such as 3D city modeling.The challenge of this task is defined by the variations of scenarios, the changes in illumination, visual perspective, and occlusions [134].One of the most common techniques for 3D city modeling is the reconstruction of the building facade, which, together with the scene understanding, is of interest to architectural design, digital scenarios in movies, and virtual environments in games.For many of these applications in deep learning, facade segmentation has been a fundamental vision task for detecting components of the facade from architectures such as CNN [135], ConvNets [38], Mask R-CNN [136], Faster R-CNN [137], RPN [138], SegNet [135,139], FCN [140], and Pyramid ALKNet [141].Pyramid ALKNet has performed better with occluded regions, ambiguities, and other image issues.
The 3D reconstruction from a 2D facade image or point cloud is another application of the methods.Researchers applied computer vision methods in location-based services, architectural restoration, and urban planning.Automated facade reconstruction may be employed using a grammar-based algorithm describing the visual structure [142][143][144][145][146]. Compared to the global methods, this approach is flexible for accounting facade variation [145].In recent years, researchers reconstructed a 3D building from point clouds of facades using CNN [147] or Faster R-CNN in complex scenarios that included texture [148] and GAN-based FCRN-Depth in cases with a higher number of segmentation classes [149].
The generation of facades is a complex challenge because its design must relate to environmental, functional, structural, and aesthetical parameters.For this reason, a multi-objective genetic algorithm is a helpful optimization procedure to find the best performance solution from a combined set of two or more decision variables.For example, in daylight optimization, building shape results from the relationship between parameters such as the window-to-wall ratio, number of windows, window distribution, length of shading devices, and others [150].The split grammar, a variety of shape grammar [151], is also a shape control approach for generating a facade which offers an application to create different building styles [152,153].This approach's main limitation is its ability to learn, although it is suitable for automating design creation from rules defined by the designer [154].The pix-to-pix, a variety of cGAN, is a framework that learns semantic facade images from network training and creates a new 2D design [155], City-GAN [156], and DCGAN [154].
Deep learning with prior knowledge of facades [157] can detect their elements automatically in pixel-wise complex environments.The framework outperforms other state-of-the-art models on two public datasets: Ecole Centrale Paris (ECP) and ArtDeco.One of the limitations of this study is the laborious labeling of building facade elements.Transfer learning can be used as an alternative to address this issue, yielding improved accuracy with less labeling data.

Recognition and other applications
The architectural design process is often divided into analysis, synthesis, and appraisal, as demonstrated in Fig. 4. Complementarily, the decision made in an early design stage sets the starting point for a designer to elaborate on the next stage or design process.The CNN and GAN methods, on the other hand, can be divided into recognition (e.g., image segmentation, object detection, and image classification) and other applications (e.g., automation in vectorization, image retrieval, reconstruction, and generation).The integration between CNN-GAN methods and early architectural design can be represented by Fig. 5, which summarizes the main steps of this approach in this section.
CNN and GAN allow automation in different steps of the design process, which can be identified as automation in recognition, automation in the query process, automation in the conversion process, and automation in the generation process.
Automation in recognition is the basis of applications in building design since all applications use at least object detection, such as chair, door, and bed [23], facade segmentation [158], land use [159], and land cover [160] classification, or 3D scene recognition [161] in their computer vision process.Most of the methods used in these applications are CNN.On the other hand, the role of GAN in this process can be to improve the accuracy of the segmentation map [62], floor plan [75], and image completion [63].
Through the different CNN and GAN methods for recognition, it is possible to transform mage representation into data, which is required for the analysis stage to make decisions in building design.Then, in sequence, the application for the synthesis stage is started.
Automation in the query process is the comparative part of the design and can count on the support of several references in the analysis process.Sketch-based image retrieval (SBIR) applied to floor plans [65,80,162,163] and facades [164,165], for example, is the task that explores the intuition humans express through the relationship between drawing and technical image results.
The use of sketch as an aid in the creative stage is usual in developing an idea in architectural design.Therefore, the automation in the query process can be seen as a methodological approximation between the traditional design model and the CNN-GAN method.With this result, it is possible to automate the conversion process, a step to follow to make the result manageable in CAD.
Automation in the conversion process can be divided into vectorization and reconstruction.While vectorization converts raster format into a vector file, reconstruction converts a real object into a 3D model through sensors or a set of 2D images obtaining space depth estimation from a 2D image.For example, a 3D layout estimation from a 2D image helps build design analysis and create new datasets [166].In addition, these samples can be used to train and test generative networks for new building designs.
In architectural design, the analysis phase usually requires building recognition as indoor layout information [167], which may be Fig. 4. Early architectural design method.
J. Parente et al. implemented in complex drawings with better scores in a one-to-one match compared to existing methods.Therefore, a vectorization function [66,167,168] is required to enable CAD manipulation.These results are then used in the automation for 3D modeling.
When the CNN-GAN method is used for building design, automation in the generation process occurs between the synthesis and decision-making stages.For example, with the GAN method, feasible layouts [169], facades [170], or different 3D indoor scene styles [171] are created according to training image input and network architecture.The generative process is also fundamental when using synthetic datasets and increased dataset size to train and evaluate the classification model [92].
The generation process is an iterative process that needs to be evaluated.If the result is not suitable, the process should return to the analysis stage to resume the process until a new evaluation is carried out.
In the following, from Figs. 6-9, the main structure identified throughout the paper when analyzing the relationship between CNN-GAN methods and building design is presented.In this structure, a pattern composed of data input, analysis, and synthesis tasks was found.Furthermore, these results have two flow categories: required and possible.Required flow is the necessary path to completely fulfill a task or a set of tasks, while the possible flow is only an alternative.
Modelers use image analysis for urban planning, diligence for law, real estate for land development, landscape architecture, environmental remediation in engineering, and building design for architectural concepts.In addition, the result of possible flows between the analysis and generation processes (Fig. 6) allows decision-making in building design, such as in the dimensional aspects of the site: boundaries, location, zoning classification, and dimensional implications.
With land size, zone, and constraints, the information will flow through the retrieval, conversion, and analysis to make the floor plan components recognizable and manipulable, such as walls and doors (Fig. 7).
The analysis of the 3D indoor scene (Fig. 8) is composed of 3D reconstruction (considering geometry, semantic, and topological modeling) and recognition (including semantic, detection, and classification) of indoor scene components.The possible flow between the analysis and synthesis processes, or other inputs such as 3D images and natural language text, allows the generation of a new 3D indoor scene and its furniture arrangement.
Reconstruction creates a facade model from a 2D image and a 3D point cloud (Fig. 9).This process results in facade segmentation, components detection, such as a frame for doors and windows, and architectural style classification.Finally, the last possible flow follows from the analysis to the synthesis process, when the generator is able to create a new building facade or 3D outdoor scene, completing the analysis and synthesis stages of the conception of form in architectural design.In summary, the whole process consists in determining the location of the building, arrangement of the rooms, distributing equipment and fixtures in indoor scenarios, and designing the facade of a building by integrating deep learning into architectural design conception.

Advantages, challenges, and barriers
The main advantages of CNN are: being free from hand-crafted featuring and having better generalization capability than other classification-based approaches, both in accuracy and efficiency [172].CNN frequently performs better than other machine learning techniques for sketch-based image retrieval in building design [173].Other sketch recognition techniques, such as histograms of oriented gradients and scale-invariant feature transform, cannot capture the abstract nature of sketches as CNN [80].Also, those other recognition techniques are hand-crafted image classification approaches.CNN-based applied in building design has in its favor the ability to deal with complicated background details and detection of small buildings in building footprint dataset [64], in addition to detecting elements in complex environments on building facades dataset [157] or achieving high accuracy in 3D scene recognition [125].As a hypothetical example, CNN may reconstruct a 3D scene using meshes, textures [174], and complex shapes [175] from a floor plan image.
In building footprint, the advantages of GAN can be found in improving the image quality [61] or complete images due to occlusions [62].In floor plans [86] and 3D indoor scenes [122], the combination between GAN methods and graphs or bubble graphs and space allocation heat maps [176] allows greater control over the outcome of the design process.Together with CNN, for example, GAN-based can convert raster-to-vector with combinatorial optimization in junction units of the floor plan with complex drawing [177] and, combined with U-Net for vectorization of the building, it can generate 3D models directly [168].The GAN method is frequently used for the creative process in the early design phase, especially in facades [178] and floor plans [3].
However, integrating these methods also faces several challenges and barriers, such as the lack of a ready-to-use tool for most users and the lack of interpretability in the deep model, also known as a black box issue [179].Other limitations are found when the method is unable to distinguish bare roads and buildings from the building footprint [64], tackle light changes, detect small objects, and retrieve RGB-D data from different 3D modalities of an indoor scene [125].Although CNN's key ingredient is data training [180], a technical challenge is the limitation of size and quality of publicly available building footprint datasets [46]; lack of floor plan representative datasets [181]; insufficient data for indoor scenes, if compared to the outdoor scene [182]; and the challenge to collect large-scale annotated dataset for segmentation [183].High-quality datasets still need to be more common despite the rise in free datasets [180] several present scenes with occlusions or low-quality images, resulting in the need for pretreating data.Data annotation, on the other hand, is an arduous and manual task requiring precision, thus prone to error [184].Furthermore, having human experts label and annotate the building design data about architectural elements, styles, and spatial relationships is indispensable.Supervised methods may require human labeling [181,185], which is time-consuming and costly.
It is also important to highlight the ethical challenges associated with creating models from datasets.First, datasets may be biased [186].For example, these may favor a particular architectural style, architect, or cultural aspect that will be prevalent in the generated designs.If how the model was trained is opaque, in the sense that the source data is unknown, the user may be unaware that the produced result may segregate or have negative cultural impacts.Second, these models may limit the role of humans in the creative design process [187].Human creativity in building design involves applying judgment, intuition, and subjective decision-making based on experience, values, and ethical considerations that vary among societies and users.And lastly, it is important to comprehend how these models will respect the intellectual property and privacy rights of the data creators and data subjects [188].The data creators have authorship and intellectual rights that are not being credited for or monetarily compensated if the model is used commercially.In addition, the source data may contain imagery subject to privacy rights, which may vary depending on each nation's legislation.Thus, transparency, explainability, and mechanism for human supervision [189] are crucial to address some of these ethical issues.

Methods and benchmarks
In addition to mastering network architectures, deep learning requires large datasets of images when applied to architectural design.Although many papers use proprietary data, there are several open-access datasets available.[56,57,[190][191][192][193][194][195], GAN-based in building floor plan [80,81,83,84,86,167,169,196], AlexNet-based in building facade [38,[197][198][199][200], and SegNet-based methods in a 3D indoor scene [98, Benchmarking compares different algorithms or methods from free and high-quality training data.Computer vision has a long tradition in using datasets with RGB, RGB-D, and HSV images, panoramic images, 3D point clouds, 3D mesh segmentation, and vector images in robotics and remote sensing.Modelers use these data in computer vision tasks such as image and point cloud segmentation, scene classification, object detection, 3D indoor or outdoor scene reconstruction, image retrieval, and object generation.For this purpose, they test diverse network architectures to find the best results, as each method's accuracy depends on the dataset's size and the data quality.In addition, they may also split the data to guarantee the diversity of objects in training.
Benchmarking can present the state-of-the-art based on quantitative or qualitative criteria.Among the analyzed works, the CNN method has the most metrics in Overall Accuracy (OA), Precision (P), Recall (R), F1-Score (F1), Intersection over Union (IoU), and others, which are defined as follows: Depending on the purpose of the method (classification or regression), different evaluation metrics may be used.For example, when a user inputs a facade image, the result may be a value indicating the level or intensity of the attribute being predicted-e.g., indoor daylighting.Therefore, metrics are usually MSE and MAE.Conversely, in classification, the same facade image may be used to categorize its architectural style-e.g., modernism, classic architecture, postmodernism, and so on.In this case, the metrics are OA, P, R, and F1.
The main methods used in different building categories in recent years are compared in Table 1.The OA, P, R, F1, and IoU metrics are the most common in CNN, while FID is the most frequent among GAN applications.In some cases, more than one dataset is used to compare the same methods and verify how they behave in different data conditions.The state-of-the-art results follow the order of metrics in the table, whereas the bold font represents the better result compared to other benchmark methods.
In the building footprint category, the extraction task has high accuracy in small buildings and complex roof configurations.However, it has a low R and does not use land cover classification in its applications [203].In the detection task, the shadow extraction methods may falsely detect the dark land covers as dense vegetation [205].Also, it has no better P and R.
In the building floor plan category, the detection task has demonstrated the advantages of graph-based representations for analysis from quantitative results [207].However, the point drop in performance for the baseline method in their applied dataset demonstrates the difficulty in constructing a repository that supports graph and raster-based approaches.Furthermore, recognition fails in villas when images have different floors in a single figure.Also, it does not perform well in semantic segmentation for basic elements such as walls, windows, rooms, and icons such as sink or fire [75].Finally, in the retrieval task, the multi-oriented framework handles scan time rotations and is also invariant to scale [79].However, performance is reduced when severe noise is present.In this case, new descriptors are needed with specific resources to annotate the global figure shape automatically.
In the building facade category, misclassifications may occur when there are similar or indistinguishable classes of houses [210], such as "villa" and "detached house," suggesting that a better class definition is needed.Some approaches overcome misclassification by adding several pre-processing features to extract different channel information.However, the procedure uses more memory and a longer training time.Furthermore, in the generation task, there are limitations in the functionality of the assignment, such as location, neighborhood, culture, and climate, to customize the architectural design [178].
The generation task in the 3D indoor scene category has two noticeable limitations [171].First, the network cannot model the view-dependent light effect because it assigns one color for each 3D coordinate.In addition, the model does not recognize transparency, thus considering all colors opaque.Second, large scenes take a long time to train, reaching up to 15 h for one scene of the SceneNet dataset.

Conclusion
This paper reviews the application of CNN and GAN algorithms in several tasks of automated building design, such as building footprint, floor plan, 3D indoor scene, and facade.The increasing number of publications reflects the growing interest in deep learning in building design, particularly computer vision and computer graphics techniques.In addition, the increasing availability of public datasets and source code will encourage more professionals to adopt these techniques, thus opening a path for radical and innovative solutions in building design.Nonetheless, their use is still in its infancy and is uncommon among building engineering and architectural design professionals.CNN and GAN-based approaches have great potential for integration into building design.They can be easily employed after training without experience and expertise, thus overcoming the downfalls of traditional artificial intelligence methods.Also, these approaches have a high degree of automation and produce good-quality results (still highly dependent on the type of task).Although these methods display more advantages in the analyzed studies than in other methods, the workflow between the different stages of the building design is neither smooth nor accessible to many designers.In conclusion, CNN-based networks provide the necessary understanding, manipulation, and results when adequately applied.Meanwhile, GANs were found to help complete images, generate building layouts, and create facades for different architectural styles.
Furthermore, CNN and GAN improve performance compared to traditional methods, reducing classification errors and improving results.In addition, the automation between CNN and GAN can overcome common barriers in image processing by facilitating the participation of non-specialists in specific steps such as remote sensing or floor plan vectorization to reconstruct these in 3D.
As observed in the previous section, the number of deep-learning approaches used in building design is growing, indicating an increase in the automation flow and integration between different building representations of the design process.Therefore, several advances are described in the literature, particularly in building extraction, integration between CNN and GAN in vectorization, reconstruction of 3D models from 2D imagery, and broader and more reliable retrieval of models from sketches.In addition, the quality and accuracy of the results incentivize non-experts and designers to use them.
However, these methods face several challenges and barriers, such as low quality and occlusions in datasets, and present obstacles for some users due to the lack of a user-friendly interface.In addition, several ethical issues are associated with the datasets used to create these methods that will eventually be regulated by societies and may pose a challenge to integrating these models in building design.
Future research paths may seek further integration between these networks and building design, going beyond computer vision to explore different types of data that incorporate different building performance criteria, such as thermal and visual comfort.For example, these studies may consider building energy simulation from input data such as weather conditions and building characteristics, daylight optimization from training GAN on building orientation, window size, and shading design.
Another research path may be the customization of architectural design according to the location, culture, and neighborhood context.In other words, the model may be able to produce design solutions that consider the surroundings by integrating the new building stylistically.Other aspects may be included as well.For example, it may generate building facades that consider alignments and floor heights of the adjacent buildings.These features will be particularly helpful in sensitive situations involving heritage sites and historical architecture.In floor plan design, the generation process could consider multiple floors, structural alignments, and the relationship between vertical circulation and the functional program of the building.
For this, the CNN-GAN tools must assume multi-task functionalities, and multiscale datasets must be available for different categories of building design representations.Furthermore, another path may integrate both CNN and GAN in different categories since analysis, evaluation, and synthesis are requirements in the decision-making process.Lastly, as the number of experts and non-experts continues to grow in this field, it is essential to understand how these methods may be smoothly integrated into building design methodology without disrupting the creative process.

Fig. 2 .
Fig. 2. Distribution of CNN and GAN-based methods in image analysis or synthesis procedures.

Fig. 6 .
Fig. 6.Analysis and decision-making from the relationship between deep learning in remote sensing.

Fig. 7 .
Fig. 7. Layout generation and furniture from the relationship between deep learning and 2D layout plan.

Fig. 8 . 2 *
Fig. 8. 3D indoor scene generation from the relationship between deep learning and 3D indoor scene plan.

Fig. 9 .
Fig. 9. Facade generation from the relationship between deep learning and 3D indoor scene plan.
Table A 1 (Appendix A) lists these datasets according to different task categories: footprint image, facade image, 3D indoor scene image, 3D point cloud outdoor scene, 3D point cloud indoor scene, and floor plan image.Table A 2 (Appendix A) presents the most common CNN and GAN methods found in the literature review.Several network architectures are used, such as ResNet-based methods in category building footprint

Table 1
Benchmark between main methods of different tasks and building categories.