Learning, Memory, and the Role of Neural Network Architecture

The performance of information processing systems, from artificial neural networks to natural neuronal ensembles, depends heavily on the underlying system architecture. In this study, we compare the performance of parallel and layered network architectures during sequential tasks that require both acquisition and retention of information, thereby identifying tradeoffs between learning and memory processes. During the task of supervised, sequential function approximation, networks produce and adapt representations of external information. Performance is evaluated by statistically analyzing the error in these representations while varying the initial network state, the structure of the external information, and the time given to learn the information. We link performance to complexity in network architecture by characterizing local error landscape curvature. We find that variations in error landscape structure give rise to tradeoffs in performance; these include the ability of the network to maximize accuracy versus minimize inaccuracy and produce specific versus generalizable representations of information. Parallel networks generate smooth error landscapes with deep, narrow minima, enabling them to find highly specific representations given sufficient time. While accurate, however, these representations are difficult to generalize. In contrast, layered networks generate rough error landscapes with a variety of local minima, allowing them to quickly find coarse representations. Although less accurate, these representations are easily adaptable. The presence of measurable performance tradeoffs in both layered and parallel networks has implications for understanding the behavior of a wide variety of natural and artificial learning systems.


Introduction
Learning, the assimilation of new information, and memory, the retention of old information, are competing processes; the first requires flexibility and the second stability in the presence of external stimuli. Varying structural complexity could uncover tradeoffs between flexibility and stability, particularly when comparing the functional performance of structurally distinct learning systems. We use neural networks as model learning systems to explore these tradeoffs in system architectures inspired by both biology and computer science, considering layered structures like those found in cortical lamina [1] and parallel structures such as those used for clustering [2], image processing [3], and forecasting [4]. We find inherent tradeoffs in network performance, most notably between acquisition versus retention of information and between the ability of the network to maximize success versus minimize failure during sequential learning and memory tasks. Identifying tradeoffs in performance that arise from complexity in architecture is crucial for understanding the relationship between structure and function in both natural and artificial learning systems.
Natural neuronal systems display a complex combination of serial and parallel [5] structural motifs which enable the performance of disparate functions [6][7][8][9]. For example, layered [1] and hierarchical [10] architectures theoretically important for sustained limited activity [11] have been consistently identified over a range of spatial scales in primate cortical systems [12]. Neurons themselves are organized into layers, or ''lamina,'' and both intra-laminar [13] and inter-laminar [14] connectivity differentially impact function. Similarly, information processing systems developed by technological innovation rather than natural evolution have structures designed to match their functionality. For example, the topological complexity of very large integrated circuits scales with the function to be performed [15]. Likewise, the internal structure of artificial neural networks can be carefully constructed [16] to enable these systems to learn a variety of complex relationships. While parallel, rather than serial, structures are appealing in artificial neural networks because of their efficiency and speed, variations in structure may provide additional benefits or drawbacks during the performance of sequential tasks.
The dependence of functional performance on structural architecture can be systematically examined within the framework of neural networks, where the complexity of both the network architecture and the external information can be precisely varied. In this study, we evaluate the representations of information produced by feedforward neural networks during supervised, sequential tasks that require both acquisition and retention of information. Our approach is quite different from studies in which large, dense networks are given an extended period of time to produce highly accurate representations of information (e.g. [17,18]). Instead, we investigate the links between structure and function by performing a statistical analysis of the error in the representations produced by small networks during short training sessions, thereby identifying mechanisms that underlie tradeoffs in performance. Our work therefore has important implications for understanding the behavior of larger, more complicated systems in which statistical studies of performance would be impossible.
In the remainder of the paper, we discuss the extent to which network architectures differ in their ability to both learn and retain information. We first describe the network model and architectures considered in this study. We then quantify the best, worst, and average performance achieved by each network during sequential tasks that vary in both their duration and complexity. We consider the adaptability of these networks to variable initial states, thereby probing the structure of functional error landscapes. Finally, we explore how landscape variations that arise from structural complexity lead to differences in performance.

Sequential Learning Approach
Our approach differs from traditional machine learning studies in that our goal is not to design the optimal network system for performing a specific task. Rather, we identify tradeoffs in network performance across a range of architectures that share a common algorithmic framework. In this context, the term ''architecture'' refers specifically to the structural organization of network connections and not, as is found in engineering studies, to the broader set of constraints governing the interactions of network components.
In evaluating network performance, we use techniques relevant to both artificial and biological systems. Artificial network systems often favor high accuracy and consistency during a single task, regardless of the time required to achieve such a solution. In biological systems, however, speed and generalizability are often more important that absolute accuracy when dynamically adapting to a variety of tasks. To probe features such as network accuracy, consistency, speed, and adaptability, we examine the representations of information produced by neural networks during competing learning and memory tasks.
We choose to study learning and memory within the biologicallymotivated framework of feedforward, backpropagation (FFBP) artificial neural networks that perform the task of supervised, onedimensional function approximation. The training process, which consists of adjusting internal connection strengths to minimize the network error on a set of external data points, can be mapped to motion within a continuous error landscape. Within this context, ''learning'' refers to the ability of the network to successfully navigate this landscape and produce an accurate functional representation of a set of data points, while ''memory'' refers to the ability to store a representation of previously-learned information. Additional details of this framework are described in the following subsection.
To simultaneously study learning and memory processes, information must be presented to the network sequentially. ''Catastrophic forgetting,'' in which a network learns new information at the cost of forgetting old information, is a longstanding problem in sequential training of neural networks and has been addressed with several types of rehearsal methods [19][20][21]. Standard rehearsal involves training the network with both the original and new information during sequential training sessions. We use a more biologically motivated approach, the pseudorehearsal method [22], in which the network trains with a representation of the original information. Pseudorehearsal has been shown to prevent catastrophic forgetting in both feedforward and recurrent networks and does not require extensive storage of examples [22,23].
In training FFBP networks, local minima and plateaus within the error landscape can prevent the network from finding a global optimum [24,25]. While considered disadvantageous in machine learning studies, the existence of local minima may provide benefits during the training process, particularly in biological systems for which highly accurate global optimums may be unnecessary or undesirable. Additionally, FFBP networks can suffer from overfitting, a problem in which the creation of highly specific representations of information hinders the ability of the network to generalize to new situations [26]. While also considered disadvantageous, failure to generalize has important biological consequences and has been linked to neurological development disorders such as Autism [27]. Instead of attempting to eliminate these sensitivities, we seek to understand the architectural basis for differences in landscape features and examine their impact on representational capabilities such as specificity and generalizability.

Neural Network Model
The construction of our network model is consistent with standard FFBP neural network models [26]. We consider the five distinct architectures shown in Figure 1(a), all of which obey identical training rules. Each network has 12 hidden nodes arranged into h layers of ' nodes per layer. Nodes in adjacent layers are connected via variable, unidirectional weights. The ''fan'' and ''stacked'' networks are both fully connected and have the same total number of connections. The connectivities of the ''intermediate'' networks, which have slightly greater numbers of connections, were chosen in order to roughly maintain the same total number of adjustable parameters per network, N p , noted in Figure 1(a).
Each node has a sigmoid transfer function s(x)~1=(1ze ({x) ) with a variable threshold h. The output y of each node is a function of the weighted sum of its inputs x p , given by y~s( where v p gives the weight of the p th input

Author Summary
Information processing systems, such as natural biological networks and artificial computational networks, exhibit a strong interdependence between structural organization and functional performance. However, the extent to which variations in structure impact performance is not well understood, particularly in systems whose functionality must be simultaneously flexible and stable. By statistically analyzing the behavior of network systems during flexible learning and stable memory processes, we quantify the impact of structural variations on the ability of the network to learn, modify, and retain representations of information. Across a range of architectures drawn from both natural and artificial systems, we show that these networks face tradeoffs between the ability to learn and retain information, and the observed behavior varies depending on the initial network state and the time given to process information. Furthermore, we analyze the difficulty with which different network architectures produce accurate versus generalizable representations of information, thereby identifying the structural mechanisms that give rise to functional tradeoffs between learning and memory.
connection. Representing the threshold as h~v 0 x 0 , where x 0~1 for all nodes, allows us to organize all adjustable parameters into a single, N p -dimensional weight vectorṽ v. During training, each network is presented with a training pattern of N d pairs of input x d and target y d values, denoted (x x,ỹ y). We restrict the input x space to the range (0,1), and the sigmoid transfer function restricts the output y space to the range (0,1). The set of variable weightsṽ v is iteratively updated via the Polak-Ribiere conjugate gradient descent method with an adaptive step size [28][29][30] in order to minimize the output error E(ṽ v). We use online training, for which E(ṽ v) is the sum of squared errors between the network output y(ṽ v) and target output y calculated after all N d points are presented to the network: Task Implementation Each network shown in Figure 1(a) is trained over two sequential sessions. In describing parameter choices for each training session, we use U(a,b) to denote a continuous uniform probability distribution over the interval (a,b). The steps of the sequential training process are shown schematically in Figure 1(b) and are described below:

First Training Session
Step 1.1 -Initialize. Network weights are randomly chosen from U({5,5). We refer to this state of the network as the ''randomly initialized state''.
Step 1.2 -Train. The network trains on six ''original'' points (x x (o) ,ỹ y (o) ) whose values remain fixed for all simulations. The original points are chosen to be evenly spaced in x (x x (o)~( :1, :26,:42,:58,:74,:9)) and random in y (ỹ y (o)~( :55,:92,:53,:78,:33, :49)). Similar behavior is observed for different choices, including permutations, of the specific values used here (see Figure S3). The original points represent the information we wish the network to remember during subsequent training. The network is given 10 5 iterations to generate a functional representation f o of (x x (o) ,ỹ y (o) ) (see second panel of Figure 1(b) and Figures 2(a) and 2(b)), and training ceases if the error plateaus (DEv10 {5 for 1000 iterations). We refer to this situation as allowing ''unlimited'' training time because in practice, the network finds a solution before reaching the maximum number of iterations.

Second Training Session
Step 2.1 -Sample. The set of weights that produce f o forms the starting point for the second training session. We refer to this state of the network as the ''sampled state'' in order to distinguish it from the randomly initialized state chosen prior to the first training session. In this state, the network randomly samples a pool of 1000 buffer points (x (b) ,y (b) ) from f o (see third panel of Figure 1(b)). This is accomplished by (i) randomly choosing input x (b) values from U(0,1) and (ii) computing the corresponding output y b~f o (x (b) ) values using the set of network weights that produce f o . Subsets of buffer points, which lie along the functional representation f o of the original points, are used in the following step to simulate memory rehearsal.
Step 2.2 -Re-train. The network re-trains on six new points (x x (n) ,ỹ y (n) ) and six buffer points (x x (b) ,ỹ y (b) ) (see fourth panel of Figure 1(b)). New points are chosen by randomly selecting six independent x (n) and y (n) values from U(0,1). Buffer points are chosen by randomly selecting, with uniform probability, six (x (b) ,y (b) ) pairs from the pool of the buffer points generated in  Notation. We use the super and subscripts ''o'' and ''n'' to refer respectively to the ''original'' and ''new'' points, (x x (o) ,ỹ y (o) ) and (x x (n) ,ỹ y (n) ), and functional approximations, f o and f n . Each function f o produces a single error value . Each set of functions ff n g produces two sets of error values, fE (o) n g and fE (n) n g, measured with respect to (x x (o) ,ỹ y (o) ) and (x x (n) ,ỹ y (n) ), respectively.

Tradeoffs in Learning and Memory Tasks
We train the five networks shown in Figure 1(a), first considering the differences between the boundary fan (parallel) and stacked (layered) networks. Given the large number of adjustable parameters N p relative to the small number of training points N d , we expect all five networks to fit the points with high accuracy. Instead, the networks show significant differences in performance both within individual training sessions and measured statistically over many sessions. These results, discussed in detail below, show the same qualitative features for larger networks (Figures S1 and S2) and for different sets of original points ( Figures S3 and S4).
Fan and stacked architectures. Examples of the solutions f o and ff n g produced by the fan and stacked networks are shown in Figures 2(a) and 2(b). Each set ff n g is characterized by errors fE (o) n g and fE (n) n g, which measure the ability of the network to retain and learn information, respectively. The cumulative distribution functions (CDFs) of these errors are shown in Figures 2(c) and 2(d), where the CDF gives the probability that the network produces an error greater than E for any value of E.
The fan and stacked networks produce qualitatively different types of solutions f o and ff n g. While the specific functional form of f o depends on the randomly initialized network state (see the following section), the f o solutions shown here have errors that are representative of the average network performance over a range of randomly initialized states. The stacked solution f o averages over the variation in the original points (Figure 2(b)). In contrast, the fan solution f o accurately fits all six original points with a high order polynomial (Figure 2(a)). In both networks, subsequent solutions ff n g retain the features of f o . Because the sigmoid transfer function (see Methods) is identical for all nodes, the differences between the fan and stacked solutions arise solely from variations in network architecture. As the sigmoid function maps an infinite input space to a finite output space bounded between 0 and 1, successive applications of sigmoids produced by serial (stacked) computations tend to result in linear or step function outputs, while a sum of sigmoids produced by parallel (fan) computations tends to result in highly variable outputs.
The interference between the two training sessions results in the deviation of ff n g from f o , which tends to increase fE (o) n g relative to o . We find that in its best case, the stacked network shows no In contrast, the fan network shows a minimum deviation of 130% and a higher deviation on average compared to the stacked network. This deviation measures the ability of the network to retain the original representation f o , regardless of how erroneous that representation may be. Although the stacked network generates a higher error representation of the original points during the first training session, it can more accurately retain this representation when presented with new points.
The minimum and maximum values of fE (o) n g measure the best success and worst failure of the network in retaining old information while avoiding interference from new information. While the bounded output space limits the maximum error, linear solutions tend to further restrict these bounds. As a result, the stacked network has a lower maximum error at the cost of having a higher minimum error, as shown in Figure 2(c). In contrast, the fan network can retain the original information more accurately by achieving a lower minimum error, but it can also fail more catastrophically with a higher maximum error.
Similar features are observed in the distributions of fE (n) n g shown in Figure 2(d). The minimum and maximum values of fE (n) n g measure the best success and worst failure of the network in learning new information while attempting to retain old information. While both networks achieve low minimum error at their best, the fan network produces a much larger maximum error than does the stacked network. In addition to achieving more extreme best and worst cases, the fan network also has higher average error values SfE (o) n gT and SfE (n) n gT.
Intermediate architectures: Tradeoffs in learning and memory. We extend this analysis to the intermediate architecftures shown in Figure 1(a), organizing the results based on the degree of network serialization h=' (a purely geometrical factor). Tradeoffs in performance are observed across the range of architectures. For example, in Figure 3(a), we see a tradeoff between the minimum and maximum values of fE (o) n g. As h=' increases, the network does not fail as badly in its worst case but also does not succeed as well in its best case. Figure 3(b) shows that increasing h=' decreases the maximum error in both fE (o) n g and fE (n) n g, indicating that the stacked architecture is best suited for minimizing failure in both learning and memory. Figure 3(c) shows that increasing h=' decreases both the average solution variance Sf(Df n ) 2 gT and the average errors SfE (n) n gT and SfE (o) n gT. While we might naively expect that high solution variance (fan) would indicate a flexible network able to accurately fit nonlinear data, we instead find that high variance leads to high average error. In contrast, low variance, linear solutions (stacked) tend to minimize average error.
Furthermore, we find a tradeoff in performance between the first and second sessions, shown in Figure 3(d). Increasing h=' worsens performance during the first session by increasing E (o) o but improves average performance during the second session by decreasing both SfE (n) n gT and SfE (o) n gT, suggesting a tradeoff between the accuracy and generalizability of network solutions. The fan network, which produces a very accurate, specific representation of the original points, shows a much higher average error when it tries to generalize this representation. In contrast, the coarser representation produced by the stacked network is better able to incorporate new information.

Adaptation to Variable Learning Conditions
Both natural and artificial systems can be found in a variety of states when presented with new information. The success in learning this information may depend both on the initial state of the system and on the learning conditions. We explore these possible dependencies by varying both the randomly initialized network state and the training conditions.
Variable initialized states. Because the conjugate gradient descent algorithm (see Methods) is deterministic, the randomly initialized state determines f o , which then influences subsequent solutions ff n g.
To study the influence of random initialization on If we inspect the solutions produced by each network, we find that low, medium, and high error solutions correspond respectively to fitting all, some, or none of the points with a high order polynomial and fitting the remaining points with a horizontal line.  To emphasize differences in network performance, the solutions f o used to generate the results shown in Figures 2 and 3 were chosen because their error was representative of the distribution averages shown in Figure 4(a).
Temporal constraints. In natural systems, the time allowed to gather information from the environment is often limited, and a highly specific representation of information may not be desirable or even attainable. To investigate the effect of temporal constraints, we train the five networks on the original points with 5000 sets of randomly chosen weights, now terminating training after 100 iterations. The increased number of randomly initialized states allows us to better resolve the edges of the error distributions shown in Figure 4(b).
Once training time is limited, all distributions shift toward higher error values, again revealing a tradeoff between speed and accuracy. As before, SfE

Dependence on Error Landscape Structure
Given unlimited training time, the distributions in Figure 4(a) mark the error of local minima found within the error landscape of each network. Each minimum can be characterized by the degree of local landscape curvature, where directions of high curvature specify combinations of weight adjustments that produce large changes in error. We adopt the terminology used in previous studies and refer to directions with high and low curvature as stiff and sloppy, respectively [31,32]. Stiff and sloppy directions are found by diagonalizing the error Hessian H pq~L 2 E=Lv p Lv q evaluated at the set of weights that produces the local error minimum. For computational efficiency, we use the approximate Levenberg-Marquardt (LM) Hessian [33], defined as: where is the residual of the dth original point. The LM Hessian is a good approximation to H when the error of local minima, and thus the residual r (o) d , is small and the additional Hessian term r (o) d L 2 r (o) d =Lv p Lv q can be neglected. For a given model and data set, the LM Hessian agrees well with the stiffest eigenvectors of H and is equivalent to H when the model perfectly fits the data. In addition, it has a known number of exactly zero eigenvalues equal to the difference in the number of model parameters N p and the number of data points N d [31,32].
We diagonalize the LM Hessian about each of the 500 minima with the error values fE (o) o g shown in Figure 4(a). Each error minimum produces a set of N p eigenvalues l and normalized eigenvectorsj j, which give the degrees and directions of stiffness in weight space.
As an illustrative example of landscape features observed along these relevant directions, Figures 5(a) and 5(b) show the projection of the error landscape onto the two stiffest eigenvector directions j j (1) andj j (2) centered on zero error minima produced by the fan and stacked networks, respectively.
The fan landscape shows a single deep basin surrounded by smoothly varying peaks. In contrast, the stacked landscape is rugged, showing a deep valley with several minima separated by small barriers. While these minima appear to be distinct, they may be connected by higher dimensional pathways that cannot be seen in this reduced space.
Participation of network connections. The ability of a network to move along relevant eigenvector directions may depend on the number of weights that must be significantly adjusted, or equivalently the localization of eigenvector components. To quantify the degree of localization of the pth eigenvectorj j (p) , we calculate its participation ratio r (p)~P q (j (p) q ) 4 [34], where individual eigenvector components j (p) q correspond to specific weights v q in the network. r (p) is a dimensionless quantity that ranges between a completely delocalized minimum of 1=N P , for which all components have equal weight 1= ffiffiffiffiffiffi ffi N P p , and a completely localized maximum of 1, for which a single component carries unit weight.
For the set of minima with error values fE (o) o g, we quantify fr (1) g and fl (1) g of the stiffest eigenvectors fj j (1) g, as combinations of weight changes specified by these eigenvector directions produce the largest changes in error. The covariances C E,r~C ov(E (o) o ,r (1) ) and C E,l~C ov(E (o) o ,l (1) ) in these quantities are shown by the ellipses centered about their average values in Figures 6(a) and 6(b), respectively. Figure 6 highlights the variability in basin structure within and between the networks. As h=' increases, both the average and variance in fE (o) o g, fr (1) g, and fl (1) g increase. Higher variance leads to lower confidence in predicting the success of the network, but it also suggests that the network has more options when exploring its error landscape.
The orientations of the covariance ellipses in Figures 6(a) and 6(b) provide information regarding the relationships between E (o) o , r (1) , and l (1) . The semi-major axis of each C E,r ellipse in Figure 6 (1) g would suggest that these quantities are also positively correlated, Figure 6 o ) tend to be shallower (smaller l (1) ) and require the adjustment of fewer weights (larger r (1) ).
Landscape characteristics and successful learning. Variations in landscape structure provide insight into the way in which each network searches for solutions. In particular, fan solutions are characterized by low error and participation ratio, indicating that the fan network must adjust nearly all of its weights in order to navigate zero error basins. In contrast, stacked solutions span a range of error values. The corresponding basins are characterized by a variety of eigenvalues and participation ratios, indicating that the stacked network can navigate many types of basins by adjusting variable numbers of weights. Larger participation ratios correspond to higher error and lower eigenvalues, suggesting that the stacked network can navigate shallow, high error basins by adjusting only a few of its connections. Narrow, low error basins, found by both the fan and stacked networks, require fine tuning of a larger number of connections.
In combination, landscape characteristics help explain the results shown in Figures 3 and 4. Given unlimited training time, landscape variability is disadvantageous and can prevent a network from finding a low error minimum. Once time is limited, landscape variability can be advantageous in preventing failure by providing the network with high error, shallow basins that can be navigated with the adjustment of relatively few connections. If limited training time is coupled with extremely noisy information, landscapes with high error basins can be advantageous by decreasing average error relative to landscapes with no easily reachable basins. Because our sequential sessions combined both limited and unlimited training time and both clean and noisy data, we see an additional tradeoff between the two sessions. Unlimited training time and well constrained data favor the fan over the stacked network in minimizing average error, while limited time and noisy data favor the stacked network over the fan.

Discussion
In this study, we investigated the tradeoffs in learning and memory performance that arise from structural complexity. Importantly, none of the architectures considered here simultaneously mastered both learning and memory tasks, which suggests that systems whose function depends on such simultaneous success might require architectures that are complex combinations of both parallel and serial structures. Indeed, this inherent sensitivity of function to underlying architecture may help to explain the high degree of variability evident in architectural motifs of large-scale biological and technical systems. For instance, in natural neuronal networks, cortical connection patterns display a variety of architectural complexities at varying spatial scales. Examples of fan architectures are found in hub-and-spoke motifs, which form an important part of the small-world architecture [35][36][37], as well as in the decomposition of cortical network architectures into subnetworks or modules which may simultaneously process differential information [10,[38][39][40][41]. Moreover, stacked architectures are evident within cortical lamina [1], within the hierarchical organization displayed in the sequential ordering of the visual system [42], and within the nested modularity of large-scale cortical connectivity [10,41,43]. Similarly, artificial neural networks display complex combinations of fan and stacked motifs including modularity [44], hierarchy [45], and small-worldness [46,47].

Parallel versus Layered Architectures
Given the wealth of structural motifs present in real world systems, it is of interest to first isolate the tradeoffs in performance associated with small parallel and layered network structures which together form the complex architectural landscape of larger systems and thereby constrain their overall performance. Here we found that the deep, narrow basins within the error landscape enabled the fan network to produce very accurate solutions. However, the difficulty of simultaneously adjusting many network connections in order to escape deep basins may have hindered the ability of the fan network to adapt, a result that helps explain the susceptibility of parallel networks to the problems of overfitting and failure to generalize [26]. In contrast, higher variability in the width and depth of local minima enabled the stacked network to quickly find coarse but generalizable solutions through the adjustment of a smaller fraction of weights. In combination, these results support the hypothesis that the number and width of local landscape minima may increase with increasing number of hidden layers [4], and we suggest that this variability helps explain why layered networks may require fewer computational units and may better generalize than parallel networks [49,50]. However, the impact of structural variations on functional tradeoffs, for example between specificity and generalizability, extends beyond artificial network studies and is crucial for understanding the interaction of learning processes in large scale models of the brain [51]. While parallel architectures are often preferred in artificial network studies due to their consistency and accuracy [48,50], our results highlight the advantages of layered architectures when performance criteria favor generalizability and minimization of failure.

Intermediate Architectures
Building on the intuition gained from the two benchmark extremes -fan and stacked -we further assessed the characteristics of intermediate networks, which can be used to more directly probe the expected behavior of structurally complex composite systems. In particular, our intermediate structures were composed of several adjacent stacked networks and therefore shared principal features of both parallel and layered systems. Additionally, these networks had slightly larger numbers of connections than the fan and stacked networks.
Due to these structural differences, the depth of local minima within the intermediate landscapes displayed more variation than fan minima but more continuity than stacked minima. As landscape variability was linked to improved generalization capabilities, a continuous range of basin depths may have enabled the more successful balance between flexible learning and stable memory observed in the intermediate networks. This performance supports the hypothesis that short path lengths (similar to the serialization h=' [52]) and low connection densities may facilitate simultaneous performance of information segregation (memory retention) and integration (generalization) within natural neuronal systems [53]. These competing processes are also maintained in natural neuronal systems and neural circuit models through homeostatic plasticity mechanisms such as synaptic scaling [54,55] and redistribution [56,57], in addition to the rehearsal methods employed here [19][20][21][22][23]. Even in the absence of such homeostatic plasticity mechanisms, we found that the architectural combination of parallel and layered connectivity helped foster a balance between learning and memory.

Variable Learning Conditions and Network Efficiency
We extended our analysis from the case of unlimited training time, which revealed information about error landscape structure, to the biologically-motivated case of limited training time. Comparison of these two cases revealed a tradeoff in performance between training speed and solution accuracy. In the absence of temporal constraints, the production of highly accurate representations required longer training times. Similarly, temporal constraints led to larger solution errors. This tradeoff between speed and accuracy has been observed in cortical networks, where emphasis on performance speed during perceptual learning tasks increased the baseline activity but decreased the transient task-related activity of neurons within the decision-making regions of the human brain [58,59]. Here we found that network architecture played a significant role in the manifestation of this tradeoff, and the presence of additional hidden layers helped minimize network susceptibility to changes in training time. In particular, the fan network demonstrated the greatest change in performance under temporal constraints, showing a decrease in consistency coupled with occasional catastrophic error values. In contrast, the intermediate and stacked networks improved consistency and minimized inaccuracy once training time was limited.
Upon closer inspection, we found that the intermediate networks produced solutions with increased speed given unlimited time and with increased potential for accuracy when time was limited as compared to the fan and stacked extremes. The presence of additional connections may have influenced the number of iterations required to find a solution, or similarly the minimum error found with a fixed number of iterations. While the graph measure of path length is known to influence network efficiency [52], these results imply that the number of networks connections may additionally enable the network to quickly find an accurate solution.
In addition to static variations in connectivity, dynamic structural changes such as synapse formation [60] can facilitate learning and memory processes. The converse case of network degradation, or disruptions to structural connectivity, is also known to have widespread consequences in functional properties of the brain [61][62][63]. A more detailed study of the relationfship between connection number and robustness could provide additional insight into the effects of synapse formation and degradation on functional performance. Our analysis of error landscape features revealed that different architectures showed variable localization properties in the eigenvectors associated with local error minima, and we therefore expect robustness to depend on both the architecture and the location of growth or damage within the network.

Methodological Considerations
We found that parallel networks suffered from the creation of excessively detailed representations of information, an ''overfitting'' problem that is often addressed through the use of crossvalidation [64] and weight regularization [65] techniques. As one goal of this study was to uncover the structural basis for differences in representational capabilities, it was crucial to understand network behavior in the absence of task-specific cross-validation schemes. Additionally, as the number of parameters was roughly constant across all network structures (and identical for the fan and stacked networks), we were able to draw comparisons across network architectures in the absence of additional weight regularization constraints.
While parallel network models have commonly been used in machine learning studies, multi-layer ''deep'' networks have recently gained interest due to their potential ability to compactly represent (using fewer computational units and parameters) highly variable functions [49,50]. The ''deep belief'' framework has been successful for training large, multi-layered networks, and training methods often couple unsupervised, layer-wise (greedy) training with supervised fine-tuning [66]. Recent studies of deep belief networks found that classification performance improved with the addition of layers [48]. In addition, it was suggested that a reduction in the number of hidden layers would require an exponential increase in the number of hidden units in order to achieve similar network performance [50]. These results emphasize the capabilities of layered networks and provide an additional framework in which to explore structure-function tradeoffs.
Although biologically-motivated, the FFBP framework includes several simplifying assumptions that could be modified to include additional, realistic complexity. First, we assumed that only the connection weights, analogous to synaptic strengths, were variable. Real neurons also exhibit changes in intrinsic dynamics [67] that interact with network architecture to constrain functionality in the brain [68]. Accounting for such relationships could be particularly relevant, for example, in the study of neuron response profiles within different cortical layers [13]. Second, we assumed that signals passed between nodes had no temporal structure, analogous to representing steady state neuron firing rates. Temporally varying signals could be included to study the dependence of dynamic properties, such as synchronization [68][69][70] and signal propagation [71], on structural organization [72]. Lastly, we assumed feedforward connectivity. The addition of recurrent connections could be used to study the relationship between recurrent structure and oscillatory functions such as cortical sleep rhythms [73] and oscillation couplings relevant for associative learning and memory [74]. In each of these directions, we anticipate that underlying structural complexity will continue to impact performance through functional tradeoffs.

Conclusion
In summary, different network architectures produce error landscapes with distinguishable characteristics, such as the height and width of local minima, which in turn determine performance features such as speed, accuracy, and adaptability. Inherent tradeoffs, observed across a range of architectures, arise as a consequence of the underlying error landscape structure. The presence of local landscape minima enable greater speed, more generalizable solutions, and minimization of catastrophic failure. However, these successes come at the cost of decreased accuracy. Understanding how both the landscape characteristics and the resulting performance features vary across a range of architectures is crucial for both understanding and guiding the design of more complex biological and technical systems.