Submitted to: New J. Phys. Contents

Hypertext was originally conceived essentially as a personal, single user tool, intended for the management of the information needed by the single researcher for his/her own purposes. As a consequence, the user of a hypertext was thought as having a complete awareness of the content and the organisation of the information. This implies that all the links, either they represent a “structural” organisation, either they just accomplish the task of representing some association among different information nodes, should be quite evident and natural to the user. In addition, we must stress that the richness of the hypertext resides in the representation of the associative mechanism, that was his inspiriting principle, by means of the links; therefore the association among the nodes constitutes probably the most relevant aspect in the hypertext. However, user’s disorientation and cognitive overhead are well-known problems arising in hypertext navigation: to our understanding, this can be caused by a poor design of the hypertext. Too often the designers consider the hypertext design to be simply a “creative” task, and implement the hypertext/hypermedia (HT/HM) just following their “inspiration”, without a careful thinking to the organisation of the information and to the modelling of the associations. This leads to poorly consistent design, in spite of some brilliant solutions taken step by step, where mere technical effects (sound, colour, animation, etc.) mask obscure design choices, if any. These two problems can become particularly relevant when the designer realises that a large number of links can enhance the associative capabilities of the hypertext. The obvious result is that the designer’s knowledge is hard-coded in the hypertext, so forcing the user to follow paths that can appear not relevant, or even unreasonable. On the other hand, too few links result in a “flat” and not stimulating hypertext. These problems are emphasised in WWW, when information is distributed across the world and organised by different persons, for different purposes. As a consequence, the modelling of the knowledge and the structuring of information nodes and related links become a major design issue in the implementation of WWW hypertext applications. Database design, information retrieval, artificial intelligence and cognitive psychology all are “mature” disciplines that can contribute to the definition of design methodology that can help in implementing effective WWW applications. Their contribution can help in the information structuring process, the free text indexing techniques, the implementation of connections among the data items, the design of effective user interfaces. Really, the designer should concentrate on avoiding obliged connections among the information nodes. It appears much more helpful to give to the user the possibility of


-Introduction
Software engineering is evolving towards the implementation of tools that could improve the maintainability of the existing software applications, possibly by their reconfiguration.
Un aspetto importante da considerare nell' opera di manutenzione è il riuso dei prodotti dello sviluppo di software ed in particolare del codice.
Il codice esistente può, quindi, essere recuperato attraverso l' allestimento di un processo di reengineering ( [Chikofsky90]) o, per meglio dire, di un processo di reuse re-engineering: which is claimed to re-design a system reusing knowledge and design elements taken from the previous products.

-Related work
In spite of its novelty, some relevant work has been yet done by several authors, and we can distinguish two main approaches in the area of the Object Oriented Re-engineering.
In their paper, Jacobson and Lindström ([Jacobson91]) suggest that an object oriented development method can be used to gradually modernise an old system via a three steps process. The first step consists of a reverse engineering phase, which allows to identify how the components of the system relate to each other and then create a more abstract description of the system. In the second step, reasoning about the changes in functionalities is done at a more abstract level. Finally, in the third step, a forward engineering phase takes place, redesigning the system from the abstract representation to the concrete one.
In the whole process, the informal documentation (manuals, requirements specifications, etc.) is taken into account in order to reconstruct the knowledge about the system functionalities.
In this approach, it is supposed that it will be possible to migrate from a top-down design environment to an object-oriented one. This implies a hybrid Software Life Cycle model As far as the approach proposed by Jacobson and Lindström is concerned, we notice that, even if in principle the informal documentation may be of relevant importance, in practice it might happen that it is lacking or incomplete or inconsistent and misleading. Therefore, information kept from the informal documentation should be carefully examined and validated.
Liu and Wilde themselves in their paper raise the question if their approach may produce "too big" objects. This is due to the intrinsic characteristics of the proposed methods. In fact they consider as strongly connected procedures and data structures if they are used together, and in this case they identify the set of the data structures as an object and the procedures as methods of this object.
Because of this, we completely agree with the authors about the fact that "a further stage of refinement will be necessary in which human intervention or heuristically guided search procedures improve the candidate objects".

-Problems
The basic concepts of the object oriented approach are objects, methods, encapsulation, inheritance. On the other hand, in conventional programs, all the knowledge that is contained in the definition of an object and its methods, namely the static and the dynamic constraints and the procedural knowledge, are dispersed in the software modules.
Therefore, re-engineering towards an object-oriented environment raises several problems, as the identification of the objects and their methods requires a semantic analysis of the code, and the identification of similarities, exceptions, rules, etc.. It is obvious that a human intervention is somehow required, however, the application of some general rules can provide a reasonable understanding of the existing software, and reliable suggestions about the identification of the right components.
In this paragraph, we will discuss some of these problems and will sketch some possible solutions.
First of all, we can face many difficulties when we intend to associate the concept of object to any component of a traditional system, developed according to the top-down design style.
In addition, we can consider two possible alternatives: • all the data or part of them are to be considered as objects; • a set of sub-functionalities acting on several data objects may be considered as a single "software object".
A third problem is constituted by the fact that the traditional systems have been designed on the basis of a functional decomposition. This implies that it is difficult to identify the functionalities that operates on shared data, while it is very frequent to encounter modules where several subfunctionalities are intermixed or fragmented.
Finally, it must be stressed that an object oriented design style is something more than objects and methods: fundamental properties like encapsulation, information hiding, inheritance, polymorphism and dynamic binding constitute the actual innovative aspect of the object oriented methodology, and can be implemented by the identification of class hierarchies. In the following, we will briefly discuss some potential ways these problems can be faced and solved.

-Identification of the objects
As far as the data structures are concerned, we can distinguish between the global and the local data structures.
As a first step, we can consider a reasonable decision to identify as objects the global data structures, while local data structures can be taken as data local to the methods. However, local data structures can be shared by different methods. This fact can have influence on the program slicing phase.

-Identification of fragments
Object methods must be derived by the analysis of the functionalities embedded in the procedures that manipulate the data structures previously identified as "objects".
As it has been pointed out discussing the [Liu90] proposal, it is necessary to adopt a "fine granularity" method to identify modules that implement the various subfunctionalities and to split, as possible as we can, the fragments corresponding to them. ). Therefore, the identification of the methods can be accomplished in two steps: • slicing of the functionalities of a single module; • clustering of the functionalities acting on the same object.
After the completion of these two steps, we can obtain a detailed map of the different "code chunks" acting on the various global data structures.

-Identification of abstractions
Up to now, the identification of the functionalities of a piece of code has been performed analysing its control flow graph and isolating special strongly connected sub-graphs (primes) associated with an atomic subfunctionality. Afterwards, the comparison of these primes with a standard atom library ([Wills90]) lead to the comprehension of the program behaviour.
However, we think that this approach is in some way limited, as two programs with a different logical structure may in fact implement the same functionality. Therefore, it is necessary to understand the semantics of the program. The usage of pre and post conditions has been proposed ([Hausler90]) as a techniques for the precise semantic specification of the code.
Therefore, in order to identify the inheritance hierarchies, we have to solve essentially the following problems: Therefore, in order to identify the inheritance hierarchies, we have to solve essentially the following problems: • identification of "program chunks" which are identical from the structural point of view; • identification of "program chunks" which are identical from the syntactical point of view; • definition of hierarchies of global data structures.

-The proposed approach
As it has been pointed out in the previous paragraphs, object oriented re-engineering requires the identification of the objects, the methods and the inheritance hierarchies.
This task can be fulfilled partly automatically, partly supported by the human intervention. The knowledge about the system must be extracted either from formal sources, i.e. the code, either from informal documentation.
In the following, we will give an outline of the proposed approach, where, starting from the analysis of the code, we obtain a list of ......
The architecture of our approach is depicted in Fig.2.
The first step is to move from a "well structured" code, i.e. a code without GOTO written in C language. However, these assumptions do not reduce the generality of the approach, as code restructuring tools are available on the market, and the peculiarities of the C language will affect only a minimal part of the process. In fact, we are looking for data structures ("external variables") that are available in other programming languages, too, even if with different names.
Afterwards, we try to identify "similarities" of portions of code, building the nesting trees of the procedures and looking for the equal regular expressions that represent the code.
The third step is the program slicing, which is performed making use of construction of the Program Dependence Graph.
Finally, we can proceed to the identification of the functionalities. In this phase, a human intervention is required, to validate the choices. At this stage, the informal documentation may be taken into account.

-The regular expression
We propose to use regular expression to represent the control logic of the code. This approach is also used by [Cimitile91] and [Wegman83] for different purposes. The great advantage of this kind of representation, is that we can reduce the identification of structurally identical subgraphs to the finding of identical substrings. In addition, we can take the advantages of the information retrieval approach (i.e. term significance) concentrating our attention on some general structural features of the code.
The grammar for the code "expressions" is reported in Appendix A.
In Appendix B we report an example of the identification of portions of code that belongs to two different procedures, but are identical from the structural point of view.

-The Program Dependence Graph
To solve the problems related to the search of candidate methods, we need a form of representation showing the relationship between sliced code and managed global data.
Therefore, such form of representation should have to show both data dependencies and control dependencies.
A form of representation matching these requirements is the Program Dependence Graph (PDG) ([Ferrante87]).
The PDG makes explicit both the essential data and control relationships without the unnecessary sequencing present in the control flow graph.
The PDG represents a program as a graph in which nodes are statements or predicate expressions and the edges incident to a node represent both the data flows and the conditions that control the execution of the operations.
In fact, there are two types of dependencies in a program.
First, a dependence exist between two statements each time a variable appearing in one statement may assume an incorrect value if the two statements are reversed.
For example, given A = B * C (S1) D = A* E + 1 (S2) S2 depends on S1 because executing S2 before S1 an incorrect value for A would result in S2. Dependencies of this type are named data dependencies.
Another type of dependence exist between a statement and a predicate whose value immediately control the execution of the statement. For example, in the sequence if ( A ) then (S1) B = C * D (S2) endif S2 depends on predicate A because the value of A determine if S2 is executed.
Dependencies of this type are named control dependencies.

-Control dependencies
First of all, we have to give some definitions.

Definition 1.
A control flow graph is a directed graph G augmented with an unique entry node START and a unique exit node STOP such that each node in the graph has at most two successors. We assume that nodes with two successors have attributes "T" (true) and "F" (false) associated with the outgoing edges in the usual way. We further assume that for any node N in G there exists a path from START to N and a path from N to STOP.

Definition 2.
A node V is post-dominated by a node W in G if every directed path from V to STOP (not including V) contains W.
Note that this definition of post-dominance does not include the initial node on the path.
In particular, a node never post-dominates itself.

Definition 3.
Let G be a control flow graph. Let X and Y be nodes in G. Y is control dependent on X iff (1) there exists a directed path P from X to Y with any Z in P (excluding X and Y)

post-dominated by Y and
(2) X is not post-dominated by Y.
If Y is control dependent on X then X must have two exits. Following one of the exits from X always results in Y being executed, while taking the other exits may result in Y not being executed.
When applied to a loop in the control flow graph, our definition of control dependence determines a strongly connected region (SCR) of control dependencies whose nodes consist of predicates that determine the exits from the loop.
While in the control flow graph nested loops appear as nested SCRs, in the PDG they appear as distinct SCRs with a control dependence edge between the outer loop and each immediate inner loop. So the nesting hierarchy is most evident since loops at the same level appear as SCRs with a common ancestor.
How do we determine control dependencies?. The last step for the construction of the control dependence subgraph consists of the addition of region nodes that summarise the set of control conditions for a node and group all nodes with the same set of control conditions together (Fig.4).

-Data dependencies
The construction of the data dependence subgraph whose nodes consist of statements and predicates have to face further problems caused by side-effects due to pointers, shared variables, or procedure calls with other than value parameters.
Directed acyclic graph (DAG) ( [Aho77]) are constructed for each basic block during the initial parse of source program. Leaf node is labelled by unique identifiers, either variable names or constants which are initially assigned the value "undefined" at program entry.
During computing the set of reaching definitions for each basic block, interior nodes are labelled by an operator symbol and are given an extra set of identifiers for labels.
Finally, the individual DAGs are connected to one another using the results of the data flow computation to make the definition-use chain explicit .
In this process, I/O operations are treated as operations on implicit file object so that the sequencing of operations is correctly represented.
In addition, the definite iteration statement (for) becomes a single operator whose operands are the initial, final and increment values and with two output values that are the index value stream and the predicate value stream.
A major problem in static flow analysis is that modification and reference to the elements of an array or elements specified by pointer are considered as references to the while object.
To face this problem, some "dummy" variables for each pointer variable are introduced; for example, dummy variables (1)p, (2)p will be introduced for pointer variable **p in C.  To correctly propagate aliasing information based on assignment of the address of a variable to another variable, a dummy literal for each variable whose address is copied is introduced. If p=&x then this literal will be denoted by (-1)x.
The value of a dummy variable (i)p is either modified or used by a statement , whereas the values of pointer variable p and dummy variables (1)p, (2)p, ....., (i-1)p will always be used.
Aliasing and side-effects present obvious problems in accurately representing dependencies in the PDG.
Explicit aliasing of scalars is easily handled by treating aliases as synonyms implicit aliasing, induced by procedure parameter binding is detect using interprocedural data flow analysis ( [Cooper85], [Wehil80]). Obviously interprocedural analysis must be performed before building the basic block DAGs.

-Slicing
The extraction of slices is based on data dependence even if control dependence is considered in the construction of slice as well.
A slice is directly obtained by a linear time walk backwards from some point in the graph visiting all predecessors.
Nodes must be annotated with references to the source code in order to permit the identification of the resulting slice.

-Conclusions
Object orientation is claimed to be the most suitable method that can be used to produce software systems which are robust, reliable and reusable. However, On the other hand, the existing software patrimony and the related investments are so relevant that it goes without doubt that we have to recover as much as possible of the effort put in the development of the old software systems.
As a consequence, object oriented re-engineering appears a promising research area.
The main difficulty we are faced with when re-engineering old software towards an object oriented environment, is to understand the semantics of the existing code, so that it will be possible to identify objects, methods and inheritance hierarchies.
In this paper, we have presented a general framework for the implementation of a reengineering cycle. Its main characteristics are the identification of "candidate objects" from the global data structures and of the methods via program slicing and clustering by the global data which are manipulated.
A human intervention is supposed to take place to solve ambiguous or non automatically decidible cases.

Aknowledgements
We have to thank A. Cimitile and U. De Carlini from University of Naples for useful discussions and suggestions.
We have also to thenk M. Gregori who participated in the early stages of this work. [Ferrante87] Ferrante J., Ottenstein K.J., Warren J.  If we consider the nesting trees as parsing trees for the expressions, we can obtain the regular expressions corresponding to the procedures by applying repeatedly the grammar rules.
Furthermore, this representation form allow the identification of structural equivalent code portions at both abstract and detailed level.