Introduction to Numerical Computing

The main aim of this paper is to understand the information to numerical computing. In this paper we solve some examples of numerical computing. The numerical computational techniques are the technique by which mathematical problems are formulated and they can be solved with arithmetic operations. Those techniques are basically numerical methods. Numerical method supports the solution of almost every type of problem. The numerical methods are classified depending upon the type of the problem. Citation: Dhere P (2018) Introduction to Numerical Computing. J Appl Computat Math 7: 423. doi: 10.4172/2168-9679.1000423


Number Systems
In the decimal system we use the 10 numeric symbols 0, 1,2,3,4,5,6,7,8,9 to represent numbers.The relative position of each symbol determines the magnitude or the value of the number.

Example:
6594 is expressible as 6594 = 6 • 10 3 + 5 • 10 2 + 9 • 10 1 + 4 • 10 0 Example: 436.578 is expressible as 436.578 = 4 • 10 2 + 3 • 10 1 + 6 • 10 0 + 5 We call the number 10 the base of the system.In general we can represent numbers in any system in the polynomial form: where B is the base of the system and the period in the middle is the radix point.In computers, numbers are represented in the binary system.In the binary system, the base is 2 and the only numeric symbols we need to represent numbers in binary are 0 and 1.
Example: 0100110 is a binary number whose value in the decimal system is calculated as follows: The hexadecimal system is another useful number system in computer work.In this system, the base is 16 and the 16 numeric and arabic symbols used to represent numbers are: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F Example: Let 26 be a hexadecimal number.It's decimal value is: (26) 16 = 2

Fixed Point and Floating Point Numbers
The term fixed point implies that the radix point is always placed on the right end of the list of digits (usually implicitly), implying that the above representation, the digits b 0 b 1 • • • are all zero.The set of numbers thus obtained is analogous to the set of integers.The floatingpoint representation of numbers is based on the scientific notation we are familiar with.For example, the numbers 635.837 and 0.0025361, respectively, are expressed in scientific notation in the form: 0.635837 × 10 3 and 0.2536 × 10 −2 (1) The general form for decimal numbers is a × 10 b where, in normalized form, a is determined such that 1 ≤ a < 10.Thus the above numbers in the normalized form are: 6.35837 × 10 2 and 2.5361 respectively.The radix point floats to the right of the first nonzero digit and the exponent of the base adjusted accordingly; hence the name floating-point.In normalized form, the exponent b gives the number's order of magnitude.The convention is that floating-point numbers should always be normalized except during calculations.In this note, for simplicity the following convention is adopted when representing numbers in any floating-point system.
A floating-point number in base β is written as an ordered pair: (± d 0 .d 1 , . . ., d t−1 , e) β and has the value ± d 0 .d 1 , . . ., d t−1 × β e .For example, using t = 4 and decimal i.e., β = 10, 13.25 is represented as (+1.325, 1) 10 using t = 5 and binary i.e., β = 2, -38.0 is represented as (−1.0011, 5) 2 using t = 6 and hexadecimal i.e., β = 16, 0.8 is represented as (+C.CCCCC, −1) 16 The choice of β and the way the bits of a memory location are allocated for storing e and how many digits t are stored as the significand, varies among different computer architectures.Together, the base β and the number of digits t determine the precision of the numbers stored.Many computer systems provide for two different lengths in number of digits in base β of the significand (which is denoted by t), called single precision and double precision, respectively.The size of the exponent e, which is also limited by the number of digits available to store it, and in what form it is stored determines the range or magnitude of the numbers stored.

Representation of Integers on Computers
An integer is stored as a binary number in a string of m contiguous bits.Negative numbers are stored as their 2's complement, i.e., 2 m -the number.In the floating-point system available in many computers including those using , a bit string of length 32 bits is provided for storing integers.We shall discuss the integer representation using 32 bits; other representations are similar.
It is convenient to picture the 32 bit string as a sequence of binary positions arranged from left to right and to label them from 0 to 31 (as b 0 , b 1 , . . ., b 31 ) for purpose of reference.When a positive integer is to be represented, its binary equivalent is placed in this bit string with the least significant digit in the rightmost bit position i.e., as b 31 .For example, (38) 10 = (00100110) 2 is stored as: 000 ............ 00100110 The left-most bit is labelled b 0 is called the sign bit and for positive integers the sign bit will always be set to 0. Thus the largest positive integer that can be stored is A negative integer is stored as its two's complement.The two's complement of the negative integer −i is the 32 bit binary number formed by 2 32 − i.For example, (−38) 10 is stored as: 111 ............ 11011010 Note that the sign bit is now automatically set to a 1 and therefore this pattern is identified as representing a negative number by the computer.Thus the smallest negative integer that can be stored is 100 ........... 00 = −2 31 = −2, 147, 483, 648 .
From the above discussion it becomes clear that the set of numbers that can be stored as integers in 32 bits machines are the integers in the range −2 31 to 2 31 − 1.The two's complement representation of negative numbers in the integer mode allows the treatment of subtraction as an addition in integer arithmetic.For e.g., consider the subtraction of the two numbers +48 and +21 stored in strings of 8 bits, i.e., m = 8 (for convenience and consider the leading bit as a sign bit).Note that overflow occurred in computing 283 because it cannot be stored in 8 bits, but a carry was made into and out of the sign bit so the answer is deemed correct.A carry is made into the sign bit but not out of in the following so the answer is incorrect: Floating-point Representation (FP) It says that hardware should be able to represent a subset F of real numbers called floating-point numbers using three basic components: s, the sign, e, the exponent, and m, the mantissa.The value of a number in F is calculated as The exponent base (2) is implicit and need not be stored.Most computers store floatingpoint numbers in two formats: single (32-bit) and double (64-bit) precision.In the single precision format numbers are stored in 32-bit strings (i.e., 4 bytes) with the number of bits available partitioned as follows: where • s = 0 or 1, denotes sign of the number, the sign bit is 0 for positive and 1 for negative 2 is called the mantissa or the significand • E = e + 127 (i.e., the exponent is stored in binary with 127 added to it or in biased form) When numbers are converted to binary floating-point format, the most significant bit of the mantissa will always be a 1 in the normalized form making the storage of this bit redundant.
It is sometimes called the hidden bit.Hence it is not stored but implicitly assumed to be present and only the fraction part denoted as f above is stored.Thus, in effect, to represent a normalized 24-bit mantissa in binary, only 23 bits are needed.The bit thus saved is used to increase the space available to store the biased exponent E. Because E is stored in 8 bits (as shown above) possible values for E are in the range 0 ≤ E ≤ 255 Since the exponent is stored in the biased form, both positive and negative exponents can be represented without using a separate sign bit.A bias (somtimes called the excess) is added to the actual exponent so that the result stored as E is always positive.For IEEE single-precision floats, the bias is 127.Thus, for example, to represent an exponent of zero, a value of 0 + 127 in binary is stored in the exponent field.A stored value of 100 indicates implies an exponent of (100−127), or −73.Exponents of −127 or E=0 (E field is all 0's) and +128 or E=255 (E field is all 1's) are reserved for representing results of special calculations.
In double precision, numbers are stored in 64-bit strings (i.e., 8 bytes) with the exponent biased with 1023, i.e., E = e + 1023 and stored in 11 bits and f , the fraction part of the mantissa, stored in 52 bits.Some examples of numbers stored in single precision floating-point follow: • (38.0) 10 is represented as (1.0011, 5) 2 and is stored in 32-bit single precision floatingpoint as: ) 10 is represented as (−1.001010101, 7) 2 and is stored in 32-bit single precision floating-point as: 1 10000110 0010101010 • • • 0 • (1.0) 10 is represented as (1.0, 0) 2 and is stored in 32-bit single precision floating-point as: • (0.022460938) 10 is represented as (1.01110, −6) 2 and is stored in single precision 32-bit floating-point as: The range of the positive numbers that can be stored in single precision is 2 −126 to (2 − 2 −23 ) × 2 127 , or in decimal, ≈ 1.175494351 × 10 −38 to ≈ 3.4028235 × 10 38 .In double precision, the range is 2 −1022 to (2−2 −52 )×2 1023 or, in decimal, ≈ 2.2250738585072014×e −308 to ≈ 1.7976931348623157 308 .Since the sign of floating-point numbers is given in a separate bit, the range for negative numbers is given by the negation of the above values.
Note that -126 is the smallest exponent for a normalized number because 1 is the smallest possible nonzero value for E for normalized numbers.Similarly, +127 is the largest exponent for a normalized number because 254 is the largest possible value for E for normalized numbers.When E is out of this allowable range of values but is still representible in 8 bits the result is encoded to the following values: 1. E = 0 and m = 0 encodes to ±0 2. E = 0 and −1 < m < 1 encodes to ±0.m × 2 −126 3. E = 255 and m = 0 encodes to NaN 4. E = 255 and m = 0 encodes to ±∞ NaN is an abbreviation for the phrase not a number implying that the encoding represents the result of an operation that cannot be represented in floating-point.This means that there are two zeroes, +0 (s is set to 0) and −0 (s is set to 1) and two infinities +∞ (s is set to 0) and −∞ (s is set to 1).NaN s may have a sign and a significand, but these have no meaning other than for diagnostics; the first bit of the significand is often used to distinguish signaling NaN s from quiet NaN s.Note that NaNs and infinities have all 1's in the E field.
An interesting case is the second situation described above which allows numbers smaller than the smallest normalized number representible in the floating-point format to be stored as a unnormalized floating-point number.As noted above, the smallest normalized number representible in the floating-point format is (±1.0, −126) 2 .Numbers with smaller exponents than -126 can be stored in the floating point format by denormalizing the number and shifting the radix pont a number of places to the left so that the exponent is always adjusted to -127.For example, the number (+1.0, −127) 2 can be stored in denormalized form because it is equal to (+0.1, −126) 2 , and will be in single precision 32-bit floating-point as: Thus, these numbers are representable in floating-point format with E=0 and m = 0.f , and are called subnormal or denormalized numbers.The smallest non-zero positive and largest non-zero negative denormalized numbers (represented by all 0's in the E field and the binary value 1 in the f field) are 2 −149 (or ≈ 1.4012985 × 10 −45 in decimal).
Operations on special numbers are well-defined by the IEEE standard.In the simplest case, any operation with a NaN yields a NaN result.These and other operations are summarized in the following table:

Floating-point Arithmetic
Because of the finite precision of floating-point numbers, floating-point arithmetic can only approximate real arithmetic.In particular, floating-point arithmetic is commutatative but not associative, implying that arithmetic expressions written in different order of operations may result in different results.Since many real numbers do not have floating-point equivalents, floating-point operations such as addition and multiplication, may result in approximations to the exact answer obtained using exact arithmetic.IEEE standard includes rounding modes which are methods to select a floating-point number to approximate the true result.The symbols ⊕, , ⊗ and are used to represent floating-point operations equivalent to real operations of +, −, ×, and ÷, respectively.The IEEE standard requires that these operations, plus square root, remainder, and conversion between integer and floating-point be correctly rounded.That is, the result must be computed exactly and then rounded to the nearest floating-point number.
The IEEE standard has four different rounding modes, the default mode being round to even or unbiased rounding, which rounds to the nearest value.If the number falls midway it is rounded to the nearest value with an even (or zero) least significant bit.Other rounding modes allowed are towards zero, towards positive infinity, and towards negative infinity.
While different machines may implement floating-point arithmetic somewhat differently, the basic steps are common and the operations are carried out in floating-point registers capable of holding additional bits than the source operands require.This means that floatingpoint arithmetic operations may utilize more bits than are used to represent the individual operands.However, some of the operations discussed below may result in bits falling off the least significant end even when additional bits are used.Usually, the FP register has a guard digit to hold at least one bit of overflow of least significant bits so that it could be used for rounding later.As an example, consider the addition (or subtraction) of two numbers.The steps needed can be summarized as follows: 1.The two operands are first brought into FP registers.
2. The exponents E of the two operands are compared and the mantissa of the number with the smaller exponent is shifted to the right a number of digits equal to the difference between the two exponents.This operation is called alignment and may result in a carry to the guard digit.
3. The larger of the two exponents is made the exponent of the result.
4. Add (or subtract) the two mantissa.This uses the guard digit and may alter its value.
5. The correct sign of the result is determined (using an algorithm).
6.The resultant mantissa is normalized by shifting digits to the left and the exponent is decreased by the number of digits needed to be so shifted.
In much of the following discussion, hypothetical computers implementing decimal floatingpoint representation and floating-point arithmetic with finite precision and rounding is used to illustrate various implications of floating-point arithmetic.In general, this helps to fix ideas and understand the concept and avoids the complication of dealing with binary arithmetic and large numbers of binary digits.
Implications of floating-point representation and floating-point arithmetic a) As observed earlier not all real numbers can be represented exactly as a floating-point number conforming to the IEEE standard.For example, it may be observed that the numbers between 6.3247 and 6.3247 are not representable using 5-digit decimal floating-point format.Also note that the same number of real numbers in each of the ranges [10 0 , 10 1 ], [10 1 , 10 2 ], . . .can be representable in the 5-digit decimal floatingpoint format.
In the IEEE floating-point format the corresponding intervals are [2 0 , 2 1 ], [2 1 , 2 2 ]  Thus the final answer is computed as Σx 2 − n x2 = (+5.305,4) − (+5.305, 4) = 0! This occurs because only 4 digits of the result of each computation are carried and these digits cancel each other out.The true answer depends entirely on those digits that were not carried.

Stability of Algorithms
In the previous example, data is not the problem but the algorithm is.The formula used does not provide a good algorithm for computing the sample variance.Algorithms that produce the best answer in FP for a wide array of input data are called stable; in particular small perturbations in the data must not produce large changes in the value computed using the algorithm.For the computation of sample variance it can easily be checked that on the same machine, the algorithm based on the formula Σ(x i − x) 2 /(n − 1) gives the exact answer, 2.5, for the sample variance of the data considered in the previous example, using the same precision arithmetic.In a later section, results of a study comparing the stability of these two algorithms are presented.
Other well-known examples are the computation of e −a for a > 0 using a series expansion and evaluation of polynomials.Consider computing exp(−5.5)using the series: The required computations are summarized in the table given below.
An alternative is to use the reciprocal of exp(5.5) to compute exp(−5.5).Using this method, exp(−5.5)= 1/ exp(5.5)= 1/244.61= 0.0040881 is obtained using only 19 terms.The true value is 0.0040867.Thus using the truncated series directly to compute exp(−5.5)does not result in a single correct significant digit whereas using the reciprocal of exp(5.5)computed using the same series gives 3 correct digits are obtained.When calculating polynomials of the form re-expressing f (x) in other forms, some of which is discussed below, often leads to increased accuracy.The reason for this is that these representations result in more numerically stable algorithms than direct use of the power form given above.One such representation is called the shifted form where f (x) is expressed in the form where the center c is a predetermined constant.It is easy to see that, given c, by comparing coefficients in (1) and ( 2) one can obtain values of b 0 , b 1 , . . ., b n from those of a 0 , a 1 , . . ., a n .Note that this shifting can be accomplished by making a change of variable t = x − c.Consider a simple parabola centered at x = 5555.5: whose power form representation is If we now compute f(5555) and f(5554.5) in 6-digit decimal arithmetic directly from the power form, we get f (5555) = 3.08636 × 10 7 − 1.11110 × 10 4 (5.5550 × 10 3 ) + (5.5550 × 10 3 ) 2 = 3.08636 × 10 7 − 6.17216 × 10 7 + 3.08580 × 10 7 = 0 f (5554.5)= 3.08636 × 10 7 − 6.17160 × 10 7 + 3.08525 × 10 7 = 0.00001 × 10 7 = 100 the same arithmetic on the shifted form show that its evaluation produces accurate results.Given a polynomial f (x) in the power form, the center c is usually chosen somewhere in the 'center' of the x values for which f (x) is expected to be evaluated.Once a value for c is determined, one needs to convert f (x) from the power form to the shifted form by evaluating coefficients b 0 , b 1 , . . ., b n .
The coding of the shifted form may be accomplished by first rewriting f(x) expressed in shifted form as in (2), as follows: Given the coefficients b 0 , b 1 , . . ., b n and the center c, this computation may be coded using the following algorithm: In the simplest implementation of this algorithm, the center c is taken to be zero in expression (3) and the coefficients b i are then equal to a i for all i = 1, . . ., n. f (x) of ( 1) is then reexpressed in the form The resulting algorithm is often called Horner's rule.Consider the evaluation of A C expression for evaluating (4) using Horner's rule is

Condition Number
The term condition is used to denote a characteristic of input data relative to a computational problem.It measures the sensitivity of computed results to small relative changes in the input data.
Ill-conditioned problems produce a large relative change in the computed result from a small relative change in the data.Thus the condition number measures the relative change in the computed result due to a unit relative change in the data.Suppose that a numerical procedure is described as producing output from some input, i.e., output=f(input).Then we define a constant κ such that κ is called the condition number and small condition numbers are obviously desirable.If this number is large for a particular computation performed on some data then that data set is said to be ill-conditioned with respect to the computation.
For some problems it is possible to define a condition number in closed-form.Suppose, for example, that the computational problem is to evaluate a function f () at a specified point x.If x is perturbed slightly to x + h, the relative change in f (x) can be approximated by represent the condition number for this problem.

Example:
Evaluate the condition number for computing f (x) = sin −1 x near x = 1.Obviously, For x near 1, sin −1 x ≈ π/2 and the condition number becomes infinitely large.Thus small relative errors in x leads to large relative errors in sin −1 x near x = 1.

Example:
Chan and Lewis(1978) has shown that the condition number for the sample variance computation problem using the one-pass machine algorithm (see later in this note) is Thus the coefficient of variation (CV) is a good measure of the condition of a data set relative the computation of s 2 .
It is important to note that the same problem may be well-conditioned with respect to another set of data.

Example:
Consider the computation of f (x) = 1.01 + x 1.01 − x for x near 2.0 and for x near 1.0.
For x near 2.0: .5% change in data causes only .67%change in the computed result.Thus the problem is well-conditioned for x near 2.
For x near 1.0: .5% change in x produces a 100% change in f (x).Thus the problem is seen to be illconditioned for x near 1.We may verify this by computing the condition number κ for this f (x).
x 2 − 1 Thus we see that κ is large near x = 1 leading to the conclusion that this computation is ill-conditioned near x = 1.Note, however, that the algorithm used (i.e., direct computation) is stable in FP.It gives "right" answers in both cases.
Thus the aim should be to develop algorithms that will be stable for both well-conditioned and ill-conditioned data with respect to a specified computational problem.In many cases simply rexpressing computational formulas may work, However, for difficult problems more complex approaches may be neccessary.As we shall see later, the Youngs-Cramer algorithm is an example of a stable algorithm for the computation of the sample variance s 2

Order of Convergence
Order Relations These are central to the development of asymptotics.Consider functions f (x) and g(x) defined in the interval I (∞ or −∞ can be a boundary of I).Let x 0 be either an interval point or a boundary point of I with g(x) = 0 for x near x 0 (but g(x 0 ) itself may be zero).By definition: then f (x) g(x) i.e.,f (x) and g(x) are asymptotically equivalent.
In practice, f (x) and g(x) are usually defined in {1,2,. . .} instead of in an interval and x 0 = ∞. Examples: x 2 +1 x+1 x as x → ∞ Proof: Thus we can say

Applications in Computing
If an infinite series (e.g., log) is approximated by a finite sum, the omitted terms represent approximation error.Here, we need to estimate this error (or a bound for it) theoretically.
Usually this error depends on some parameters: e.g., n = number of terms in the finite series h = "step size" in numerical differentiation/integration In an iterative procedure we need to be able to test whether the process has converged.
In addition, we need to be able to measure the rate of convergence; for e.g. say that a method converges like 1/n or h 2 or that the approximation error is of the order 1/n 2 .We use "big O" to measure order of convergence.If we have sequences {a n } and {b n } we say that For e.g., 5 n 2 , 10 n 2 + 1 n 3 , −6.2 n 2 + e −n n are all of O(1/n 2 ) as n → ∞ and 4h, 3h + h 2 log h , −h + h 2 − h 3 are all of O(h) as h → 0

Definition of Convergence Rate:
If s = solution and i = |x i − s| where x 1 , x 2 , . . .are successive iterates then the iteration is said to exhibit convergence rate β if the limit exists.c is called the rate constant.We talk of linear convergence, super-linear convergence, quadratic convergence etc. depending on the value of β and c.For example linear convergence implies that the error of the (i + 1) th iterate is linearly related to error of i th iterate, thus the error is decreasing at linear rate.That is, for β = 1 if 0 < c < 1, the sequence is said to converge linearly.For any 1 < β < 2, and c is finite, the convergence is said to be superlinear.If c is finite for β = 2, the iteration converges quadratically.
For example, consider the Newton-Raphson algorithm for solving f (x) = 0.The iteration is given by: and This shows that Newton-Raphson iteration is a 2 nd order method or that it is quadratically convergent.

Numerical Root Finding
Consider the problem of solving f (x) = 0 where f (x) is a (univariate) nonlinear function in x.There are several approaches to obtaining iterative algorithms for this problem.We shall look at a few methods briefly.

Newton's Method
From the diagram, f . This is generalized to obtain the iterative formula where f (x i ) is the first derivative of f (x) evaluated at x i and x 0 is an initial solution.To begin the iteration a suitable value for x 0 is needed.As an illustration of the use of this algorithm it is applied to find a solution of the cubic equation To estimate a value for x 0 , first evaluate f (x) for a few values of x: This type of a search is known as a grid search, the terminology coming from the use of the technique in the multidimensional case.It is clear that a zero for f (x) exists between 5 and 6.Hence, a good starting value is x 0 = 5.0.To use the above algorithm the first derivative of f (x) is also needed: Thus we obtain Similarly, It is known that a true root of the above equation is 5.005265097.Newton-Raphson iteration is quadratically convergent when x (0) is selected close to an actual root.

Secant Method
is plugged in Newton's formula we obtain the iteration: This is thus a variation of Newton's method that doe not require the availability of the first derivative of f (x).The iterative formula can also be directly obtained by noting that the triangles formed by x 2 , x 1 f (x 1 ) and x 2 , x 0 f (x 0 ) in the above diagram are similar.An equivalent iterative formula is: Usually, the Secant method takes a few more iterations than Newton-Raphson.

Fixed Point Iteration
Re-expressing f (x) = 0 in the form x = g(x) one can obtain an iterative formula x (i+1) = g(x (i) ) for finding roots of nonlinear equations. .One way to do this is to set Thisted examines solving where ξ = .61489,by expressing g(x) in several different forms: The iteration based on g(x) fails to converge.g 4 (x) is a scaled version of g(x).
Table 1: Convergence of two methods for fixed-point iteration.The first pair of columns gives the results using g 2 (x).The second pair gives the results for g 4 (x).Both of these versions are extremely stable, although convergence is not rapid when x 0 is far from the solution s.The iterative process will produce a sequence of iterates x (1) , x (2) , . . .that lie in the specified interval; however, there is no guarantee that the sequence converges to a root of f (x) = x − g(x) = 0.For this to occur, there must be at most one root of x − g(x) = 0 in [a, b].We can check whether the function g(x) and the interval [a, b] satisfy one of two conditions to verify whether x − g(x) has a unique root in [a, b].
If one of the above is satisfied and if the iterative process has a fixed point in [a, b], then it converges to the unique root of f Example: Find the root of f (x) = x 2 − 4x + 2.3 = 0 contained in the interval [0, 1] by an iterative method.Consider the possible iterative processes that may be obtained by writing f (x) = 0 in the forms: , and (iii) x = −2.3/(x− 4).

Aitken Acceleration
First consider n + 1 iterates of a linearly convergent iteration: Suppose the method converges to the unique solution X at rate α.That is, by our previous definition, for a linearly convergent sequence, where (i) = X − x (i) .Thus, as i → ∞, we should have From this we have Defining the first and second order forward differences ∆x This is called the Aitken's ∆ 2 method.If a sequence x * (0) , x * (1) , . . . is defined such that then it can be shown that the new sequence will converge faster to X than the convergent sequence x (0) , x (1) , . . . in the sense that We can observe that the Aitken's method attempts to eliminate the first order term in the error by the correction (∆x (i) ) 2 /∆ 2 x (i) .However, since the assumption above that the error rate α is constant is only an approximation, one would expect that the convergence to be not exactly quadratic although considerably better than linear.
Steffensen's method is a slight modification of this straight Aitken acceleration of the original sequence.Instead of of using all of the original iterates to obtain the new sequence, say, a new x * (0) is computed using (1) and it is then is used to generate 2 new iterates applying the iterating function: successively.From these three iterates, a new x * (0) is computed, from which two iterates x * (1) and x * (2) are generated, and so on.This results in an algorithm that is closer to achieving quadratic convergence than the direct application of Aitken above.

Steffensen's Algorithm:
Select x = g(x), starting value x 0 an , and set i = 0.While i < N do 1.Set x 1 = g(x 0 ) and x 2 = g(x 1 ) Example: For solving x 3 − 3x 2 + 2x − 0.375 = 0 using fixed point iteration write It can be shown that this has a fixed point and results in a convergent iteration.We compare below the output from using the 5 iteration schemes; fixed point, Aitken, Newton-Raphson, Steffensen, Secant.

Bracketing Methods
Methods such as Newton-Raphson may be categorized as analytical iterative methods for finding roots of f (x) = 0 since they depend on the behaviour of the first derivative f (x) even if it may not be explicitly calculated.The class of methods where only the value of the function at different points are used are known as search methods.The best known iterative search methods are called bracketing methods since they attempt to reduce the width of the interval where the root is located while keeping the successive iterates bracketed within that interval.These methods are safe, in the sense that they avoid the possibility that an iterate may be thrown far away from the root, such as that may happen with the Newton's or Secant methods when a flat part of the function is encountered.Thus these methods are guaranteed to converge to the root.
The simplest bracketing method is the bisection algorithm where the interval of uncertainty is halved at each iteration, and from the two resulting intervals, the one that straddles the root is chosen as the next interval.Consider the function f (x) to be continuous in the interval [x 0 , x 1 ] where x 0 < x 1 and that it is known that a root of f (x) = 0 exists in this interval.

Bisection Algorithm:
Select starting values x 0 < x 1 such that f (x 0 ) * f (x 1 ) < 0, x , and f , Repeat The Golden-section search is a bracketing algorithm that reduces the length of the interval of uncertainty by a fixed factor α in each iteration (instead of by .5 as in bisection method), where α satisfies 1 This gives α = ( √ 5 − 1)/2 = .618033989,which is known as the golden ratio.The Fibonacci search, which is not discussed here, is the optimal method of this type of search methods.
The main problem with bisection is that convergence is extremely slow.A search algorithm that retains the safety feature of the bisection algorithm but achieves a superlinear convergence rate is the Illinois method.As in bisection, this algorithm begins with two points [x 0 , x 1 ] that straddle the root and uses the Secant algorithm to obtain the next iterate x 2 .
Then instead of automatically discarding x 0 as in Secant, the next two iterates are either [x 0 , x 2 ] or [x 2 , x 1 ] depending on which interval straddles the root.This is determined by examining the sign of f (x 2 ) and comparing with the sign of f (x 1 ).Thus one of the end-points of the previous interval is retained in forming the next interval.If the same end-point is retained in the next iteration also that iteration will not be a true Secant iteration (since the oldest iterate will always be discarded in Secant).If this occurs, the value of f () at that end-point is halved to force a true Secant step to be performed.The halving of an end-point is continued until a true Secant iteration takes place.

Iterative Solution of Linear Systems
We will introduce direct methods for solution of linear systems of equations such as those based on Gauss-Jordan row operations or on the Cholesky factorization during the discussion of matrix methods later.Compared to algorithms based on direct methods which produce exact solutions with a fixed number of computations, the methods introduced here are iterative in nature -they produce a sequence of approximations to the solution that converge to the exact solution.These methods are important since variations of them are used in many computer-intensive nonlinear statistical methods.Consider the solution of the linear system of equations: where A is an n × n matrix of coefficients of full rank, b is a given n × 1 vector of constants.By rewriting the first row, we have provided a 11 = 0. Similarly, we can write a ij x j )/a ii for i = 1, . . ., n, provided a ii = 0 for all i.This relation suggests an iterative algorithm of the fixed point type for obtaining an updated value x (k) i given x (k−1) i for i = 1, . . ., n.In particular, starting from an initial guess x (0) , we can compute the sequence x (k) , k = 1, 2, . .., where x This procedure is called the Jacobi iteration.This is also called a method of simultaneous displacement since no element of x (k) is used until every element of x (k) has been computed, and then x (k) replaces x (k−1) entirely for the next cycle.Let us consider A to be a real symmetric positive definite matrix and let D = diag(A) and N = A − D. Then the above iteration (1) can be expressed in vector form as (2) The error e (k) at the kth step of the Jacobi iteration is easily derived to be: Notice that H is a matrix that is fixed throughout the iteration which is a characteristic of stationary methods.This also gives which shows that for this iteration to converge it is clearly necessary that the largest eigenvalues of H, called the spectral radius, to be less than unity.However, it can be shown that even in a convergent iteration, individual elements of e (k) may exceed the corresponding elements of e (0) in absolute value.These elements will not decrease exponentially as k increases unless the largest eigenvalue is larger than the others.That is, unless absolute values of off-diagonals of A are negligible compared to the diagonals, the convergence is slow.
In equation ( 1), if the elements of x (k) replaces those of x (k−1) as soon as they have been computed, this becomes a method of successive displacement.The resulting algorithm is called the Gauss-Seidel iteration and is given by for k = 1, 2, . . . .Again by setting D = diag(A) and N = A − D = L + U , where L and U are strictly lower triangular and strictly upper triangular, respectively, the iteration can be expressed in the form where B = D + L (i.e., B is the lower triangle of A) and H = −B −1 U .The error e (k) at the kth step of the Gauss-Seidel iteration is given by e (k) = He (k−1) = H k e (0) , as before.In particular, convergence of Gauss-Seidel will occur for any starting value x (0) , if and only if H = B −1 U has all eigenvalues less than unity.Convergence is rapid when diagonals of A dominate the off-diagonals, but is quite slow if the largest eigenvalue of H is close to one.When this is the case Gauss-Seidel can be accelerated using the method of successive over-relaxation (SOR).Suppose that the kth Gauss-Seidel iterate is: Then the kth SOR iterate is given by where w = 0.When w = 1, the SOR iteration reduces to the Gauss-Seidel.Typically, in SOR, w > 1, which means we are "overcorrecting" or taking a bigger step away from the kth estimate than the Gauss-Seidel iteration would (but in the same direction) because In general, the relaxation factor w must be estimated.Various choices are discussed in Chapter 6 of Young(1971).As a practical consideration of implementing these algorithms, the original system Ax = b is first replaced by where D 1/2 = diag( (a ii )).Thus all divisions involved are performed beforehand.Let us assume that this new system is denoted by the same notation i.e.,Ax = b.D = diag(A) is now an identity matrix and L anf U matrices strictly lower triangular and strictly upper triangular, respectively.The resulting Jacobi iteration is then given by and the Gauss-Seidel iteration by To exploit the symmetry that exists in certain statistical problems, Gauss-Seidel method has been applied using a "double sweep", i.e., after an iteration updating x (k) through x (k) 1 are updated in that order.This double sweep ensures exponential decrease in the size of e.

Example
Formulate the Jacobi and point Gauss-Seidel methods for the system of linear equations Using the definition of the Jacobi method, Eq. ( 1), we obtain where x 1(0) , x 2(0) , and x 3(0) are initial guesses.
In contrast, the Gauss-Seidel method, Eq. ( 6), may be written for the first iteration where x 2(0) and x 3(0) are initial guesses.
Starting with the initial guesses x 1(0) = x 2(0) = x 3(0) = 0, the two algorithms produce the results given in Tables 1 and 2. The faster convergence of the Gauss-Seidel method is evident from the tables.This is often the case although examples can be constructed where the Jacobi is faster.Convergence Criteria As we have seen, many numerical algorithms are iterative.When iterative procedures are implemented in a program it is required to check whether the iteration has converged to a solution or is near to a solution to a desired degree of accuracy.Such a test is called a stopping rule and is derived using a convergence criterion.In the root finding problem, we may use the convergence criterion that the change in successive iterates x (n) and x (n+1) is sufficiently small or, alternatively, whether f (x (n) ) is close enough to zero.In the univariate root finding problem, the stopping rules thus may be one or more of the tests A variety of reasons exist for these criteria to fail.Some of these are discussed with examples in Rice(1983) where it is recommended that composite tests be used.In the case that x is a vector then some norm of the difference of iterates, denoted by x (n+1) − x (n) is used in the stopping rule.Usually the norm used is the square root of the Euclidean distance between the two vectors, called the 2-norm.

Standard algorithms
The sample variance of a given set of n data values x 1 , x 2 , . . ., x n is denoted by s 2 and is defined by the formula x i /n .
The numerator of the formula for s 2 , n i=1 (x i − x) 2 which will be denoted by S, is the sum of squared deviations of the n data values from their sample mean x.The computation of S using the above formula requires two passes through the data since it requires going through the data twice: once to compute x and then again to compute S. Hence this method of computation is called a two-pass algorithm.If the sample size is large and data is not stored in memory a two-pass algorithm may turn out to be not very efficient.
The following is generic code for implementing the two-pass algorithm for computing the sample variance, s 2 .In this and other programs given in this note, the data vector x 1 , x 2 , . . ., x n is assumed to be available in a standard data structure, such as a properly dimensioned data array x.The one-pass machine (or textbook) formula for computing s 2 is obtained by expressing S in the form It is easy to show that n i=1 (x i − x) 2 and the above formula are mathematically equivalent.The formula requires that the quantities n i=1 x 2 i and n i=1 x i be computed and this can be achieved, as easily seen, by a single pass through the data.However, the quantities x 2 i and ( x i ) 2 /n may be quite large in practice and will generally be computed with some round-off error.In this case, as seen by observing the above expression, the value of S is computed by the subtraction of two (positive-valued) quantities which may have been computed inaccurately.Thus, the possibility exists for S to be computed incorrectly due to the effect of cancellation of leading digits of the two numbers being subtracted.(Cancellation is a topic to be discussed in detail elsewhere in the notes).The following generic code may be used to program the one-pass machine formula for computing s 2 (and x): Several alternative one-pass algorithms which are more stable than the above algorithm have been suggested.In the next section one of these algorithms is presented and the Fortran and C implementation of it is discussed.

Updating Algorithms
During the 60's and the 70's several researchers proposed algorithms to overcome the numerical instability associated with the machine formula for computing s 2 .These algorithms involve the computation of the numerator S for subsets of data increasing in size, at each stage adjusting the previously computed value to account for the effect of adding successive data values.These are called updating algorithms.Youngs and Cramer in 1971 introduced the following algorithm which has turned out to be the most frequently implemented one-pass algorithm for computing the sample variance.
For single precision computations this algorithm is more stable than the one-pass machine algorithm because it avoids the computation of x 2 i and ( x i ) 2 as intermediate values.Moreover, in this algorithm S is computed as a sum of nonnegative quantities which thus avoids any problems arising due to cancellation of significant digits.
, . ... Thus the same number of real numbers are representable in intervals that are getting wider.The denseness of representable numbers gets lesser as the range increases.This implyies that floating-point number representable are not evenly distributed over the set of real numbers.The expectation is that IEEE floating-point format allows the representation of most real numbers to an acceptable degree of accuracy.For example, the decimal number 3.1415926535 is stored in single precision floating-point as 0 10000000 10010010000111111011011 resulting in a value of 3.1415927 but in double precision floating-point representation as 0 10000000000 1001001000011111101101010100010000010001011101000100 that results in the decimal equivalent of 3.1415926535000000, i.e. exactly.b) The error in representing a real number x as the nearest floating-point number f l(x) is measured as the relative error = |f l(x) − x|/|x| which can be expressed in the form f l(x) = x(1 ± ).For example, the relative error in representing 3.1415926535 in 5-digit decimal floating-point is |3.1416 − 3.1415926535|/|3.1415926535|= .000002339implying that 3.1456 is actually 3.1415926535 times the factor 1.000002339.It can be shown that | | ≤ u where u is called the machine unit and is the upper bound on the relative error in representing a number x in floating-point.For the IEEE binary floating-point standard with rounding to the nearest u = 1 2 2 −23 c) In addition to the creation of special values like NaN's and infinities as a result of arithmetic operations, arithmetic exceptions called underflow and overflow may occur if a computed result is outside the range representible.Negative numbers less than −(2 − 2 −23 ((((2.0 * x + 5.0) * x − 9.0) * x + 4.0) * x + 3.0) * x + 7.0 which requires only 5 multiplications, which is 10 multiplications less than evaluating it using the expression 7.0 + 3.0 * x + 4.0 * x * x − 9.0 * x * x * x + . . .
• 16 1 + 6 • 16 0 = 32 + 6 = (38) 10It is easy to convert numbers from hexadecimal to binary and back since a maximum of four binary digits can represent one hex digit.To convert from hex to binary one replaces each hex digit by its binary equivalent.
The IEEE Standard for Binary Floating Point Arithmetic (ANSI/IEEEdefines formats for representing floating-point numbers and a set operations for performing floatingpoint arithmetic in computers.It is the most widely-used standard for floating-point operations on modern computers including Intel-based PC's, Macintoshes, and most Unix platforms. × 2 127 cause negative overflow, and positive numbers greater than (2 − 2 −23 × 2 127 positive overflow.In the denormalized range, negative numbers greater than −2 −149 result in negative underflow and positive numbers less than 2 −149 , positive underflow.The IEEE standard requires exceptions to be flagged or an overflow/underflow trap handler to be called.d)Round-off error occurs when floating-point arithmetic is carried out.For example, consider the addition of the two numbers 100.0 and 1.01 in floating-point in 3-digit decimal arithmetic (assuming rounding to the nearest and a single guard digit).In effect, the operation is performed as (+1.00, 2) 10 + (+0.010, 2) 10 The exact sum is 101.01 which and the floating-point computation gives (+1.01, 2) giving result of this operation as 101.0.Thus Absolute Error: |f l(op) − op| = .01Number of correct digits can be calcualted using − log 10 (Relative Error) which in this example is ≈ 4. Accumulated Round-off (or Error Propagation) is a major problem in computation, For e.g., consider the computation of the inner product x T y = x i y i .