Reduced-Area Constant-Coefﬁcient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

: Multiplication by a constant is a common operation for many signal, image, and video processing applications that are implemented in field-programmable gate arrays (FPGAs). Constant-coefficient multipliers (KCMs) are often implemented in the logic fabric using lookup tables (LUTs), reserving embedded hard multipliers for general-purpose multiplication. This paper describes a two-operand addition circuit from previous work and shows how it can be used to generate and add pre-computed partial products to implement KCMs. A novel method for pre-computing partial products for KCMs with a negative constant is also presented. These KCMs are then extended to have two to eight coefficients that may be selected by a control signal at runtime to implement time-multiplexed multiple-constant multiplication. Synthesis results show that proposed pipelined KCMs use 27.4% fewer LUTs on average and have a median LUT-delay product that is 12% lower than comparable LogiCORE IP KCMs. Proposed pipelined KCMs with two to eight selectable coefficients use 46% to 70% fewer LUTs than the best LogiCORE IP based alternative and most are faster than using a LogiCORE IP multiplier with a coefficient lookup function. They also outperform the state-of-the-art in the literature, using 22% to 57% fewer slices than the smallest pipelined adder graph (PAG) fusion designs and operate 7% to 30% faster than the fastest PAG fusion designs for the same operand size and number of selectable coefficients. For KCMs and KCMs with selectable coefficients of a given operand size, the placement and routing of LUTs remains the same for all positive and negative constant values, which is advantageous for runtime partial reconfiguration.


Introduction
Field-programmable gate arrays (FPGAs) are often used for computationally intensive applications such as digital-signal processing (DSP), video and image processing, and artificial neural network (ANN) based applications such as machine learning and artificial intelligence. For these applications and others, multiplication is the dominant operation in terms of required resources, delay and power consumption. In many cases, one of the operands is a constant and the multiplier is called a constant-coefficient multiplier (KCM). Most contemporary FPGAs have embedded hard multipliers distributed throughout the fabric due to the importance of multiplication. Even so, soft KCMs based on lookup tables (LUTs) in the configurable logic fabric are often used for high-performance designs for several reasons: • Embedded multiplier operands are fixed in size and type, such as 25 × 18 two's complement, while LUT-based KCM operands can be any size or type; • The number and location of embedded multipliers are fixed, while LUT-based KCMs can be placed anywhere and the number is limited only by the size of the reconfigurable fabric; • Embedded multipliers cannot be modified, while LUT-based KCMs can use techniques such as merged arithmetic [1] and approximate arithmetic [2] to optimize the overall system.
One approach to designing a KCM is to build lookup tables of pre-computed partial products, indexed by one or more bits of the variable operand, and sum them to produce the product. Chapman's KCM algorithm uses LUT-based lookup tables to generate radix-16 partial products, specifically targeting Xilinx FPGAs with 4-input LUTs [3,4]. Wirthlin generalizes this approach and presents a method to merge the lookup with addition logic that is also specific to Xilinx FPGAs with 4-input LUTs [5]. Hormigo et al. extend Wirthlin's work to include runtime self-reconfiguration [6]. These approaches target FPGA implementations.
Another approach to designing a KCM is to sum shifted copies of the variable operand that correspond to non-zero digits of the constant. Canonical signed digit (CSD) recoding gives a structure that requires at most m/2 and on average m/3 add/subtract operations, where m is the number of bits in the constant [7]. Sub-expressions can be shared to further reduce the number of add/subtract operations [8,9]. Turner and Woods present a technique to design reduced coefficient multipliers (RCMs) that operate on a limited set of coefficients [10], exploiting the observation that LUTs used to implement add/subtract operations have unused inputs. This is also known as time-multiplexed multiple-constant multiplication, where a variable input is multiplied by one of several constants selected by a control input to produce a single output. Tummeltshammer et al. present an algorithm for time-multiplexed multiple-constant multiplication, which is useful for finite-impulse response (FIR) filters and other sum-of-product computations, which fuse directed acyclic graph (DAG) solutions for multiplication by each constant into a time-multiplexed DAG [11]. Their work is optimized for application-specific integrated circuit (ASIC) implementations. Kumm et al. present a heuristic they call reduced pipelined adder graph (RPAG) that includes provisions for pipelining, which is especially important for FPGA implementations [12]. Möller et al. extend the RPAG heuristic by applying the fusion concept of Tummeltshammer et al. which they call pipelined adder graph (PAG) fusion [13]. PAG fusion is a heuristic that specifically targets FPGAs and is able to search for opportunities to use three-input (ternary) adders, which are available on recent Xilinx and Altera FPGAs and use roughly the same resources as two-input adders. The work of Möller et al. also incorporates low-level optimizations using primitives for Xilinx FPGAs that use fewer resources than allowing the tools to interpret hardware description language (HDL) models that do not specify primitives.
This paper describes an approach that uses a novel two-operand addition circuit [14][15][16] that combines generation of a pre-computed partial product with addition of another value, similar to Wirthlin's work but optimized for Xilinx FPGAs with 6-input LUTs. A novel approach is used for the case where the constant is negative. A design variation for KCMs with two, four or eight selectable coefficients is also presented. The discussion and results focus on the Xilinx 7 Series FPGAs, but the technique is applicable to the Spartan-6, Virtex-5, Virtex-6, UltraScale and newer Xilinx FPGAs that use 6-input LUTs.
The paper is organized as follows. Section 2 discusses relevant FPGA architecture and the two-operand adder used to make the proposed KCMs. Section 3 describes the proposed LUT-based constant-coefficient multipliers. Section 4 extends proposed designs to handle two, four or eight selectable coefficients. Synthesis results are discussed in Section 5 and conclusions are given in Section 6.

Background
This section describes details of the Xilinx logic fabric and the proposed two-operand adder.

FPGA Logic Fabric
The main logic resource for implementing combinational and sequential circuits in a Xilinx FPGA is the configurable logic block (CLB). Each CLB has two slices. Figure 1 is a partial diagram of a 7 Series FPGA slice. Each slice has four 6-input lookup tables (LUT6s) designated A, B, C, and D.
Each LUT6 is composed of two 5-input lookup tables (LUT5s) and a 2-to-1 multiplexer. The two LUT5s are 32 × 1 memories that share five inputs designated I5:I1. The memory values are designated M[63:32] in one LUT5 and M[31:0] in the other LUT5. The output of the M[31:0] LUT5 is designated O5. The sixth input, I6, is input to a multiplexer that selects one of the LUT5 outputs. The selected output is designated O6. The LUT6 is normally configured as either two LUT5s with five shared inputs and two outputs by connecting I6 to logic '1', or as one LUT6 with six inputs and one output by connecting the sixth input to I6 [17,18]. A multiplexer (MUXCY) and an XOR gate (XORCY) are associated with each LUT6. Inputs to the MUXCY associated with the A LUT6 are a select signal, prop i , a first data input, gen i , and a second data input, c i . The output of the MUXCY, c i+1 , is connected to the MUXCY associated with the B LUT6. These connections continue through the C and D LUT6s to form a fast carry chain within the slice. The c i+4 output of the slice, COUT, can be connected to the c i input of the next slice, CIN, to form longer carry chains. The prop signal is driven by the O6 output of the corresponding LUT6. The gen signal is selected by a configuration multiplexer and is either the O5 output of the corresponding LUT6 or the bypass input, which is designated AX, BX, CX, or DX.
Two flip-flops are associated with each LUT6. One flip-flop can be used to register O5 or the bypass input. The other flip-flop can be used to register O5, O6, the bypass input, the MUXCY output, or the XORCY output.

Proposed Two-Operand Adder
Suppose X and Y are to be added using the Xilinx fast carry logic. For the i th column of the adder, x i and y i are the bits of X and Y, respectively, c i is the carry-in bit, c i+1 is the carry-out bit and s i is the sum bit. The prop i signal must be set to x i ⊕ y 1 and the gen i signal can be set to either x i or y i to add x i and y i [14,16]. If x i and y i together are a function of five or fewer inputs, then the LUT6 can be configured as two LUT5s, generating either x i or y i at O5 and routing it to gen i , and generating x i ⊕ y i at O6 to drive prop i . If x i and y i together are a function of six inputs, then the LUT6 can be configured to generate x i ⊕ y i at O6 to drive prop i and x i or y i can be applied to the bypass input and configured to drive the gen i input. A disadvantage to this configuration is that the bypass flip-flop cannot be used.
Normally, a LUT6 can be used to either generate a function of six inputs at O6 or to generate two functions of five inputs at O5 and O6 [17,18]. However, in some cases, one function of six variables can be output at O6 and a separate function of five shared variables can be output at O5. Suppose x i is a function of one variable connected to I6 and y i is a function of five variables connected to I5:I1. The function y i is stored in M[31:0], so y i is output at O5. If x i is '0', y i is also output at O6. If x i is '1', the function stored in M[63:32] is output at O6. If y i is stored in M[63:32] then x i ⊕ y i is generated at O6 and y i is generated at O5. This can be used to add x i and y i without using the bypass input when x i is a function of one variable and y i is a function of up to five variables. Figure 2 shows the connections for this configuration. This frees the bypass input to be connected to the bypass flip-flop to implement additional registers. Input I6 has the shortest delay path and I1 has the longest [17], so this method also allows faster inputs to be used. The carry into the proposed adder, c 0 , can be used to implement subtraction or to add an extra bit to the least significant column.

Proposed Constant-Coefficient Multipliers
This section describes how the proposed constant-coefficient multipliers (KCMs) are implemented and pipelined.

Radix-2 Multiplication by a Constant
Suppose A is an m-bit constant, B is an n-bit variable and P = A · B is to be computed. If A and B are unsigned integers, then and the product is If A is positive and B is signed, then and the product can be computed using Baugh and Wooley's approach [19] as Figure 3 shows a (6 × 6)-bit KCM, where A is a positive constant and B is a two's-complement variable as described by Equation (6). The least-significant column has a weight of 2 0 to simplify equations and column references, but the results in this work are applicable to fixed-point multipliers by applying appropriate shifts and placement of the binary point. 2 11 2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 p 11 p 10 p 9 p 8 p 7 p 6 p 5 p 4 p 3 p 2 p 1 p 0 If A is negative, it could be coded in two's complement form and Baugh and Wooley's approach could be used to develop an equation for the product. A would have m − 1 bits of useful precision instead of m bits because the most-significant bit (MSB) would always be '1'. In the proposed designs, the magnitude of A is used with an implicit negative sign bit and Equation (3) is used if B is unsigned or Equation (6) is used if B is signed. The product is then negated by negating each row of partial products.
Each bit, including implicit leading '0's, is complemented and '1' is added to the least-significant bit (LSB) in each row. The constants are then pre-added to simplify the matrix. If A is negative and B is unsigned, then The product is m + n + 1 bits to accommodate the sign bit. The product is always negative so the MSB is always '1' and does not require any logic. If A is negative and B is signed, then The product is m + n bits assuming |A| ≤ 2 m − 1. If |A| = 2 m , a hard-wired shift and negation of the product would be used instead of a KCM. Figure 4 shows a (6 × 6)-bit KCM, where A is the magnitude of a negative constant and B is a two's-complement variable as described by Equation (8). 2 11 2 10 2 9 2 8 2 7 2 6 2 5 2 4 2 3 2 2 2 1 Figure 5 shows a dot diagram of a proposed (12 × 12)-bit KCM, where A is a negative constant and B is a two's complement variable. Each dot is a partial-product bit that corresponds to a bit in Equation (8). Each row j of partial-product bits is a function of only one variable bit, b j . The rows of partial-product bits are divided into groups, each of which are summed to produce a partial product, P ρ . Each partial product P ρ is the sum of j ρ rows of partial-product bits. In the example of Figure 5, the first five rows of partial-product bits are grouped and their sum is P 0 . P 0 is a function of the constant A, the constant 2 12 + 2 5 − 2 0 and a 5-bit sub-vector of the variable B, B[4:0]. The 2 5 possible values of P 0 are pre-computed and generated using LUT6s. Each LUT6 generates two bits of P 0 , p 0,i+1 and p 0,i . The next five rows of partial-product bits are grouped and their sum is P 1 , which is a function of A, 2 10 − 2 5 and B [9:5]. The 2 5 possible values of P 1 are pre-computed and generated by a proposed two-operand adder, which adds the generated value to P 0 and produces an accumulated sum, X 1 . The final two rows of partial-product bits are grouped and their sum is P 2 , which is a function of A, 2 23 + 2 10 and B [11:10]. The 2 2 possible values of P 2 are pre-computed, generated by another proposed two-operand adder, and added to X 1 to produce an accumulated sum X 2 . The five least-significant bits of the final product, P[4:0], are the five LSBs of P 0 . The next five LSBs of the product, P [9:5], are the five LSBs of the accumulated sum X 1 . The remaining bits of the product, P[23 :10], are the accumulated sum, X 2 .

Design of Proposed Constant-Coefficient Multiplier
In a proposed (m × n)-bit KCM, all of the partial-product bits are grouped into (n − 1)/5 partial products. Each partial product, P ρ , is the sum of j ρ rows of partial-product bits. When n − 1 is an exact multiple of five, such as when n = 16, P 0 is the sum of six rows and each of the other partial products are the sum of five rows. When n − 1 is not an exact multiple of five, each partial product is the sum of five rows except possibly the last, which is the sum of the remaining rows. P 0 is the sum of the first j 0 rows of partial-product bits and is generated using LUT6s. When P 0 is the sum of six rows, each bit p 0,i is a function of six variables, B[5:0], so each LUT6 generates one output bit. When P 0 is the sum of five rows, each pair of bits p 0,i+1 and p 0,i are functions of the same five variables, B[4:0], so each LUT6 generates two output bits. P 0 is m + j 0 bits long, so m + j 0 LUT6s are required if j 0 = 6 and (m + j 0 )/2 LUT6s are required if j 0 ≤ 5.
The remaining partial products, P ρ where ρ ≥ 1, are each generated using a proposed two-operand adder. The proposed two-operand adder generates a function of up to five variables, so it is most efficient when P ρ is the sum of five rows of partial-product bits. P ρ is m + j ρ bits long, so m + j ρ LUT6s are required for each two-operand adder.
Constant '1's can be grouped with any partial product and are simply included in each pre-computed value. In practice, groups are selected so that constant '1's do not increase the length of the partial product.
When n − 1 is not an exact multiple of five, there are (m + j 0 )/2 LUT6s instead of m + j 0 LUT6s in the first row, so the maximum number of required LUT6s is #LUT6s ≤ m (n − 1)/5 + n − (m + 5)/2 .
Some LUTs may be optimized away during synthesis, so these equations give the maximum number of required LUT6s. Figure 6 shows the structure of the proposed (12 × 12)-bit KCM from the example of Figure 5. The top row of LUT6s generates the first five rows of partial-product bits and outputs the sum, P 0 . The second row of LUT6s implements a proposed two-operand adder that generates the sum of the next five rows of partial-product bits, P 1 , and adds it to P 0 to produce an accumulated sum, X 1 . The third row of LUT6s implements another two-operand adder that generates the sum of the last two rows of partial-product bits, P 2 , and adds it to X 1 to produce an accumulated sum, X 2 . The KCM output, P, is composed of the five LSBs of P 0 , the five LSBs of X 1 and X 2 . The proposed KCM can be pipelined by placing registers after each row of LUT6s. The first stage registers m + j 0 bits of the final product P and n − j 0 bits of B, which requires m + n flip-flops. Subsequent stages register m + j ρ + 1 bits of X ρ , j ρ additional bits of P and j ρ fewer bits of B, which requires m + n + 1 flip-flops. The last stage registers the output P, which requires m + n flip-flops. There are (n − 1)/5 stages, and each stage registers m + n + 1 bits except the first and last stages, which register m + n bits each, so the maximum number of flip-flops required is

Array Structure and Pipelining
Each LUT6 used in the KCM has two available flip-flops so there are more than enough flip-flops available within the footprint of the multiplier to implement pipeline registers. The structure is very regular and easy to place in the logic fabric so that routing paths are short and fast.

Discussion
When n = 10, the first row of the KCM computes the sum of five partial products using LUT6s. Each LUT6 computes two bits of the sum, except for one LUT6 that computes only one bit if m is even. The second row of the KCM also computes the sum of five partial products and adds them to the sum from the first row. This is very efficient because both rows are computing the maximum number of partial-product bits per LUT6. When n is increased to n = 11, the second row still computes the sum of five partial products, but the first row now computes the sum of six partial products, so each LUT6 only computes one bit of the sum. This causes a jump in the number of LUTs required to implement the KCM. When n is increased to n = 12, the first row computes the sum of five partial products, which reduces the number of LUT6s in that row compared to n = 11. The second row still computes the sum of five partial products. However, a third row is now required, which causes another jump in the number of LUTs required to implement the KCM. When n is increased to n = 13, the first and second rows still compute the sum of five partial products each. The third row computes the sum of three partial products, compared to two for n = 12, which only requires one additional LUT6 plus an additional LUT6 per bit that m increases, so the increase in the number of LUTs required to implement the KCM is not as large as the increase from n = 10 to n = 11 or from n = 11 to n = 12. The situation is similar when n is increased to n = 14 and again when n is increased to n = 15. When n is increased to n = 16, the first row computes the sum of six partial products, which causes a jump in the number of required LUTs as it does when n increases from n = 10 to n = 11. This cycle repeats itself as n is increased. The significance of this is that for a given value of m, KCMs with n ∈ {10, 15, 20, 25, . . .} are generally the most efficient in terms of required LUTs, while KCMs with n ∈ {12, 17, 22, 27, . . .} are generally the least efficient.
The value of m does not affect the number of rows in the KCM, so there are no jumps in the required number of LUTs as m is increased. If n − 1 is an exact multiple of five, there are (n − 1)/5 rows in the KCM and the first row requires one LUT6 per bit of the sum. As m is increased, each row of the KCM requires one additional LUT6 per bit that m increases, so a total of ∆m((n − 1)/5) additional LUT6s are required. If n − 1 is not an exact multiple of five, there are (n − 1)/5 rows in the KCM and the first row requires approximately one half of an LUT6 per bit of the sum. As m is increased, the KCM requires approximately one half of an additional LUT6 for the first row and one additional LUT6 for each of the other rows per bit that m increases, so a total of (n − 1)/5 − 1 2 additional LUT6s are required per bit that m increases. The significance of this is that, for a given value of n, the increase in the number of LUTs required to implement the KCM as m increases is approximately linear, and the value of m has a much lower impact than n on the efficiency of the implementation in terms of required LUTs. Figure 7 shows the number of LUT6s required for KCMs as m and n are varied, based on Equations (9) and (10). These functions are discrete and the points are connected by lines for readability only, not to imply continuity. The middle set of points is the case where m = n. The total number of LUTs required for the KCM increases as m = n increases, with jumps from n = 10 to n = 11, from n = 11 to n = 12, etc., due to n increasing as discussed earlier. The other sets of points are cases where m ∈ {1.5n, 1.25n, 0.75n, 0.5n}. This results in m having a fractional value for many points, which is not possible. However, those fractional values are used to compute the points because the intent of the graph is to show how the number of LUTs scales with m, not to show an exact number of LUTs. The graph shows that for a given value of n, the change in the number or LUTs required is roughly proportianal to ∆m.

Proposed KCMs with Selectable Coefficients
Turner and Woods present a reduced-coefficient multiplier (RCM) that can operate on a limited set of coefficients, selectable at run-time [10]. Their multipliers use canonical signed digit (CSD) recoding and sub-expression elimination to reduce the number of add/subtract operations. This section discusses how the proposed KCMs can be modified to incorporate the idea to operate on a set of two, four or eight coefficients, selectable at run-time. In

Proposed KCMs with Two Selectable Coefficients
A KCM with two selectable coefficients requires one input to select the coefficient. Partial products for both coefficients are pre-computed and generated using LUT6s for each P[k] 0 , and generated using proposed two-operand adders for the rest of the partial products, P[k] i .
One input to each LUT6 used to generate P[k] 0 is needed to select the coefficient, so only five inputs are left to select the pre-computed value of P[k] 0 if each LUT6 generates one bit, p[k] 0,i , and only four inputs are left if each LUT6 generates two bits, p[k] 0,i+1 and p[k] 0,i . One of the y i inputs to each LUT6 in each of the adders are needed to select the coefficient, so only four inputs are left to select the pre-computed value of P[k] i . Therefore, all of the partial-product bits in a KCM with two selectable coefficients are grouped into (n − 1)/4 partial products.
When n − 1 is an exact multiple of four, each partial product requires m + j[k] ρ LUT6s. There are (n − 1)/4 partial products, and ∑ When n − 1 is not an exact multiple of four, there are (m + j[k] 0 )/2 LUT6s instead of m + j[k] 0 LUT6s in the first row, so the maximum number of required LUT6s is Some LUTs may be optimized away during synthesis, so these equations give the maximum number of required LUT6s. Figure 9 shows a dot diagram of a proposed (12 × 12)-bit KCM with two selectable coefficients, where A[k] is a negative constant and B[k] is a two's complement variable (cf. Figure 5). In this example, no additional adders are needed and the unit has a very similar footprint to the single-coefficient KCM. Other size operands usually require one or more additional adders.

Proposed KCMs with Four Selectable Coefficients
A KCM with four selectable coefficients requires two inputs to select the coefficient. Partial products for each coefficient are pre-computed and generated using LUT6s for each P[k] 0 and the proposed two-operand adders generate and add the rest of the partial products.
Two inputs to each LUT6 used to generate P[k] 0 are needed to select the coefficient, so only four inputs are left to select the pre-computed value of P[k] 0 if each LUT6 generates one bit and only three inputs are left if each LUT6 generates two bits. Two of the y i inputs to each LUT6 in each of the adders are needed to select the coefficient, so only three inputs are left to select the pre-computed value of P[k] i . Therefore, all of the partial-product bits in a KCM with four selectable coefficients are grouped into (n − 1)/3 partial products.
When n − 1 is an exact multiple of three, the maximum number of required LUT6s is When n − 1 is not an exact multiple of three, the maximum number of required LUT6s is Figure 10 shows a dot diagram of a proposed (12 × 12)-bit KCM with four selectable coefficients, where A[k] is a negative constant and B[k] is a two's complement variable (cf. Figure 5). In this example, one additional adder is needed compared to the single-coefficient KCM.

Proposed KCMs with Eight Selectable Coefficients
A KCM with eight selectable coefficients requires three inputs to select the coefficient. Partial products for each coefficient are pre-computed and generated using LUT6s for each P[k] 0 and the proposed two-operand adders generate and add the rest of the partial products.
Three inputs to each LUT6 used to generate P[k] 0 are needed to select the coefficient, so only three inputs are left to select the pre-computed value of P[k] 0 if each LUT6 generates one bit and only two inputs are left if each LUT6 generates two bits. Three of the y i inputs to each LUT6 in each of the adders are needed to select the coefficient, so only two inputs are left to select the pre-computed value of P[k] i . Therefore, all of the partial-product bits in a KCM with eight selectable coefficients are grouped into (n − 1)/2 partial products.
When n − 1 is an exact multiple of two, the maximum number of required LUT6s is #LUT6s ≤ m (n − 1)/2 + n.
When n − 1 is not an exact multiple of two, the maximum number of required LUT6s is Figure 11 shows a dot diagram of a proposed (12 × 12)-bit KCM with eight selectable coefficients, where A[k] is a negative constant and B[k] is a two's complement variable (cf. Figure 5). In this example, three additional adders are needed compared to the single-coefficient KCM.  Table 1 compares proposed KCMs that have a single coefficient to the proposed KCMs with two, four and eight selectable coefficients. The number of partial products and the number of LUTs used by each version are given, based on Equations (12) through (17). The percentage increase in the number of LUTs for two, four and eight-coefficient versions versus single-coefficient versions is also given. One or more of the LUTs used to generate the least-significant bits in the first row can often be optimized away so the number of LUTs in an actual implementation may be a little lower. For the operand sizes in the table, KCMs with two selectable coefficients use an average of 19% more LUTs, KCMs with four selectable coefficients use an average of 55% more LUTs and KCMs with eight selectable coefficients use an average of 117% more LUTs than single-coefficient KCMs. In designs where a KCM with selectable coefficients can replace two or more single-coefficient KCMs, the increase is more than offset by the reduced number of KCMs required. KCMs with selectable coefficients usually have more partial products than single-coefficient KCMs. This means more adder stages are required, which translates into additional delay in single-cycle units. In pipelined versions, this results in longer latencies. However, cycle times are comparable because the adders are the same width or a little shorter. Figure 12 shows the number of LUT6s required for KCMs with one, two, four and eight selectable coefficients. These functions are discrete and the points are connected by lines for readability only, not to imply continuity. The lower set of points is for single-coefficient KCMs and is the same as the middle set of points in Figure 7. As discussed in Section 3.4, there are jumps at every fifth value of n, starting with n = 11, because the first row requires twice as many LUT6s every fifth value of n starting at n = 11 and the number of rows increases every fifth value of n starting at n = 12. KCMs with two selectable coefficients have jumps for the same reasons, except they occur every fourth value of n, KCMs with four selectable coefficients have jumps every third value of n and KCMs with eight selectable coefficients have jumps every second value of n.  Figure 13 shows the number of partial product bits that are computed and summed for a single output per LUT6 for KCMs with one, two, four and eight selectable coefficients. The upper set of points is for single-coefficient KCMs and is the same as the middle set of points in Figure 8. As discussed in Section 3.4, there are local maximums every fifth value of n starting at n = 10 and local minimums every fifth value of n starting at n = 12, indicating most efficient and least efficient units, respectively. KCMs with two selectable coefficients have a similar cycle every fourth value of n. They can be implemented using the same number of LUTs as single-coefficient KCMs for n = 8 and n = 12 because of the different period of each cycle. The cycle for KCMs with four selectable coefficients is every third value of n and the cycle for KCMs with eight selectable coefficients is every second value of n. KCMs with selectable coefficients are less efficient than single-coefficient KCMs by this measure because they require more LUTs to produce a single product in a clock cycle. However, they are more efficient in a design that performs time-multiplexed multiplication because additional single-coefficient KCMs or a general-purpose multiplier would be required to provide the same functionality.

Results
The proposed KCMs are compared to Xilinx LogiCORE IP v12.0 (rev. 12) (Xilinx Inc., San Jose, CA, USA) constant-coefficient multipliers [20] for (n × n)-bit units. Proposed KCMs with two, four and eight selectable coefficients are compared to units composed of a LogiCORE IP general-purpose multiplier and a lookup function to select the coefficient. Proposed KCMs with two and four selectable coefficients are also compared to units composed of two or four LogiCORE IP KCMs and a multiplexer to select the output. Results for 8, 10, 12, 14, 16, 20 and 24-bit operands are given for single-cycle and pipelined units. KCMs are synthesized with a positive constant and again with a negative constant. KCMs with selectable coefficients are synthesized with half of the constants being positive and the other half negative.

Methodology
Version 2016.3 of the Xilinx Vivado Design Suite (Vivado) was used. Designs were synthesized with the strategy set to 'Flow_PerfOptimized_high' and implemented with the strategy set to 'Performance_Retiming'. Designs were synthesized for the Xilinx Virtex-7 XC7VX330T-FFG1157 (-3 speed grade) device with a timing constraint of 1 ns on the inner clock. All results are post place-and-route.
LogiCORE IP constant-coefficient multipliers and general-purpose multipliers were created using the IP Catalog in Vivado. Structural models of the proposed multipliers were implemented in Verilog-2001 (IEEE Standard 1364-2001, IEEE, Piscataway, NJ, USA). Pipelined versions were created for LogiCORE IP multipliers using the optimal number of stages specified in the IP customization dialog. Input and output (I/O) ports were double registered to reduce dependence on I/O placement [21]. A separate clock on the inner level was used to measure the delay through each multiplier.

SynthesisResults
Synthesis results for proposed KCMs are given Section 5.2.1. Synthesis results for proposed KCMs with two, four and eight selectable coefficients are given in Sections 5.2.2-5.2.4, respectively.

Proposed Constant-Coefficient Multipliers
Synthesis results for single-cycle constant-coefficient multipliers are given in Tables 3 and 4. The total number of LUTs used and the delay are given. The LUT-delay product (LDP), computed by multiplying the number of LUTs by the delay, is also given. LDP is analogous to the area-delay product of a very-large-scale integration (VLSI) design. The reciprocal of LDP gives a metric to compare maximum throughput. The total number of LUTs, delay and LDP are normalized to LogiCORE IP KCMs. Table 3 gives results for single-cycle KCMs, where the constant A is positive and the variable B is signed. For these units, proposed designs are 10% to 31% smaller than comparable LogiCORE IP KCMs, except for 12-bit units which are 14% larger. This anomaly occurs because proposed KCMs are less efficient for n = 12 as discussed in Section 3.4 and LogiCORE IP KCMs with positive coefficients are more efficient for n = 12 as shown in Figure 15. Proposed designs have a 23% to 108% increase in delay, so there is a trade-off of fewer LUTs for increased cycle time. Table 4 gives results for single-cycle KCMs where the constant A is negative and the variable B is signed. For these KCMs, LogiCORE IP units increase in size while proposed units remain roughly the same. This reduces the relative size, so proposed designs with a negative constant are 17% to 35% smaller than LogiCORE IP units. Normalized delay is similar to proposed KCMs with a positive constant.
For most proposed single-cycle units, the normalized LDP is greater than 1.0. This suggests that single-cycle LogiCORE IP units usually offer higher throughput in designs where the KCMs are on the critical path and determine the clock period. However, when the KCMs are not on the critical path and proposed designs meet timing requirements, proposed designs for most operand sizes will improve the system by reducing the number of LUTs required.  Synthesis results for pipelined constant-coefficient multipliers are given in Tables 5 and 6. The number of pipeline stages are reported, as well as the total number of LUTs used, the delay and the LUT-delay product. The number of pipeline stages determines the latency in clock cycles. The reported delay is for one clock cycle. The total number of LUTs, delay and LDP are normalized to LogiCORE IP units. Table 5 gives results for pipelined KCMs, where the constant A is positive and the variable B is signed. For these units, proposed designs are 15% to 36% smaller than comparable LogiCORE IP units, except for 12-bit units which are 5% larger. Proposed designs have a 23% to 38% increase in delay so there is still a trade-off of LUTs for cycle time. However, the extreme cases are significantly reduced and normalized delay is fairly constant as operand size is scaled. Table 6 gives results for pipelined KCMs, where the constant A is negative and the variable B is signed. As with single-cycle KCMs, negative constant LogiCORE IP KCMs increase in size, while proposed units remain roughly the same. This again reduces the relative size, so proposed designs with a negative constant are 20% to 42% smaller than LogiCORE IP units. Even 12-bit units are 20% smaller. Normalized delay is similar to proposed KCMs with a positive constant as it is with single-cycle KCMs.
The average normalized LDP for proposed pipelined KCMs is 1.025 for units with a positive constant and 0.855 for units with a negative constant. The overall average LDP is 0.940 and the overall median LDP is 0.881. This suggests that, for many operand sizes, proposed pipelined KCMs offer higher throughput in designs where they are on the critical path and determine the clock period. When they are not on the critical path and meet timing requirements, the throughput advantage of proposed units increases because they use 27% fewer LUTs on average than comparable LogiCORE IP units. Proposed KCMs have more pipeline stages than some LogiCORE IP KCMs, especially as n gets larger, because the proposed method uses an array structure to add partial products while LogiCORE IP units appear to use a tree structure. This may be a problem for systems where latency requirements are difficult to meet. However, for systems that can tolerate the increased latency this is less of an issue. Table 5. Synthesis results for pipelined (m × n)-bit KCMs, where m = n, A = π/4 · 2 n and B is a signed variable. Proposed KCMs use 22% fewer LUTs on average compared to LogiCORE IP KCMs and the increase in delay is less significant than it is for single-cycle units. Acronyms: lookup tables (LUTs), LUT-delay product (LDP).  Figure 14 combines the graph of Figure 12 with actual values for LogiCORE IP KCMs obtained by synthesis. The graph shows that, for many operand sizes, the proposed KCMs with two selectable coefficients use fewer LUTs than LogiCORE IP KCMs that only handle a single-coefficient.

Proposed KCMs with Two Selectable Coefficients
Synthesis results for single-cycle KCMs with two selectable coefficients are given in Table 7. Results for units composed of a LogiCORE IP general-purpose multiplier and a lookup function to select the coefficient are given. Results for units composed of two LogiCORE IP KCMs with a multiplexer to select the output are also given. Results are normalized to LogiCORE IP KCM units because they use 27% fewer LUTs and are 2.08 times faster on average than LogiCORE IP multiplier units. Proposed KCMs with two selectable coefficients use only 20% more LUTs on average than proposed KCMs with a single coefficient, while units based on LogiCORE IP KCMs use more than twice as many LUTs because they cannot be combined and require a multiplexer to select the product. Delay for proposed 8-and 12-bit units is about the same but increases for other sizes because an additional row is required to compute the product. Delay for all LogiCORE IP KCM-based units increases due to the multiplexer and because the variable operand must be routed to two KCMs, which doubles the fanout. Proposed units use 57% to 70% fewer LUTs compared to LogiCORE IP KCM-based units at the expense of a 13% to 74% increase in delay. The LDP for proposed units is 28% to 67% lower, indicating that significantly higher throughput can be achieved compared to LogiCORE IP KCM-based units. LogiCORE IP multiplier-based units are not competitive for two selectable coefficients. Table 8 gives synthesis results for pipelined KCMs with two selectable coefficients. Proposed units use the same number of LUTs as single-cycle versions, except for 20 and 24-bit units, which use some additional LUTs as shift registers (SRLs) to replace flip-flops. This optimization can be avoided if desired using the -shreg_min_size setting in synthesis options. Similar to single-cycle units, proposed designs use 60% to 70% fewer LUTs than LogiCORE IP KCM-based units. However, proposed units benefit relatively more from pipelining than LogiCORE IP and are only 3% to 37% slower, and the relative delay tends to improve as n increases. Proposed units have 45% to 63% lower LDP, which is consistently lower for all operand sizes. The LDP suggests that proposed units offer more than double the throughput versus LogiCORE IP KCM-based units for most operand sizes. LogiCORE IP multiplier-based units are still not competitive.  Table 10 gives synthesis results for pipelined KCMs with four selectable coefficients. Proposed units benefit relatively more than LogiCORE IP KCM-based and multiplier-based units in regards to delay. They are faster than most LogiCORE IP multiplier-based units and slower than LogiCORE IP KCM-based units but more comparable than they were for single-cycle units. Proposed pipelined units use 61% to 66% fewer LUTs than LogiCORE IP multiplier-based units and 72% to 76% fewer LUTs than LogiCORE IP KCM-based units. They have a 63% to 72% lower LDP than LogiCORE IP multiplier-based units and a 63% to 76% lower LDP than LogiCORE IP KCM-based units.   Table 11 gives synthesis results for single-cycle KCMs with eight selectable coefficients and Table 12 gives synthesis results for pipelined KCMs with eight selectable coefficients. Results for LogiCORE IP KCM-based units are not given because they would require eight KCMs and do not scale well as the number of coefficients increase. LogiCORE IP multiplier-based units only require a small amount of additional logic for the lookup function so they scale very well.

Proposed KCMs with Eight Selectable Coefficients
Proposed units use 51% to 52% fewer LUTs than LogiCORE IP for single-cycle units. They are slower than LogiCORE IP for most units, and the relative delay generally increases as n increases. The LDP for proposed single-cycle units is 13% to 53% lower than LogiCORE IP, with better results for smaller operand sizes.
Proposed pipelined units use 46% to 52% fewer LUTs and are faster, having 10% to 16% lower delay than LogiCORE IP. The LDP for proposed units is 51% to 59% lower than LogiCORE IP and performance is consistently better for all operand sizes.

Comparison to Möller Et Al.
Möller et al. [13] present synthesis results for (16 × 16)-bit constant coefficient multipliers with two to fourteen selectable coefficients. They compare units generated using their proposed PAG fusion heuristic to units based on DAG fusion [11], using a Xilinx CoreGen multiplier with a distributed RAM to store coefficients as a baseline for comparison. Results for pipelined PAG fusion with ternary adders and PAG fusion with only two-operand adders are given. Results for single-cycle DAG fusion are given, as well as pipelined DAG fusion with resigters after each adder, subtractor, adder/subtractor and multiplexer, plus additional registers as needed for pipeline balancing. The Xilinx CoreGen multiplier-based unit is pipelined to the same depth as pipelined PAG fusion units. The number of slices required for implementation are shown on one graph and the maximum clock frequency for each method is shown on another graph in their paper. Numerical values are estimated from these graphs and tabulated in Tables 13 and 14 for units with two to eight selectable coefficients.   Table 13. Slice utilization for directed acyclic graph (DAG) fusion [11,13], pipelined adder graph (PAG) fusion [13] and proposed (16 × 16)-bit KCMs with two to eight selectable coefficients. DAG fusion and PAG fusion units are normalized to CoreGen as presented in Möller et al. [13], proposed units are normalized to a LogiCORE IP multiplier-based unit.
Results presented by Möller et al. were obtained using Xilinx ISE v13.4, targeting a Virtex 6 FPGA (xc6vlx75t-2ff484-2) [13]. Slices are used as the metric for resource usage, and a Xilinx CoreGen based unit is used as a baseline for comparison. This paper presents results obtained using Xilinx Vivado 2016.3, targeting a Virtex 7 FPGA and uses LUTs as the metric for resource usage. In order to compare results, a Xilinx LogiCORE IP multiplier with a function to lookup coefficients is used in this work as a baseline. The LogiCORE IP multiplier is pipelined to the optimal depth as given in the IP customization dialog and a pipeline register is inserted between the coefficient lookup and the multiplier. Results given by Möller et al. are normalized to their CoreGen based unit and results in this work are normalized to the LogiCORE IP based units to account for differences in the synthesis tools, target device and IP implementation. Tables 13 and 14 summarize these results. Figures 16 and 17 compare results from Möller et al. to this work by plotting normalized values. With the proposed approach, a KCM with three selectable coefficients would have the same structure as the KCM with four selectable coefficients described in Section 4.2, except the table of precomputed partial products would use zeros or don't cares for the unused coefficient. This may allow some LUTs in the first row to be optimized away, but the LUTs in the other rows would still be required so the resources consumed by the unit would be identical or slightly less than a KCM with four selectable coefficients. For this reason, proposed KCMs with three selectable coefficients are graphed using the same values as KCMs with four selectable coefficients. Likewise, proposed KCMs with five, six or seven selectable coefficients are graphed using the same values as KCMs with eight selectable coefficients.  [11,13], pipelined adder graph (PAG) fusion [13] and proposed (16 × 16)-bit KCMs with two to eight selectable coefficients. One slice contains four LUT6s and eight flip-flops (see Figure 1). DAG fusion and PAG fusion units are normalized to CoreGen as presented in Möller et al. [13], proposed units are normalized to a LogiCORE IP multiplier-based unit.
PAG fusion units with two-operand adders use fewer slices than CoreGen for two to four selectable coefficients and PAG fusion units with ternary adders use fewer slices than CoreGen for two to six selectable coefficients. All PAG fusion units use fewer slices than pipelined DAG fusion units. All PAG fusion units with two-operand adders have a maximum frequency comparable to CoreGen units, ranging from 6% slower to 4% faster. PAG fusion units with ternary adders are 22% to 31% slower than CoreGen, mainly because ternary adders are slower than two-operand adders. However, for many applications, they would not be on the critical path and would be better than CoreGen because they use 6% to 60% fewer slices for units with two to six selectable coefficients. Figure 17. Maximum operating frequency for DAG fusion [11,13], PAG fusion [13] and proposed (16 × 16)-bit KCMs with two to eight selectable coefficients. DAG fusion and PAG fusion units are normalized to CoreGen as presented in Möller et al. [13], proposed units are normalized to a LogiCORE IP multiplier-based unit.
Proposed KCMs with selectable coefficients use significantly fewer slices than LogiCORE IP and pipelined versions are faster than LogiCORE IP for units with two to eight selectable coefficients. PAG fusion units outperform CoreGen units in most cases, so it is important to compare proposed units to PAG fusion units. Table 15 compares required slices and Table 16 compares maximum operating frequency for PAG fusion units, normalized to CoreGen, with proposed units, normalized to LogiCORE IP. These values are then normalized to PAG fusion with two-operand adders and PAG fusion with ternary adders. Comparing normalized values, proposed KCMs with selectable coefficients use 47% to 65% fewer slices than PAG fusion with two-operand adders and 22% to 57% fewer slices than PAG fusion with ternary adders. Proposed KCMs with selectable coefficients can operate 7% to 30% faster than PAG fusion with two-operand adders and 28% to 52% faster than PAG fusion with ternary adders.