The comments were getting too long.
Consider the following block diagram:
This represents an unrolled loop (for i in 0 to 7 loop
) and shows that no add +3 occurs before i = 2 for the LS BCD digit and no add +3 occurs before i = 5 for the middle BCD digit, and no adjustment occurs on the MS BCD digit, which is comprise in part of static '0' values.
This gives us a total of 7 add3 modules (represented by the enclosing if statement, and conditional add +3).
This is demonstrated in VHDL:
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity bin8bcd is
port (
bin: in std_logic_vector (7 downto 0);
bcd: out std_logic_vector (11 downto 0)
);
end entity;
architecture struct of bin8bcd is
procedure add3 (signal bin: in std_logic_vector (3 downto 0);
signal bcd: out std_logic_vector (3 downto 0)) is
variable is_gt_4: std_logic;
begin
is_gt_4 := bin(3) or (bin(2) and (bin(1) or bin(0)));
if is_gt_4 = '1' then
-- if to_integer(unsigned (bin)) > 4 then
bcd <= std_logic_vector(unsigned(bin) + "0011");
else
bcd <= bin;
end if;
end procedure;
signal U0bin,U1bin,U2bin,U3bin,U4bin,U5bin,U6bin:
std_logic_vector (3 downto 0);
signal U0bcd,U1bcd,U2bcd,U3bcd,U4bcd,U5bcd,U6bcd:
std_logic_vector (3 downto 0);
begin
U0bin <= '0' & bin (7 downto 5);
U1bin <= U0bcd(2 downto 0) & bin(4);
U2bin <= U1bcd(2 downto 0) & bin(3);
U3bin <= U2bcd(2 downto 0) & bin(2);
U4bin <= U3bcd(2 downto 0) & bin(1);
U5bin <= '0' & U0bcd(3) & U1bcd(3) & U2bcd(3);
U6bin <= U5bcd(2 downto 0) & U3bcd(3);
U0: add3(U0bin,U0bcd);
U1: add3(U1bin,U1bcd);
U2: add3(U2bin,U2bcd);
U3: add3(U3bin,U3bcd);
U4: add3(U4bin,U4bcd);
U5: add3(U5bin,U5bcd);
U6: add3(U6bin,U6bcd);
OUTP:
bcd <= '0' & '0' & U5bcd(3) & U6bcd & U4bcd & bin(0);
end architecture;
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity bin8bcd_tb is
end entity;
architecture foo of bin8bcd_tb is
signal bin: std_logic_vector (7 downto 0) := (others => '0');
-- (initialized to prevent those annoying metavalue reports)
signal bcd: std_logic_vector (11 downto 0);
begin
DUT:
entity work.bin8bcd
port map (
bin => bin,
bcd => bcd
);
STIMULUS:
process
begin
for i in 0 to 255 loop
bin <= std_logic_vector(to_unsigned(i,8));
wait for 1 ns;
end loop;
wait for 1 ns;
wait;
end process;
end architecture;
That when the accompanying test bench is run yields:
And if you were to scroll through the entire waveform you'd find that all bcd outputs from 001 to 255 are present and accounted for (no holes), no 'X's or 'U's anywhere.
From the representation in the block diagram showing i = 7 we see that no add +3 occurs after the final shift.
Also note that the LSB of bcd is always the LSB of bin, and that bcd(11) and bcd(10) are always '0'.
The add3 can be hand optimized to create an increment by 3 using logic operators to get rid of any possibility of reporting meta values derived from bin (and there'd be a lot of them).
As far as I can tell this represents the most optimized representation of 8 bit binary to 12 bit BCD conversion.
Sometime previously I wrote a C program to provide input to espresso (a term minimizer):
/*
* binbcd.c - generates input to espresso for 8 bit binary
* to 12 bit bcd.
*
*/
#include <stdlib.h>
#include <stdio.h>
int main (argc, argv)
int argc;
char **argv;
{
int binary;
int bit;
char bcd_buff[4];
int digit;
int bcd;
printf(".i 8\n");
printf(".o 12\n");
for (binary = 0; binary < 256; binary++) {
for ( bit = 7; bit >= 0; bit--) {
if ((1 << bit) & binary)
printf("1");
else
printf("0");
}
digit = snprintf(bcd_buff,4,"%03d",binary); /* leading zeros */
if (digit != 3) {
fprintf(stderr,"%s: binary to string conversion failure, digit = %d\n",
argv[0],digit);
exit (-1);
}
printf (" "); /* input to output space */
for ( digit = 0; digit <= 2; digit++) {
bcd = bcd_buff[digit] - 0x30;
for (bit = 3; bit >= 0; bit--) {
if ((1 << bit) & bcd)
printf("1");
else
printf("0");
}
}
/* printf(" %03d",binary); */
printf("\n");
}
printf (".e\n");
exit (0);
Then started poking around with intermediary terms, which leads you directly to what is represented in the block diagram above.
And of course you could use an actual component add3 as well as use nested generate statements to hook everything up.
You won't get the same minimized hardware from a loop statement representation without constraining the if statements (2 < i < 7 for the LS BCD digit, 5 < i < 7 for the middle BCD digit).
You'd want the subsidiary nested generate statement to provide the same constraints for a shortened structural representation.
A logic operator version of add3 is shown on PDF page 5 on the university lecture slides for Binary to BCD Conversion using double dabble, where the forward tick is used for negation notation, "+" signifies OR, and Adjacency signifies AND.
The add3 then looks like:
procedure add3 (signal bin: in std_logic_vector (3 downto 0);
signal bcd: out std_logic_vector (3 downto 0)) is
begin
bcd(3) <= bin(3) or
(bin(2) and bin(0)) or
(bin(2) and bin(1));
bcd(2) <= (bin(3) and bin(0)) or
(bin(2) and not bin(1) and not bin(0));
bcd(1) <= (bin(3) and not bin(0)) or
(not bin(2) and bin(1)) or
(bin(1) and bin(0));
bcd(0) <= (bin(3) and not bin(0)) or
(not bin(3) and not bin(2) and bin(0)) or
(bin(2) and bin(1) and not bin(0));
end procedure;
Note this would allow package numeric_std (or equivalent) to be dropped from the context clause.
If you write signals in AND terms in the same order (in this case left to right) the duplicated AND terms show up well, as the also do using espresso. There is no value in using intermediary AND terms in an FPGA implementation, these all fit it LUTs just the way they are.
espresso input for add3:
.i 4
.o 4
0000 0000
0001 0001
0010 0010
0011 0011
0100 0100
0101 1000
0110 1001
0111 1010
1000 1011
1001 1100
1010 ----
1011 ----
1100 ----
1101 ----
1110 ----
1111 ----
.e
And espresso's output (espresso -eonset):
.i 4
.o 4
.p 8
-100 0100
00-1 0001
--11 0010
-01- 0010
-110 1001
-1-1 1000
1--1 1100
1--0 1011
.e
When you consider the combinatorial 'depth' of the binary to BCD conversion, for an FPGA it's 6 LUTs (the 6th an input to something following). That likely limits the clock speed to something shy of 100 MHz if the conversion occurs in one clock.
By pipelining or using sequential logic (clocked loop) you'd be able to run an FPGA at it's fastest speed while executing in 6 clocks.