Base41: A proposal for printable encoding of bit strings

This article presents the proposal of a novel encoding method for representing binary data in printable format. The encoding procedure reversibly transforms binary strings into printable letter sequences using a mapping from bit strings to an alphabet of 41 symbols taken from a subset of the English uppercase and lowercase letters. The decoding process reversibly recovers an original bit string from the encoded printable string. The encoding and decoding algorithms along with the properties of this coding are examined and discussed. In particular, the use of only URL‐safe and clearly visually distinguishable characters is emphasized.


INTRODUCTION
In the field of computer science there are many applications that need a representation of data in printable form, that is, encoded with a set of characters that can be printed and read by a human. Examples of such software and hardware are some electronic mail clients and servers or QR-code readers. To cope with these systems many frameworks have been developed to translate (i.e., encode and decode) data from binary to printable form and vice versa.
In the following we will introduce such an encoding system and compare it with existing ones. In this context encoding means representing a number written as a numeral in a certain system with a numeral written in another system (e.g., the number of legs of a dog is represented with, among others, the numerals 4 (Arabic numerals), IV (Roman numerals), four (English language)). Within these systems, positional systems have a noticeable relevance. A positional system uses a base B (with B > 1, B ∈ N) having B symbols to represent numbers as a weighted sum of integer powers of B, that is, a number M is decomposed in A n B n + A n-1 B n-1 + … + A 0 B 0 and written as A n A n-1 A n-2 … A 0 .
In the following the term Base Y refers to a method that uses a base of Y symbols. The objective of this article is to present an encoding format of binary data by means of printable symbol sequences using an alphabet of 41 symbols.
The proposal in this article recalls the Base45 1 encoding and the Base41 proposal. 2 As in Reference 1 this proposal encodes two octets with three symbols and also allows for encoding one octet or a shorter bit string in three symbols.
Differently from Reference 3, where the ten digits and the uppercase and lowercase letters of the English alphabet (totaling 62 symbols) are used in an encoding system, called UTF-62, for multilingual identifiers, the present proposal employs 41 letters only: motivation for the use of 41 symbols is that 41 is the minimum base that may be used to encode 2 octets (2 16 = 65,536 configurations) with 3 symbols, that is, 40 3 = 64,000 < 2 16 < 41 3 = 68,921. Moreover, part of the exceeding configurations allows for the representation of bit strings with length up to 8 bits. The alphabet is composed by the 20 uppercase letters "ABCDFGHJKLMNQRSTUVXZ" and the 21 lowercase letters "abcdefhikmnopqrstuvxz" from the English alphabet excluding "EIOPWYgjlwy": this has the effect of using only URL-safe characters whose graphical representation does not give rise to ambiguities in the visual interpretation of the glyphs by a human. In fact, as done in the Base58 encoding 4 for some letters, the proposed Base41 encoding avoids the possible visual uncertainty in the printed string between: • "Q" (capital q), "O" (capital o), and "0" (digit zero), • "B" and "E", • "E" and "F", • "P" and "R", • "g" (lowercase G) and "q" (lowercase Q), • "l" (lowercase L), "I" (capital i), and "1" (digit one), • "i" (lowercase I) and "j" (lowercase J), • "vv" and "w", both lowercase and uppercase, • "v" and "y", both lowercase and uppercase.
The main differences with Base45 1 encoding are: • a smaller set of symbols; • the use of uppercase and lowercase letters from the English alphabet; • the use of URL-safe characters only; • a smaller number of unused encoded sequences (considered to be rejected in Reference 1); • limiting the similar glyphs in the encoding alphabet; • the mode of encoding of octet strings of odd octet length; • the encoding of bit strings of any length (not necessarily an integer number of octets).
The presented Base41 encoding differs from the one in Reference 2 for the advantages of the alphabet chosen to represent the 41 symbols, which uses URL-safe characters only (that is, characters that have no special meaning or function in coding URLs and thus do not need to be escaped with the % sign), and for the possibility to uniformly encode bit strings of any length.
In addition to the cited works on the bases 41, 45, 58, and 62, it is important to note that in previous years many works have been published on the theme of encoding binary strings using a printable format.
The base 62 is also used in Reference 5 to build a printable encoding from a bit stream that is read 6 bit at a time. Care is taken in case the input stream has a bit length not multiple of 6 performing the proper padding.
The Base64 6 encoding represents three octets with four symbols leading to a compact representation that uses the lowercase and uppercase letters of the English alphabet, the ten digits and two special characters. The same document 6 defines two more encodings, one based on 16 symbols and the other using 32 symbols.
An encoding that uses 85 symbols (the ASCII characters from code 33 to code 117) to represent 4 octets with 5 characters has been proposed in Reference 7, where the encoding is called Ascii85.
A readable representation of IPV6 addresses is also obtained with a base with 85 symbols: the used alphabet and the encoding/decoding are defined in Reference 8.
Two works, 9,10 use the base 91. Reference 9 represents blocks of 13 bits with pairs of signs from an alphabet of 91 characters that is a subset of printable ASCII symbols; some pairs are used to indicate how many bits to discard in the last block in case it has a length different from 13 bits. Differently from Reference 9, which has some unused Base91 pairs, Reference 10 makes use of all the pairs to encode blocks of 13 bits: in some cases, blocks of 14 bits are encoded saturating all the available 91 2 configurations.
Base 36 is also frequently used in many programming languages (e.g., Python, 11 PHP, 12 Javascript 13 ) that have routines for its conversions: in general, the alphabet is composed by the 26 letters of the English alphabet and the 10 digits.
The structure of the article is the following: first, we introduce some notation used throughout the article. Section 2 presents the proposed Base41 alphabet. Section 3 illustrates the encoding and decoding procedures detailing them with pseudo-code algorithms; Section 4 shows some encoding and decoding examples. In Section 5 some considerations about the protocol and security issues are discussed, and Section 6 presents some details regarding the implementation and provides experimental results. Finally, Section 7 draws some conclusions.

Notation
In the rest of the article the following variables will be used: • C1, C2, C3 represent Base41 symbols; an instantiation of a Base41 symbol is written with a different font like "x" or "z"; • M is a number to be converted from binary to Base41 and vice versa; • N1, N2 are nibbles, that is, sequences of 4 bits; • O1, O2 are octets, that is, sequences of 8 bits; • P1, P2 represent numbers of 5 bits; • V1, V2, V3 represent numeric values of Base41 symbols ranging from 0 to 40.

BASE41 ALPHABET
The Base41 alphabet is composed by the following 41 letters (note that no digits are used): ABCDFGHJKLMNQRSTUVXZabcdefhikmnopqrstuvxz Each letter is associated to a numerical value according to Table 1. To convert a number to base 41 any method developed for such purpose may be used: for example, using successive divisions by 41 the obtained sequence of remainders will be made of numbers with values between 0 and 40; each value is used as an index in the sequence of letters of the proposed Base41 alphabet and the extracted letters are concatenated to get the Base41 representation of the number (in the present proposed alphabet).
Let us see an example: given the decimal numeral 2,357,293 to convert it to Base41 we start dividing it by 41 obtaining 57,494 and remainder 39; the 39th letter (starting from 0) in the proposed Base41 alphabet is x. Then divide 57,494 by 41 obtaining 1402 and remainder 12; the 12th letter in the proposed Base41 alphabet is Q. Continuing to divide 1402 by 41 we have 34 and remainder 8 which is associated to letter K. Dividing 34 by 41 results in 0 (stopping the iteration) with remainder 34 which is coded with r. Writing from left to right the letters obtained from last to first we have rKQx which is the Base41 numeral for the decimal numeral 2,357,293. In case the numeral must be written with more symbols it may be left padded with the Base41 symbol A which has a corresponding value of 0: rKQx, ArKQx, AArKQx, AAArKQx, … all represent the same number.

BASE41 ENCODING AND DECODING
Binary data encoding may be performed according to the data size. Base41 encoding allows the representation of pairs of octets (16 bits), single octets (8 bits) and groups of 1 to 7 bits. All these kinds of data are encoded in 3 Base41 symbols: thus, for sequences of a large number of bits the size of the resulting data is approximately 1.5 the dimension of the original binary data. A detailed analysis of this data expansion factor is reported in Appendix A. A pair of octets O1, O2 is interpreted as a 16 bits number M = O1 * 256 + O2 (first octet most significant). Then, M is converted in base 41 and represented by the three Base41 symbols C1, C2, C3, that is, M = C1 * 41 * 41 + C2 * 41 + C3 (first Base41 symbol most significant). The minimum 16 bits value, 0, is represented in Base41 as AAA. The maximum 16 bits value, 65,535, is represented in Base41 as vzV. Algorithm 1 presents the encoding procedure pseudo-code for 2 octets (see flow chart in Figure B1).
Note that neither "x" nor "z" may occur as first symbol. These characters will be used as prefixes of sequences of three symbols that encode, respectively, bit strings and single octets.
A single octet O1 is considered composed by two nibbles N1 and N2, N1 most significant, that is, The decimal values of N1 and N2 are used as indexes in the Base41 table to obtain two symbols C1 and C2: these symbols are concatenated to the Base41 symbol "z" producing a string of three characters z C1 C2 to obtain the Base41 representation of the single octet. The minimum 8 bits value, 0, is represented in Base41 as zAA. The maximum 8 bits value, 255, is represented in Base41 as zTT. Algorithm 2 presents the encoding procedure pseudo-code for a single octet (see flow chart in Figure B2).
A bit string composed of 1 bit (b1), 2 bits (b1 b2), … or 7 bits (b1 b2 b3 b4 b5 b6 b7), leftmost most significant bit, is first represented with ten bits, the leading three bits encoding the length as in Table 2 (note the zero valued bits for filling the ten bits).  Table 1 Output C1, C2, C3 end Algorithm 2. Encoding procedure pseudo-code for one octet O1 begin Extract the two nibbles N1, N2 of O1, that is O1 = N1 * 16 + N2 Use Table 1 to convert the values N1 and N2 to base 41, obtaining symbols C1 and C2 Output "z", C1, C2 end TA B L E 2 Representation of bit strings for the proposed Base41 encoding using ten bits: The first two columns encode five bits that will be represented by a decimal number P1, the last column encodes the remaining five bits represented by a decimal number P2

Length
Bit string (zero filled) The decimal values P1 (representing the five bits in the first columns of Table 2) and P2 (representing the five bits in the last five columns of Table 2) of the two groups of five bits are used as indexes in the Base41 table to obtain two symbols C1 and C2: these symbols are concatenated to the Base41 symbol "x" to obtain the Base41 representation of the bit string. The minimum value single bit string, 0, is represented in Base41 as xFA. The maximum value seven-bit string, 1111111, is represented in Base41 as xoo. Note that not all the intermediate values are possible, for example, a bit length of 2 with the trailing five bits valued 1 may not happen. Moreover, it is possible to extend this code to represent the empty string with a 0 bit length: the Base41 representation following the previous encoding mode will be xAA. The following Algorithm 3 presents the encoding procedure pseudo-code for a bit string of maximum length 7 bits (see flow chart in Figure B3).
Encoding of a bit stream should be performed by first considering it as a sequence of contiguous pairs of octets each one coded separately, then encoding the eventual trailing octet and then encoding the possibly remaining 1 to 7 bits. Nonetheless, an application may divide the input to its convenience and encode every part as a bit stream on its own. Algorithm 3. Encoding procedure pseudo-code for bit string b1, b2, … , bn, with 0 < n < 8 begin Represent the bit string and its length according to Table 2 Halve the ten bits into P1 and P2 and map their values to base 41 using Table 1 obtaining symbols C1 and C2 Output "x", C1, C2 end Appendix B reports flow charts and examples of executions of these algorithms. One side effect of this encoding is that by inspecting the first character in each group of three symbols it is possible to immediately infer which kind of data is encoded (two octets, one octet or a bit string). In fact, decoding may be performed by splitting the stream of Base41 symbols in groups of 3 letters, then: • if the first symbol of the group is "z" then it encodes a single octet; the values of the two nibbles of this octet are obtained from Table 1 using as indexes the second and third symbols of the group; • if the first symbol of the group is "x" then it represents a bit string; two groups of five bits each are acquired from Table 1 using as indexes the second and third symbols of the group, then the resulting ten bits are decoded according to Table 2 to get the number of bits and their values to form the encoded bit string; • otherwise, the three symbols encode two octets; first, the three symbols are transformed in numbers V1, V2, V3 according to Table 1  The following Algorithm 4 shows the decoding procedure pseudo-code. Please note that an error is issued in case of trailing ones when decoding a bit string, but this is only the suggested behavior.  In Appendix C the reversibility of the encoding/decoding process is proven.

EXAMPLES
In

PROTOCOL AND SECURITY CONSIDERATIONS
This document only defines an alphabet and an encoding mode. How a real binary string is encoded is left to the application defining the communication/representation protocol. In particular, this document: • does not specify if a sequence of octets must be encoded as pairs (exception made for the possibly single final octet) or as single octets, or as a mix of pairs and single octets; • does not specify the behavior of the application if a zero-length bit string is found when decoding; • does not specify the behavior of the application if one-valued bits are found in the filling of a decoded bit string, for example, 010 11 00110; • does not specify the behavior of the application if a symbol not contained in the alphabet is found in an encoded string: care must be taken to avoid security attacks.
Obviously, space efficiency suggests encoding data as pairs of octets leaving, if necessary, a single octet and then a 1 to 7 bit string (if present) only as final data to encode.
Implementations involving Base41 encoding must prevent attacks leveraging the symbol representation of the different kinds of data (single and octet pairs and bit strings).

IMPLEMENTATION CONSIDERATIONS AND EXPERIMENTAL RESULTS
We built two C programs of a Base41 encoder/decoder (available from the authors upon request): these functions may be used as a starting point for other software or hardware implementations of Base41.
The first program uses a table that keeps the encodings of • the 65,536 two octet bit strings, • the 256 octet strings, • the 254 1 to 7 bit strings, • the empty (0 bit) string.
The encoder may use this table sorted according to the input bit string; the decoder may sort this table according to the three Base41 characters encodings.
The second implementation does not keep any table and performs the divisions by 41 for encoding the two octet bit strings. In this case it is possible to reduce the complexity due to a division by applying the division by multiplication We run the software on some files obtaining almost the same performance for both kinds of programs: on an Intel® Core™ i7-1165G7 at 2.80 GHz encoding was performed at more than 80 Mo/s and decoding at more than 120 Mo/s.

CONCLUSIONS
In this article has been presented a format for representing binary data using URL-safe printable symbol sequences by means of a specific alphabet of 41 symbols taken from the uppercase and lowercase letters of the English alphabet ( Figure D1 in Appendix D reports the logo we made for the proposed method). The number "41" is specific because it is the minimal number of symbols allowing to encode two octets in a sequence of three symbols.
The main characteristics and advantages of the proposed method with respect to the pertinent works on Base45 1 and base 41 2 are: • use of a set of glyphs that does not leave ambiguities in interpretation from a human point of view; • use of a minimum set of symbols (41) required for encoding pairs of octets; • use of URL-safe characters; • ability to represent bit strings of any length (not only an integer number of octets).
The presented encoding is suggested in the contexts of representing binary data of any length in printable form, for example, in the encoding of data to be represented in a QR code.
The prefix of three output symbols immediately allows to know the kind of data encoded (one or two octets, or a shorter bit string): in this way the decoding may be performed in a very simple and efficient manner. Moreover, the printable symbols are chosen from an alphabet of URL-safe characters whose representations also avoid confusion by humans in reading the encoding.

APPENDIX A
The following discussion presents the computation of the data expansion factor as a function of the number of octets and eventual trailing bits making the original binary data.
Let us suppose that the original data is composed of m pairs of octets, m ≥ 0, then a possible single octet (in case the data length is an odd number of octets) indicated by q ∈ {0, 1} and that there are possibly following n bits, with 0 ≤ n ≤ 7.
The number of bits in the original data is According to the proposed Base41 encoding the number of bits in the resulting bit string is The data expansion factor is defined as and it is easy to see that that is, for a large number of octets the increase in size is 50% because the eventual trailing octet and bits are negligible. Table A1 reports the values of the data expansion factor for many values of m when the trailing single octet is not present and when it is present. These values are computed for n = 1 because this is the worst case that requires three octets to encode a single bit.

APPENDIX B
We provide flow charts of the encoding process for the three kinds of data, namely two octets, one octet, and a bit string. Near every flow chart it is reported an example of the corresponding coding ( Figures B1-B3).

APPENDIX C
The reversibility of encoding to and decoding from Base41 may be proved examining three cases separately.
• In case of encoding two octets the process is reversible as any other conversion between different bases (e.g., decimal and binary). Note that the maximum possible value, 65,535, does not have as most significant digit x nor z, as all the possible two octet values.
• For the same reason one octet encoding/decoding is reversible producing two Base41 symbols each one associated to a nibble of the octet: prefixing the two Base41 symbols with z allows to distinguish this case from the two octets one, making the process reversible.
• Encoding of a bit string (of 1 up to 7 bits) starts formatting it with a three bits prefix specifying its length, then filling it with trailing zeroes to have a length of 10 bits. Halving the obtained string into two parts of 5 bits each and encoding every part with a Base41 symbol (note that this is possible because 5 bits allow for a maximum of 32 configurations which is smaller than 41) permits to reversibly obtain the 5 bits from the Base41 symbol; prefixing these two Base41 symbols with x allows to distinguish this case from the previous two leading to a reversible process.
Having shown the reversibility of the encoding/decoding process for the three cases which are always distinguishable proves the reversibility of the proposed Base41 conversion.

APPENDIX D
This appendix reports the logo for the proposed Base41 encoding/decoding ( Figure D1).

F I G U R E D1
Logo for the proposed Base41 encoding/decoding