Builders are used to efficiently construct sequences of bytes
from smaller parts. Typically, such a construction is part of the
implementation of an
encoding, i.e., a function for converting
Haskell values to sequences of bytes. Examples of encodings are the
generation of the sequence of bytes representing a HTML document to be
sent in a HTTP response by a web application or the serialization of a
Haskell value using a fixed binary format.
For an
efficient implementation of an encoding, it is important
that (a) little time is spent on converting the Haskell values to the
resulting sequence of bytes
and (b) that the representation of
the resulting sequence is such that it can be consumed efficiently.
Builders support (a) by providing an
O(1) concatentation
operation and efficient implementations of basic encodings for
Chars,
Ints, and other standard Haskell values. They
support (b) by providing their result as a
LazyByteString,
which is internally just a linked list of pointers to
chunks of
consecutive raw memory.
LazyByteStrings can be efficiently
consumed by functions that write them to a file or send them over a
network socket. Note that each chunk boundary incurs expensive extra
work (e.g., a system call) that must be amortized over the work spent
on consuming the chunk body.
Builders therefore take special
care to ensure that the average chunk size is large enough. The
precise meaning of large enough is application dependent. The current
implementation is tuned for an average chunk size between 4kb and
32kb, which should suit most applications.
As a simple example of an encoding implementation, we show how to
efficiently convert the following representation of mixed-data tables
to an UTF-8 encoded Comma-Separated-Values (CSV) table.
data Cell = StringC String
| IntC Int
deriving( Eq, Ord, Show )
type Row = [Cell]
type Table = [Row]
We use the following imports.
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Builder
import Data.List (intersperse)
CSV is a character-based representation of tables. For maximal
modularity, we could first render
Tables as
Strings
and then encode this
String using some Unicode character
encoding. However, this sacrifices performance due to the intermediate
String representation being built and thrown away right
afterwards. We get rid of this intermediate
String
representation by fixing the character encoding to UTF-8 and using
Builders to convert
Tables directly to UTF-8 encoded
CSV tables represented as
LazyByteStrings.
encodeUtf8CSV :: Table -> L.LazyByteString
encodeUtf8CSV = toLazyByteString . renderTable
renderTable :: Table -> Builder
renderTable rs = mconcat [renderRow r <> charUtf8 '\n' | r <- rs]
renderRow :: Row -> Builder
renderRow [] = mempty
renderRow (c:cs) =
renderCell c <> mconcat [ charUtf8 ',' <> renderCell c' | c' <- cs ]
renderCell :: Cell -> Builder
renderCell (StringC cs) = renderString cs
renderCell (IntC i) = intDec i
renderString :: String -> Builder
renderString cs = charUtf8 '"' <> foldMap escape cs <> charUtf8 '"'
where
escape '\\' = charUtf8 '\\' <> charUtf8 '\\'
escape '\"' = charUtf8 '\\' <> charUtf8 '\"'
escape c = charUtf8 c
Note that the ASCII encoding is a subset of the UTF-8 encoding, which
is why we can use the optimized function
intDec to encode an
Int as a decimal number with UTF-8 encoded digits. Using
intDec is more efficient than
stringUtf8 .
show, as it avoids constructing an intermediate
String. Avoiding this intermediate data structure significantly
improves performance because encoding
Cells is the core
operation for rendering CSV-tables. See
Data.ByteString.Builder.Prim for further information on how to
improve the performance of
renderString.
We demonstrate our UTF-8 CSV encoding function on the following table.
strings :: [String]
strings = ["hello", "\"1\"", "λ-wörld"]
table :: Table
table = [map StringC strings, map IntC [-3..3]]
The expression
encodeUtf8CSV table results in the following
lazy
LazyByteString.
Chunk "\"hello\",\"\\\"1\\\"\",\"\206\187-w\195\182rld\"\n-3,-2,-1,0,1,2,3\n" Empty
We can clearly see that we are converting to a
binary format.
The 'λ' and 'ö' characters, which have a Unicode codepoint above 127,
are expanded to their corresponding UTF-8 multi-byte representation.
We use the
criterion library
(
http://hackage.haskell.org/package/criterion) to benchmark the
efficiency of our encoding function on the following table.
import Criterion.Main -- add this import to the ones above
maxiTable :: Table
maxiTable = take 1000 $ cycle table
main :: IO ()
main = defaultMain
[ bench "encodeUtf8CSV maxiTable (original)" $
whnf (L.length . encodeUtf8CSV) maxiTable
]
On a Core2 Duo 2.20GHz on a 32-bit Linux, the above code takes 1ms to
generate the 22'500 bytes long
LazyByteString. Looking again at
the definitions above, we see that we took care to avoid intermediate
data structures, as otherwise we would sacrifice performance. For
example, the following (arguably simpler) definition of
renderRow is about 20% slower.
renderRow :: Row -> Builder
renderRow = mconcat . intersperse (charUtf8 ',') . map renderCell
Similarly, using
O(n) concatentations like
++ or the
equivalent
concat operations on strict and
LazyByteStrings should be avoided. The following definition of
renderString is also about 20% slower.
renderString :: String -> Builder
renderString cs = charUtf8 $ "\"" ++ concatMap escape cs ++ "\""
where
escape '\\' = "\\"
escape '\"' = "\\\""
escape c = return c
Apart from removing intermediate data-structures, encodings can be
optimized further by fine-tuning their execution parameters using the
functions in
Data.ByteString.Builder.Extra and their "inner
loops" using the functions in
Data.ByteString.Builder.Prim.