If you're accustomed to working with text in almost any other
programming language, you'd be aware that a "string" typically refers
to an in-memory
array of characters. Traditionally this was a
single ASCII byte per character; more recently UTF-8 variable byte
encodings which dramatically complicates finding offsets but which
gives efficient support for the entire Unicode character space. In
Haskell, the original text type,
String, is implemented as a
list of
Char which, because a Haskell list is implemented as a
linked-list of boxed values, is wildly inefficient at any kind
of scale.
In modern Haskell there are two primary ways to represent text.
First is via the [rather poorly named]
ByteString from the
bytestring package (which is an array of bytes in pinned
memory). The
Data.ByteString.Char8 submodule gives you ways to
manipulate those arrays as if they were ASCII characters. Confusingly
there are both strict (
Data.ByteString) and lazy
(
Data.ByteString.Lazy) variants which are often hard to tell
the difference between when reading function signatures or haddock
documentation. The performance problem an immutable array backed data
type runs into is that appending a character (that is, ASCII byte) or
concatonating a string (that is, another array of ASCII bytes) is very
expensive and requires allocating a new larger array and copying the
whole thing into it. This led to the development of "builders" which
amortize this reallocation cost over time, but it can be cumbersome to
switch between
Builder, the lazy
ByteString that
results, and then having to inevitably convert to a strict
ByteString because that's what the next function in your
sequence requires.
The second way is through the opaque
Text type of
Data.Text from the
text package, which is well tuned and
high-performing but suffers from the same design; it is likewise
backed by arrays. (Historically, the storage backing Text objects was
encoded in UTF-16, meaning every time you wanted to work with unicode
characters that came in from
anywhere else and which inevitably
were UTF-8 encoded they had to be converted to UTF-16 and copied into
a further new array! Fortunately Haskell has recently adopted a UTF-8
backed
Text type, reducing this overhead. The challenge of
appending pinned allocations remains, however.)
In this package we introduce
Rope, a text type backed by the
2-3
FingerTree data structure from the
fingertree
package. This is not an uncommon solution in many languages as finger
trees support exceptionally efficient appending to either end and good
performance inserting anywhere else (you often find them as the
backing data type underneath text editors for this reason). Rather
than
Char the pieces of the rope are
ShortText from the
text-short package, which are UTF-8 encoded and in normal
memory managed by the Haskell runtime. Conversion from other Haskell
text types is not
O(1) (UTF-8 validity must be checked, or
UTF-16 decoded, or...), but in our benchmarking the performance has
been comparable to the established types and you may find the
resultant interface for combining chunks is comparable to using a
Builder, without being forced to use a Builder.
Rope is used as the text type throughout this library. If you
use the functions within this package (rather than converting to other
text types) operations are quite efficient. When you do need to
convert to another type you can use
fromRope or
intoRope
from the
Textual typeclass.
Note that we haven't tried to cover the entire gamut of operations or
customary convenience functions you would find in the other libraries;
so far
Rope is concentrated on aiding interoperation, being
good at appending (lots of) small pieces, and then efficiently taking
the resultant text object out to a file handle, be that the terminal
console, a file, or a network socket.