Rope package:core-text

If you're accustomed to working with text in almost any other programming language, you'd be aware that a "string" typically refers to an in-memory array of characters. Traditionally this was a single ASCII byte per character; more recently UTF-8 variable byte encodings which dramatically complicates finding offsets but which gives efficient support for the entire Unicode character space. In Haskell, the original text type, String, is implemented as a list of Char which, because a Haskell list is implemented as a linked-list of boxed values, is wildly inefficient at any kind of scale. In modern Haskell there are two primary ways to represent text. First is via the [rather poorly named] ByteString from the bytestring package (which is an array of bytes in pinned memory). The Data.ByteString.Char8 submodule gives you ways to manipulate those arrays as if they were ASCII characters. Confusingly there are both strict (Data.ByteString) and lazy (Data.ByteString.Lazy) variants which are often hard to tell the difference between when reading function signatures or haddock documentation. The performance problem an immutable array backed data type runs into is that appending a character (that is, ASCII byte) or concatonating a string (that is, another array of ASCII bytes) is very expensive and requires allocating a new larger array and copying the whole thing into it. This led to the development of "builders" which amortize this reallocation cost over time, but it can be cumbersome to switch between Builder, the lazy ByteString that results, and then having to inevitably convert to a strict ByteString because that's what the next function in your sequence requires. The second way is through the opaque Text type of Data.Text from the text package, which is well tuned and high-performing but suffers from the same design; it is likewise backed by arrays. (Historically, the storage backing Text objects was encoded in UTF-16, meaning every time you wanted to work with unicode characters that came in from anywhere else and which inevitably were UTF-8 encoded they had to be converted to UTF-16 and copied into a further new array! Fortunately Haskell has recently adopted a UTF-8 backed Text type, reducing this overhead. The challenge of appending pinned allocations remains, however.) In this package we introduce Rope, a text type backed by the 2-3 FingerTree data structure from the fingertree package. This is not an uncommon solution in many languages as finger trees support exceptionally efficient appending to either end and good performance inserting anywhere else (you often find them as the backing data type underneath text editors for this reason). Rather than Char the pieces of the rope are ShortText from the text-short package, which are UTF-8 encoded and in normal memory managed by the Haskell runtime. Conversion from other Haskell text types is not O(1) (UTF-8 validity must be checked, or UTF-16 decoded, or...), but in our benchmarking the performance has been comparable to the established types and you may find the resultant interface for combining chunks is comparable to using a Builder, without being forced to use a Builder. Rope is used as the text type throughout this library. If you use the functions within this package (rather than converting to other text types) operations are quite efficient. When you do need to convert to another type you can use fromRope or intoRope from the Textual typeclass. Note that we haven't tried to cover the entire gamut of operations or customary convenience functions you would find in the other libraries; so far Rope is concentrated on aiding interoperation, being good at appending (lots of) small pieces, and then efficiently taking the resultant text object out to a file handle, be that the terminal console, a file, or a network socket.
A type for textual data. A rope is text backed by a tree data structure, rather than a single large continguous array, as is the case for strings. There are three use cases: Referencing externally sourced data Often we interpret large blocks of data sourced from external systems as text. Ideally we would hold onto this without copying the memory, but (as in the case of ByteString which is the most common source of data) before we can treat it as text we have to validate the UTF-8 content. Safety first. We also copy it out of pinned memory, allowing the Haskell runtime to manage the storage. Interoperating with other libraries The only constant of the Haskell universe is that you won't have the right combination of {strict, lazy} × {Text, ByteString, String, [Word8], etc} you need for the next function call. The Textual typeclass provides for moving between different text representations. To convert between Rope and something else use fromRope; to construct a Rope from textual content in another type use intoRope. You can get at the underlying finger tree with the unRope function. Assembling text to go out This involves considerable appending of data, very very occaisionally inserting it. Often the pieces are tiny. To add text to a Rope use the appendRope method as below or the (<>) operator from Data.Monoid (like you would have with a Builder). Output to a Handle can be done efficiently with hWrite.
Append some text to this Rope. The default implementation is basically a convenience wrapper around calling intoRope and mappending it to your text (which will work just fine, but for some types more efficient implementations are possible).
Copy the pieces underlying a Rope into a single piece object. Warning This function was necessary to have a reliable Hashable instance. Currently constructing this new Rope is quite inefficient if the number of pieces or their respective lengths are large. Usually, however, we're calling hash so the value can be used as a key in a hash table and such keys are typically simple (or at least not ridiculously long), so this is not an issue in normal usage.
A zero-length Rope. You can also use "" presuming the OverloadedStrings language extension is turned on in your source file.
Convert a Rope into another text-like type.
Insert a new piece of text into an existing Rope at the specified offset. Examples:
λ> insertRope 3 "Con" "Def 1"
"DefCon 1"
λ> insertRope 0 "United " "Nations"
"United Nations"
Take another text-like type and convert it to a Rope.
A Rope built from a list of characters. Equivalent to calling intoRope on the String, but can help you avoid ambiguious type errors when composing functions or working with literals.
Repeat the input Rope n times. The follows the same semantics as other replicate functions; if you ask for zero copies you'll get an empty text and if you ask for lots of "" you'll get ... an empty text. Implementation note Rather than copying the input n times, this will simply add structure to hold n references to the provided input text.
A Rope with but a single character.
Break the text into two pieces at the specified offset. Examples:
λ> splitRope 0 "abcdef"
("", "abcdef")
λ> splitRope 3 "abcdef"
("abc", "def")
λ> splitRope 6 "abcdef"
("abcdef","")
Going off either end behaves sensibly:
λ> splitRope 7 "abcdef"
("abcdef","")
λ> splitRope (-1) "abcdef"
("", "abcdef")
Take the first _n_ characters from the beginning of the Rope.
λ> takeRope 3 "123456789"
"123"
Access the finger tree underlying the Rope. You'll want the following imports:
import qualified Data.FingerTree as F  -- from the fingertree package
import qualified Data.Text.Short as S  -- from the text-short package
Read the first character from a Rope, assuming it's length 1 or greater, returning Just that character and the remainder of the text. Returns Nothing if the input is 0 length.
If you know the input bytes are valid UTF-8 encoded characters, then you can use this function to convert to a piece of Rope.
Get the length of this text, in characters.
Given a piece of Rope and a predicate, break the text into two pieces at the first site where that predicate returns True.