Skip to content

pgaskin/go-marisa

Repository files navigation

go-marisa

Go Reference Test Attest marisa build

Go bindings for marisa-trie.

MARISA is a read-only space-efficient trie data structure optimized for lookup, reverse lookup, commmon prefix search (keys which are prefixes of the query), and predictive search (keys starting with the query).

This library wraps a WebAssembly build of MARISA using wazero.

Getting started

var trie marisa.Trie
if err := trie.Build(EnglishWords(), marisa.Config{}); err != nil {
    panic(err)
}

id, ok, err := trie.Lookup("iterate")
if err != nil {
    panic(err)
}
if !ok {
    fmt.Println("not found")
}

key, ok, err := trie.ReverseLookup(id)
if err != nil {
    panic(err)
}
if !ok {
    fmt.Println("not found")
}

fmt.Println("l", id, key)

for id, key := range trie.PredictiveSearchSeq("iterat")(&err) {
    fmt.Println("p", id, key)
}
if err != nil {
    panic(err)
}

for id, key := range trie.CommonPrefixSearchSeq("iterated")(&err) {
    fmt.Println("c", id, key)
}
if err != nil {
    panic(err)
}

Limitations

This library supports little-endian MARISA dictionaries up to 4 GiB. On 32-bit systems, the size is limited to around 2 GiB. These are limitations of MARISA itself.

Big-endian dictionaries (i.e., ones generated with the native tools on big-endian hosts) are not supported.

Memory-mapped dictionaries are only supported on unix-like platforms.

Performance

On platforms which support wazero's JIT compiler, it's about 2-3x slower than the native library. Using the interpreter, it's about 70-150x slower.

The memory usage should be around the same other than a ~115K overhead per trie. Keys are copied to/from Go strings when used.

Design

The API is stable, type-safe, idiomatic, and does not leak implementation details of the marisa-trie library. All errors are handled appropriately and returned.

It supports common Go interfaces like encoding.BinaryMarshaler, encoding.BinaryUnmarshaler, encoding.BinaryAppender, io.WriterTo, and io.WriterFrom.

It provides optimized iterable APIs for queries.

Tries are garbage collected automatically along with other Go objects when there are no more references to it or iterators derived from it.

This module also includes drop-in replacements for the native command-line tools. The have compatible input/output and exit codes, but the error messages may differ.

Concurrency

Each trie can be used across goroutines, but must not be used concurrently by multiple goroutines.

Multiple iterators can be active at once, but must not be called concurrently by multiple goroutines with each other or the trie it came from.

Testing

The wasm blob is fully reproducible and verified.

The tries written by this wrapper will be bit-identical to the ones generated by the native marisa-build.

There are comprehensive unit tests for all exposed functionality.