Arcaneum

Arcaneum is a local-first semantic search and knowledge management tool to be use to index source code, research papers and references (pdfs), and markdown files. The later often the result of deep analysis using the other data sources, so ensuring citations and assertions are valid is important.

Arcaneum was developed for two reasons.

First, as an experiment in building a full-featured project only using the RDR model. I didn’t write a single line of code, nor have I read any of it, but I did iterate extensively on the RDRs I wanted to have implemented on my behalf.

Second, I needed a tool to help me write better RDRs and perform deeper research. This meant I needed a RAG for both research/academic papers and for source code. Arcaneum supports both.

A third bonus reason was that I wanted to explore the AI tooling ecosystem.

Arcaneum relies on a full text search and a vector search engine, both loaded and run via docker locally. When creating a corpus, both full text and vector search collections are created and loaded during indexing.

The semantic search is for finding nuanced topics and relationships between documents. The full text is so citations and references can be extracted and embedded in resulting analysis documents. This is especially helpful when navigating and researching standards body documents that rely on one another and needing to cross reference them with reference implementations.

Using Arcaneum within an LLM harness is the most powerful way to use it, but only for search. Indexing is time consuming depending on the embedding model chosen.

I use Arcaneum very day and have a dozen topic related corpora. One top level corpus maintains an index of 4k pdf research papers, using a reasonably fast embedding model. When I zero in on a topic, use this corpus to find related documents and citations I then re-index into a new corpus using a stronger/larger embedding model.

Chris K Wensel
Chris K Wensel
Data and Systems Architect