Project Overview
Literature Vault: Project Overview
I'm planning to build "Literature Vault" — a local search engine for research papers stored as PDFs on my machine. It is meant to ingest and index the content of these PDFs and allow search with keywords or natural language through a simple CLI and MCP. It would be ideal for agentic flows.
If I were doing my PhD again, I would be building something like this. I recently signed a contract to write a book on AI and STEM education, so I'm building this to help with the research for the book, for my own use. I'll document the process of building it here too, in case it's useful to others, and I'll share the code on GitHub when it's ready.
This is not a "chat-with-your-PDF" tool. Those tools exist! For example, Google's NotebookLM lets you upload your PDFs and then ask questions about them. But NotebookLM is more of a synthesis tool, and that is a layer one can build on top of my literature vault. Besides, I prefer to do that kind of work myself (let the human do some of the work).
I want to be able to run this entirely on-device, without relying on any AI service providers. I know it's possible, but at the moment I'm not sure whether it's feasible on modest hardware. I'll figure it out as I go, and I'll try to build it in a modular way so that I can swap out components as needed.
There are two main components to this system: the ingestion pipeline and the search engine. I'm going to jot down some thoughts on each of these components here, since it helps me organize my ideas. These may change as I go, and that's fine.
Search
I want three modes of search: keyword, semantic, and hybrid.
The keyword search is plain full-text search. This is the classic approach, from before embeddings, and it's still the standard today. It is not just the exact string match that you get when you press Ctrl+F in a PDF viewer. It will be more like the search you get in Google Scholar, which includes stemming, stop-word removal, and other NLP techniques to make it more effective.
The semantic search is what we got in the era of LLMs. The process is to break the papers into smaller chunks (e.g., paragraphs), turn those into a numerical representation (embedding) using a language model, and then store those embeddings in a vector database. The idea is that the "vector" (a numerical representation of the text) captures the meaning (semantics) of the text, so that similar meanings will have similar vectors, even if they use different words. When I search, I will also turn my query into an embedding and then find the passages whose embeddings are closest to the query embedding in the vector space (vector search). This is the part that lets you search in natural language.
So you might think: now that we can do semantic search, why do we need keyword search? Well, semantic search can miss some things that keyword search would catch. For example, if I search for exact phrases or specific technical terms, keyword search will certainly find those, while semantic search might not if those terms get diluted in the embedding. So the common wisdom is to use hybrid search, which combines the keyword and semantic search. It runs both of them and then combines the rankings. For example, it might take the top 100 results from each search and then re-rank them based on some combination of their keyword relevance and their semantic relevance to the query.
I don't want three separate commands for this. They're really the same thing: I type a query, I get back a ranked list of passages. So I'll make it one search command with a --mode flag.
lv search "attention in transformers" # hybrid by default
lv search --mode keyword "transformer architecture"
lv search --mode semantic "how models handle long context"
The lv is the tentative name for the CLI tool.
A few things I'm deciding early:
- Hybrid is the default, so a plain
lv search "..."just does the sensible thing without me picking a mode. - The results are passages with some ID that we can use to retrieve the paper (its metadata and location on disk) later.
- We should have
--jsonand--filesoutput formats that are more machine-readable, for use in agentic flows. - We should have some other flags for controlling the search, like how many results to return, whether to restrict the search based on certain metadata (e.g., only papers from 2020-2022), etc.
Ingestion
Before I can search anything, I need to get the papers into the vault. This is the ingestion side.
The first thing is to figure out what a paper is, as far as the system is concerned. I think the right move is to give each paper a fingerprint by hashing its content. That gives me a stable ID, and it also solves a practical problem: if I add the same paper twice, the system can tell it has seen it before and skip it. So lv add becomes idempotent almost for free. The fingerprint is also what lv remove and lv list refer to when they talk about a paper.
The next thing is to get the text out of the PDF. This is harder than it sounds, because PDFs are messy, but it's a solved problem and I don't need to solve it myself. I'll treat text extraction as a component I plug in, so I can start with something simple and swap in a better extractor later without touching the rest of the system. The same goes for pulling out metadata like the title, authors, and year, which I'll need for lv list and for the metadata filters on search.
Once I have the text, the rest of ingestion is a pipeline that turns it into the things search needs. It chunks the paper into passages, builds the keyword index over them, computes the embeddings, and stores everything alongside the paper's fingerprint and the location of each passage. When that pipeline finishes, the paper is searchable. That's really all "ingested" means here: the paper has been fingerprinted, its text extracted, and its passages indexed for both kinds of search.
Tech stack
I'm going to build this in Python, since the libraries for embeddings, PDF handling, and everything on the ingestion side mostly live there.
For storage, I'll use a single SQLite file. SQLite has full-text search built in, and with the sqlite-vec extension it can store and search vectors too. So both the keyword index and the embeddings live in the same file, next to the paper metadata and fingerprints. One file, no server to run, everything on disk. That fits the on-device goal nicely, and it's easy to swap out later if I outgrow it.
For the embeddings, I'll use a small local embedding model that runs on CPU. The exact model is a detail I'll figure out later; the point is that it runs on my machine and nothing leaves it.
For the CLI, I'll use Typer. It lets me write each command as a plain function with type hints, which keeps things small and readable. And for the MCP server, I'll use FastMCP, which works the same way: a tool is just a typed function with a decorator.
That last bit is the part I'm most happy about. Because both Typer and FastMCP wrap plain typed functions, I can write the core logic once and expose it through both the CLI and the MCP server without duplicating anything. And if I ever want to put a web API in front of it, the same logic is right there to wrap again. So the searching and ingesting live in one place, and the CLI, the MCP server, and a future API are all just thin layers on top.
That's the plan for now.