Adding and Reading Papers
Adding and Reading Papers
In the previous post, I set up the database module: a connect() function that opens a SQLite file and an init_db() function that creates the papers table. The module could create the table, but it could not yet do anything with it.
In this post, I want to add the first real database operations: inserting a paper record and reading it back. I am still going to pass the fingerprint in by hand, like I did before. Computing a real fingerprint from a PDF is its own topic, and I want to get the read-and-write path working first.
There is one decision in this post that is more interesting than it looks, and it is the reason I am doing insert before anything else: what should happen when I add the same paper twice?
Adding a paper
Let's start with the insert. Add this function to src/literature_vault/db.py:
def add_paper(
connection: sqlite3.Connection,
fingerprint: str,
path: str,
title: str | None = None,
year: int | None = None,
) -> bool:
cursor = connection.execute(
"""
INSERT INTO papers (fingerprint, path, title, year)
VALUES (:fingerprint, :path, :title, :year)
ON CONFLICT (fingerprint) DO NOTHING
""",
{
"fingerprint": fingerprint,
"path": path,
"title": title,
"year": year,
},
)
connection.commit()
return cursor.rowcount == 1
The function takes the connection and the fields for one paper. The title and year are optional, because there will be papers where I do not have that metadata yet. It runs an INSERT, commits, and returns a boolean.
A couple of things here are worth slowing down on.
The first is how I pass the values. Instead of putting the data directly into the SQL string, I use named placeholders like :fingerprint and hand the actual values in a separate dictionary. SQLite fills them in for me. This is the standard way to avoid SQL injection, and it also means I do not have to worry about quoting strings or escaping characters myself. The same idea works with ? placeholders and a tuple, but I find the named version easier to read once there are more than two or three columns.
The second is the ON CONFLICT (fingerprint) DO NOTHING clause, which is the heart of this post.
What happens when I add the same paper twice
Back in the overview, I said I wanted lv add to be idempotent: if I add the same paper twice, the system should notice it has seen it before and skip it. The papers table already has the piece that makes this possible, which is the UNIQUE constraint on fingerprint. SQLite will not let me store two rows with the same fingerprint.
The question is what happens when I try anyway. By default, SQLite treats it as an error and raises an exception. That is not what I want for ingestion. Adding a paper I already have is a normal thing to do, not a failure.
So I have a few options:
- Catch the exception. Let the
INSERTfail and wrap it in atry/exceptforsqlite3.IntegrityError. This works, but it uses an exception for something that is not really an error, and it makes the normal "already seen" path go through error handling. - Use
INSERT OR IGNORE. This tells SQLite to silently skip the row on any constraint violation. It is short, but it is also blunt: it ignores every constraint, so if I later add a constraint I actually want to hear about, this would swallow it too. - Use
ON CONFLICT (fingerprint) DO NOTHING. This says exactly which conflict I am choosing to ignore. If a row with this fingerprint already exists, do nothing. Any other problem still raises.
I went with the last one. It is the most explicit: anyone reading the function can see that the duplicate-fingerprint case is handled on purpose, and that nothing else is being quietly swallowed.
That leaves one loose end. If the insert quietly does nothing on a duplicate, how does the caller know whether the paper was new or already there? That is what the return value is for. After the statement runs, cursor.rowcount tells me how many rows were actually inserted: 1 if it was new, 0 if the conflict skipped it. So add_paper returns True when it added a new paper and False when the paper was already in the vault. Later, the add command can use that to print "added" or "already in vault" without doing a separate lookup first.
Reading a paper back
I also want to read a paper back out, so I can confirm it is really there and so the rest of the app has a way to look one up. Add this function next:
def get_paper(
connection: sqlite3.Connection, fingerprint: str
) -> sqlite3.Row | None:
return connection.execute(
"""
SELECT id, fingerprint, path, title, year, created_at
FROM papers
WHERE fingerprint = :fingerprint
""",
{"fingerprint": fingerprint},
).fetchone()
This looks a paper up by its fingerprint and returns the row, or None if there is no match. The fetchone() call returns the first matching row, and since fingerprint is unique there is at most one. Because I set row_factory = sqlite3.Row back in connect(), the row I get back lets me read columns by name, so the caller can write paper["title"] instead of counting column positions.
Returning None when the paper is not found keeps the function simple. The caller decides what a missing paper means, which is usually different in different places: a lookup might want to print an error, while an "is this already here?" check just wants the absence.
Testing it
Now I can test the two functions together, since reading back is how I check that writing worked. Add these to src/tests/test_db.py.
First, the basic round trip: add a paper, read it back, and check the fields survived.
def test_add_paper_inserts_and_reads_back(tmp_path: Path) -> None:
db_path = tmp_path / "vault.db"
connection = connect(db_path)
try:
init_db(connection)
is_new = add_paper(
connection,
fingerprint="abc123",
path="/papers/attention.pdf",
title="Attention Is All You Need",
year=2017,
)
assert is_new is True
paper = get_paper(connection, "abc123")
assert paper is not None
assert paper["fingerprint"] == "abc123"
assert paper["path"] == "/papers/attention.pdf"
assert paper["title"] == "Attention Is All You Need"
assert paper["year"] == 2017
finally:
connection.close()
This uses the same tmp_path fixture as the previous post, so each test gets its own throwaway database file. It adds one paper, checks that add_paper reported it as new, then reads it back and checks every field.
The next test is the one I actually care about: adding the same fingerprint twice.
def test_add_paper_is_idempotent(tmp_path: Path) -> None:
db_path = tmp_path / "vault.db"
connection = connect(db_path)
try:
init_db(connection)
first = add_paper(
connection,
fingerprint="abc123",
path="/papers/attention.pdf",
title="Attention Is All You Need",
year=2017,
)
second = add_paper(
connection,
fingerprint="abc123",
path="/papers/moved.pdf",
title="A Different Title",
year=2020,
)
assert first is True
assert second is False
# The second add was ignored, so the original row is untouched.
paper = get_paper(connection, "abc123")
assert paper is not None
assert paper["path"] == "/papers/attention.pdf"
assert paper["title"] == "Attention Is All You Need"
assert paper["year"] == 2017
finally:
connection.close()
The second add uses the same fingerprint but different values for everything else. Two things should be true. The first add returns True and the second returns False, so the function correctly reports that it had seen the paper before. And the row in the database still has the original values, not the second set. That second part matters: DO NOTHING means exactly that. It does not update the existing row with the new values, it leaves it alone. If I ever want adding a known paper to update its metadata, that would be a different choice (DO UPDATE), and I would make it on purpose.
The last test covers the miss: looking up a fingerprint that was never added.
def test_get_paper_returns_none_when_missing(tmp_path: Path) -> None:
db_path = tmp_path / "vault.db"
connection = connect(db_path)
try:
init_db(connection)
assert get_paper(connection, "does-not-exist") is None
finally:
connection.close()
Run the tests:
uv run pytest
The tests from the earlier posts should still pass, and now there are three new ones for the database operations.
9 passed
This is another small step, but the database can now do the two things it could not do at the end of the last post: store a paper and find it again. And it does it in a way that already matches the idempotent behavior I want from lv add, before there is an lv add to speak of.
I have still been passing the fingerprint and path in by hand, and these functions still live only in the database layer. Nothing in the CLI or the MCP server touches them yet. In the next post, I will wire these operations up through the core layer and expose the first real command, so that adding and listing papers works from the terminal the way greet does now.