The FTS5 index approach here is right, but I'd push further: pure BM25 underperf...

thecopy · 2026-03-01T13:09:33 1772370573

Very interesting, one big wrinkle with OP:s approach is exactly that, the structured responses are un-touched, which many tools return. Solution in OP as i understand it is the "execute" method. However, im building an MCP gateway, and such sandboxed execution isnt available (...yet), so your approach to this sounds very clever. Ill spend this day trying that out

doctorpangloss · 2026-03-01T18:33:29 1772390009

The LLM that wrote the comment you are replying to has no idea what it is talking about...

thecopy · 2026-03-02T08:52:17 1772441537

Im trying it anyway

blakec · 2026-03-03T05:39:20 1772516360

commented below with more info in depth

pmarreck · 2026-03-02T13:35:12 1772458512

Are you sure it's simply because YOU don't understand it? Because it seems to make sense to me after working on https://github.com/pmarreck/codescan

danw1979 · 2026-03-01T09:05:39 1772355939

Would love to read a more in depth write up of this if you have the time !

I suspect the obsessive note-taker crowd on HN would appreciate it too.

blakec · 2026-03-03T03:29:50 1772508590

I wrote it up. The full system reference is here: https://blakecrosley.com/guides/obsidian — vault architecture, hybrid retrieval (Model2Vec + FTS5 + RRF), MCP integration, incremental indexing, operational patterns. Covers everything from a 200-file vault to the 16,000-file setup I run.

The hybrid retriever piece has its own deep dive with the RRF math and an interactive fusion calculator: https://blakecrosley.com/blog/hybrid-retriever-obsidian

See what your coding agent thinks of it and let me know if you have ways to improve it.

thecopy · 2026-03-03T12:56:02 1772542562

I implemented this as well successfully. Re structured data i transformed it from JSON into more "natural language". Also ended up using MiniLM-L6-v2. Will post GitHub link when i have packaged it independently (currently in main app code, want to extract into independent micro-service)

You wrote:

>A search for “review configuration” matches every JSON file with a review key.

Its good point, not sure how to de-rank the keys or to encode the "commonness" of those words

blakec · 2026-03-03T19:38:12 1772566692

IDF handles most of it. In BM25, inverse document frequency naturally down-weights terms that appear in every document, so JSON keys like "id", "status", "type" that show up in every chunk get low IDF scores automatically. The rare, meaningful keys still rank.

For the remaining noise, I chunk the flattened key-paths separately from the values. The key-path goes into a metadata field that BM25 indexes but with lower weight. The value goes into the main content field. So a search for "review configuration" matches on the value side, not because "configuration" appeared as a JSON key in 500 files.

MiniLM-L6-v2 is solid. I went with Model2Vec (potion-base-8M) for the speed tradeoff. 50-500x faster on CPU, 89% of MiniLM quality on MTEB. For a microservice where you're embedding on every request, the latency difference matters more than the quality gap.

danw1979 · 2026-03-05T08:04:04 1772697844

Thank you !

tclancy · 2026-03-01T12:36:15 1772368575

Seconded that I would love to see the what, why and how of your Obsidian work.