Show HN: Semantic search over Hacker News, built on pgvector (opens in new tab)

(ask.rivestack.io)

5 pointsstranger901mo ago2 comments

I built https://ask.rivestack.io — a semantic search engine over Hacker News posts. Instead of keyword matching, it finds results by meaning, so you can search things like "best way to handle authentication in microservices" and get relevant threads even if they don't contain those exact words. How it works:

Indexed HN posts and comments into PostgreSQL with pgvector (HNSW index) Embeddings generated with OpenAI's embedding model Queries run as nearest-neighbor vector searches — typical response under 50ms The whole thing runs on a single Postgres instance, no separate vector DB

I built this partly because I wanted a better way to search HN, and partly to dogfood my own project — Rivestack (https://rivestack.io), a managed PostgreSQL service with pgvector baked in. I wanted to see how pgvector holds up with a real dataset at a reasonable scale. A few things I learned along the way:

HNSW vs IVFFlat matters a lot at this scale. HNSW gave me much better recall with acceptable index build times. Storing embeddings alongside relational data in the same DB simplifies things enormously — no syncing between a vector store and your main DB. pgvector has gotten surprisingly fast in recent versions. For most use cases, you really don't need a dedicated vector database.

The search is free to use. Rivestack has a free tier too if anyone wants to try something similar. Happy to answer questions about the architecture, pgvector tuning, or anything else.

2 comments

Niko901ch1mo ago

This is a great practical application of pgvector! The HN corpus is perfect for semantic search because the discussions tend to be technical and well-structured.

I'm curious about the embedding model you chose - did you compare different options (OpenAI ada-002, Cohere, open-source models like all-MiniLM)? And how's the query performance with pgvector at scale?

One feature that would be valuable: filtering by time range or karma score. Sometimes you want recent discussions vs. classic threads with high engagement.

malandin1mo ago

Hey, great project! You mention that you didn't want to use a vector database in this project. Any particular reason for this? Have you also thought about using a search engine like Elastic or OpenSearch?

j / k navigate · click thread line to collapse