undefined | Better HN

0 pointsn1xis10t5mo ago0 comments

That’s pretty cool. How many pages do you have in your database so far? Also since you are working on a search engine, I would recommend reading this article: https://archive.org/details/search-timeline

0 comments

eightballsystem5mo ago

Last I checked it had just over 9 thousand, I think it was like 9076 or something like that. And thank you for the read! That was pretty interesting!

n1xis10tOP5mo ago

No problem. You might consider using data from the Common Crawl to boost your index size. If you get the extracted text files (called WET instead of WARC), they don’t take up much space. I have one from 2014 that has about 73’000 pages in it, and it only takes up about 300mb uncompressed. Those files are surprisingly easy and fun to work with, and downloading them will probably always be faster than crawling on your own. If you use files from the older crawls it will probably make your product more distinctive, but there are probably a lot of 404’s so you might have to give people an option to view the cached page or go to the Wayback Machine. You probably don’t have the resources for this, but I would love it if someone made a search engine that lets you search though all 115 or so crawls that they have, which would be around 100 billion pages and take up around 816 TB.

eightballsystem5mo ago

thank you! Ill look into that a bit, i don't have the space for the full thing yet but some day!

j / k navigate · click thread line to collapse

0 comments

eightballsystem5mo ago

Last I checked it had just over 9 thousand, I think it was like 9076 or something like that. And thank you for the read! That was pretty interesting!

n1xis10tOP5mo ago

eightballsystem5mo ago

thank you! Ill look into that a bit, i don't have the space for the full thing yet but some day!

j / k navigate · click thread line to collapse