Show HN: Open-source search engine with 2bn-page index (opens in new tab)

(deusu.org)

229 pointsdeusu9y ago139 comments

139 comments

Alternative general purpose search engines are an exciting idea.

It seems a lot like we're about the time when yahoo was dominant and searching was sort of awful. When you searched, what ranked highest was market-driven sorts of stuff.

Right now, for topics normal people search for - not techies -all you get are content farm sites with js-popups asking for your email address. Try searching for anything health related, for example. We've regressed.

My own half-solution is to look only for sites that are discussions - reddit, hn, etc. It could be better. A search engine that favored non-marketing content could really steal some thunder.

This doesn't look like that, but maybe its a start?

thr0waway12399y ago

Speaking of which, it seems possible for a computer to detect content which is just mostly marketing, versus content which is not (based on how spam filters work). The search engine should just show a "marketing index" score right next to the result. Even better is to whitelist certain sites (Wikipedia,popular .edu and .org domains) to begin with and prioritize those results.

It would likely be really niche, but it could become the anti-Google, which would be great when people actually seek an alternative to all the noise.

I think it has to be a non-profit like Wikipedia itself, I cannot imagine a model where it can also make money. The submitted site is a candidate but it has to improve the search quality as other commenters pointed out.

pjc509y ago

> seems possible for a computer to detect content which is just mostly marketing

But based on the spam race, marketers will then tune content so that it doesn't trip those filters. Paid news and journal articles, etc.

4 more replies

pbhjpbhj9y ago

It seems that you could just use Google's algorithms and modify the site trust metric using a front-page spam-score, whilst reducing the effect of link-juice from links with associated marketing keywords ("buy the doohickey on this link", or whatever).

Keeping marketing sites high in your SERPs would make you way more money on referrals though.

1 more reply

ganeshkrishnan9y ago

This search engine seems to use only tf-idf inverted index for it searches and then a vector space model for ranking the similarity.

A search for "java twitter bot" places more emphasis on "bot" then on Java and then on twitter which is what a tf-idf would do.

A good start like you said but it's miles away even from yahoo or bing.

6502nerdface9y ago

Wow, the contrast between what this engine returns for that query and what google returns is amazing. Literally zero relevant links from the former and only relevant links from the latter. Search relevance is a serious high-science research problem, and it's going to be tough to compete with established players that have probably man-centuries' worth of proprietary research IP and some of the world's best scientists.

1 more reply

dood9y ago

I'd love a search engine which only indexes forums. Something I've been thinking of doing for years, but it'd be a lot of work.

dvirsky9y ago

There were a few attempts at that in the past, one being http://omgili.com/ that now seems to return pretty much garbage.

BTW About 12 years ago I was building this search engine, and I was toying with the idea of building a classifier that classifies web pages based on their "genre" rather than category, so you can limit your search for shopping websites, forums, blogs, news sites, social media, etc. It was a bitch to train, and my classifier's algorithm was pretty crappy, but it showed some potential.

I think today modern search engine do that behind the scene, and try to diversify the result to include pages from multiple genres, but they usually don't let you choose.

1 more reply

siegecraft9y ago

I thought about doing the same thing, found http://boardreader.com/ and then moved on. No idea how useful it is.

mox19y ago

Often times when searching for something, I would love this as well. I usually add phpbb, forums or discussion to the search keywords, but it's never 100%.

I would love something like this as well...

ap222139y ago

Oh, Is this the search wish-list thread? Ok: I just want a search engine that only indexes the sites that would be of interest to people like me.

For most of my searches over the last year, Google has been so broken that it's almost unusable. At this point, to get any relevant results, I have to anticipate how Google will work, and then trial and error 10-15 times until I find what I'm looking for.

But, if I happen to be looking for incorrect information from 2005, Google works like a charm.

beejiu9y ago

Another product (Discussions) that Google discontinued.

MasterScrat9y ago

Probably the easiest way would be to use Google Search APIs and do custom queries and/or filtering.

tspike9y ago

That's been my pet idea, as well. Would love to support it however I can. boardreader.com is OK but could be a lot better.

Keyframe9y ago

I always had this, maybe stupid idea, of vouching referrals search engine. You pick a few sites you know are good and they (their 'webmasters') would vouch for new ones, and those could vouch for new ones, etc. Catch is, if one or two (whatever) of your child or grandchild vouched sites screws up, then they're toast, out of index, but so are you. That way you would pick wisely.

Same idea would probably work for online commenting. Vouch with a chain of responsibility. That's essentially how pagerank did its thing, but with no repercussions and vouching was automatic based on links from initial seed of what they thought was good. I'd do it with humans.

austingulati9y ago

This is very similar to PageRank except "vouch" is accomplished by linking to the other page.

1 more reply

Jaruzel9y ago

I always had an inkling for a Search Engine that ONLY indexed the root page of every domain. Not sure if I'm right about this, but it sure would sort the chaff from the wheat for general purpose queries.

tempestn9y ago

Seems like that would just give you all those made-for-seo sites that tend to have second-rate content at best. ie, you search for 'best electric lawn mower', and you'll get bestelectriclawnmowers.com, 10bestelectricmowers.com, etc. Those sort of sites exist for every imaginable topic, and in my experience are rarely worth visiting.

I would almost want the opposite. The best content on most topics I've found tends to be a page on a discussion forum of some sort, followed by blogs and more general editorial sites.

1 more reply

richmarr9y ago

The sheer scale required to attempt a new search engine is pretty staggering... it seems like one area where decentralisation might actually be worthwhile; the key obstacle being everyone's interest in gaming search results. I wonder if there's a useful application of ledgers that'd be useful in there somewhere...

samstave9y ago

Why not have a search engine with "sub-reddits" that can be subscribed to...

Whereby - a site would self-identify as being in a particular genre, say "healthcare" - and I could launch a tab to the engine and set my sub to "health, health-tech, healthcare, medicine, etc.." and then do my search and only those sites that set their category will show up in that search - but if I dont find my search, I can then easily slide out to other areas where I may not have thought what I was looking for would have identified with. Further - any post by any company/site could individually been given a topic to self-declare as... thus even if the company or site isnt necessarily in that space - their page or object could at least be a part of that result ranking....

Or has this been tried/found to be stupid?

allendoerfer9y ago

> Or has this been tried/found to be stupid?

You are describing the keywords meta tag.

While it is often told that competitors before Google did not use something like PageRank, which is not true, Google's PageRank algorithm was better and cheaper than the competitors' and effectively killed your idea 20 years ago.

1 more reply

iplaw9y ago

I've started appending "forum" to many of my queries when I am looking for answers from users.

jobigoud9y ago

A few years ago Google had a "discussions" category that you could pick alongside "images", "videos", etc. I wonder why they removed it. It was indexing forums, Google and Yahoo groups, etc.

ma2rten9y ago

I would love to have a search engine that would allow you supply your own ranking function.

amirouche9y ago

How would this work? You could boost the query term for instance, like it possible to boost the column score in postgresql but that is all. Otherwise allowing user to provide their own ranking function (which is itself an art) would not be pratical performance wise. It should be noted that search engine interface, the search box is already a DSL for the underlying algorithm that support OR/AND and NOT.

1 more reply

mack739y ago

That is a lovely idea. Unfortunately, a scoring scheme has one foot in the indexing process (that thing that the google bot does) and another in the querying part, so switching schemes would often mean you would need to re-index your data to cater for the new metrics you now need for a new type of scoring.

1 more reply

Razengan9y ago

The "ideal" search engine is probably not possible without some sort of AI having access to all the content on internet.

rspeer9y ago

> some sort of AI having access to all the content on internet

Isn't this a description of any decent search engine?

CM309y ago

Well, I admire the work behind it, and I think the idea is good (especially how having this open source means multiple sites can build on the same data set and get it more and more accurate over time).

But I have to be honest and say that it's just not working for me.

I type in Reddit, and it shows links to the NSFW subreddits instead of the main site or anything else on it.

Typing in Wikipedia gives me the Dutch version of Wikipedia.

Mario Wiki? The page on Mario Wiki about Mario Clash, then the Smash Bros Wiki and a bunch of SEO spam pages.

Pokemon Go gets me no relevant results at all. Certainly not anything official, that's for sure.

It's a decent start, and having 2 billion pages indexed is pretty impressive for a project like this as it is, but it's just not really usable as a search engine just yet.

laurent1234569y ago

They need to filter porn out of their search results (even for common queries like "hat", there's only porn) and perhaps be more resilient to SEO techniques since it looks like there's lot of spam on top results. Queries with common words such as "cat" return almost only irrelevant results.

I'd really like to see that kind of project working as a good alternative to Google, but as it is it's not really usable.

deusuOP9y ago

I hadn't even thought about that. But it should be pretty easy to do in post-processing. I just have to take a list of "porn" keywords. If none of them occurs in the query, but in a search-result, then that result gets downranked.

runarb9y ago

If you want you can use the now defunct web search engine boitho.com list of adult words. It is avalibal at https://github.com/searchdaimon/adult-words .

We mostly filtered out porn by using a two word phrase method. There were a lot of edge cases because many potentially dirty concepts are made up of words that are not bad when used alone. For example a text can have both "girls" and "nude" in it without being vulgar, but if it has the phrase "nude girls" the chance for it being pornografic is much higher.

laurent1234569y ago

Yes I guess filtering them out would at least make the website SFW, and it would make it easier to show it to people. The issue seems to happen mainly with common words (which results also appear to be polluted with heavily SEO-ed websites).

I've also searched for less generic things like "xperia z5" and the results looked good.

1 more reply

Taek9y ago

Is the two billion page index open source?

I've been thinking a lot about days recently. Seems to me like Pandora's box is open. Google knows where you live, where you eat, what your fetishes are, all of your sexual partners. Facebook knows most of those things to, via different methods. And if you run Windows Microsoft probably has access to most of that as well. Apple will too, because if they don't they won't be able to compete. Tesla, Uber, Waze also have a huge amount of data on your life.

Everyone is pushing the envelope on how much data they are collecting, and the companies which collect more data will compete better. As tech gets better we will increasingly be unable to resist sharing our whole lives with the companies who are powering modern living.

Even worse, there's a huge monopolization effect to having data. Nobody else has anywhere near as much data as Google. That means nobody else can compete. Nevermind the engineering, your algorithms can be 2x as good but you won't have 0.1% the data as a company with billions of daily users.

So Google and Facebook are left untouchable. Microsoft, Apple, and maybe Amazon can get in range. Is there anyone else?

We can fight back by giving up the privacy war and blowing the doors open instead. Take your data (as much as you dare) and make it public. Let every startup have access to it. Let every doctor have access to it. Give the small players a fighting chance.

That does mean a massive cultural shift. It means your neighbors will be able to look up your salary, your fetishes, your personal affairs. It's a big deal.

I don't see any other way out of this though. Surveillance technology is getting better faster than privacy technology, because surveillance tech has the entire tech industry behind it. Smarter phones, smarter TVs, smarter grocery stores, smarter credit cards, smarter shoes... smarter everything. Privacy is melting away and we aren't getting it back.

deusuOP9y ago

The software is already open-source.

A free search API will be fully available probably next week. It's in testing already. It's just a matter of putting the finishing touches on the documentation.

And the crawl- and index-data will be available for download in a few weeks. It's also just a matter of documenting the data-format.

BTW: I disagree with your points about privacy. I see DeuSu as a way of fighting back.

siegecraft9y ago

How does the search / indexing compare to sphinx or lucene?

1 more reply

jahewson9y ago

If you're looking for an open source web crawl, commoncrawl.org has billions of pages.

greglindahl9y ago

... and Common Search made an index of the homepages, too.

mi100hael9y ago

Interesting perspective. What about the other extreme, though? This circumstance has only really existed in the past 20 years or so, maybe less. Why not just revert some of your behaviors?

- Delete your Facebook account. If you really need it to keep in touch with people across the country, at least delete the app from your phone and don't leave it open in a browser tab.

- Don't place asinine Amazon orders just because they ship free. Stop by a drug store or hardware store on your way home from work for odds & ends. You can even pay cash and kill the CC data bird with the same stone. Bonus points for instant gratification.

- Don't use GMail. Use an email provider like Protonmail or Tutanota that doesn't index all of your emails for advertising or other purposes.

- Don't sign in to Google (or Waze or whoever else's site) when navigating or searching so all of your actions online aren't automatically tied to a single account.

- Don't buy some internet-of-shit appliance that inevitably phones home with a bunch of telemetry data just so you can get a push notification when your laundry is dry.

Want to go really extreme? Now that you've uninstalled Facebook and aren't getting push notifications from your toaster, ditch your smart phone. Pay $15/mo for calls/texts on a Nokia 3310 and save some cash while you're at it. Want to listen to music on the go? Buy an MP3 player like everyone did 10 years ago. Want to play games on the go? Buy a GameBoy and experience the wonder of physical buttons while gaming. Really need directions on the go? You can probably get by with a Garmin or even gasp a paper map. Want to read HN or Reddit on the toilet? Try a book. Yes, I've written mobile apps and my iPhone is sitting on the desk right in front of me, but I hate the fucking thing and its always-connected mobile data. I've done the rest of the above bullet points and don't plan on buying another smart phone when this one hits planned-obsolescence in another year or two. Aggressive data collection depends on user engagement. The easiest way to fight it is to just quit engaging. Usually it'll save you money, too, which is nice.

allendoerfer9y ago

I am on your side. Except I need to contact some friends and - more importantly - customers, so I cannot entirely ditch everything. I cannot get rid of Facebook (Pages, API, relatives, even some customers), Skype (customers) and Whatsapp (friends). I am waiting for Whatsapp to become available on either Ubuntu Touch or Firefox OS. I would use these before buying a dumbphone and an MP3 player.

fnord1239y ago

It's written in pascal. Neat.

However, it's not very good. If I search for "banana" I get information about a sex shop rather than about bananas.

0xmohit9y ago

Appears to be a "smart" search engine. Tries to infer a lot (maybe based on data collected earlier).

mstolpm9y ago

In addition to the lack of removing porn and the ordering of the results not priorizing "quality" sources, some of the indexed site data is at least 4-6 months old and has heavily changed since the last crawl. I even got 404 errors. That makes it very hard to really find use in the project other than for academic interest.

deusuOP9y ago

A fresh recrawl is currently running. Should take about 2-3 months. Newly crawled data will gradually replace older data during that time.

webtechgal9y ago

Great work, congrats. :-)

Here is some input based on my experience building a similar project at my former company. (We did not quite get to 2B pages, but were close to ~300M):

For creating a really viable (alternative) search engine, the freshness of your index is going to be a fairly important factor. Now, obviously, re-crawling a massive index frequently/regularly is going to need/consume some huge amounts of bandwidth + CPU cycles. Here is how we had optimized the resource utilization:

Corresponding to each indexed URL, store a 'Last Crawled' time-stamp.

Corresponding to each indexed URL, also store a sort-of 'crawl-history' (If space is a constraint, don't store each version of the URL, store only the latest one). On each re-crawl, store two data fields: time-stamp and a boolean if the URL content has changed since last crawl. As more re-crawl cycles run, you will be able to calculate/predict the 'update frequency' of each URL. Then, prioritize the re-crawls based on the update frequency score (i.e. re-crawl those with higher scores more frequently and the others less frequently).

If you need any more help/input, let me know and I'll be happy to do what I can.

HTH and all the best moving forward.

1 more reply

jbb5559y ago

I think projects like this are really important because they help reduce the impression that big server projects are only meant to be done by big companies. The internet is becoming a content consumption medium for many people.

I'm not sure I'll use this, but I'll try to... it all depends on how good it is. But I approve of the project so I sent a (very) small bitcoin donation to hopefully help fund it for a few more minutes :)

deusuOP9y ago

Thank you!

Depending on who you are (there were 2 bitcoin donations today), you funded either about 18 or 28 hours of operations. :)

ccleve9y ago

You get really good performance on not much hardware. Can you share some technical details?

- file formats, particularly the postings

- query evaluation strategy

- update strategy

I poked around in the source code a bit, but couldn't find these things.

deusuOP9y ago

File formats will be documented when I publish the data-files in a few weeks.

What do you mean with postings?

The main index is split into 32 shards (there is also an additional news-index which is updated about every 5-10 minutes). Each shard is updated and queried seperately. The query actually runs 2/3 on a Windows server and 1/3 on a Linux server. The latter in Docker containers. I want to move everything to Linux over time.

Query has two phases. First only a rough - but fast - ranking is done. Then the top results of all shards are combined and completely re-ranked. This is basically a meta search engine hidden within.

First query phase is in src/searchservernew.dpr, and the second phase is in src/cgi/PostProcess.pas.

ccleve9y ago

Thank you. "Postings" is another word for the format of the doc ids and related information in the inverted file. A google for "inverted index postings" will turn up a bunch of references.

pmontra9y ago

Written in Delphi. I might be wrong but I don't see many people downloading and working on it. 30 day free trial and then you have to pay for the development environment. IMHO it's a non starter for an open source project but if it's the only language the author is comfortable with, well that's OK.

deusuOP9y ago

Originally it was written in Delphi. But I now use FreePascal for the development. I'm even compiling both Windows and Linux versions on my Linux machine.

pmontra9y ago

Great choice! Thanks.

RobAley9y ago

It appears to now have moved over to FreePascal, which is the free Open Source delphi look-a-like.

NKCSS9y ago

Fun, but overal quality seems a bit lacking.

When I search myself; the top 10 results don't even have my last name ('Kusters') and just shows pages that have the word 'Nick'. I suppose you don't use a form of LSA to score the search results? Maybe it's too specific, but afaik mainstream search engines seem to give somewhat consistent results here.

https://deusu.org/query?q=nick+kusters

Looking at the code (https://github.com/MichaelSchoebel/DeuSu/) I notice that you have ranking modifiers based on the .tld; why not store the reported content language and score based on that? Isn't that more relevant?

deusuOP9y ago

In my experience this is usually caused by the fact that even 2bn pages aren't that many nowadays. The index needs to get bigger to better find (and rank) long-tail results like queries like this.

gkst9y ago

Pascal is an interesting language choice. I think it is the 1st time I see an open source project that is actually used in production written in Pascal.

skykooler9y ago

It shows snippets of the web pages under each result; however, generally not the particular snippets that contain the search term. I would think that would be useful.

deusuOP9y ago

Yes, it would be better.

The snippets are currently the first 255 characters of the page's text. For snippets to be customized to the search term, I would have to store all the text of the page. And that would require a lot more disk space. Space that I can't afford at the moment.

yati9y ago

Looking at the source code took me back to days when I used to do stuff in Delphi :)

Neat project -- Loads of room for improvement, but a great initiative!

swiley9y ago

The site's interface is just incredibly pleasant compared to Google.com. I really hope the author sticks with it. Unfortunately I'm not sure it's usable right now, searching "group theory Wikipedia" never brings up a Wikipedia page (although maybe I should just be directly searching Wikipedia if that's what I wanted).

DanBC9y ago

DuckDuckGo's approach of !bang searches, making duckduckgo the place[1] I go when I want to search another site, is really useful.

[1] It's my default search engine in Chrome, so I use bang searching in the address bar.

Cyph0n9y ago

Same here. The problem is that I find myself using `!g` way too often... I guess I'm not used to the DDG results page.

1 more reply

rshm9y ago

As of aug 16, common crawl has 1.73n pages. For the complimentary set of urls, if any benefit you can use their data dump as seed.

If the metadata (such as last modified) size of your index is small enough to upload to aws, you can also reduce your re-crawl efforts when they have a fresh release.

greglindahl9y ago

It doesn't have to be small to donate to Common Crawl, they have a free S3 bucket.

supersan9y ago

Hi, I find the Blog more interesting right now since I hope to find write-ups about how you were able to manage such a herculean task on your own?

Crawling 2bn pages could take forever and could generate a huge bandwidth bills, so any lessons you learnt, pitfalls you faced, etc would be a great read.

deusuOP9y ago

Some issues that appeared over the years:

Block outgoing connects to local IP nets in your firewall. Otherwise your hosting provider might think you are trying to hack them. Apparently there are a lot of links out there that point to hosts which resolve to private IP ranges.

Another problem with following links is that you are bound to run across some that are malware command & control servers. Had several complaints to my ISP after authorities took over control of one and used the C&C server's domain as a honeypot. My crawler is on a whitelist now.

I had one person who vehemently complained that I was trying to hack him, because the software downloaded his robots.txt. I'm NOT kidding! :)

Make sure your robots.txt parsing is working correctly. I had an undiscovered bug in the software at some time which basically caused it to think everything is allowed. Luckily someone was nice enough to let me know. And he was really nice about it. And he would have had every right to be angry.

A major bottleneck is DNS queries. Run your own DNS server and even cache the hostname/IP pairs yourself. Do not even think about using your IPS's DNS server. If you bombard them with 100+ DNS requests/s then they WILL be angry. :)

webtechgal9y ago

> Run your own DNS server and even cache the hostname/IP pairs yourself.

This[1] might be a useful resource to get started:

[1] https://scans.io/

(Register and download the IPv4 Address Space data file to use as an initial cache and then append/update as you go.)

1 more reply

ommunist9y ago

DeuSu seems not indexing Cyrillic part of the Internet, and cannot give you insights for Greek, try https://deusu.org/query?q=ελιά . Is it Latin ANSI only index?

deusuOP9y ago

Only ASCII and German umlauts (äöüß) at the moment. The parser needs rewriting. It was originally written in pre-unicode times. :)

tychuz9y ago

And all javascript related questions still have w3schools as first result, god dammit.

gkilmain9y ago

I think for newbs who want to learn the fundamentals of web dev w3schools is a good resource. Even the people over at w3fools admit it. For a deeper dive though clearly MDN is the winner.

kowdermeister9y ago

Strange, Wikipedia article is not on the first page and don't blame me for searching something non German thing :)

https://deusu.org/query?q=berlin

semi-extrinsic9y ago

It's pretty obvious that Google et al. do a lot of "custom" filtering like prioritising Wikipedia, removing porn from "obviously non-porn" searches etc. (That "Berlin" search gives porn as the 8th result.)

gkst9y ago

I doubt that Google prioritizes Wikipedia deliberately. Wikipedia has tons of backlinks, authority, trust, typically a high text to html ratio, probably a low bounce rate. Moreover, it is fast, works well on mobile and on and on. It's is just a very well done and useful site for users and search engines.

1 more reply

0xmohit9y ago

Earlier discussion: https://news.ycombinator.com/item?id=9122397

ommunist9y ago

DeuSu does not crawl social pages it seems. No traces of linkedin profiles and no facebook. From a certain point of view - this is a good thing.

billconan9y ago

I searched "meta programming c++" and the top returns are all about java.

I'm curious, is it expensive to run a search site like this?

deusuOP9y ago

Currently €300/month. More details on https://deusu.org/donate.html

vain9y ago

Google's secret ingredient to stay relevant and informational is Wikipedia.

Deusu on the other hand seems to weight words in urls highly.

If you search for scientology only on Deusu, you might end up wearing a funky hat https://deusu.org/query?q=scientology

amirouche9y ago

Did you think about database dump of popular services like HN, SO or Wikipedia to speed up crawling and revelance?

deusuOP9y ago

Yes. I have downloaded several data dumps, but haven't gotten around to import them yet.

outpan9y ago

Awesome job!

For the life of me I can't figure out how you manage to crawl over a billion web pages (even in 2-3 months), index the data and run the server with €300 per month. Especially the crawler part...

rbjorklin9y ago

What makes this better than https://duckduckgo.com ?

diggan9y ago

Not saying that it's better but one of the main selling-points of DeuSu seems to be that it's fully open source and independent search index. Duckduckgo, if I remember correctly, is not 100% open source and get their search index from Yahoo (or maybe Bing, not sure)

kowdermeister9y ago

If it's not good, the it doesn't matter if it's OSS or not.

2 more replies

vcool079y ago

Any specific reason you've used pascal ? I thought that language got extinct long ago.

deusuOP9y ago

It's alive and well. The TIOBE index still lists it ahead of Ruby, Swift, Objective-C, GoLang...

And I started this software 20 years ago. Granted, a LOT of the software has changed since then. But I don't see a reason to throw away existing code unless it is in need of so much change that rewriting from scratch would be easier. And even then I might stick to what I know best, and what fits best with other parts of the software.

malinens9y ago

works really fast!

deusuOP9y ago

Thx.

But all the traffic from here is currently driving the servers to their limit. Queries are already slowing down a bit because of imminent overload. Usually the average query takes about 250ms. Currently the average is at 334ms.

scandox9y ago

Every time I see new search engine projects I remember this: https://en.wikipedia.org/wiki/Cuil

I note that Dr Anna Patterson is back with Google. She wrote this in 2004: http://queue.acm.org/detail.cfm?id=988407

hvo9y ago

I am not sure many of the issues Dr. Anna Patterson raised here are applicable now.Web is way different now compare to 2016.

micwo9y ago

Deusu can't find deusu (or deusu.org)

https://deusu.org/query?q=deusu

deusuOP9y ago

And why should it? You are already at the destination. No need to find it. :)

micwo9y ago

Try to find any other site by url:

https://deusu.org/query?q=news.ycombinator.com

1 more reply

ashitlerferad9y ago

Another open source search engine:

http://yacy.net/

ytjohn9y ago

Thanks, I was trying to remember that one. I think that for any new, non-profit, search engine to be viable, it has to be decentralized. deusu.com takes 2-3 months to crawl 2bn pages. Yacy claims to be at 1.4bn. I don't know how long it takes for that index to get refreshed, but it has 600 peer operators. Even if Yacy has a weaker indexing algorithm, I imagine that 600 peers, each crawling and contributing their own set of sites must be faster than a single deusu node.

Yacy is also quite a bit more resilient.

I will say that I don't buy Yacy's "no censoring" statement. If I was a bad actor, I could run yacy on a computer with false dns and false certificates, and yacy could index my fake content with official looking URLs.

j / k navigate · click thread line to collapse

139 comments

throwaway133379y ago

Alternative general purpose search engines are an exciting idea.

It seems a lot like we're about the time when yahoo was dominant and searching was sort of awful. When you searched, what ranked highest was market-driven sorts of stuff.

My own half-solution is to look only for sites that are discussions - reddit, hn, etc. It could be better. A search engine that favored non-marketing content could really steal some thunder.

This doesn't look like that, but maybe its a start?

thr0waway12399y ago

It would likely be really niche, but it could become the anti-Google, which would be great when people actually seek an alternative to all the noise.

pjc509y ago

> seems possible for a computer to detect content which is just mostly marketing

But based on the spam race, marketers will then tune content so that it doesn't trip those filters. Paid news and journal articles, etc.

4 more replies

pbhjpbhj9y ago

Keeping marketing sites high in your SERPs would make you way more money on referrals though.

1 more reply

ganeshkrishnan9y ago

This search engine seems to use only tf-idf inverted index for it searches and then a vector space model for ranking the similarity.

A search for "java twitter bot" places more emphasis on "bot" then on Java and then on twitter which is what a tf-idf would do.

A good start like you said but it's miles away even from yahoo or bing.

6502nerdface9y ago

1 more reply

dood9y ago

I'd love a search engine which only indexes forums. Something I've been thinking of doing for years, but it'd be a lot of work.

dvirsky9y ago

There were a few attempts at that in the past, one being http://omgili.com/ that now seems to return pretty much garbage.

I think today modern search engine do that behind the scene, and try to diversify the result to include pages from multiple genres, but they usually don't let you choose.

1 more reply

siegecraft9y ago

I thought about doing the same thing, found http://boardreader.com/ and then moved on. No idea how useful it is.

mox19y ago

Often times when searching for something, I would love this as well. I usually add phpbb, forums or discussion to the search keywords, but it's never 100%.

I would love something like this as well...

ap222139y ago

Oh, Is this the search wish-list thread? Ok: I just want a search engine that only indexes the sites that would be of interest to people like me.

But, if I happen to be looking for incorrect information from 2005, Google works like a charm.

beejiu9y ago

Another product (Discussions) that Google discontinued.

MasterScrat9y ago

Probably the easiest way would be to use Google Search APIs and do custom queries and/or filtering.

tspike9y ago

That's been my pet idea, as well. Would love to support it however I can. boardreader.com is OK but could be a lot better.

Keyframe9y ago

austingulati9y ago

This is very similar to PageRank except "vouch" is accomplished by linking to the other page.

1 more reply

Jaruzel9y ago

tempestn9y ago

I would almost want the opposite. The best content on most topics I've found tends to be a page on a discussion forum of some sort, followed by blogs and more general editorial sites.

1 more reply

richmarr9y ago

samstave9y ago

Why not have a search engine with "sub-reddits" that can be subscribed to...

Or has this been tried/found to be stupid?

allendoerfer9y ago

> Or has this been tried/found to be stupid?

You are describing the keywords meta tag.

1 more reply

iplaw9y ago

I've started appending "forum" to many of my queries when I am looking for answers from users.

jobigoud9y ago

A few years ago Google had a "discussions" category that you could pick alongside "images", "videos", etc. I wonder why they removed it. It was indexing forums, Google and Yahoo groups, etc.

ma2rten9y ago

I would love to have a search engine that would allow you supply your own ranking function.

amirouche9y ago

1 more reply

mack739y ago

1 more reply

Razengan9y ago

The "ideal" search engine is probably not possible without some sort of AI having access to all the content on internet.

rspeer9y ago

> some sort of AI having access to all the content on internet

Isn't this a description of any decent search engine?

CM309y ago

But I have to be honest and say that it's just not working for me.

I type in Reddit, and it shows links to the NSFW subreddits instead of the main site or anything else on it.

Typing in Wikipedia gives me the Dutch version of Wikipedia.

Mario Wiki? The page on Mario Wiki about Mario Clash, then the Smash Bros Wiki and a bunch of SEO spam pages.

Pokemon Go gets me no relevant results at all. Certainly not anything official, that's for sure.

It's a decent start, and having 2 billion pages indexed is pretty impressive for a project like this as it is, but it's just not really usable as a search engine just yet.

laurent1234569y ago

I'd really like to see that kind of project working as a good alternative to Google, but as it is it's not really usable.

deusuOP9y ago

runarb9y ago

If you want you can use the now defunct web search engine boitho.com list of adult words. It is avalibal at https://github.com/searchdaimon/adult-words .

laurent1234569y ago

I've also searched for less generic things like "xperia z5" and the results looked good.

1 more reply

Taek9y ago

Is the two billion page index open source?

So Google and Facebook are left untouchable. Microsoft, Apple, and maybe Amazon can get in range. Is there anyone else?

That does mean a massive cultural shift. It means your neighbors will be able to look up your salary, your fetishes, your personal affairs. It's a big deal.

deusuOP9y ago

The software is already open-source.

A free search API will be fully available probably next week. It's in testing already. It's just a matter of putting the finishing touches on the documentation.

And the crawl- and index-data will be available for download in a few weeks. It's also just a matter of documenting the data-format.

BTW: I disagree with your points about privacy. I see DeuSu as a way of fighting back.

siegecraft9y ago

How does the search / indexing compare to sphinx or lucene?

1 more reply

jahewson9y ago

If you're looking for an open source web crawl, commoncrawl.org has billions of pages.

greglindahl9y ago

... and Common Search made an index of the homepages, too.

mi100hael9y ago

Interesting perspective. What about the other extreme, though? This circumstance has only really existed in the past 20 years or so, maybe less. Why not just revert some of your behaviors?

- Delete your Facebook account. If you really need it to keep in touch with people across the country, at least delete the app from your phone and don't leave it open in a browser tab.

- Don't use GMail. Use an email provider like Protonmail or Tutanota that doesn't index all of your emails for advertising or other purposes.

- Don't sign in to Google (or Waze or whoever else's site) when navigating or searching so all of your actions online aren't automatically tied to a single account.

- Don't buy some internet-of-shit appliance that inevitably phones home with a bunch of telemetry data just so you can get a push notification when your laundry is dry.

allendoerfer9y ago

fnord1239y ago

It's written in pascal. Neat.

However, it's not very good. If I search for "banana" I get information about a sex shop rather than about bananas.

0xmohit9y ago

Appears to be a "smart" search engine. Tries to infer a lot (maybe based on data collected earlier).

mstolpm9y ago

deusuOP9y ago

A fresh recrawl is currently running. Should take about 2-3 months. Newly crawled data will gradually replace older data during that time.

webtechgal9y ago

Great work, congrats. :-)

Here is some input based on my experience building a similar project at my former company. (We did not quite get to 2B pages, but were close to ~300M):

Corresponding to each indexed URL, store a 'Last Crawled' time-stamp.

If you need any more help/input, let me know and I'll be happy to do what I can.

HTH and all the best moving forward.

1 more reply

jbb5559y ago

deusuOP9y ago

Thank you!

Depending on who you are (there were 2 bitcoin donations today), you funded either about 18 or 28 hours of operations. :)

ccleve9y ago

You get really good performance on not much hardware. Can you share some technical details?

- file formats, particularly the postings

- query evaluation strategy

- update strategy

I poked around in the source code a bit, but couldn't find these things.

deusuOP9y ago

File formats will be documented when I publish the data-files in a few weeks.

What do you mean with postings?

Query has two phases. First only a rough - but fast - ranking is done. Then the top results of all shards are combined and completely re-ranked. This is basically a meta search engine hidden within.

First query phase is in src/searchservernew.dpr, and the second phase is in src/cgi/PostProcess.pas.

ccleve9y ago

Thank you. "Postings" is another word for the format of the doc ids and related information in the inverted file. A google for "inverted index postings" will turn up a bunch of references.

pmontra9y ago

deusuOP9y ago

Originally it was written in Delphi. But I now use FreePascal for the development. I'm even compiling both Windows and Linux versions on my Linux machine.

pmontra9y ago

Great choice! Thanks.

RobAley9y ago

It appears to now have moved over to FreePascal, which is the free Open Source delphi look-a-like.

NKCSS9y ago

Fun, but overal quality seems a bit lacking.

https://deusu.org/query?q=nick+kusters

deusuOP9y ago

In my experience this is usually caused by the fact that even 2bn pages aren't that many nowadays. The index needs to get bigger to better find (and rank) long-tail results like queries like this.

gkst9y ago

Pascal is an interesting language choice. I think it is the 1st time I see an open source project that is actually used in production written in Pascal.

skykooler9y ago

It shows snippets of the web pages under each result; however, generally not the particular snippets that contain the search term. I would think that would be useful.

deusuOP9y ago

Yes, it would be better.

yati9y ago

Looking at the source code took me back to days when I used to do stuff in Delphi :)

Neat project -- Loads of room for improvement, but a great initiative!

swiley9y ago

DanBC9y ago

DuckDuckGo's approach of !bang searches, making duckduckgo the place[1] I go when I want to search another site, is really useful.

[1] It's my default search engine in Chrome, so I use bang searching in the address bar.

Cyph0n9y ago

Same here. The problem is that I find myself using `!g` way too often... I guess I'm not used to the DDG results page.

1 more reply

rshm9y ago

As of aug 16, common crawl has 1.73n pages. For the complimentary set of urls, if any benefit you can use their data dump as seed.

If the metadata (such as last modified) size of your index is small enough to upload to aws, you can also reduce your re-crawl efforts when they have a fresh release.

greglindahl9y ago

It doesn't have to be small to donate to Common Crawl, they have a free S3 bucket.

supersan9y ago

Hi, I find the Blog more interesting right now since I hope to find write-ups about how you were able to manage such a herculean task on your own?

Crawling 2bn pages could take forever and could generate a huge bandwidth bills, so any lessons you learnt, pitfalls you faced, etc would be a great read.

deusuOP9y ago

Some issues that appeared over the years:

I had one person who vehemently complained that I was trying to hack him, because the software downloaded his robots.txt. I'm NOT kidding! :)

webtechgal9y ago

> Run your own DNS server and even cache the hostname/IP pairs yourself.

This[1] might be a useful resource to get started:

[1] https://scans.io/

(Register and download the IPv4 Address Space data file to use as an initial cache and then append/update as you go.)

1 more reply

ommunist9y ago

DeuSu seems not indexing Cyrillic part of the Internet, and cannot give you insights for Greek, try https://deusu.org/query?q=ελιά . Is it Latin ANSI only index?

deusuOP9y ago

Only ASCII and German umlauts (äöüß) at the moment. The parser needs rewriting. It was originally written in pre-unicode times. :)

tychuz9y ago

And all javascript related questions still have w3schools as first result, god dammit.

gkilmain9y ago

I think for newbs who want to learn the fundamentals of web dev w3schools is a good resource. Even the people over at w3fools admit it. For a deeper dive though clearly MDN is the winner.

kowdermeister9y ago

Strange, Wikipedia article is not on the first page and don't blame me for searching something non German thing :)

https://deusu.org/query?q=berlin

semi-extrinsic9y ago

gkst9y ago

1 more reply

0xmohit9y ago

Earlier discussion: https://news.ycombinator.com/item?id=9122397

ommunist9y ago

DeuSu does not crawl social pages it seems. No traces of linkedin profiles and no facebook. From a certain point of view - this is a good thing.

billconan9y ago

I searched "meta programming c++" and the top returns are all about java.

I'm curious, is it expensive to run a search site like this?

deusuOP9y ago

Currently €300/month. More details on https://deusu.org/donate.html

vain9y ago

Google's secret ingredient to stay relevant and informational is Wikipedia.

Deusu on the other hand seems to weight words in urls highly.

If you search for scientology only on Deusu, you might end up wearing a funky hat https://deusu.org/query?q=scientology

amirouche9y ago

Did you think about database dump of popular services like HN, SO or Wikipedia to speed up crawling and revelance?

deusuOP9y ago

Yes. I have downloaded several data dumps, but haven't gotten around to import them yet.

outpan9y ago

Awesome job!

For the life of me I can't figure out how you manage to crawl over a billion web pages (even in 2-3 months), index the data and run the server with €300 per month. Especially the crawler part...

rbjorklin9y ago

What makes this better than https://duckduckgo.com ?

diggan9y ago

kowdermeister9y ago

If it's not good, the it doesn't matter if it's OSS or not.

2 more replies

vcool079y ago

Any specific reason you've used pascal ? I thought that language got extinct long ago.

deusuOP9y ago

It's alive and well. The TIOBE index still lists it ahead of Ruby, Swift, Objective-C, GoLang...

malinens9y ago

works really fast!