'No Trash Search' is very focussed on STEM and not "for daily use". It's surprisingly good when you're looking for certain kinds of information. Under the hood it's little more than a programmable search engine [1] with a whitelist of ~120 sites.
So back to what web search was in the 1990s, roughly: an index from a curated selection of sites.
(BTW, any good search engines these days that aren't indirectly using Google or Bing ?)
Sure - the web is now a cesspool optimized for advertising and attention. The traditional search engines made a lot more sense at the dawn of the internet when it was more about discovery. Now, for the most part, it's closer to an information retrieval tool, where a finite list of established sites have the bulk of what one is looking for. It only makes sense to have a tool that lets one navigate the established, legit internet, and not have to deal with all the crap.
That doesn't mean there is no use case for google as it is, but some more focused competition is a no brainer.
The code for Gigablast is open-source, including the crawler.
I could be wrong but I do not think search.marginalia.eu nor wiby.me use Google or Bing.
The comment about "hundreds of millions" is interesting. Assume hypothetically a search engline claimed to be searching millions of sites for a given query but in truth it was actually only searching 120 sites that it had determined answered this query (i.e., was the most popular answer source) for the majority of users. How would a user verify the search engine's claim about searching millions of sites was true. What if the search engine only allowed the user to retrieve a maxmimum of about 230 results, not matter how many sites it claimed to search.
I searched for “3 hole punch review” [1] here, and the results have zero relevancy.
First one is a Chinese cell phone company, second a Wikipedia page for an episode of the office, third a thesaurus page with synonyms for ‘colorful’ and fourth a link to the Wikipedia page of Yellow Submarine.
I can’t even imagine how you get there from “3 hole punch review”
Nice SEO campaign ;)
If only I could get NTS to whitelist my domain name (myfirstnamelastname dot com), the Big-G has hated it seemingly since even before I acquired it > ten years ago, even though it's ad-free and totally benign. Good thing I mostly just host go pkgs with it and use it for my email.
p.s. OP this is amazing! Would love an article explaining any backstory and details on how you made this (or setup / configured it).
There is a form [0] on the about page that allows people to suggest websites to add :)
> p.s. OP this is amazing! Would love an article explaining any backstory and details on how you made this (or setup / configured it).
Thanks! I think this is gonna be disappointing from an engineering perspective, and certainly not article worthy :) As further explained in my other comments, the website is basically a wrapper around google programmable search [1] where I whitelisted a set of sites I found useful personally, plus some suggestions from other users. It's really easy to set up.
As to why, I will quote some other comments of mine:
"I built this website a couple of months ago because I was annoyed by how hard it was to find useful things on Google."
"to find things more easily while programming or studying (I study biology, cs and ai; and philosophy in my free time, so expect the best results for queries related to those subjects). ... When I'm not doing those things, I just use Google or DDG because they have better results for day-to-day queries."
Let me know if you have other questions!
[0] https://docs.google.com/forms/d/e/1FAIpQLSdf8lAoShQz7Wjl9h60...
[1] https://developers.google.com/custom-search/docs/overview
The site uses a whitelist of URLs to (attempt to) keep results relevant to science and programming. In the context in which I'm using this search engine, I have no interest in (reviews on) 3 hole punches. (That's not to say I never do, but in that case I'd use Google, Reddit, etc.) The fact that results don't show up here means that they also won't show up when I'm not looking for them, which is 100% of the time when I'm using this search engine. That's a plus for me personally.
Best case would be to have relevant results in a single search engine, but that's not what I intended when building this site.
One way to improve is a "bring your own list" feature, and the ability to include vetted lists. Maybe some kind of web of trust - if your friends have whitelisted a site, it is whitelisted for you too. If you find a problem with that site, you let your friend know to remove it. If they don't respond you can remove that friend from your trusted persopn list (maybe they got hacked?). Then maybe you can 'follow' a few lists of famous trusted people (e.g. paulg etc.) to build up a bigger slice of the internet you can search.
A spammer will want to come in then and create something that white lists their spam sites, but they need to convince you to add their list! And when you see the spam you can just unfollow them. They can't succeed.
The GitHub repo was third and had to be scrolled to.
Seems pretty trash to me.
Also search.marginalia.nu puts a smile on my face almost every time I use it :-)
(I should try Neeva, I keep hearing good things about it.)
One suggestion that I have is to remove w3schools.com from the whitelist. MDN is a much better source for information about web development.
Ads are added automatically by Google. The whole thing is little more than a wrapper around the 'Programmable Search Element Control API' which is an HTML element you can just insert into any site, like an iframe. Unfortunately this is the only way to make Programmable Search available at scale as the API is restricted to either 10 sites or 10K queries / day, even when paid!
There is a paid version for the HTML plugin, but that would leak the API key and so it wouldn't work as a business.
There is an option to get a share of the revenue generated by a search engine. Maybe it's time for me to figure out how that works.
I was thinking of making a hosted, ad free, customizable version where people upload their own keys. Not sure if people would like that.
As a side-note, it's super easy to remove ads with 1 line of CSS, but I wasn't sure how Google would feel about that so it's not in the online version. TamperMonkey is an extension that allows people to insert their own CSS on different websites. Hmm.
You can view all offerings in the docs [0].
[0] https://developers.google.com/custom-search/docs/overview#su...
Right now, looking at your allow-list config, it feels a bit custom to you, but if I had an easy way to limit search to the sites I myself know and trust, I could see how that would be useful.
I know I could probably pick it out of my browser's history UI & poke it into Google's Programmable Search UI, but that seems like a hassle and a half.
With caching, I think you might be able to reduce the load.
Also, why is w3fools in the list? It's an awful site.
The poster did say it was mostly for STEM subjects though...
More importantly though, I think "Best smartphone 2021" is really a search that has been conditioned on the crap google gives back now. At best you might expect to find a "best smartphones" listicle or something.
This is just a whitelisted search, so in my 5 min playing with it, it looks like popular or consumer queries are more likely to just provide reddit or wikipedia links, while more technical searches land on SO or documentation sites.
I think with a little tuning, this approach is great. Given the modern internet and all the crap there is, a manual whitelist of sites that are actually legit is always going to be superior to an algorithmic approach.
Honestly I just created this search engine for myself to find things more easily while programming or studying (I study biology, cs and ai; and philosophy in my free time, so expect results the best results for queries related to those subjects). I think those subjects also appeal to the HN audience, that's why I shared it here. When I'm not doing those things, I just use Google or DDG because they have better results for day-to-day queries.
That being said, I'm definitely interested in helping improve other people's search as well (reason I'm posting at all), so let me know if you have suggestions for sites to add!
The blob of the ads are still the top results. This is not the "no trash search" I'm looking for.
As explained in my other comment, this website is a wrapper around google programmable search. The actual searching happens on Google servers, and I can see why people have problems with that. The code you see on the website is the same as the repo, though. It's actually hosted by GitHub! You can verify this by opening the web inspector in any browser or looking at the `.github.io` portion of the URL.
You can learn more about Programmable Search here: https://developers.google.com/custom-search/docs/overview. NoTrashSearch uses the 'Programmable Search Element Control API', which is documented here: https://developers.google.com/custom-search/docs/element and can be used with very little code!
Stupid question though: where is the list of whitelisted sites? Is that something you set up separately with google? I scanned though the code and expected to find a list somewhere, but obviously you do it in a different way