Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal
“They’re [Reddit] killing everything for search but Google,” Colin Hayhurst, CEO of the search engine Mojeek told me on a call.
Hayhurst tried contacting Reddit via email when Mojeek noticed it was blocked from crawling the site in early June, but said he has not heard back.
from @404mediaco
https://www.404media.co/google-is-the-only-search-engine-that-works-on-reddit-now-thanks-to-ai-deal/
@Mojeek I'm too tired to be surprised 😦
I usually search reddit directly or with bangs, so it shouldn't affect me too much, but there are times I've stumbled upon the odd useful reddit in search results. Maybe time for Mojeek to start integrating a #Threadiverse search? 😉
@badrihippo @404mediaco lemmy and other similar tools don't have the same kind of discourse when it comes to searching them right?
(as far as you are aware)
@Mojeek @404mediaco yeah no, I mean it's there but not in such volume due to less users. I'm guessing the relevant threads will already be showing up during ordinary crawling though! In which case you'll already be good 🙂
Of course, not taking away from the fact that it sucks and is unfair of Reddit to only allow Google crawlers 😠
@badrihippo @404mediaco means that for any of us who are trying to show the world we can have search engines outside of GAFAM the task is that little bit more difficult.
That being said, we didn't decide to take this path because we wanted an easy ride.
@Mojeek @404mediaco call me an optomist, but i think this is a self-destructive move for both of them.
@Mojeek @404mediaco Reddit is irrelevant anyway. I’ve been thrown on it by Google various times over the years, but never figured out either how to even navigate it or what I was coming for.
@mirabilos @404mediaco it can often not be useful, for some also it can answer questions; either way the precedent isn't so good 😔
@Mojeek @404mediaco
Reddit kicks up such a fuss with my VPN I've stopped using it.
@Mojeek @404mediaco Reddit is quickly becoming irrelevant as X.
@marta Abso-fucking-lutely needed. I can't stomach even going anywhere near Google and now even DuckDuckGo's giving shitty results, whether a side effect of being based on Bing or SEOs kicking in, and it's super disruptive when I'm trying to look for information on things.
@marta @justincroser Brave is best sent to /dev/null and never spoken of in polite conversation again.
In other words, fuck Brave.
this should teach us to never build our knowledge bases on proprietary, centralized platforms. but it won't of course.
because conformity with the crowd is just so comfortable.
public communications on #twitter, events and publications on #facebook, videos on #youtube, knowledge base on #reddit, support and real time communications on #discord, — despite existing distributed alternatives.
i have no more hope for humanity, nor enough energy to even be surprized.
@marta@corteximplant.net Sorry, didn't realize you had said index. Really tired
@marta want to start something? 😀 federated search index where the nodes share the “load” of running the indexers and then pass on index data to each other
@marta why not? There are many small things that can be done relatively easily.
I would build a multi stage pipeline:
First stage takes an url, checks for robots.txt on that url and then just spits out all links from that into a db.
Next stage would then load these links if not already in the db, parse the content type and put it into the db.
Another service would check for urls that timed out and run head requests to check for changes, putting them back into queue
1/2
Another service would use the the text/html in the db to build a searchable index.
Another service could calculate index weights for sorting search results like old google pagerank did.
If you break it down into such small steps it should be doable.
Creating a protocol for the nodes to share the index and calculate trust is another can of worms though. But to share an index you need one in the first step 🙂
I am thinking about this for some time now. One could start the index with wikipedia…
@Mojeek @404mediaco Since when is "now"? Because I have not noticed any difference
@WhyNotZoidberg @404mediaco if a crawler is blocked then you won't see it immediately, try restricting your searches to the past week
@marta you don’t have to do it alone ;) and for the prototyping stage to see if it even pans out as one expects i am fine with the hackiest of hacks. You can iterate on the code if it works out. That’s the reason why i would break it down into such small parts. If it turns out doing x in step y is a bad idea you can just rip that out and try a different thing. Parsing “the web” is inherently messy so why overcomplicate the stuff. I think i’ll set something up. Join if you want!
@Mojeek @404mediaco As a non-Google user, I love that I'm now likely to become a non-reddit user too.
@Mojeek @404mediaco mojeek looks very interesting I will try it! Whats the modern equivalent of putting reddit on the end of your query so you don’t just get the terrible seo websites?
Common Crawl is the closest thing we have to an open index, though it doesn’t meet your requirement of ignoring robots.txt for corporate websites while obeying it for personal sites. Unfortunately, being open and publicly available means that people use it to train LLMs. Google did this for initial versions of Bard, so a lot of sites block its crawler. Most robots.txt guides for blocking GenAI crawlers include an entry for it now.
Common Crawl powers Alexandria Search and was the basis of Stract’s initial index, both of which are upstart FOSS engines.
A similar EU-focused project is OpenWebSearch/Owler.
Originally posted on seirdy.one
: See Original (POSSE).
@hook @marta CC is an index for anybody to use. This reduces the barrier to create alternatives to other indexes, like those owned by Google/Bing/Yandex. On one hand, this makes upstart engines and research like the Web Data Commons possible; on the other hand, it allows all sorts of bad actors (non-consensual GenAI, for instance) to index your pages easily. If Common Crawl embedded information from the upcoming W3C TDM Reservation Protocol in each site/page, this would be partially resolved; bad actors could ignore this but well-behaved actors would at least know what they’re welcome to do with each page. Right now, they don’t even know so they can’t comply with your preferred rules easily; all they have are X-Robots tags, which are a bit limited.
I think that the CC does more good than harm on my site, so I allow it. I can’t speak for everyone else.
@Mojeek @404mediaco I tend to avoid Gargoyle, so unless I'm using Kagi it seems Reddit is off the menu!
@RupertReynolds @404mediaco kagi should still be able to access it through google
@Mojeek @404mediaco this sounds like an opportunity for blog authors to publish all the helpful tricks they’ve learned from Reddit over the years and benefit from the increased attention from search engines
@djsf @404mediaco yeah i think they might have ripped off lemmy and like... centralized it or something 🤷
@Mojeek @404mediaco initial impressions great so far, going to replace duck as my primary lets see! 🙏 how are you making money standard search ads? Its very clean and nice to use!
@implementcontrols @404mediaco API mainly, lots of demand for an index these days
@corruptian @marta That's pretty much the opposite of net neutrality.
@Mojeek very cool, its a very nice experience to use! I am glad you have found this a viable business model!
Net neutrality says ISPs cannot charge the biggest users of their lines more money.
That empowers these big users to make exclusive deals because now there are no consequences to them.
As usual, normals do not understand net neutrality, which is a legal not technical argument.
Google and Reddit are the companies the ISPs wanted to charge more.
Six big sites make up 90% of the traffic on local ISPs; the ISPs wanted to charge them more. This was rejected.
Now that they have this monopoly, these sites are going to trade on it. The market mechanism that would limit them was removed by government.
@iacore federated search indexing is a thing (YaCy is one attempt, but results are mixed), as is distributed crawling with a centralized search index (mwmbl).
@iacore i already thought about how to get the first “seed” links. One could start with a wikipedia export but perhaps harvesting links from public feeds from the fedi would be a more or less curated option (of course without any PI or post text, just as an input to the indexer with consent check, etc.). I don’t want to start another slashdot/digg/reddit though.