There is no script that can do it in a clever way. In practice there are only 2 approaches you can use (and it uses the second one).If u can, check out their code, maybe you can use it here. I don't think this will make searching computationally expensive.
If that script is that harmful, then I think I should remove it from my reply.There is no script that can do it in a clever way. In practice there are only 2 approaches you can use (and it uses the second one).
2. The client filters the data
This is what the MAL script is doing it. It's just loading more and more pages (so page 1, then page 2, ...) and hiding what is in library.
Because if your page 1 has 32 results, which you all follow, and they get hidden, then you see an empty search results page, but a "page 2" button still. Which is silly.
So to compensate the client massively overfetches. Spamming the crap out of MAL in the process.
So we're back to doing many-times-more searches per search.
---
In the end, if your site has a small-enough userbase, you can bruteforce it, but the performance cost of that grows exponentially with the userbase, until you just cannot do it at all.
I understand that you might not want to do anything on either the server or client side that would affect pagination, as that require on-going repeated iterations of recalculations, but surely something that includes the result but flags it would be doable? That way the same number are going to show up on the page, but a quick scan will tell the user that of the say 20 items on the page, 17 are already in their Reading List.There is no script that can do it in a clever way. In practice there are only 2 approaches you can use (and it uses the second one).
1. The server filters it out
Meaning that it has to check, for every manga returned by the search, if you follow it, discard it, and load more results in until the remaining results list is as big as expected.
2. The client filters the data
This is what the MAL script is doing it. It's just loading more and more pages (so page 1, then page 2, ...) and hiding what is in library.
Because if your page 1 has 32 results, which you all follow, and they get hidden, then you see an empty search results page, but a "page 2" button still. Which is silly.
In the end, if your site has a small-enough userbase, you can bruteforce it, but the performance cost of that grows exponentially with the userbase, until you just cannot do it at all.
While we wish we could offer this feature, it's MUCH more complex to offer than it sounds, and scripts like the one you linked actively harm the websites in question. As far as I know, there just isn't a clever solution to this problem (which is why, again, no one offers that feature in general).
Hope the wall of text at least clarifies why I'm saying it's unlikely we'll offer this, at least not without us having a massive redesign of our search for that single feature, which would come with other features being impossible as a result. It's just not a critical enough feature to make everything else worse for.
Yes that would be significantly easier. And likely what we will eventually do.I think progress would be made if we could at least show an indicator.
We're aware of Novelupdates yes. First, I don't mean this as a criticism at all, because I don't know their internals nearly enough to judge them or anything, but a few points of comparison.the Novel Updates (novelupdates.com) site for reading light novels and web novels, and they have a very robust Series Finder that allows a lot of parameters--including the ability to filter out or in not only one reading list, but several different reading lists--and their search is very quick and responsive and also repaginates on the fly.
I thought for a single query you don't need a cross join as it is possible to filter/index by user. But I agree that it is more computationally intensive than a filter on a single table. Especially with this number of queries.There is no script that can do it in a clever way. In practice there are only 2 approaches you can use (and it uses the second one).
1. The server filters it out
Meaning that it has to check, for every manga returned by the search, if you follow it, discard it, and load more results in until the remaining results list is as big as expected.
This doesn't sound crazy at first, until you give it a few minutes of thought:
- There are 60'000 mangas to search through
- And there are a bit more than 3'000'000 users
- And each user can add many manga to their library (say, a rather conservative average of 100 per user).
A search currently has to sift through the 60k manga. It's a lot but not that bad with a lot of optimization work.
A library-aware search still has to sift through the 60k manga, but then also check each against the list of user+manga combinations. The latter is now 300'000'000 long (user * user-manga-reading-statuses). And it needs to check in it for every single returned manga.
With our current somewhat low default of 32 titles per search page, that is 32 * 300m. Compared to 60k titles, we're talking about 5'000 times more elements to look at per-search.
Now thankfully, it doesn't work out that badly at all from a technical standpoint (given some adjustments). But while it wouldn't be 5'000 times more work, it would still easily be on the order of 10x more expensive per-search at the very least, and would require us changing things in a way that would make user data take much more space.
Just check how fast a search is compared to opening your library page. It will give you an idea of it.
Also, if a user has all the manga that match the search in their library already, the search will keep filtering them all out, load more manga in (basically page 2), filter them all out, load more manga in (now page 3), etc. It potentially will take many search equivalents to return a single-page of manga not the user's library. So you can multiply the slowness by the number of searches this ends up requiring.
Basically, with the current site design, it's entirely unthinkable. And it is, once again, why no one else with a significant userbase offers that feature either.
2. The client filters the data
This is what the MAL script is doing it. It's just loading more and more pages (so page 1, then page 2, ...) and hiding what is in library.
Because if your page 1 has 32 results, which you all follow, and they get hidden, then you see an empty search results page, but a "page 2" button still. Which is silly.
So to compensate the client massively overfetches. Spamming the crap out of MAL in the process.
So we're back to doing many-times-more searches per search.
---
In the end, if your site has a small-enough userbase, you can bruteforce it, but the performance cost of that grows exponentially with the userbase, until you just cannot do it at all.
While we wish we could offer this feature, it's MUCH more complex to offer than it sounds, and scripts like the one you linked actively harm the websites in question. As far as I know, there just isn't a clever solution to this problem (which is why, again, no one offers that feature in general).
Hope the wall of text at least clarifies why I'm saying it's unlikely we'll offer this, at least not without us having a massive redesign of our search for that single feature, which would come with other features being impossible as a result. It's just not a critical enough feature to make everything else worse for.
That is true yes; in our case it's compounded by the fact that our search engine is Elasticsearch, rather than MySQL (or other RDBMS). So we get ridiculously fast searches, but cannot use joins. Tradeoffs.I thought for a single query you don't need a cross join as it is possible to filter/index by user. But I agree that it is more computationally intensive than a filter on a single table. Especially with this number of queries.
Just for the sake of argument you can have several databases. But robust sync is difficult.That is true yes; in our case it's compounded by the fact that our search engine is Elasticsearch, rather than MySQL (or other RDBMS). So we get ridiculously fast searches, but cannot use joins. Tradeoffs.
Oh but we do; MySQL is our authoritative source of data, and Elasticsearch hosts a copy of it, ie publicly searchable data, and is the one called to handle searches.Just for the sake of argument you can have several databases. But robust sync is difficult.
Thank you very much for the explanation.Oh but we do; MySQL is our authoritative source of data, and Elasticsearch hosts a copy of it, ie publicly searchable data, and is the one called to handle searches.
So yes we could fallback to MySQL if some query is too join-y for ES even with the large amounts of denormalization we do at indexing time (like adding chapter-related data to title documents like whether it has chapters available in a language, and vice-versa like chapter documents being searchable based on their title's content rating).
However, we moved towards ES for (most) public requests on purpose, because we know very well that MySQL is not able to comfortably scale to the extent we need it to do (there are options like Vitess to shard it, but this gets fiendishly complex and also has limitations). This is why guests weren't able to use the search feature in v3 for example. We have nothing against using it, but we try to do so as sparingly as possible, because we know that there really is a finite "budget" when using it, and once it's hit there's just no magic trick you can pull to speed it up...
Did you not consider Typesense instead of ES ?Oh but we do; MySQL is our authoritative source of data, and Elasticsearch hosts a copy of it, ie publicly searchable data, and is the one called to handle searches.
So yes we could fallback to MySQL if some query is too join-y for ES even with the large amounts of denormalization we do at indexing time (like adding chapter-related data to title documents like whether it has chapters available in a language, and vice-versa like chapter documents being searchable based on their title's content rating).
However, we moved towards ES for (most) public requests on purpose, because we know very well that MySQL is not able to comfortably scale to the extent we need it to do (there are options like Vitess to shard it, but this gets fiendishly complex and also has limitations). This is why guests weren't able to use the search feature in v3 for example. We have nothing against using it, but we try to do so as sparingly as possible, because we know that there really is a finite "budget" when using it, and once it's hit there's just no magic trick you can pull to speed it up...
Now the problem with that, is that means we'd have to index every single user's library in it. And that is by far the biggest dataset on MD, around a billion records and always growing... But maybe one day we'll have to take that leap, who knows.Allows "NOT IN()" searches on indexed content
We'd selfhost it anyway, but just for the fun of it we can do some napkin mathsoffers cloud hosting at basically cost
Fair, though it doesn't really make any difference.I have to clear up a misunderstanding. With "indexed content" I meant just the 60,000 mangas.
This is the problem though... That, thousands of times per second, is in fact not cheap at all. And we lose any caching ability for those searches too.then just query for the list of manga_id's for a given user
I'm not sure what part of your argument fixes the 10k$ bandwidth costs we'd incur with their cloud though...? Even assuming that manga searches are half of the total weight of searches (I doubt it, but I don't know, to be fair), that's still 5k$ per month 🤔Therefore, their cloud could still be feasible.