Option to exclude mangas from my library in advanced search [Rejected] (technical reasons, dev reply)

Dex-chan lover
Joined
Oct 9, 2019
Messages
2,019
I don’t think they will because it’ll probably get too computationally expensive, which is something they hate. Right now they’re using some special database for searches but that doesn’t have access to user data so they’d have to do those searches on the more internal RDBMS which is slower
 
Yuri Enjoyer
Staff
Developer
Joined
Feb 16, 2020
Messages
464
That is indeed not planned at the moment, as it would be technically very challenging. We’ll keep thinking about ways we might be able to do it, but don’t expect anything soon. This is for the same reasons that no other large platform usually allows it :/
 
Active member
Joined
Jan 10, 2023
Messages
54
There is a MAL script that automatically loads the next page, but it has an extra feature which lets you hide titles in your library in advance search.
Search will happen like it normally does (with titles in your library included), this script just hides them from the viewer screen. I don't do scripting. If u can, check out their code, maybe you can use it here. I don't think this will make searching computationally expensive.
 
Last edited:
Yuri Enjoyer
Staff
Developer
Joined
Feb 16, 2020
Messages
464
If u can, check out their code, maybe you can use it here. I don't think this will make searching computationally expensive.
There is no script that can do it in a clever way. In practice there are only 2 approaches you can use (and it uses the second one).

1. The server filters it out

Meaning that it has to check, for every manga returned by the search, if you follow it, discard it, and load more results in until the remaining results list is as big as expected.

This doesn't sound crazy at first, until you give it a few minutes of thought:
  • There are 60'000 mangas to search through
  • And there are a bit more than 3'000'000 users
  • And each user can add many manga to their library (say, a rather conservative average of 100 per user).

A search currently has to sift through the 60k manga. It's a lot but not that bad with a lot of optimization work.

A library-aware search still has to sift through the 60k manga, but then also check each against the list of user+manga combinations. The latter is now 300'000'000 long (user * user-manga-reading-statuses). And it needs to check in it for every single returned manga.

With our current somewhat low default of 32 titles per search page, that is 32 * 300m. Compared to 60k titles, we're talking about 5'000 times more elements to look at per-search.

Now thankfully, it doesn't work out that badly at all from a technical standpoint (given some adjustments). But while it wouldn't be 5'000 times more work, it would still easily be on the order of 10x more expensive per-search at the very least, and would require us changing things in a way that would make user data take much more space.

Just check how fast a search is compared to opening your library page. It will give you an idea of it.

Also, if a user has all the manga that match the search in their library already, the search will keep filtering them all out, load more manga in (basically page 2), filter them all out, load more manga in (now page 3), etc. It potentially will take many search equivalents to return a single-page of manga not the user's library. So you can multiply the slowness by the number of searches this ends up requiring.

Basically, with the current site design, it's entirely unthinkable. And it is, once again, why no one else with a significant userbase offers that feature either.

2. The client filters the data

This is what the MAL script is doing it. It's just loading more and more pages (so page 1, then page 2, ...) and hiding what is in library.

Because if your page 1 has 32 results, which you all follow, and they get hidden, then you see an empty search results page, but a "page 2" button still. Which is silly.

So to compensate the client massively overfetches. Spamming the crap out of MAL in the process.

So we're back to doing many-times-more searches per search.

---

In the end, if your site has a small-enough userbase, you can bruteforce it, but the performance cost of that grows exponentially with the userbase, until you just cannot do it at all.

While we wish we could offer this feature, it's MUCH more complex to offer than it sounds, and scripts like the one you linked actively harm the websites in question. As far as I know, there just isn't a clever solution to this problem (which is why, again, no one offers that feature in general).

Hope the wall of text at least clarifies why I'm saying it's unlikely we'll offer this, at least not without us having a massive redesign of our search for that single feature, which would come with other features being impossible as a result. It's just not a critical enough feature to make everything else worse for.
 
Active member
Joined
Jan 10, 2023
Messages
54
There is no script that can do it in a clever way. In practice there are only 2 approaches you can use (and it uses the second one).

2. The client filters the data

This is what the MAL script is doing it. It's just loading more and more pages (so page 1, then page 2, ...) and hiding what is in library.

Because if your page 1 has 32 results, which you all follow, and they get hidden, then you see an empty search results page, but a "page 2" button still. Which is silly.

So to compensate the client massively overfetches. Spamming the crap out of MAL in the process.

So we're back to doing many-times-more searches per search.

---

In the end, if your site has a small-enough userbase, you can bruteforce it, but the performance cost of that grows exponentially with the userbase, until you just cannot do it at all.
If that script is that harmful, then I think I should remove it from my reply.
 
Dex-chan lover
Joined
Nov 5, 2018
Messages
170
There is no script that can do it in a clever way. In practice there are only 2 approaches you can use (and it uses the second one).

1. The server filters it out

Meaning that it has to check, for every manga returned by the search, if you follow it, discard it, and load more results in until the remaining results list is as big as expected.

2. The client filters the data

This is what the MAL script is doing it. It's just loading more and more pages (so page 1, then page 2, ...) and hiding what is in library.

Because if your page 1 has 32 results, which you all follow, and they get hidden, then you see an empty search results page, but a "page 2" button still. Which is silly.

In the end, if your site has a small-enough userbase, you can bruteforce it, but the performance cost of that grows exponentially with the userbase, until you just cannot do it at all.

While we wish we could offer this feature, it's MUCH more complex to offer than it sounds, and scripts like the one you linked actively harm the websites in question. As far as I know, there just isn't a clever solution to this problem (which is why, again, no one offers that feature in general).

Hope the wall of text at least clarifies why I'm saying it's unlikely we'll offer this, at least not without us having a massive redesign of our search for that single feature, which would come with other features being impossible as a result. It's just not a critical enough feature to make everything else worse for.
I understand that you might not want to do anything on either the server or client side that would affect pagination, as that require on-going repeated iterations of recalculations, but surely something that includes the result but flags it would be doable? That way the same number are going to show up on the page, but a quick scan will tell the user that of the say 20 items on the page, 17 are already in their Reading List.

On a different note, I'm puzzled by the computational burden, because I also use the Novel Updates (novelupdates.com) site for reading light novels and web novels, and they have a very robust Series Finder that allows a lot of parameters--including the ability to filter out or in not only one reading list, but several different reading lists--and their search is very quick and responsive and also repaginates on the fly. I don't know if they have a sufficiently smaller user base, or it is because a majority of their content has few or zero images per chapter, but they seem to be doing something right. Maybe you could consult with their devs to see if there are actions you can take to implement some of their features without crashing your servers?
 
Yuri Enjoyer
Staff
Developer
Joined
Feb 16, 2020
Messages
464
the Novel Updates (novelupdates.com) site for reading light novels and web novels, and they have a very robust Series Finder that allows a lot of parameters--including the ability to filter out or in not only one reading list, but several different reading lists--and their search is very quick and responsive and also repaginates on the fly.
We're aware of Novelupdates yes. First, I don't mean this as a criticism at all, because I don't know their internals nearly enough to judge them or anything, but a few points of comparison.

If I'm not wrong, they have about 14k titles indexed atm, to our 60k.

Sorting their titles by reader count, their top title (by descending number of readers) is https://www.novelupdates.com/series/trash-of-the-counts-family/ at 33'787 readers (at the time of writing).

Meanwhile, our top followed title (https://mangadex.org/title/32d76d19-8a05-4db0-9fc2-e0b0648fe9d0/solo-leveling) is at 220'758 followers. And you have to go all the way to the 17th page on MD to find 33k follows (and 17 pages * 32 titles per search page means on the order of the 5XXth most popular title).

So there's definitely a massive discrepancy in # of people engaging with our respective library systems. Maybe as a result of the userbase size, or maybe as a result of the userbase's behaviour, I don't know. Also I wasn't able to find their number of accounts (as opposed to guest users), so I can't comment on that, but everything seems to point at it being at least quite a bit lower.

Finally, I'd like to point out that they voluntarily don't seem interested in providing any public API, so they don't have to concern themselves with the # of off-site users as much as we do. And those are in fact a somewhat significant number of users.

In the end, props to them for having that, but I don't think it's a super valid comparison here, even without going into the other features they do not have (a feed per list, an aggregate feed for your lists, ...) etc.

Also, and it might not be the case here, but I want to emphasize that I never said this was impossible to do, but rather that doing it would require heavy tradeoffs and removing other features (since we'd have to optimize for that instead of for other things). Every website has to make these tradeoffs, and we believe that our current tradeoff in that regard is the one that enables 99% of the features we want.
 
VIP
Joined
Apr 29, 2019
Messages
35
There is no script that can do it in a clever way. In practice there are only 2 approaches you can use (and it uses the second one).

1. The server filters it out

Meaning that it has to check, for every manga returned by the search, if you follow it, discard it, and load more results in until the remaining results list is as big as expected.

This doesn't sound crazy at first, until you give it a few minutes of thought:
  • There are 60'000 mangas to search through
  • And there are a bit more than 3'000'000 users
  • And each user can add many manga to their library (say, a rather conservative average of 100 per user).

A search currently has to sift through the 60k manga. It's a lot but not that bad with a lot of optimization work.

A library-aware search still has to sift through the 60k manga, but then also check each against the list of user+manga combinations. The latter is now 300'000'000 long (user * user-manga-reading-statuses). And it needs to check in it for every single returned manga.

With our current somewhat low default of 32 titles per search page, that is 32 * 300m. Compared to 60k titles, we're talking about 5'000 times more elements to look at per-search.

Now thankfully, it doesn't work out that badly at all from a technical standpoint (given some adjustments). But while it wouldn't be 5'000 times more work, it would still easily be on the order of 10x more expensive per-search at the very least, and would require us changing things in a way that would make user data take much more space.

Just check how fast a search is compared to opening your library page. It will give you an idea of it.

Also, if a user has all the manga that match the search in their library already, the search will keep filtering them all out, load more manga in (basically page 2), filter them all out, load more manga in (now page 3), etc. It potentially will take many search equivalents to return a single-page of manga not the user's library. So you can multiply the slowness by the number of searches this ends up requiring.

Basically, with the current site design, it's entirely unthinkable. And it is, once again, why no one else with a significant userbase offers that feature either.

2. The client filters the data

This is what the MAL script is doing it. It's just loading more and more pages (so page 1, then page 2, ...) and hiding what is in library.

Because if your page 1 has 32 results, which you all follow, and they get hidden, then you see an empty search results page, but a "page 2" button still. Which is silly.

So to compensate the client massively overfetches. Spamming the crap out of MAL in the process.

So we're back to doing many-times-more searches per search.

---

In the end, if your site has a small-enough userbase, you can bruteforce it, but the performance cost of that grows exponentially with the userbase, until you just cannot do it at all.

While we wish we could offer this feature, it's MUCH more complex to offer than it sounds, and scripts like the one you linked actively harm the websites in question. As far as I know, there just isn't a clever solution to this problem (which is why, again, no one offers that feature in general).

Hope the wall of text at least clarifies why I'm saying it's unlikely we'll offer this, at least not without us having a massive redesign of our search for that single feature, which would come with other features being impossible as a result. It's just not a critical enough feature to make everything else worse for.
I thought for a single query you don't need a cross join as it is possible to filter/index by user. But I agree that it is more computationally intensive than a filter on a single table. Especially with this number of queries.
 
Yuri Enjoyer
Staff
Developer
Joined
Feb 16, 2020
Messages
464
I thought for a single query you don't need a cross join as it is possible to filter/index by user. But I agree that it is more computationally intensive than a filter on a single table. Especially with this number of queries.
That is true yes; in our case it's compounded by the fact that our search engine is Elasticsearch, rather than MySQL (or other RDBMS). So we get ridiculously fast searches, but cannot use joins. Tradeoffs.
 
VIP
Joined
Apr 29, 2019
Messages
35
That is true yes; in our case it's compounded by the fact that our search engine is Elasticsearch, rather than MySQL (or other RDBMS). So we get ridiculously fast searches, but cannot use joins. Tradeoffs.
Just for the sake of argument you can have several databases. But robust sync is difficult.
 
Yuri Enjoyer
Staff
Developer
Joined
Feb 16, 2020
Messages
464
Just for the sake of argument you can have several databases. But robust sync is difficult.
Oh but we do; MySQL is our authoritative source of data, and Elasticsearch hosts a copy of it, ie publicly searchable data, and is the one called to handle searches.

So yes we could fallback to MySQL if some query is too join-y for ES even with the large amounts of denormalization we do at indexing time (like adding chapter-related data to title documents like whether it has chapters available in a language, and vice-versa like chapter documents being searchable based on their title's content rating).

However, we moved towards ES for (most) public requests on purpose, because we know very well that MySQL is not able to comfortably scale to the extent we need it to do (there are options like Vitess to shard it, but this gets fiendishly complex and also has limitations). This is why guests weren't able to use the search feature in v3 for example. We have nothing against using it, but we try to do so as sparingly as possible, because we know that there really is a finite "budget" when using it, and once it's hit there's just no magic trick you can pull to speed it up...
 
VIP
Joined
Apr 29, 2019
Messages
35
Oh but we do; MySQL is our authoritative source of data, and Elasticsearch hosts a copy of it, ie publicly searchable data, and is the one called to handle searches.

So yes we could fallback to MySQL if some query is too join-y for ES even with the large amounts of denormalization we do at indexing time (like adding chapter-related data to title documents like whether it has chapters available in a language, and vice-versa like chapter documents being searchable based on their title's content rating).

However, we moved towards ES for (most) public requests on purpose, because we know very well that MySQL is not able to comfortably scale to the extent we need it to do (there are options like Vitess to shard it, but this gets fiendishly complex and also has limitations). This is why guests weren't able to use the search feature in v3 for example. We have nothing against using it, but we try to do so as sparingly as possible, because we know that there really is a finite "budget" when using it, and once it's hit there's just no magic trick you can pull to speed it up...
Thank you very much for the explanation.
The only good solution for scaling I know is YDB, but it requires more hardware than MySQL and a different SQL dialect.
 
Aggregator gang
Joined
Sep 7, 2019
Messages
303
Oh but we do; MySQL is our authoritative source of data, and Elasticsearch hosts a copy of it, ie publicly searchable data, and is the one called to handle searches.

So yes we could fallback to MySQL if some query is too join-y for ES even with the large amounts of denormalization we do at indexing time (like adding chapter-related data to title documents like whether it has chapters available in a language, and vice-versa like chapter documents being searchable based on their title's content rating).

However, we moved towards ES for (most) public requests on purpose, because we know very well that MySQL is not able to comfortably scale to the extent we need it to do (there are options like Vitess to shard it, but this gets fiendishly complex and also has limitations). This is why guests weren't able to use the search feature in v3 for example. We have nothing against using it, but we try to do so as sparingly as possible, because we know that there really is a finite "budget" when using it, and once it's hit there's just no magic trick you can pull to speed it up...
Did you not consider Typesense instead of ES ?
It is fully open-source but offers cloud hosting at basically cost. Allows "NOT IN()" searches on indexed content while still being fast (on my self-hosted, above 20ms is unheard and 2-5 ms being common). Recently, more complex querying was supported. Development cycle is fast and its not uncommon to have several builds per week.
 
Yuri Enjoyer
Staff
Developer
Joined
Feb 16, 2020
Messages
464
Hadn't heard of it before, and does seem pretty cool tbh.

Allows "NOT IN()" searches on indexed content
Now the problem with that, is that means we'd have to index every single user's library in it. And that is by far the biggest dataset on MD, around a billion records and always growing... But maybe one day we'll have to take that leap, who knows.

offers cloud hosting at basically cost
We'd selfhost it anyway, but just for the fun of it we can do some napkin maths :)

Looking at our current data (ie without indexing the billion+ records for user libraries), their calculator recommends we use 100-150 GB of RAM for the cluster.

1. Compute: Checking their HA mode, dividing their recommended memory by 3 (so 150/3, rounded down to 32GB to be conservative), and using our current ES cluster per-node CPU average use (around 2.8CPU x 5 nodes, so ~4C/node in their 3-node HA offer) as a guideline, it works out to 1404$/month for the compute

2. Bandwidth: We don't have a great way to check for the network traffic split between ES nodes and between ES and our applications; but we can likely safely assume the latter to be the dominating factor, as MD is by far more of a read workload than a write one. Currently the ES cluster does about 83400GB/month of egress; so we'd be looking at almost 10000$/month in networking fees for it 😅

(edit: their faq reads "For Highly Available multi-node clusters, when you index data, the data first reaches one node and is then replicated to the other nodes in the cluster. This is considered outgoing bandwidth from the perspective of the node that first receives the data and then replicates it out.", so we don't even need to ponder on the portion of that which is node-to-node traffic, and their count exactly the metric we have)

So yeah... This is why anything SaaS is entirely impossible for MD. It's always "contact us" territory. :dogkek:
 
Aggregator gang
Joined
Sep 7, 2019
Messages
303
I have to clear up a misunderstanding. With "indexed content" I meant just the 60,000 mangas.
Assuming each manga a unique identifier "manga_id", and there is a MySQL table named "user_manga" with "user_id, manga_id" combination, then just query for the list of manga_id's for a given user (this you can cache per session or whatever to reduce repeated querying MySQL, and of course invalidate/refresh on changes). For the search query NOT IN() you just feed that query result in matching the field containing "manga_id".
Typesense has built-in query cache support, but I would assume you would control it yourself.

Therefore, their cloud could still be feasible. But you could also self-host in your existing servers to make-up a cloud solution.
 
Yuri Enjoyer
Staff
Developer
Joined
Feb 16, 2020
Messages
464
I have to clear up a misunderstanding. With "indexed content" I meant just the 60,000 mangas.
Fair, though it doesn't really make any difference.

then just query for the list of manga_id's for a given user
This is the problem though... That, thousands of times per second, is in fact not cheap at all. And we lose any caching ability for those searches too.
But yeah, I see the approach you're suggesting, and it would work yes. Though as I said before, it's not so much complexity but rather performance that is the problem here.
Maybe we'll experiment with it eventually, but realistically it's quite unlikely still.

Also we wouldn't really need typesense in any meaninful way for that, afaik? A bunch of must/must-not clauses by doc id on ES should do the exact same.

Therefore, their cloud could still be feasible.
I'm not sure what part of your argument fixes the 10k$ bandwidth costs we'd incur with their cloud though...? Even assuming that manga searches are half of the total weight of searches (I doubt it, but I don't know, to be fair), that's still 5k$ per month 🤔
 

Users who are viewing this thread

Top