Explorations in manga recommendation systems: manga-recsys

Member
Joined
Jan 8, 2023
Messages
2
Hey all,

Over the winter holidays, I built a website to play around with MangaDex data and recommendation systems. It can be found at manga-recsys.geospiza.me, with source code on GitHub. It takes some inspiration from similar manga used by Neko, particularly around the use of static hosting for serving recommendations. I scraped the manga, groups, and chapter endpoints to try different techniques. You can find consolidated datasets in parquet and ndjson on the data page.

There are three smaller sub-projects that I'd like to call out for their interesting results:
  • Association rule mining of tags
  • Visualizations of tag embeddings
  • Re-ranking recent manga updates using a personal library

Association rule mining of tags​


Association rule mining is a data mining technique to learn relationships between entities. The most famous example is the association between "diapers and beer" in grocery store purchases. I ran a rule-mining algorithm (FPGrowth) on the tag data. I built a small searchable table to look at the rules.


Here, we see that "Reincarnation" and "Fantasy" often lead to "Isekai". One potential application is to build up sets of tags that frequently appear to make suggestions in filters. You can play around with this table on the tag-rules page.

Visualization of tag embeddings​

The basic principle of a recommendation system is to find items (e.g., manga) that are similar to each other. Several methods exist to build recommendations, but I focus on a system based on tag similarity. Two manga are similar if they share tags. We can build an embedding to represent each manga as a vector; embeddings are geometric spaces that preserve distances from a higher dimensional space. I build embedding models using word2vec, latent semantic indexing, and bipartite networks. After building these, I visualize them in 2D plots.


Play around with the embedding plots here; they have been pre-computed for all tags on each model I've built. The plots are a way to eyeball how well the recommendation models do in clustering manga based on their tags.

Re-ranking recent manga updates using a personal library​

The last thing I'd like to present is an proof-of-concept application of the tag embedding models. I use the word2vec model to build a client-side re-ranking system of recent updates. You can add manga that you like to your personal library, and it will build a "preference vector" that represents your preferences.


We use this preference vector to sort the last 100 manga updates, so that manga that are most similar to your library are shown at the top. This works well if you have particular tastes (isekai anyone?), and is simple to implement. You can play with the proof-of-concept application here, which updates at most once every 5 minutes, and saves manga to your local storage.

I've pre-computed the word2vec vectors for each tag that's served to clients. Each tag vector is a 16-element float array. On the client, we represent a manga as a vector by the average of their tag vectors. We represent a client's personal library as the average of the manga vectors.

We compute a vector for each manga using the tags in the manga listing response. We then find the distance between the preference vector and each manga using the cosine similarity (dot product normalized by magnitude, or the sum of element-wise vector products divided by the product of their L2 norms). We sort by descending similarity, which ranges between 1 and -1.

This can all be done client-side, as long as you have the weights for the tags available locally.

The polar plots that are used to compare your library to a particular manga also have an interesting implementation:


Just like we can visualize embed data into 2D to visualize in a plot, we can actually embed data into 1D to obtain their order on a line. You can see in these plots that tags are organized in a way that similar tags are close together (except for the first and last elements at 0 and 360 degrees).

Thoughts​

It was a lot of fun to apply different algorithms to MangaDex data, and I learned a lot while reading all sorts of papers. Small projects like this manga recommendation exploration are an excellent way to stretch and learn. There's plenty of potential future work to do. For example, I've been considering automating analysis pipelines on up-to-date datasets. There are also extensions to the models that would be worthwhile to look at, like adding description embeddings (via GloVe/BERT/GPT-2 vectors) to improve recommendations/break symmetry and implementing rigorous quality measures to compare models.

I expect this work to have only some real value to myself, but if you find it exciting or applicable, feel free to suggest ideas/feedback. I'm available on discord (geospiza#5912) on the tachiyomi and mangadex servers if you want to chat, too.
 
Double-page supporter
Joined
May 8, 2019
Messages
124
It's a good idea, but I think it would be excellent to train your model on a database that already exists. It might sound a little unethical but scrapping user-submitted MAL and Anilist recommendations for manga would be a step in the right direction.
 
Member
Joined
Jan 8, 2023
Messages
2
It's a good idea, but I think it would be excellent to train your model on a database that already exists. It might sound a little unethical but scrapping user-submitted MAL and Anilist recommendations for manga would be a step in the right direction.

Thanks for the feedback! I'm not interested in going this direction -- making models based on the outputs of another model aren't going to be very good (nor theoretically sound). And getting to the point of building a collaborative filtering model with all of the user-{anime|manga} data would require a lot more technical planning and breaking of the TOS. Honestly, getting a clean dataset was probably the hardest thing about this whole project.

MangaDex could probably implement a half-decent recommendation system at some point, doing a "you might like this because other people who follow the same manga as you also follow this". They have all the data they need. But user-level data isn't available on the public API, so content-based recommendations are the way to scratch the curiosity itch. With some basic description-based features (and maybe rating/statistics from the API), the recommendations on my site might become up-to-par with similar manga's.
 
Dex-chan lover
Joined
Oct 9, 2019
Messages
2,011
Thanks for the feedback! I'm not interested in going this direction -- making models based on the outputs of another model aren't going to be very good (nor theoretically sound). And getting to the point of building a collaborative filtering model with all of the user-{anime|manga} data would require a lot more technical planning and breaking of the TOS. Honestly, getting a clean dataset was probably the hardest thing about this whole project.

MangaDex could probably implement a half-decent recommendation system at some point, doing a "you might like this because other people who follow the same manga as you also follow this". They have all the data they need. But user-level data isn't available on the public API, so content-based recommendations are the way to scratch the curiosity itch. With some basic description-based features (and maybe rating/statistics from the API), the recommendations on my site might become up-to-par with similar manga's.
The problem with this is that a lot of people on MangaDex read total trash that they don't wanna recommend
 

rdn

Forum Admin
Staff
Developer
Joined
Jan 18, 2018
Messages
281
MangaDex could probably implement a half-decent recommendation system at some point, doing a "you might like this because other people who follow the same manga as you also follow this". They have all the data they need. But user-level data isn't available on the public API, so content-based recommendations are the way to scratch the curiosity itch. With some basic description-based features (and maybe rating/statistics from the API), the recommendations on my site might become up-to-par with similar manga's.
Definitely something I have been thinking about since forever, but on the list of things that we need/want on the lower end atm :cry:
 

Users who are viewing this thread

Top