Explorations in manga recommendation systems: manga-recsys

geospiza · Jan 8, 2023

Hey all,

Over the winter holidays, I built a website to play around with MangaDex data and recommendation systems. It can be found at manga-recsys.geospiza.me, with source code on GitHub. It takes some inspiration from similar manga used by Neko, particularly around the use of static hosting for serving recommendations. I scraped the manga, groups, and chapter endpoints to try different techniques. You can find consolidated datasets in parquet and ndjson on the data page.

There are three smaller sub-projects that I'd like to call out for their interesting results:

Association rule mining of tags
Visualizations of tag embeddings
Re-ranking recent manga updates using a personal library

Association rule mining of tags

Association rule mining is a data mining technique to learn relationships between entities. The most famous example is the association between "diapers and beer" in grocery store purchases. I ran a rule-mining algorithm (FPGrowth) on the tag data. I built a small searchable table to look at the rules.

https://imgur.com/09pKa11

Here, we see that "Reincarnation" and "Fantasy" often lead to "Isekai". One potential application is to build up sets of tags that frequently appear to make suggestions in filters. You can play around with this table on the tag-rules page.

Visualization of tag embeddings

The basic principle of a recommendation system is to find items (e.g., manga) that are similar to each other. Several methods exist to build recommendations, but I focus on a system based on tag similarity. Two manga are similar if they share tags. We can build an embedding to represent each manga as a vector; embeddings are geometric spaces that preserve distances from a higher dimensional space. I build embedding models using word2vec, latent semantic indexing, and bipartite networks. After building these, I visualize them in 2D plots.

https://imgur.com/LchE7J6

Play around with the embedding plots here; they have been pre-computed for all tags on each model I've built. The plots are a way to eyeball how well the recommendation models do in clustering manga based on their tags.

Re-ranking recent manga updates using a personal library

The last thing I'd like to present is an proof-of-concept application of the tag embedding models. I use the word2vec model to build a client-side re-ranking system of recent updates. You can add manga that you like to your personal library, and it will build a "preference vector" that represents your preferences.

https://imgur.com/qRKMDet

We use this preference vector to sort the last 100 manga updates, so that manga that are most similar to your library are shown at the top. This works well if you have particular tastes (isekai anyone?), and is simple to implement. You can play with the proof-of-concept application here, which updates at most once every 5 minutes, and saves manga to your local storage.

I've pre-computed the word2vec vectors for each tag that's served to clients. Each tag vector is a 16-element float array. On the client, we represent a manga as a vector by the average of their tag vectors. We represent a client's personal library as the average of the manga vectors.

We compute a vector for each manga using the tags in the manga listing response. We then find the distance between the preference vector and each manga using the cosine similarity (dot product normalized by magnitude, or the sum of element-wise vector products divided by the product of their L2 norms). We sort by descending similarity, which ranges between 1 and -1.

This can all be done client-side, as long as you have the weights for the tags available locally.

The polar plots that are used to compare your library to a particular manga also have an interesting implementation:

https://imgur.com/ic13Xow

Just like we can visualize embed data into 2D to visualize in a plot, we can actually embed data into 1D to obtain their order on a line. You can see in these plots that tags are organized in a way that similar tags are close together (except for the first and last elements at 0 and 360 degrees).

Thoughts

It was a lot of fun to apply different algorithms to MangaDex data, and I learned a lot while reading all sorts of papers. Small projects like this manga recommendation exploration are an excellent way to stretch and learn. There's plenty of potential future work to do. For example, I've been considering automating analysis pipelines on up-to-date datasets. There are also extensions to the models that would be worthwhile to look at, like adding description embeddings (via GloVe/BERT/GPT-2 vectors) to improve recommendations/break symmetry and implementing rigorous quality measures to compare models.

I expect this work to have only some real value to myself, but if you find it exciting or applicable, feel free to suggest ideas/feedback. I'm available on discord (geospiza#5912) on the tachiyomi and mangadex servers if you want to chat, too.

wcfdwfc · Jan 9, 2023

It's a good idea, but I think it would be excellent to train your model on a database that already exists. It might sound a little unethical but scrapping user-submitted MAL and Anilist recommendations for manga would be a step in the right direction.

geospiza · Jan 9, 2023

wcfdwfc said:
It's a good idea, but I think it would be excellent to train your model on a database that already exists. It might sound a little unethical but scrapping user-submitted MAL and Anilist recommendations for manga would be a step in the right direction.

Thanks for the feedback! I'm not interested in going this direction -- making models based on the outputs of another model aren't going to be very good (nor theoretically sound). And getting to the point of building a collaborative filtering model with all of the user-{anime|manga} data would require a lot more technical planning and breaking of the TOS. Honestly, getting a clean dataset was probably the hardest thing about this whole project.

MangaDex could probably implement a half-decent recommendation system at some point, doing a "you might like this because other people who follow the same manga as you also follow this". They have all the data they need. But user-level data isn't available on the public API, so content-based recommendations are the way to scratch the curiosity itch. With some basic description-based features (and maybe rating/statistics from the API), the recommendations on my site might become up-to-par with similar manga's.

Remocracy · Jan 9, 2023

geospiza said:
Thanks for the feedback! I'm not interested in going this direction -- making models based on the outputs of another model aren't going to be very good (nor theoretically sound). And getting to the point of building a collaborative filtering model with all of the user-{anime|manga} data would require a lot more technical planning and breaking of the TOS. Honestly, getting a clean dataset was probably the hardest thing about this whole project.

MangaDex could probably implement a half-decent recommendation system at some point, doing a "you might like this because other people who follow the same manga as you also follow this". They have all the data they need. But user-level data isn't available on the public API, so content-based recommendations are the way to scratch the curiosity itch. With some basic description-based features (and maybe rating/statistics from the API), the recommendations on my site might become up-to-par with similar manga's.

The problem with this is that a lot of people on MangaDex read total trash that they don't wanna recommend

rdn · Jan 10, 2023

geospiza said:
MangaDex could probably implement a half-decent recommendation system at some point, doing a "you might like this because other people who follow the same manga as you also follow this". They have all the data they need. But user-level data isn't available on the public API, so content-based recommendations are the way to scratch the curiosity itch. With some basic description-based features (and maybe rating/statistics from the API), the recommendations on my site might become up-to-par with similar manga's.

Definitely something I have been thinking about since forever, but on the list of things that we need/want on the lower end atm

Explorations in manga recommendation systems: manga-recsys

geospiza

Member

Association rule mining of tags

Visualization of tag embeddings

Re-ranking recent manga updates using a personal library

Thoughts

wcfdwfc

Double-page supporter

geospiza

Member

Remocracy

Dex-chan lover

rdn

Forum Admin

Similar threads

Users who are viewing this thread

Explorations in manga recommendation systems: manga-recsys

geospiza

Member

Association rule mining of tags​

Visualization of tag embeddings​

Re-ranking recent manga updates using a personal library​

Thoughts​

wcfdwfc

Double-page supporter

geospiza

Member

Remocracy

Dex-chan lover

rdn

Forum Admin

Similar threads

Users who are viewing this thread

Association rule mining of tags

Visualization of tag embeddings

Re-ranking recent manga updates using a personal library

Thoughts