Member
- Joined
- Jan 8, 2023
- Messages
- 2
Hey all,
Over the winter holidays, I built a website to play around with MangaDex data and recommendation systems. It can be found at manga-recsys.geospiza.me, with source code on GitHub. It takes some inspiration from similar manga used by Neko, particularly around the use of static hosting for serving recommendations. I scraped the manga, groups, and chapter endpoints to try different techniques. You can find consolidated datasets in parquet and ndjson on the data page.
There are three smaller sub-projects that I'd like to call out for their interesting results:
Association rule mining is a data mining technique to learn relationships between entities. The most famous example is the association between "diapers and beer" in grocery store purchases. I ran a rule-mining algorithm (FPGrowth) on the tag data. I built a small searchable table to look at the rules.
Here, we see that "Reincarnation" and "Fantasy" often lead to "Isekai". One potential application is to build up sets of tags that frequently appear to make suggestions in filters. You can play around with this table on the tag-rules page.
Play around with the embedding plots here; they have been pre-computed for all tags on each model I've built. The plots are a way to eyeball how well the recommendation models do in clustering manga based on their tags.
We use this preference vector to sort the last 100 manga updates, so that manga that are most similar to your library are shown at the top. This works well if you have particular tastes (isekai anyone?), and is simple to implement. You can play with the proof-of-concept application here, which updates at most once every 5 minutes, and saves manga to your local storage.
The polar plots that are used to compare your library to a particular manga also have an interesting implementation:
Just like we can visualize embed data into 2D to visualize in a plot, we can actually embed data into 1D to obtain their order on a line. You can see in these plots that tags are organized in a way that similar tags are close together (except for the first and last elements at 0 and 360 degrees).
I expect this work to have only some real value to myself, but if you find it exciting or applicable, feel free to suggest ideas/feedback. I'm available on discord (geospiza#5912) on the tachiyomi and mangadex servers if you want to chat, too.
Over the winter holidays, I built a website to play around with MangaDex data and recommendation systems. It can be found at manga-recsys.geospiza.me, with source code on GitHub. It takes some inspiration from similar manga used by Neko, particularly around the use of static hosting for serving recommendations. I scraped the manga, groups, and chapter endpoints to try different techniques. You can find consolidated datasets in parquet and ndjson on the data page.
There are three smaller sub-projects that I'd like to call out for their interesting results:
- Association rule mining of tags
- Visualizations of tag embeddings
- Re-ranking recent manga updates using a personal library
Association rule mining of tags
Association rule mining is a data mining technique to learn relationships between entities. The most famous example is the association between "diapers and beer" in grocery store purchases. I ran a rule-mining algorithm (FPGrowth) on the tag data. I built a small searchable table to look at the rules.
Here, we see that "Reincarnation" and "Fantasy" often lead to "Isekai". One potential application is to build up sets of tags that frequently appear to make suggestions in filters. You can play around with this table on the tag-rules page.
Visualization of tag embeddings
The basic principle of a recommendation system is to find items (e.g., manga) that are similar to each other. Several methods exist to build recommendations, but I focus on a system based on tag similarity. Two manga are similar if they share tags. We can build an embedding to represent each manga as a vector; embeddings are geometric spaces that preserve distances from a higher dimensional space. I build embedding models using word2vec, latent semantic indexing, and bipartite networks. After building these, I visualize them in 2D plots.Play around with the embedding plots here; they have been pre-computed for all tags on each model I've built. The plots are a way to eyeball how well the recommendation models do in clustering manga based on their tags.
Re-ranking recent manga updates using a personal library
The last thing I'd like to present is an proof-of-concept application of the tag embedding models. I use the word2vec model to build a client-side re-ranking system of recent updates. You can add manga that you like to your personal library, and it will build a "preference vector" that represents your preferences.We use this preference vector to sort the last 100 manga updates, so that manga that are most similar to your library are shown at the top. This works well if you have particular tastes (isekai anyone?), and is simple to implement. You can play with the proof-of-concept application here, which updates at most once every 5 minutes, and saves manga to your local storage.
I've pre-computed the word2vec vectors for each tag that's served to clients. Each tag vector is a 16-element float array. On the client, we represent a manga as a vector by the average of their tag vectors. We represent a client's personal library as the average of the manga vectors.
We compute a vector for each manga using the tags in the manga listing response. We then find the distance between the preference vector and each manga using the cosine similarity (dot product normalized by magnitude, or the sum of element-wise vector products divided by the product of their L2 norms). We sort by descending similarity, which ranges between 1 and -1.
This can all be done client-side, as long as you have the weights for the tags available locally.
We compute a vector for each manga using the tags in the manga listing response. We then find the distance between the preference vector and each manga using the cosine similarity (dot product normalized by magnitude, or the sum of element-wise vector products divided by the product of their L2 norms). We sort by descending similarity, which ranges between 1 and -1.
This can all be done client-side, as long as you have the weights for the tags available locally.
The polar plots that are used to compare your library to a particular manga also have an interesting implementation:
Just like we can visualize embed data into 2D to visualize in a plot, we can actually embed data into 1D to obtain their order on a line. You can see in these plots that tags are organized in a way that similar tags are close together (except for the first and last elements at 0 and 360 degrees).
Thoughts
It was a lot of fun to apply different algorithms to MangaDex data, and I learned a lot while reading all sorts of papers. Small projects like this manga recommendation exploration are an excellent way to stretch and learn. There's plenty of potential future work to do. For example, I've been considering automating analysis pipelines on up-to-date datasets. There are also extensions to the models that would be worthwhile to look at, like adding description embeddings (via GloVe/BERT/GPT-2 vectors) to improve recommendations/break symmetry and implementing rigorous quality measures to compare models.I expect this work to have only some real value to myself, but if you find it exciting or applicable, feel free to suggest ideas/feedback. I'm available on discord (geospiza#5912) on the tachiyomi and mangadex servers if you want to chat, too.