- Joined
- Jan 18, 2018
- Messages
- 172
MangaDex supports the ability to give each series multiple titles/alt-titles. At some point, we also added the ability to mark which language each alt-title is in. The problem is that we had no data for every title added before that point and they were marked as english, which is not very cool.
I did what anyone would do and ran all titles in the database through a unicode character script classifier, then mapped each script to a list of languages that use it and labeled every title with the list of languages it is likely to belong to based on the % of characters in each script. There are many language which use the Latin script so this method is kinda trash for that (we'll possibly get back to that on part two), but I believe we can fix a significant portion of titles in other scripts with this. I've filtered all titles whose current language does not match its script and put it on a google doc so that people can verify it and contribute to the effort. If you've verified/edited a set of titles, comment the row numbers here and I will periodically update the doc.
I did what anyone would do and ran all titles in the database through a unicode character script classifier, then mapped each script to a list of languages that use it and labeled every title with the list of languages it is likely to belong to based on the % of characters in each script. There are many language which use the Latin script so this method is kinda trash for that (we'll possibly get back to that on part two), but I believe we can fix a significant portion of titles in other scripts with this. I've filtered all titles whose current language does not match its script and put it on a google doc so that people can verify it and contribute to the effort. If you've verified/edited a set of titles, comment the row numbers here and I will periodically update the doc.
Last edited: