Waiting for User Title Language Retification Project - Part One

Watermelon Consumer
Staff
Developer
Joined
Jan 18, 2018
Messages
169
MangaDex supports the ability to give each series multiple titles/alt-titles. At some point, we also added the ability to mark which language each alt-title is in. The problem is that we had no data for every title added before that point and they were marked as english, which is not very cool.

I did what anyone would do and ran all titles in the database through a unicode character script classifier, then mapped each script to a list of languages that use it and labeled every title with the list of languages it is likely to belong to based on the % of characters in each script. There are many language which use the Latin script so this method is kinda trash for that (we'll possibly get back to that on part two), but I believe we can fix a significant portion of titles in other scripts with this. I've filtered all titles whose current language does not match its script and put it on a google doc so that people can verify it and contribute to the effort. If you've verified/edited a set of titles, comment the row numbers here and I will periodically update the doc.
 
Last edited:
Contributor
Joined
Jan 8, 2023
Messages
922
but the main story's title is : " I'm Doomed if It Can't Be You"



Idk if it is want you want. I hope it helps a little.
 
Contributor
Joined
Jan 8, 2023
Messages
922
I have a question :

Do we need to write just one time the title without the language tag or we need to rewrite it with the alternative titles ?

An example :


Here the english title is not in the alternative titles list. Is it not needed to mark it as an english title ?
 
Contributor
Joined
Jan 8, 2023
Messages
922

純情ドッロプ => Dorropu not cider ??!

 
Watermelon Consumer
Staff
Developer
Joined
Jan 18, 2018
Messages
169
I have a question :

Do we need to write just one time the title without the language tag or we need to rewrite it with the alternative titles ?

An example :


Here the english title is not in the alternative titles list. Is it not needed to mark it as an english title ?
If it's already in the main title, there's no need to repeat it on the alt titles.

Thanks for the contributions, I have edited the doc to filter out the verified entries.
 
Contributor
Joined
Jan 8, 2023
Messages
922






 
Contributor
Joined
Jan 8, 2023
Messages
922



 
Watermelon Consumer
Staff
Developer
Joined
Jan 18, 2018
Messages
169
I don't understand this entry there was just a main title...
I included both main titles and alt titles in the doc since main titles also have a language attribute, though I now realize the front-end does not have support for editing the lang on those. Keep mentioning them and I'll edit them through the mod tools.

0DgQrHe.png
 
Contributor
Joined
Jan 8, 2023
Messages
922
I now realize the front-end does not have support for editing the lang on those
Ooooooh. I need to review all the main titles to make sure...wait I can't review the language tag. :pepehmm:
 
Watermelon Consumer
Staff
Developer
Joined
Jan 18, 2018
Messages
169
you can check them in the API but I don't really expect people to do that so let's ignore that for now :nyoron:
 
Contributor
Joined
Jan 8, 2023
Messages
922
216 -> 246 are all good now.

id 223 (in the google sheet) :
01432726-c13b-4f6f-a53d-98e911faec6a
no title with at this url, maybe this one ?

id 242 : 015d7710-3924-465a-b302-6731773e9ed2 Uzbek language required

I think it's easier to do it like this for me, if it's no good for you I can change.
 
Contributor
Joined
Jan 8, 2023
Messages
922
Btw, I have a little suggestion :

Maybe you can code something like this.

If you have :
  • Counter({'KATAKANA': 1.., AND/OR 'HIRAGANA': 1.., AND/OR 'KATAKANA-HIRAGANA': 1..* AND 'LATIN': 1..*}) => Japanese
  • Counter({'KATAKANA': 1.., AND/OR 'HIRAGANA': 1.., AND/OR 'KATAKANA-HIRAGANA': 1..*}) => It's Japanese
  • Counter({'HANGUL': 1..* }) => It's Korean
  • Counter({'HANGUL': 1..* AND 'LATIN': 1..*}) => It's Korean
  • Counter({'KATAKANA': 1..* AND/OR 'HIRAGANA': 1..* AND/OR 'KATAKANA-HIRAGANA': 1..* AND 'HANGUL': 1..*}) => ERROR
I think it can help a lot with the work.
 
Watermelon Consumer
Staff
Developer
Joined
Jan 18, 2018
Messages
169
Btw, I have a little suggestion :

Maybe you can code something like this.

If you have :
  • Counter({'KATAKANA': 1.., AND/OR 'HIRAGANA': 1.., AND/OR 'KATAKANA-HIRAGANA': 1..* AND 'LATIN': 1..*}) => Japanese
  • Counter({'KATAKANA': 1.., AND/OR 'HIRAGANA': 1.., AND/OR 'KATAKANA-HIRAGANA': 1..*}) => It's Japanese
  • Counter({'HANGUL': 1..* }) => It's Korean
  • Counter({'HANGUL': 1..* AND 'LATIN': 1..*}) => It's Korean
  • Counter({'KATAKANA': 1..* AND/OR 'HIRAGANA': 1..* AND/OR 'KATAKANA-HIRAGANA': 1..* AND 'HANGUL': 1..*}) => ERROR
I think it can help a lot with the work.
Yeah I guess I could count all the japanese scripts together for example to try and make it more accurate. I'd also like to re-run an updated version to get rid of some of the titles which have already been changed in the meantime and maybe make this more digestable, but for now I've gotta wait until database access is available again.
 
Contributor
Joined
Jan 8, 2023
Messages
922
Any update?

I think you can correct a lot of errors.

Also, it would be good to separate the titles who don't have a title in their original language.
 
Watermelon Consumer
Staff
Developer
Joined
Jan 18, 2018
Messages
169
No change on this front for now unfortunately.
 

Users who are viewing this thread

Top