Integrate Git-like functionality for chapter release management and potential data mining (LakeFS).

night_ · Jan 20, 2023

There apparently had been some past drama involving rouge members of TL groups defacing chapters on the mainsite that resulted in the current system for managing releases to be somewhat lacking, which apparently also had groups resort to sharing one account with upload permissions, something that's not a good security practice to say the least.

One solution that i think might be applied here (mostly in the long term future) is to provide groups with capabilities similar to to that of a simplified Git repository - only for files. Each group member (or even any non-memebr) would be able to have changes committed to a separate branch before pushing a merge request to the group admins, who would then review the changes before merging them into the actual master release displayed on the mainsite.
In an event a mistake is made on any branch or a someone goes rouge, there will be version control system that would allow reverting to any of the previous commits.

One existing tool that can be integrated as part of such a solution is LakeFS, which although primarily seems tailored for data science, basically appears to function as an overlay for any existing object storage system (like S3) while providing the aforementioned Git-like functionality when it comes to storing files. Seeing that MD does use Ceph as its object storage system, it should be possible to run LakeFS in parallel by utilizing Ceph's S3-compatible API.

As a by-product, it might also be possible to eventually leverage LakeFS as an actual data mining tool for tricky things like harvesting views (as Panda mentioned on the latest Reddit AMA) or any other data sciencey stuff anyone might fancy at some point.

Now just to make it explicitly clear: It's obvious that adding such a complexity to the site ~~might be bats*it insane~~ will require resources that MD currently might not be able to spare and/or afford, not to mentioned the additional integration and dev time required to implement the whole feature in any usable form within mainsite app.
That's why i am only suggesting this a long-term feature of sorts, which if deemed viable, would only become a thing sometime in the non-immediate future.

tristan9 · Jan 20, 2023

That's why i am only suggesting this a long-term feature of sorts, which if deemed viable, would only become a thing sometime in the non-immediate future.

That is definitely a cool idea, and something to consider perhaps, but it does sound somewhat unlikely we'll be able to do it to be completely honest.

On one hand, we would likely not use LakeFS mostly because we don't use RGW for image storage (we use CephFS for those) even if we use it for other internal concerns (logs, metrics, and embed images cache). One problem with RGW is that it's honestly not that fast and makes you lose out on the filesystem RAM cache, alongside needing userland HTTP requests when CephFS has kernel support, so its network traffic is much more optimized than we ever could do given unlimited time.

On the other hand, we only soft-delete a lot of things, which includes pre-edit chapters, and chapter deletions. There are many reasons for it, like being able to review rejected/deleted chapters when people come lie to our face about how they broke the rules, or in a less malicious way because our experience shows that a lot of groups just don't keep backups of their work at all, so when they come knocking at our door to be reinstated or disband, undeleting their chapters would otherwise often involve scraping terribly compressed aggregators.

Either way, we could take advantage of this soft-deletion approach for it, but we would likely want to keep agency over how much retention we apply, and how many edits back we archive, so if we did it, we'd likely do it without any commitment to how much and how long we archive historical data. And maybe not even make it visible to users should we be concerned about it somehow.

But yeah, overall we do want to move towards something more wiki-like for the edit history of things. It's just another lot of data to keep track of

night_ · Jan 21, 2023

One other idea that i had in mind for the git-like approach was more orientated towards collaboration, in which users would have a more open way of sharing resources either within groups or between them.

In one such approach Images could be stored on their own branch/dataset, while all text containing TLs and their respective notes would be saved as separate metadata that could then be overlaid (or encoded) on top of images from the said branch/dataset before getting pushed for release.

Something like this could for example allow one group to maintain a single branch of HQ clean raws, which another non-English group could then fork for their own localized TLs. Another example is to use something likes this for resolving potential scanlation drama disputes by simply forking and modifying the text instead of having to go through the hassle of creating varying image files from scratch.

Seeing that there's optimization concerns with RGW though, keeping all the images tracked using a wiki-like system that stores on CephFS or an RBD would be a better idea. However, does the separate text-metadata approach sound technically viable in your opinion from the perspective of either approach (git/wiki)?

As for agency, with both the git-like and wiki-like approach you could catalog milestones as their own snapshots that would serve as an incremental baseline for further edits, while older ones would ultimately become archived, over-time deprecated and eventually either deleted or merged according to your retention policy.

tristan9 · Jan 21, 2023

That sounds quite similar to https://forums.mangadex.org/threads/reader-ui-for-scanlator-notes.1071771/ in general (just for the pre-publication process rather than as publication notes). While not necessarily a bad idea, it's still quite a lot of complexity though.

So uh... not fundamentally opposed to it, but we won't have the time for seriously working on this any soon, pretty much

BraveDude8 · Jan 21, 2023

Anything involving hosting raws on MD isn't going to happen, fyi. We've made a very deliberate choice to not allow that, for a variety of reasons.

Leaving that aside, this is a massively complicated system that you'd have to persuade people to include in their existing workflows and I don't think it'd get used much. You're also assuming scanlators would want to upload all of their layered files for other people to use, and this concept as a whole is more like something I'd expect from Webtoons than a scanlation site.

night_ · Jan 21, 2023

While i did acknowledge the complexity of the whole thing on the backend side of things, the idea was more towards simplifying the concept behind a Git-repository for the lay scanlator who's not versed with CLIs. Particularly by using a tailored graphic frontend of sorts that would also work with some form of version control as an alternative to ACLs - with the option of rolling back when needed. However if it's more complex than beneficial, i suppose the more monolithic wiki-like approach Tristan9 suggested could work better.

In the same regard i wasn't suggesting hosting whole JP raws, rather only clean and/or redrawn raws that already had all the text removed, in a static and non-layred format (e.g not whole PSDs, rather just images). Mainly so that more than one person/group could focus on the various TLs in the meanwhile, but also to enable collaborate in an effort to produce better TL quality.
However, i do understand if that's also problematic and not possible with MD's current policy.

As for sharing in my other idea: what i assumed was that introducing a workflow similar to that of FOSS projects, who leverage public volunteer contributions much like scanlators do, could result in a methodology where groups could both give and receive help more easily (be it in resources or skill). Something that if eventually turned into a platform (like GitHub), could potentially alleviate their recurring need of begging on credit pages or having people run circles around discord servers in order to try and offer their help.

Either-way as i mentioned before, i can see why would anyone be averse to attempting pulling off something like this, which is why i only made this suggestion as a broad idea with specific examples that may later help formulating a guideline for a system that someone may or may not fancy in the long-term future.

tristan9 · Jan 21, 2023

To be clear, this remains at the very least an interesting idea that we might eventually explore. The main hurdle, besides figuring out workflow/ux/etc details (which would be a huge task on its own tho), is just that we're already stretched quite thin for ongoing development, so a complex system like that isn't something we'll have much time for in the immediate future.

But maybe in a year, we can all see where MD is, and think again about it in some form

Integrate Git-like functionality for chapter release management and potential data mining (LakeFS).

night_

Aggregator gang

tristan9

Yuri Enjoyer

night_

Aggregator gang

tristan9

Yuri Enjoyer

BraveDude8

Head Contributor Wrangler

night_

Aggregator gang

tristan9

Yuri Enjoyer

Similar threads

Users who are viewing this thread