Super resolution model with simple FastAPI serving

For those of you who might be interested, I just wanted to let you know that i’ve given this project some love over the week :slight_smile: I’m working to improve the model and be upscaling to 4x instead. Added a gaussian noise layer to make the model more robust etc. I’ll likely push the changes sometime this weekend :slight_smile:

2 Likes

I think the use of gaussian noise in preference to dropout is a very practical idea - I’d love to know how much difference it makes - and how adding noise works in this use case…

Any chance you can share the PerceptiLabs model.json too?

Can i just share the json file found in the Perceptilabs model folder? Or would you need the .csv and data files too?

model.json alone would be a start, but CSV + datafiles would make a turnkey demo :wink:

1 Like

Pushed the changes now and also added releases with the Perceptilabs model.json and the training dataset :slight_smile:

1 Like

Is there a way to download the dataset in one go? The “downgit” approach applied to “https://github.com/birdstream/Super_Resolution/releases/tag/dataset” gave a “Server failure or wrong URL” error…

Hmm, I’ve never used downgit but I guess maybe that only works for the repo and not for releases :thinking:

My initial though was to just push the files to the repo but I guess not everyone is keen on pulling a repo that just jumped up to over 2GB :sweat_smile:

2GB - ha! - a mere bagatelle in these days of multi-terabyte disks! Mention it not! But you’re right, not everyone would want all that. Oh well, I suppose the mouse could do with the exercise :slight_smile:

I wasn’t clear about what was in the different “source code” things… . are zip and tar.gz the same, just packaged differently?

Tbh i don’t really know because the source code stuff is something github puts there :sweat_smile: it’s the Zip you should go for

1 Like

Hi @birdstream

Well, I finally downloaded the releases from github - datasets and PL model.- and extracted them. Many thanks - though I would appreciate your input/help on the image size issue below.

Info to others: don’t be confused by seeing the same file used as both input and target in the CSV - when you open the PL model itself you will see that the model generates low res versions to learn from by down-sampling/rescaling to 1/4 size to support the target x4 up-sampling.

Issue 1: For info, the CSV file refers to images in a ‘cropped’ sub-folder, whereas the archives place the images in a “faces” folder. Workaround: a simple global search & replace in the CSV file should take care of file placement.

Issue 2: the model seems to be expecting images of 224 x224 but the images in the dataset are 1024x1024. Does it matter? (I seem to recall some discussion between you and @robertl on this or a related topic). If it does matter, how would you recommend dealing with difference in image sizes?

Thx!

The answer is simple, i accidentally uploaded the original dataset :man_facepalming::man_facepalming: sorry for that i’ll fix it when i get back home :rofl:

Ah ok, I was also going to say I had discovered 112k rows in the CSV but only 28k images… and no _00x :slight_smile: All clear now!

Sorry for the delaym but it’s up now. Hopefully right this time :sweat_smile:

1 Like

7zip says that training_data.z06 is missing… :slight_smile:

Noo hahah you have to be joking, right? :see_no_evil::sweat_smile:

Would you mind trying another zip program? There is no .z06, and p7zip (Ubuntu) opens and extract the files just fine :thinking:

TL;DR I have sorted it. Mea culpa, Joakim!

The masochists in the audience may enjoy the longer, more painful version; the technical elite may be content with smirks of smug superiority at what follows.

Please bear in mind that I intended to repeat exactly what I had done previously with the .z01 to .z24 files of the initial release - downloaded everything in the release (remember that, it will be important!), selected them all, right-clicked and extracted with 7zip from the context menu. No problems: I got 28k files… 28k good files, even if they weren’t the right files. It follows, logically, that on at least one occasion, I knew what I was doing - and despite the flaws of the inductive method, maybe, just maybe, I was not doing anything wrong this time either.

Anyway, when there was a new release to process I deleted the 1st 24 zips. Am I sure I did that? Yes… because if I hadn’t done that, the new downloads would have prompted to overwrite or rename and that did not happen, so I am sure I had the right files.

Except <sigh> I didn’t download everything in the release… and I might not have downloaded training_data.zip, which I think is in fact index 0 of a set of split archive files -in the zip files there is a Volume column… and looking in the (latest) zip the files are in volumes 0-5. That might have confused 7zip - I selected only the numbered .zXX files and it might have inferred .zip for volume 0 and got rather confused when it wasn’t the right .zip… though I would have expected the error behaviour to be different. i.e. Yes! I did have the right files! Alas, not all of them :man_facepalming:

The gory details of the 7zip struggle were: not only did 7zip complain about .z06 being missing, there were heading errors (5760 files) and ~22k “unavailable data” errors for a total of 27, 579 errored files. 421 files were however extracted, but they were 1k x 1k files, IIRC.

But, rather than assume I had the moral high ground, I did try what @Birdstream suggested.

And the lesson is, when tech support ask you to check your device is plugged in and switched on, just go with the flow… regardless of certainty, one day you will find that the cleaner unplugged it for you.

Anyway. First I installed peazip; that couldn’t handle the archive format. So I then installed zipware, and that also complained about errors… but, remember the wrong zip file? That was in play there too.

Finally, I fired up an ubuntu vm, installed p7zip to be just like @birdstream’s system and had a look at the files that I downloaded again from github - they were fine!

So I re-downloaded for Windows - but this time I took the .zip and the .zXX files, and lo and behold, there were all the cropped images.

1 Like

Lovely read :rofl:
Glad to hear you sorted things out :slight_smile:

Btw, I would expect that training this model to produce sharp uspcaled images may take hours if not days on a consumer grade GPU… which is a bit of a pain right now because of the slow-down issue… If you decide to train it, please let me know how it goes for you, as your rig is probably a little more powerful than mine :sweat_smile:

Edit: btw, you probably want to train the model with noise for a couple of epochs, then decrease it, run some more, decrease it and so on until there is practically no noise left… at least thats what i think right now. Exploring that path is taking som time though :sweat_smile:

I have finally started training! It looks as though it will take ~9-10 hours based on 11% progress after ~1hr. Not too bad. (NB I did start training from scratch.)

Question: 112k images and batch size = 8… that’s quite a small batch size.

Was it specified after experimenting with other batch sizes? I wonder whether training would be faster with larger batches (answer: surely yes, PL has callbacks on batch end and there’s other overhead, but I wonder how much data is being transferred - inefficiently?- to the GPU per batch. In theory, I think my GPU could keep ALL data in memory, I’m just not sure how this bit works!)

Info for @robertl… it took ~3 minutes for the training panels to become “active”, during which time there is no indication of what is happening or how long it thinks it will take before training starts - is that something that can be added.

And… recalling a slack discussion about training slow-down there seem to be a number of things that could be done to speed things up…

  1. The info display is absolutely great for model development - it’s a major strength :slight_smile:
  2. However, once the model is “developed” and ready for a full training run you could
    2a. Still acquire data, but turn off live updates to screen until e.g. refresh button is clicked (or do every nth batch, or…)
    2b Don’t acquire data at all (if that makes it even faster.) and just show a message "Disabled for

Could PL give simple ETA for training completion? One can sort of do it in one’s head based on progress but it would be nice to have a date/time for completion.

What’s the internal thinking on performance these days?

Info for @robertl… it took ~3 minutes for the training panels to become “active”, during which time there is no indication of what is happening or how long it thinks it will take before training starts - is that something that can be added.

Thanks @JulianSMoore!
I think this is the one you are looking for: https://perceptilabs.canny.io/feature-requests/p/more-informative-loading-when-starting-training :slight_smile:
It’s quite high on the roadmap, should come out this year.

As for performance, we are first looking for a solution to fix the progressive slowdown, that shouldn’t be too bad but is behind some other fixes.
As for the training speed vs gathering data in general, we used to have something called “headless mode” which did something similar to your 2b suggestion. Whenever you were not looking at the statistics it would only gather accuracy and loss (for history) and nothing else.
We are looking to re-introduce something similar where you can toggle how much visualizations vs how fast model training you want to have.

As for the ETA, we can as soon as the training has started, and will for sure add something like that. I had forgotten to add that feature to Canny, added it here now (as a Feature Draft until I fill out the spec a bit more): https://perceptilabs.canny.io/drafts/p/allow-the-option-to-train-with-less-visualizations-for-faster-training

I often run into OOM’s when using higher batch sizes, so i set it a bit lower to be on the safe side… running this model with batch size 8 still consumes most of the 4GB
on my GTX980 anyway :sweat_smile: Surely, with more VRAM one could use higher batch size :wink: This model has millions of trainables, and since all activations needs to be stored during the forward pass, it adds up pretty easy. This is something gradient checkpointing could remedy a bit, but unfortunately there doesn’t seem to be an official implementation for this in TF. There is however in Pytorch. (Hint: look at the modified U-Net model in my colorization project on github)