Could PL 0.12.7 handle this

I’ve just built and run a very simple model in Jupyter Lab. Before I invest too much time in exporting a CSV file to redo it in Perceptilabs I just wanted to ask whether PL 0.12.7 is ready for this use case:

Inputs:

  • 16 numeric cols
  • 1 categorical (5 categories to one-hot encode - do you “drop first” for regression??)

Output:

  • 1 float

2.5 million rows.

Bonus Question

I have realised that for this type of problem, a regression, one wants the least noisy data to use for test, so I have a special train/test partitioning function. A naive % split won’t work as well here.

  • Would I be able to put the custom split code into PL somehow, or
  • Could I possibly, he asked, expecting the answer “no”, specify two distinct datasets via the data wizard - one for training and one for test?

PS UPDATE

A simple solution to custom train/validation/test splitting/datasets could be to allow the datawizard to use info within the CVS e.g. a single col TRAIN/VALIDATION/TEST category value for each row, or one-hot encoded across three columns TRAIN/VALIDATION/TEST.

Hi @JulianSMoore,

Multi-input works in PL, but it’s a bit painful at the moment, especially as you get up to higher number of columns. We have been more focused on image/multi-modal usecases with fewer columns now for a bit while improving other parts of the tool.
The number of rows should not be an issue though.

I’m unfamiliar with “drop first”, how does it work?

We unfortunately only have the naive % split available at the moment. The next step for us though will be to let you specify two distinct datasets, one for training+validation (which you specify in the Data Wizard) and one for testing (which you get to specify when you run the tests).
It will be a little bit before this change comes out though, as we are currently preparing for the Cloud solution.

A simple solution to custom train/validation/test splitting/datasets could be to allow the datawizard to use info within the CVS e.g. a single col TRAIN/VALIDATION/TEST category value for each row, or one-hot encoded across three columns TRAIN/VALIDATION/TEST.

That’s a great idea :slight_smile:
We want to make it possible to store more information in the CSV (data type and if it’s an Input/Target for example) so this fits nicely with that.

For anyone else who reads this: “drop first” is an argument to pandas.get_dummies such that n categories are encoded in n-1 one-hot columns, with all zeroes representing one category.

Apparently that’s important for regressions.

@robertl Given the info at In Machines We Trust it seems that regularisation avoids colinearity with prob = 1 (“almost always”) and so there’s no need to use drop_first if regularisation is used.

What’s the PL approach to regularisation?

Great to know, thank you!
As you have seen we have dropout as our current regularization method in the tool, but will be adding the standard L1 and L2 in the loss function as well when we start building out the next loss-function-centric feature (we are planning on placing the loss functions inside the Target components so that you can run multi-task problems and we will at the same time add the ability to customize the loss function with a code snippet).

1 Like