I’ve just built and run a very simple model in Jupyter Lab. Before I invest too much time in exporting a CSV file to redo it in Perceptilabs I just wanted to ask whether PL 0.12.7 is ready for this use case:
- 16 numeric cols
- 1 categorical (5 categories to one-hot encode - do you “drop first” for regression??)
- 1 float
2.5 million rows.
I have realised that for this type of problem, a regression, one wants the least noisy data to use for test, so I have a special train/test partitioning function. A naive % split won’t work as well here.
- Would I be able to put the custom split code into PL somehow, or
- Could I possibly, he asked, expecting the answer “no”, specify two distinct datasets via the data wizard - one for training and one for test?
A simple solution to custom train/validation/test splitting/datasets could be to allow the datawizard to use info within the CVS e.g. a single col TRAIN/VALIDATION/TEST category value for each row, or one-hot encoded across three columns TRAIN/VALIDATION/TEST.