CSV off screen and probably has mental anxiety

I tried to upload a test CSV from a beginner Kaggle set as this is my first time using the platform.
Every time I click to select the target / input it jumps to a random location on the data set.
Also the “next” button is grayed out as I assume I need to manually select all 100+ categories.

This is the first time I rage quit something that was not a game.

Hi @JacoMoolman - nice to see another new face here… do stick around - we’re seeing the evolution of an amazing tool and things change quickly!

Unfortunately I know that issue, it’s bitten me too - and I only have 18 cols for my regression.

And yes, until you’ve told the tool what to do with each column (input, output, ignore) Next is not available.

I did have a workaround but probably won’t work for you: I only had 18 cols. Change the browser scaling, then more columns fit - and maybe keyboard works (can’t recall - sorry, double checked: no dice).

Related Q: are you by any chance also trying to do a regression here? To the uninitiated eye it looks like property valuation from characteristics, in which case I would simply say that the recent PL focus has been more image oriented and I haven’t yet completed my model in PL either… there’s an outstanding issue with the Merge component needed to bring all the inputs together, but I’m told that’s getting some love soon.

That said PL evolves quickly and they’re very responsive so maybe @RobertL can give us a timeline for stronger support for this sort of thing…

Good day @JulianSMoore.
Thank you so much for the most detailed reply. I’ve tried zooming out, but yes like I said the dataset has more than 100 cols.
I was really exited to have found this platform and I’ve gone through most of the videos on their youtube channel. However I think I will come back in a couple of months to see if some of the bugs has been fixed.
And yes you seem to be correct that this platform is more looking at visual ML at the moment.
I will be following this platform from time to time, but for now I will stick to other like BigML

Hi @JacoMoolman - I’ve been playing with PL for about 6 months now and have seen a lot of changes…

It would be really helpful if you had time to share your bug list - obviously there’s more than just the data wizard & for me personally it might be nice to say, "See… I’m not the only one with this issue :wink: "

And then, if the PL guys have e.g. an email for you (I know I get notifications of forum posts to me but can’t recall how that got set up!) someone could let you know when major items on your* list are fixed.

I think we are both keen to do some heavier lifting with PL -

I know PL are really keen to create a place where people like us can be supported and support each other to spread the ML goodness - hope to see you back again in the not too distant future!

* or anyone’s… @robertl if there were some way for people to upvote bugs so their names were on it, everyone affected/benefiting from a new release could be notified - > feature request?? :smiley:

I was thinking of cases like this with a lot of inputs it would be nice to have some sort of “advanced load” option where you could load multiple csv’s by each category, input, target etc… Or have some way to declare what is what inside the csv itself. Would that work?

@birdstream I thought about that for a moment and then thought - doesn’t that actually make it more complicated - and difficult to ensure that all your data files are in sync? (I’d hate to try to maintain it - in fact my solution would be to put it all in one file and split out as needed :smiley: )

The datawizard table certainly needs some love and there are a couple of enhancements I would like to see (such as the ability to specify use of train//val/test per row - because I’ve got a funny dataset that doesn’t “split” into proportions nicely) but when there are lots of inputs

  • I think that’s more of a UI/process challenge
  • The number should be drastically cut down
    • I know @JacoMoolman’s data was from, kaggle (wish I knew which one so I could take a look), but if I were given 100 cols to model I’d say, that’s not modelling that’s an ill-specified problem that it would be very inefficient to tackle as-is…
    • several of the columns I can see are multi-value within so there are effectively even more booleans when one-hot encoded.
  • Just because someone set a problem doesn’t mean it’s a good problem… or - if it is - that it aims to teach what you think :wink:

I like this sort of question and the ensuing debate - so Q1 is - is it a legitimate need, and only then, if yes Q2 is UI or prepocessing & Q3 How?

What do you guys think?

(Hmmm… I think I need to pose a kaggle-challengin question separately!)

Some interesting discussion in here :slight_smile:

To start off with, @JacoMoolman, really sorry about the issue you encountered!
We have it tracked and looking to fix it, although we have some other features and bug fixes in the pipeline so it may drop in a bit later.
By the time you are back it will be sorted (based on your “in a couple of months” timeline).

One solution we are looking at for the large CSV files (besides fixing the UI) is providing the ability to convert/concatenate columns into the Array datatype, be that either in the Data Wizard UI or in the CSV file itself.
We also have some thought on allowing you to create a new CSV based on existing CSVs @birdstream, although keeping all in synch is going to be important there as @JulianSMoore mentioned, which we have some thoughts on as well, but that’s a bit down the line. It could look something like:

  • Existing loaded datasets can’t be modified from within PL, but instead combined to create new ones
  • The new ones can be downloaded
  • Automatically synch existing datasets based on what happens with the source, but update the “data version” when it happens

Very early thoughts though as you can see.

Then it’s an interesting point with the problem being correct or not. Large tabular datasets actually rarely use Deep Learning, but rather classic ML like clustering, SVMs, regression, etc. as it’s easier, pretty well performant and provides smaller more responsive models. Hence our initial focus on the Computer Vision domain :slight_smile:

@robertl if there were some way for people to upvote bugs so their names were on it, everyone affected/benefiting from a new release could be notified - > feature request??

Haha, I feel like you are setting me up for presenting Canny :sweat_smile:
I’ve started playing around with using Canny for publicly tracking and voting on bugs, with the given benefit that anyone who votes on it gets notified when there is any update to the bug.
I’ll make an official announcement to it as soon as all the bugs are in there and some more internal stuff is sorted (like enabling logging in with your forum account, if that proves possible), but here is how this bug would look like in Canny: https://perceptilabs.canny.io/bug-reports/p/data-wizard-does-not-work-well-for-large-amount-of-columns
Feel free to browse the features and roadmaps we have in there as well :slight_smile:

It was a genuine question :slight_smile: - I only spotted later that voters appear on the LHS of the screen, and it didn’t automatically have to lead to notifications to those who had expressed interest… I just thought that would be a good idea. (no names on the lists of items, only when individual items are open and, initially there was only your name, so I didn’t notice the word [large, all caps title! :scream_cat:] “Voters”)

Glad to see the emergence/convergence of harmony & light :smiley:

Those dataset ideas are interesting… have you (literally!) sketched out these early thoughts, i.e. boxes and lines? Because: more features = more flexibility = good, but I haven’t managed to get my head around what the use-cases are, etc.

I think its due to the rail-x and rail-y resetting after you click on the screen.
Somewhere here is my best guess.
For javascript/css.
I am only an Undergrad so take what I say with a decent amount of salt

1 Like

Hey @bamgm14

Nice to have an undergrad in the mix… youth has not ossified, is less inhibited and generally more creative/passionate so anything you think is valuable is something we want to hear! (We greybeards are dull, cynical and stuck in our ways.*)

… and you can tell your friends that some greybeards do appreciate alternative viewpoints - courteously expressed, naturally! - so do invite them to contribute too :smiley:

What are you doing with PL? Course related or personal interest?

(* we are usually right though :rofl: :rofl: :rofl: - benefit of experience! - but an occasional kick up the backside doesn’t go amiss :smiley: )

I return to this because I have the same issue again XD
Also will do XD
College kinda consumed me for a while.
Oh Request from an Undergrad student, make a Discord Server?
Basically as a secondary community tab?
Might be an idea since most of my gen rarely maintain contact on forms but most people keep an eye on discord
Plus u can notify without too much email spam XD
Edit Here is the new problem:
My data has about 12+ Columns and when I set the Target to be the last column, it sets it to the 3rd column
Along with while failure to make a model.

Hi @bamgm14,
Strange that you still have the same issue with the CSV, it was fixed a few versions ago (just tested it locally as well and works fine for a 16 column dataset) :confused:
Just to make sure that you are running on the latest version, do things look like this for you?

Thanks for the suggestion on creating a Discord server! :slight_smile:
I’m using it a fair bit for games but we never tried it out for PL as we already had a Slack channel and later this forum.
Wouldn’t mind trying it out, as long as it doesn’t become too many channels with both the Discord and Slack.

Its not exactly the same. Just makes it so that I can’t choose Target and Input on the last rows.

Ok, great to know that it happens, thanks for highlighting it! :slight_smile:

I haven’t been able to reproduce it yet on my end, would you be up for sending a quick video where it happens for you (with the browse console open through F12) and maybe your dataset (or a snippet of it) if possible?

This is the data
2021-12-10 23-18-32.zip (81.6 KB)
Also Request: Add 7z file support for uploads Pls Thx
Also Actual Data (Its on Kaggle for a competition so I think its fine to post here as long as none of you give me a solution to solve the actual ML XD)
train.zip (1.8 MB)
Also add csv support XD

1 Like

Hi @bamgm14,
Thanks for the data! :slight_smile:
I managed to find the issue, it looks like the first column (the indexing) doesn’t have any header on it which causes the data type recommendation to not work (which causes the other issue you mentioned as well).
If you either add a header for the first column or if you remove the indexing column then it will work well.

As for your other requests:

Also Request: Add 7z file support for uploads Pls Thx
Also add csv support XD

You mean to the forum? :slight_smile: Will look into adding it

For anyone else with similar issues, I think I’ve had exactly the same problem in the past when using pandas dataframe to_csv to prepare data.

The fix is easy enough, just to something like this to give the index a label (assuming you want the index - I find it’s a handy way of ordering/keeping order in CSV etc.)

aDf.to_csv(opPath, index_label='index')

where aDf = a dataframe, opPath = the full path for the target CSV