What follows are two common data analysis tasks using the popular pandas Python library, recorded for personal posterity. Enjoy!
Situation number one: you’ve just loaded a DataFrame and are poking about. You have some categorical columns and you want to graphically understand the distribution. You can plot it easily with the following:
which yields, for example:
Alternatively, if the categorical column you’re trying to understand has high cardinality, you might want a histogram rather than a value for each category:
Situation number two: you have a training DataFrame you’d like to split into train/test data for cross validation. Naively splitting the data, however, would result in label leakage between the training set and the validation set. Say, for example, you have a dataset of quiz attempts with a column installation_id that ties each quiz attempt to a mobile device (and ostensibly a unique user). Users might take multiple quizzes, and eventually your model will be evaluated on a test set of users you’ve never seen before.
What would happen if you randomly split the data? You’d end up with some users who have quiz attempts in both training and validation sets. When evaluating these users during cross-validation, your model might simply memorize that a specific user already performed well in a training set instance, so is likely to perform well in future instances as well. Obviously, this type of heuristic won’t generalize to unknown users.
So to combat introducing this leakage, you must do the train/test split on values of a specific column. Here is how to do this:
train_installations, test_installations = train_test_split(df.installation_id.unique(), test_size=0.2) train_df = df[df.installation_id.isin(train_installations)]
Those are my two tricks for the day. Code on!