python — Data Projects

I have 18 categories to classify.

Each category contains anywhere from 3 - 140+ classes.

15 of my categories are routinely scoring 90% and above when back tested. 2 others are scoring between 80-90% and the last is my problem child. The last category is my largest with 140+ classes in it for the model to choose from. This category started at 25%.

PROBLEM 1: DIRTY DATA

One of the first things I noticed were mistakes in the classification for this class. Misspellings or just straight up incorrect class choice. One of the most common misspelling was making a singular item plural.

To combat the multiple potential problems, I created a script to go through every class of each category and compare it to the other classes within it’s own category. It would then return a “similarity score” and return any classes that were over 80% similar. This uncovered all the misspelled and pluralized classes to which I then updated within the dataset.

Fixing these increased accuracy by 5-10%.

PROBLEM 2: VARIANCE

My next problem was dealing with variance in similar classes within one category. For example: 1/4in Hex Bolt is similar to 1/2in Hex Bolt, yet they are two completely different items.

For a time I was attempting to classify both within the same category. With 30 1/4in examples and 42 1/2in examples, compared to other classes with 100+ examples, the model struggled to identify either. That’s when I decided to create a “Sizing” category. Hex Bolt is the “Product” and 1/4in or 1/2in is the size. So now the model has 70 examples of Hex Bolt as a “Product”. And since “Sizing” is only dealing with two classes, it has an easier time identifying them.

I continued this for all other classes that could fit this scenario. Consolidating similar items and separating the various characteristics that I could.

This increased the “Product” accuracy by 15% while “Sizing” came in above 90%.

Now my problem child is coming in at 45% single shot training and nearly 50% on second training.

I think the next thing I need to focus on is the parameters for training.

I’ve already consolidated as much as I can. The only other idea I have is to breakout different models for different item types. Such as all the Bronze items have their own model, Silver has their own. But the computing power for the actual classifying is limited. Having to run 5+ models per classification will take over an hour to do. I want to perform this quicker.

NLP Model: Setting up for success

David J Boronow EMAIL