Machine Learning

NLP Model: Setting up for success

I have 18 categories to classify.

Each category contains anywhere from 3 - 140+ classes.

15 of my categories are routinely scoring 90% and above when back tested. 2 others are scoring between 80-90% and the last is my problem child. The last category is my largest with 140+ classes in it for the model to choose from. This category started at 25%.

PROBLEM 1: DIRTY DATA

One of the first things I noticed were mistakes in the classification for this class. Misspellings or just straight up incorrect class choice. One of the most common misspelling was making a singular item plural.

To combat the multiple potential problems, I created a script to go through every class of each category and compare it to the other classes within it’s own category. It would then return a “similarity score” and return any classes that were over 80% similar. This uncovered all the misspelled and pluralized classes to which I then updated within the dataset.

Fixing these increased accuracy by 5-10%.

PROBLEM 2: VARIANCE

My next problem was dealing with variance in similar classes within one category. For example: 1/4in Hex Bolt is similar to 1/2in Hex Bolt, yet they are two completely different items.

For a time I was attempting to classify both within the same category. With 30 1/4in examples and 42 1/2in examples, compared to other classes with 100+ examples, the model struggled to identify either. That’s when I decided to create a “Sizing” category. Hex Bolt is the “Product” and 1/4in or 1/2in is the size. So now the model has 70 examples of Hex Bolt as a “Product”. And since “Sizing” is only dealing with two classes, it has an easier time identifying them.

I continued this for all other classes that could fit this scenario. Consolidating similar items and separating the various characteristics that I could.

This increased the “Product” accuracy by 15% while “Sizing” came in above 90%.

Now my problem child is coming in at 45% single shot training and nearly 50% on second training.

I think the next thing I need to focus on is the parameters for training.

I’ve already consolidated as much as I can. The only other idea I have is to breakout different models for different item types. Such as all the Bronze items have their own model, Silver has their own. But the computing power for the actual classifying is limited. Having to run 5+ models per classification will take over an hour to do. I want to perform this quicker.

NLP Model: A problem set one year in

I first had the idea for this particular project about three years ago. I'm now a year into the project and I have learned much. Yet I feel like I know nothing.

What is an NLP?

A machine learning technology that enables computers to understand, process, and manipulate human language. NLP is a branch of artificial intelligence, computer science, and linguistics. It uses techniques like machine learning, neural networks, and text mining to interpret language, translate between languages, and recognize patterns. NLP is used in many everyday products and services, including search engines, chatbots, voice-activated digital assistants, and translation apps.

Why an NLP?

I had tried several approaches and tested a few ideas. Ultimately I realized that my raw inputs would be too messy for most models. I needed something flexible and capable of comparing words within a sentence. NLP seemed like the best option that required the least physical resources.

What model am I using?

I am currently using a pre-trained Distilled BERT model that I am fine-tuning with my custom data.

How do I train it?

I will probably get into this with more detail at a later date, mostly because I want to update this part. However, I am converting the training data into a DataFrame with Python and then splitting that data into train and validation sets. But I feel like I can improve this significantly.

Goal

Classify an incredibly large dataset with at least 85% accuracy on an hourly basis without human assistance.

Problems

  1. There are no training datasets for public consumption. I need to create my own.

  2. The dataset needs to be fairly big in order to get the best results.

  3. The custom dataset needs to be classified manually, which takes longer the bigger it is.

  4. There will be over 360 different classes in the dataset. There needs to be a balance.

  5. My processing power is limited to a 6 year old GTX1070

The first couple problems have already been solved/still being solved. I have created 36,000+ lines of training data by scraping data that gets sent directly to me on a daily basis. And also, since I’m a data hoarder, I still have three years worth of raw data to convert into usable training data.

The third problem is still an ongoing problem. It takes a long long time to classify 36,000+ lines of training data. And my plan for the summer is to have 40,000 lines of training data. My next problem is that while I gain good training data for some classes, I’m still lacking for other classes. So I have to hunt for training points on lesser utilized classes. And they are lesser utilized for a reason. This slows down the overall progress of the project as it takes time to claw for examples of these points.

CUE THE PONZI SCHEME

This is when I came up with an idea. The NLP model reads the raw data and makes a classification effort for 18 different categories. Each category could relate to anywhere from 1 to 5 words in the raw data. Those words can be spelt or abbreviated different several ways that all mean the same thing. By swapping the source words with their alternatives, I can inflate the training data. And depending on the amount of alternative words in a single source sentence, that sentence can be transformed as many as 15 times. Now the model not only gets reinforcement training, but exposure to all spelling and abbreviation types.

That approach turned my 36,000 lines into 293,000.

NEXT

So now I need to ponder my processing power problem. My GTX 1070 doesn’t do a terrible job. But the bigger the model gets, the longer it takes to train. A few ways I think I can approach this without buying hardware:

  1. Adjust Training Parameters

  2. Play with the padding / truncation

  3. Clean up the training data

  4. Consolidate the Category with 140+ possible classes

  5. Research