Data

NLP Model: A problem set one year in

I first had the idea for this particular project about three years ago. I'm now a year into the project and I have learned much. Yet I feel like I know nothing.

What is an NLP?

A machine learning technology that enables computers to understand, process, and manipulate human language. NLP is a branch of artificial intelligence, computer science, and linguistics. It uses techniques like machine learning, neural networks, and text mining to interpret language, translate between languages, and recognize patterns. NLP is used in many everyday products and services, including search engines, chatbots, voice-activated digital assistants, and translation apps.

Why an NLP?

I had tried several approaches and tested a few ideas. Ultimately I realized that my raw inputs would be too messy for most models. I needed something flexible and capable of comparing words within a sentence. NLP seemed like the best option that required the least physical resources.

What model am I using?

I am currently using a pre-trained Distilled BERT model that I am fine-tuning with my custom data.

How do I train it?

I will probably get into this with more detail at a later date, mostly because I want to update this part. However, I am converting the training data into a DataFrame with Python and then splitting that data into train and validation sets. But I feel like I can improve this significantly.

Goal

Classify an incredibly large dataset with at least 85% accuracy on an hourly basis without human assistance.

Problems

  1. There are no training datasets for public consumption. I need to create my own.

  2. The dataset needs to be fairly big in order to get the best results.

  3. The custom dataset needs to be classified manually, which takes longer the bigger it is.

  4. There will be over 360 different classes in the dataset. There needs to be a balance.

  5. My processing power is limited to a 6 year old GTX1070

The first couple problems have already been solved/still being solved. I have created 36,000+ lines of training data by scraping data that gets sent directly to me on a daily basis. And also, since I’m a data hoarder, I still have three years worth of raw data to convert into usable training data.

The third problem is still an ongoing problem. It takes a long long time to classify 36,000+ lines of training data. And my plan for the summer is to have 40,000 lines of training data. My next problem is that while I gain good training data for some classes, I’m still lacking for other classes. So I have to hunt for training points on lesser utilized classes. And they are lesser utilized for a reason. This slows down the overall progress of the project as it takes time to claw for examples of these points.

CUE THE PONZI SCHEME

This is when I came up with an idea. The NLP model reads the raw data and makes a classification effort for 18 different categories. Each category could relate to anywhere from 1 to 5 words in the raw data. Those words can be spelt or abbreviated different several ways that all mean the same thing. By swapping the source words with their alternatives, I can inflate the training data. And depending on the amount of alternative words in a single source sentence, that sentence can be transformed as many as 15 times. Now the model not only gets reinforcement training, but exposure to all spelling and abbreviation types.

That approach turned my 36,000 lines into 293,000.

NEXT

So now I need to ponder my processing power problem. My GTX 1070 doesn’t do a terrible job. But the bigger the model gets, the longer it takes to train. A few ways I think I can approach this without buying hardware:

  1. Adjust Training Parameters

  2. Play with the padding / truncation

  3. Clean up the training data

  4. Consolidate the Category with 140+ possible classes

  5. Research