NLP models often work by autoregressively predicting the next word in a text sequence. This is a highly specific task. This does not appear to be optimized to produce generalization. To produce more generalization, I propose two changes.
Training Method to Produce More Generalization:
- Start with a pre-trained model (e.g., GPT-3).
- Create a strong paraphrase detection model.
- Create an emotion embedding model.
- Fine tune on the original training data, but use the paraphrase detection model output to train the model to be less exact, but more conceptually related. Multiple levels could be focused on simultaneously (sentence, paragraph). Perhaps keep entering the same text until paraphrase scores increase. It would be important for a paraphrase score to be lower if the text is exactly the same.
- Optimize for the sentence, paragraph, and text corpus producing matching emotion embeddings. Similar to #4, autoregressively focus on the same text until a certain threshold of similarity is reached for the emotion embeddings.
Another recommendation would be to use a school curriculum from primary, to secondary, to post-secondary school. The entire set of text, and questions, would be good to use for the training. The reason for this is that children and adults are exposed to a very wide variety of reading content and conceptual question types throughout their educational history. It is my opinion that increasing the number of concepts, or types of questions, will lead to a higher level of generalization.
It would also be useful to consider more about the training individual humans receive. It is very different from how NLP models are trained. NLP models are trained on very large amounts of symbolic data, whereas the amount of symbolic data used for training humans is very limited in comparison. That means that either the network architecture, method of training, or both need to be revised. Of the possibilities, the training data is the easiest to address. Datasets designed to be similar to the scope and size that humans receive, may also help to drive changes in network architecture.
Text transcriptions of specific parent and child interactions throughout early childhood prior to school.
Text transcripts of classroom lectures, peer interactions, quizzes, and tests.
All textbook material.
Social interaction via text at all age levels and with various different people.
Television transcripts integrated into the above for specific time periods (pretending that the AI is a certain age and going back in time from the present).
Radio transcripts – same as above.
Group interactions with family and peers integrated through the above.
Dating interactions integrated through the above.
Sports and extra-curricular activities. This might be very difficult to get.
Transcripts of interactions with various professions (doctors, therapists, etc…).
Transcript of stream of consciousness (of thinking in words).
In other words, the training data would ideally be the textual content received and produced by an individual during their lifetime. Given that the speed of data transfer by spoken language is about 39 bits per second, we can work towards an upper end of the total amount of language processing done by a human. Thinking in words is up to 4 times faster than speaking. This would put the thinking in words data rate at approximately 160 bits per second. Let’s go ahead and assume that thinking occurs 24/7 (including sleep) to get a calculation for the total amount of language data processing done by a human by age 30. This gives a maximum value of 17.6 GB of text data. This is very likely an overestimate by up to an order of magnitude given that we are not constantly thinking in words at our highest rate and very likely have greatly reduced verbal processing during sleep. Compare the 17.6 GB of data maximum estimation to the 45 TB of text data used to train GPT-3. This is 2,500 to 25,000 times more data than the total language processing done by a human by age 30. One limitation of this analysis is that it excludes other forms of thinking and learning (e.g., visual, kinesthetic, performance).
What is clear from this analysis is that we need to work on developing better datasets for training and better network architectures. I have outlined some initial steps for using paraphrase detection and emotion embedding to improve generalization and performance of existing models. I have provided an analysis of the amount of data used in training language models and assert that scaling towards larger dataset may be missing important opportunities. Certainly, it will be interesting to continue using more data and compute to get better results, but this does not get us closer to a more efficient general model that has a chance of being more accessible to the masses.