Thursday 11 February 2021

Machine Learning: Soft Skill: Personality Detection using text information

1.   Executive Summary

There is a “The Myers Briggs Type Indicator (MBTI)” Personality Test system available which describes the person’s preferences or behavior and its key personality traits. It is one of the most popular personality tests in the world used in businesses, online, for research, and lots more. It helps in predicting one’s strengths and weaknesses in terms of personality. However, its use/validity has come into question because of unreliability in many experiments. But It is still clung to as being a very useful tool in a lot of areas.

2.   Problem Statement

The ‘Personality Test Model’ based on MBTI assessment from social media comment data would predict the relevance of the famous MBTI system. The purpose is to see if any patterns can be detected in specific types and their style of writing, which overall explores the validity of the test in analyzing, predicting, or categorizing behavior and later on testing the models to predict the trait by feeding a few of the writings.

3.   Analytics Rationale Statement

Personality plays a key role in predicting many factors of an individual such as mental and physical health, career fit, and well being. Hence, getting a deep insight into a person’s personality types is of key importance. I have taken into consideration is that when we are creating teams to fight crime, develop amazing software, or simply play sports. It's important to consider everyone's technical position on the team, but it's also deeply important to explore soft skills to ensure strengths and weaknesses are balanced. This allows another layer of interdisciplinary not only in technical skillset but also in mindset and personality.

Myers Briggs Classification Problem

 The Myers Briggs Type Indicator (or MBTI for short) is a personality type system that divides everyone into 16 distinct personality types across 4 axes:


Introversion (I) – Extroversion (E)                  The Attitudes of Consciousness


Intuition (N) – Sensing (S) 

Thinking (T) – Feeling (F)                                    The Functions of Consciousness

Judging (J) – Perceiving (P)

(Note that the opposite personalities are aligned above to give one a sense of difference in the meanings of the personalities while comparing them with each other.)

Introversion

Tend to be inward turning, or focused more on internal thoughts, feelings, and moods rather than seeking out external stimulation.

Extroversion

It is a personality trait typically characterized by outgoingness, high energy, and/or talkativeness.

Intuition

It refers to a deeper perception of inherent possibilities and inner meanings. It never directly reflects reality but actively, creatively, insightfully, and imaginatively adds meaning by reading things into the situation

Sensing

It refers to our immediate experience of the objective world that is available to the senses.

Thinking

It is based upon the intellectual comprehension of things through analysis and logical inference.

Feeling

It involves judging the value of things or having an opinion about them based on our likes and dislikes.

Judging

It reflects a closed, organized, decisive approach

Perceiving

It is more open, flexible, and curious.



INTP                        ( I - introversion, N - intuition, T - thinking, and P - perceiving ) nature

There are lots of personality-based components that would model or describe this person’s preferences or behavior based on the label.  

Data Acquisition

We use an MBTI dataset from Kaggle containing posts written by users on the PersonalityCafe online forum labeled by their personality type. The dataset is composed of 8675 unique users each with their 4 letter MBTI code along with 50 post utterances. The distribution of the data by type is displayed on the right. “(MBTI) Myers-Briggs Personality Type Dataset”,  contains

·       Type (This person 4 letter MBTI code/Type)

·       A section of each of the last 50 things they have posted (Each entry separated by “|||” (3 pipe characters))

Feature Engineering

Feature engineering is the process of using your knowledge about the data and about the machine-learning algorithms at hand to make the algorithm work better by applying hardcoded transformations to the data before it goes to the machine learning model.

 Text Preprocessing

·       Remove all irrelevant characters such as any nonalphanumeric characters

·       Tokenize your text by separating it into individual words

·       Remove words that are not relevant, such as “@” Twitter mentions or URLs

·       Convert all characters to lowercase, to treat words such as “hello”, “Hello”, and “HELLO” the same

·       Consider combining misspelled or alternately spelled words to a single representation (e.g. “cool”/”kewl”/”cooool”)

·       Consider lemmatization (reduce words such as “am”, “are”, and “is” to a common form such as “be”)

·       After following these steps and checking for additional errors, we can start using the clean, labeled data to train models!

 TEXT VECTORIZATION

Machine learning with natural language is faced with one major hurdle – its algorithms usually deal with numbers, and natural language is, well, text. So we need to transform that text into numbers, otherwise known as text vectorization.

 

One-hot encoding (Bag of Words)

To build a vocabulary of all the unique words in our dataset, and associate a unique index to each word in the vocabulary. Each sentence is then represented as a list that is as long as the number of distinct words in our vocabulary. At each index in this list, we mark how many times the given word appears in our sentence. This is called a Bag of Words model since it is a representation that completely ignores the order of words in our sentence. This is illustrated below.

 





 

Representing sentences as a Bag of Words. Sentences on the left, representation on the right. Each index in the vectors represents one particular word.

 

TF/IDF (term frequency-inverse document frequency)

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents. Multiplying these two numbers results in the TF-IDF score of a word in a document. The higher the score, the more relevant that word is in that particular document

 

Text vectorization transforms text within documents into numbers, so TF-IDF algorithms can rank articles in order of relevance.


Model creation

Import Libraries


Preprocessing












Worked on  multiple ML algorithms to reach the good output.