1. Executive Summary
There is a “The Myers Briggs Type Indicator (MBTI)”
Personality Test system available which describes the person’s preferences or
behavior and its key personality traits. It is
one of the most popular personality tests in the world used in businesses,
online, for research, and lots more. It helps in predicting one’s strengths and
weaknesses in terms of personality. However, its
use/validity has come into question because of unreliability in many experiments.
But It is still clung to as being a very useful tool in a lot of areas.
2. Problem Statement
The ‘Personality Test Model’ based on MBTI assessment
from social media comment data would predict the relevance of the famous
MBTI system. The purpose is to see if any
patterns can be detected in specific types and their style of writing, which
overall explores the validity of the test in analyzing, predicting, or categorizing
behavior and later on testing the models to predict the trait by feeding a few
of the writings.
3. Analytics Rationale Statement
Personality plays a key role in predicting many factors
of an individual such as mental and physical health, career fit, and well being.
Hence, getting a deep insight into a person’s personality types is of key
importance. I have taken into consideration is that when we are creating teams
to fight crime, develop amazing software, or simply play sports. It's important
to consider everyone's technical position on the team, but it's also
deeply important to explore soft skills to ensure strengths and weaknesses
are balanced. This allows another layer of interdisciplinary not only in
technical skillset but also in mindset and personality.
Myers Briggs Classification
Problem
Introversion (I) – Extroversion (E) The Attitudes of Consciousness
Intuition (N) – Sensing (S)
Thinking (T) – Feeling (F) The Functions
of Consciousness
Judging (J) – Perceiving (P)
(Note that the opposite personalities are aligned above
to give one a sense of difference in the meanings of the personalities while
comparing them with each other.)
Introversion |
Tend to be inward turning, or focused more
on internal thoughts, feelings, and moods rather than seeking out external
stimulation. |
Extroversion |
It is a personality trait typically
characterized by outgoingness, high energy, and/or talkativeness. |
Intuition |
It refers to a deeper perception of
inherent possibilities and inner meanings. It never directly reflects reality
but actively, creatively, insightfully, and imaginatively adds meaning by
reading things into the situation |
Sensing |
It refers to our immediate experience of
the objective world that is available to the senses. |
Thinking |
It is based upon the intellectual
comprehension of things through analysis and logical inference. |
Feeling |
It involves judging the value of things or
having an opinion about them based on our likes and dislikes. |
Judging |
It reflects a closed, organized, decisive
approach |
Perceiving |
It is more open, flexible, and curious. |
INTP ( I - introversion, N -
intuition, T - thinking, and P - perceiving ) nature
There are
lots of personality-based components that would model or describe this person’s
preferences or behavior based on the label.
Data Acquisition
We use an MBTI dataset from Kaggle containing posts
written by users on the PersonalityCafe online forum labeled by their personality
type. The dataset is composed of 8675 unique users each with their 4 letter
MBTI code along with 50 post utterances. The distribution of the data by type is
displayed on the right. “(MBTI) Myers-Briggs Personality Type Dataset”, contains
·
Type
(This person 4 letter MBTI code/Type)
·
A section of
each of the last 50 things they have posted (Each entry separated by “|||” (3
pipe characters))
Feature Engineering
Feature engineering is the process of using your knowledge about the data
and about the machine-learning algorithms at hand to make the algorithm work
better by applying hardcoded transformations to the data before it goes to the
machine learning model.
·
Remove all
irrelevant characters such as any nonalphanumeric characters
·
Tokenize
your text by separating it into individual words
·
Remove
words that are not relevant, such as “@” Twitter mentions or URLs
·
Convert all
characters to lowercase, to treat words such as “hello”, “Hello”, and “HELLO”
the same
·
Consider
combining misspelled or alternately spelled words to a single representation
(e.g. “cool”/”kewl”/”cooool”)
·
Consider
lemmatization (reduce words such as “am”, “are”, and “is” to a common form such
as “be”)
·
After
following these steps and checking for additional errors, we can start using
the clean, labeled data to train models!
Machine learning with natural language is faced with one major hurdle –
its algorithms usually deal with numbers, and natural language is, well, text.
So we need to transform that text into numbers, otherwise known as text
vectorization.
One-hot encoding (Bag of Words)
To build a vocabulary of all the unique words in our dataset, and
associate a unique index to each word in the vocabulary. Each sentence is then
represented as a list that is as long as the number of distinct words in our
vocabulary. At each index in this list, we mark how many times the given word
appears in our sentence. This is called a Bag of Words model since it is a
representation that completely ignores the order of words in our sentence. This
is illustrated below.
Representing sentences as a Bag of Words.
Sentences on the left, representation on the right. Each index in the vectors
represents one particular word.
TF/IDF (term frequency-inverse document frequency)
TF-IDF is a statistical measure that evaluates how relevant a word is to
a document in a collection of documents. This is done by multiplying two
metrics: how many times a word appears in a document, and the inverse document
frequency of the word across a set of documents. Multiplying these two numbers
results in the TF-IDF score of a word in a document. The higher the score, the
more relevant that word is in that particular document
Text vectorization transforms text within documents into numbers, so
TF-IDF algorithms can rank articles in order of relevance.
Preprocessing