Personality Prediction from Text

Description

Data on personality types was gathered (MBTI and big five) for further information, see below.
The situation on the data was evaluated. There is much more MBTI data available which is scientifically less reliant, but there is only very few data on the BIG FIVE traits. Machine Learning algorithms thrive on data so a approach created to combine MBTI and BIG Five data.
The data from three different sources was converted in mutual form and preprocessed to the needs of ML algorithms
features from text were extracted to vectorize the data with bags of words and GloVe approach
several supervised classification learning algorithm were used and trained to predict on future unknown text
the results of the classifiers were evaluated
a predictor was developed who predicts traits and visualizes them:

Inspired by the paper from sentic.net this topic was chosen during studies at University of applied Sciences Wiener Neustadt, Computer Science, Data Science. Some time was spent on Dr. Jordan B. Petersons Personality lectures and some understanding was gathered on the BIG FIVE personality model and its applications. As I'm continuously stunned by the complexity of the human psyche, the motivations and desires of human beings I am of course also fascinated by Machine Learning applications, which are nothing else than the pursue to reverse engineer the human brain and discover the yet unknown algoritm of the human brain. So this first machine learning python application marks my start in the huge world of machine learning and artificial intelligence, please be critical.

The big five personality model

The Big Five personality traits, also known as the five-factor model (FFM) and the OCEAN model, is a taxonomy, or grouping, for personality traits. The five factors are:

Openness to experience (inventive/curious vs. consistent/cautious)
Conscientiousness (efficient/organized vs. easy-going/careless)
Extraversion (outgoing/energetic vs. solitary/reserved)
Agreeableness (friendly/compassionate vs. challenging/detached)
Neuroticism (sensitive/nervous vs. secure/confident)

Quoted from and further information: https://en.wikipedia.org/wiki/Big_Five_personality_traits

The Myers–Briggs Type Indicator

The Myers–Briggs Type Indicator (MBTI) is an introspective self-report questionnaire indicating differing psychological preferences in how people perceive the world and make decisions.

_	Subjective	Objective
Deductive	Intuition/Sensing	Introversion/Extraversion
Inductive	Feeling/Thinking	Perception/Judging

The combinations as four pairs of preferences lead to 16 possible combinations aka types. The 16 types are typically referred to by an abbreviation of four letters—the initial letters of each of their four type preferences (except in the case of intuition, which uses the abbreviation "N" to distinguish it from introversion). For instance:

ESTJ: extraversion (E), sensing (S), thinking (T), judgment (J) INFP: introversion (I), intuition (N), feeling (F), perception (P)

Quoted from and further information: https://en.wikipedia.org/wiki/Myers%E2%80%93Briggs_Type_Indicator

Differences and commonalities of the Big Five and MBTI

Adrian Furnham 1996 concludes a corellation of his 1996 paper The big five versus the big four: the relationship between the Myers-Briggs Type Indicator (MBTI) and NEO-PI five factor model of personality concludes a correlation between these traits from:

MBTI	Big Five
Intuition/Sensing	Openness to experience (corellates with N)
Feeling/Thinking	Agreeableness (correlates with F)
Perception/Judging	Conscientiousness (correlates with J)
Introversion/Extraversion	Extraversion (correlates with E)
not available in MBTI	Neuroticism

Goal of this project

Predicting personality traits in high accuracy with classifiers trained from text data which is labeled with the personality types.
gathering familiarty with machine learning core concepts
trying to find an approach to combine MBTI data with BIG FIVE data to increase amount of data to train machine learning classifiers

Results

EXT	NEU	AGR	CON	OPN
77.18	61.74	75.51	70.34	80.39

for detailed results

Tech overview

data for training

stream of consciousness essays "data/essays.csv"

This is the scientific gold standard from psychology, controlled environment collected stream of consciousness by James Pennebaker and Laura King labelled with Big Five personality traits. See: http://web.archive.org/web/20160519045708/http://mypersonality.org/wiki/doku.php?id=wcpr13

emotion lexicon "data/Emotion_Lexicon.csv"

For scentence filtering a lexicon containing ~ 14,000 words was used. Further Information: https://www.saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm

(MBTI) Myers-Briggs Personality Type Dataset "mbti_1.csv"

From Kaggle: This data was collected through the [PersonalityCafe forum(https://www.personalitycafe.com/forum/), as it provides a large selection of people and their MBTI personality type, as well as what they have written.

scraped data from reddit from "typed_comments.csv"

Props to Matej Gjurković, and his 2018 paper Reddit: A Gold Mine for Personality Prediction who provided me his scraped data from personality subreddits, where people show their personality types in the forum and therfore provide labelled text comments and posts. I cannot share the data.

Used Methods

Classification

SVM (sklearn)
Decision Tree (sklearn)
Naive Bayes (sklearn)
Logistic Regression (sklearn)
Random Forest (sklearn)

Feature extraction

Bags of Words (sklearn CountVectorizer)
GloVe pretrained https://nlp.stanford.edu/projects/glove/

scentence filtering

scentences which contain no emotional charge (meaning they contain no word of the emotion lexicon) will be removed before further preprocessing.

combining MBTI and BIG FIVE data

MBTI and BIG FIVE data was combined on the corellating traits. therefore the trait "neuroticism" from big five was lost. this explains the weaker results in the trait Neuroticism (NEU)

repo overview / how to use

if you just want to run with my pretrained models

just work with predict.ipynb and use your own text on the variable "text"
done, have fun predicting
if you want, check analysis_results.ipynb - this compares the feature extractions and classifiers in their score

if you want to train on your own (with gloVe)

download glove pretrained models and put in the folder data/pretrained:

run preprocessing.ipynb
- choose the data you want to combine (essays, kaggle mbti, and if you have access reddit)
- this saves the preprocessed data in data/essays
- this is required for further use to run the models
run model_glove.ipynb for using preprocessing with GloVe OR run model_bow.ipynb for using preprocessing with Bags of Words and sklearn CountVectorizer
work with predict.ipynb and use your own text on the variable "text"
if you want, check analysis_results.ipynb - this compares the feature extractions and classifiers in their score

further info: essay.py the class to save various data about the essays required

jkwieser/personality-prediction-from-text

jkwieser

Reviews

Repository Details