• Stars
    star
    124
  • Rank 288,207 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created about 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Social Analysis based on Whatsapp data

SoAn

Code for applying natural language processing methods on whatsapp conversations

SoAn (Social Analysis) can be used to extract word frequency, word clouds, TF-IDF, sentiment analysis, and more from whatsapp conversations. The main application was initially used to analyze the messages between my wife and me, but I extended so that it can be used for your own messages.

Table of Contents

  1. Instructions

  2. Output

    a. General Plots

    b. TF-IDF

    c. Emoji

    d. Sentiment

    e. Word Clouds

    f. Topic Modeling

1. Instructions

Back to ToC

There are several steps for using this repository:

  • Download or fork this repository
  • Install the requirements with pip install -r requirements.txt
  • Save your whatsapp.txt file in the data folder
    • To download your whatsapp messages simply go open your whatsapp, go to a conversation, click the three vertical dots and export the file
  • Finally, from the commandline, run the following:
    • python soan.py --file whatsapp.txt --language english
  • The results will be saved as images and text files in the results folder

In the notebooks folder, you will also find the soan.ipynb where you can run individual pieces of the code.

2. Output

Back to ToC

2.a General Plots

There are 4 types of plots to be generated:

  • Messages over time

  • Active days of each user

    • Spider
    • Histogram
  • Active hours of each user

  • Calendar plot

  • There are 2 types of stats that are generated:

    • General statistics (text frequency, etc.)
    • Timing

Below are some examples of the plots above:

Below are some examples of the text generated:

##########################
Number of Messages
##########################

4444 Her
3266 Me

#########################
Messages per hour
#########################

Her: 0.1259887165820883
Me: 0.09259206758710628

2.b TF-IDF

Using a class-based TF-IDF, I extract the most important words per person and plot them using a horizontal barchart with a mask as image. I created a horizontal bar chart with two bars stacked on top of each other both plotted on a background image. I started with a background image and plotted the actual values on the left and made it fully transparent with a white border to separate the bars. Then, on top of that I plotted which bars so that the right part of the image would get removed.

NOTE: In the notebook, you will see more instructions on how to use your own image.

2.c Emoji

These analysis are based on the Emojis used in each message. Below you can find the following:

  • Unique Emoji per user
  • Commonly used Emoji per user

2.d Sentiment Analysis

The sentiment from each sentence in the messages is extract per user using Vader and visualized as follows:

2.e Sentiment Analysis

For each user, a word cloud will be made based on frequent and important words. Stopwords are removed if you have supplied the language:

2.f Topic Modeling

For each user, the most frequent topics using LDA and NMF are modeled and saved a .txt file:

Me

Topics in nmf model:
Topic #0: ga boodschappen nodig lieverd halen uurtje half
Topic #1: thuis wel goed haha lekker we morgen
Topic #2: lieverd dank hey fijn allerliefste plezier verwacht
Topic #3: gezellig jeey super jeeeey erg hartstikke samen
Topic #4: love you most more schattie much very

Visualizations Wife
Below, you will find an overview of the visualizations I made for my wife, in part using this package:

More Repositories

1

BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
Python
4,444
star
2

KeyBERT

Minimal keyword extraction with BERT
Python
2,474
star
3

PolyFuzz

Fuzzy string matching, grouping, and evaluation.
Python
649
star
4

Concept

Concept Modeling: Topic Modeling on Images and Text
Python
149
star
5

cTFIDF

Creating class-based TF-IDF matrices
Python
67
star
6

ML-API

Guide on creating an API for serving your ML model
Jupyter Notebook
63
star
7

Projects

Data Science Portfolio
Jupyter Notebook
63
star
8

ReinLife

Creating Artificial Life with Reinforcement Learning
Python
56
star
9

CustomerSegmentation

Analysis for Customer Segmentation
Jupyter Notebook
56
star
10

streamlit_guide

A guide on creating and deploying your Streamlit application to Heroku
Python
47
star
11

feature-engineering

Tips for Advanced Feature Engineering
Jupyter Notebook
47
star
12

BERTopic_evaluation

Code and experiments for *BERTopic: Neural topic modeling with a class-based TF-IDF procedure*
Python
40
star
13

boardgame

Heroku app to explore boardgame data
Jupyter Notebook
20
star
14

UnitTesting

Guide for applying Unit Testing in data-driven projects
Python
18
star
15

Sprite-Generator

Python procedural sprite generator
Jupyter Notebook
15
star
16

VLAC

Vectors of Locally Aggregated Concepts
Jupyter Notebook
10
star
17

Reviewer

Tool for extracting and analyzing IMDB reviews
Jupyter Notebook
7
star
18

InterpretableML

My analyses for interpretable Machine Learning
Jupyter Notebook
7
star
19

validation

Overview of validation techniques
Jupyter Notebook
6
star
20

ReinforcementLearning

Train SOTA RL-algorithms using Stable Baselines andย Gym
Jupyter Notebook
4
star
21

cars_dashboard

Dashboard for the cars dataset
Python
3
star
22

MaartenGr

Python
3
star
23

PotholeDetection

Detection of Potholes in Images
Jupyter Notebook
2
star
24

fitbit

Analysis of my FitBit data
Jupyter Notebook
1
star
25

DisneyTournament

Statistically Generated Disney Tournament Bracket
Jupyter Notebook
1
star
26

BoardGames

Analysis of board game matches
Jupyter Notebook
1
star