authorshipAttributionInUrduPoetry
The work on the Urdu language dataset. The dataset is collected from five nonsocial websites e.g. Urdu Library. After dataset collection from different sources like Urdu Library, Iqbal, Rekhta, and so on, dataset preprocessing is done both manually and programmatically. Pre-processing is done by removing characters like punctuation marks, comma, semicolons, colon, and so on. After data pre-processing is completed, machine learning algorithms and neural networks are trained and tested on this dataset. Machine learning libraries included are gensim and sklearn. Algorithms used are Support Vector Machine, Multinomial Naïve Bayes, Multilayer Perceptron (MLP) and pre-trained model word2vec. After comparison of these algorithms, we got the highest accuracy of 82.85% and precision of 83.0% of Support Vector Machine. The most focusing part while doing the thesis was to increase our dataset that is unique couplets of three unique poets and increase the accuracy of our trained models.