• Stars
    star
    112
  • Rank 312,240 (Top 7 %)
  • Language
    Jupyter Notebook
  • Created almost 4 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Data Science algorithms and topics that you must know. (Newly Designed) Recommender Systems, Decision Trees, K-Means, LDA, RFM-Segmentation, XGBoost in Python, R, and Scala.

TopN Movie Recommender Case Study

View Paper here

View: Python Code here

The MovieLens dataset consisting of 20M ratings and 466K tag applications across 27K movies spanning a period of 20 years, from January 1995 to March 2015, is used to constract simple Top N Movie Recommender that generates N recommendations for a user using Item based Nearest-Neighbor Collaborative Filtering. In case of Item-based CF, the system finds similar items and assuming that similar items will be rated in a similar way by the same person, it predicts the rating corresponding to a user assigned to that item. Usually, the Item-based CF is preferred over the User-based CF because users can change their preferences and choices (aging, change of environment) whereas items (relatively) does not change over time.

Files consists of the following parts:

  • Introduction to recommender systems
  • Description and Descriptive Statistics for MovieLens data
  • MovieLens data visualization
  • Methodology: description of Collaborative Filtering (CF) and KNN algorithmms
  • Top-N Movie Recommender System Algorithm step-by-step
  • Evaluation of the results

Publications:

  • Aggarwal, C. (2016). Recommender Systems. Thomas J. Watson Research Center.
  • Billsus, D. and Pazzani, M. J. (1998). Learning collaborative information filters. In Proceedings of the Fifteenth International Conference on Machine Learning, pages 46–54. Morgan Kaufmann Publishers Inc.

Linear Discriminant Analysis (LDA) Algorithm

View: R Code here
Note: the code contains LDA and robust LDA mannually written functions (checked with the library function's output)

Linear discriminant analysis (LDA) (don't confuss this with Latent Dirichlit Allocation which is Topic Modelling technique) is a generalization of Fisher's linear discriminant, which is a statistical method to find a linear combination of features that characterizes/separates two or more classes of objects. The resulting combination may be used as a linear classifier. LDA is closely related to analysis of variance (ANOVA) and regression analysis, which also attempt to express one (dependent) variable as a linear combination of other (independent) variables. However, ANOVA uses a continuous dependent variable and categorical independent variables, whereas LDA uses a categorical dependent variable (classes of LDA) and continuous independent variables. Logistic regression and Probit regression are more similar to LDA than ANOVA is, as they also explain a categorical (dependent) variable by the values of continuous (independent) variables. The key difference between Logistic Regression/Probit regression and LDA is the assumption about the probability distribution about the explanatory (independent) variables. In case of LDA , fundamental assumtion is that the independent variables are normally distributed. This can be checked by looking at the probability distribution of the variables.

Publications:

  • Nasar, S., Aldian, A., Nuredin, J., and Abusaeeda, I. (2016). Classification depend on linear discriminant analysis using desired outputs. 1109(10)
  • Zhao, H., Wang, Z., and Nie, F. (2019) A New Formulation of Linear Discriminant Analysis for Robust Dimensionality Reduction. 31(4):629-640

K-Means Algorithm

View Paper here

View R Code here

K-means clustering is a method of vector quantization with a goal to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. k-means clustering minimizes within-cluster variances (squared Euclidean distances). This algorithm is also referred to as Lloyd's algorithm, particularly in the computer science community. It is sometimes also referred to as "naïve k-means", because there exist much faster alternatives. Target number k needs to be pre-determined, it refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster. Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares. In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

Publications:

  • Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency versus interpretability ofclassifications.biometrics, 21:768–769.
  • Na, S., Xumin, L. and Yong, G. (2010), Research on k-means Clustering Algorithm: An Improved k-means Clustering Algorithm, 2010, pp. 63-67.

Descision Tree Algorithm

View Paper here

View R Code here

Decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including event probabilities. In Machine Learning, this algorithm is often referred as "Decision Tree Learning". Decision Tree Learning is one of the predictive modelling approaches used in statistics, data mining and machine learning. It uses a Decision Tree (as a predictive model) to cluster the entire sample of observations into clsuters (represented by the leaves of the table). There are two type of Decision Trees: Classification and Regression Trees. Tree models where the target variable can take a discrete set of values are called classification trees; in this type of tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision Trees where the target variable can take continuous values (usually real numbers) are called regression trees. Because of its intelligibility and simplicity, Decision Tree Algorithms are considered one of most popular ML algorithms.

Publications:

  • Haughton, D. and Oulabi, S. (1993). Direct marketing modeling with cart and chaid. Journal of direct marketing, 7(3):16–26
  • Duchessi, P. and Lauria, E. (2013). Decision tree models for profiling ski resorts’ promotional and advertising strategies and the impact on sales. 40(15):5822–5829.

Cluster Dyamics Algorithm

View R Code here

Once you have determined the customers are clustered to certain class, for example Good, Better, and Best then it can also be very helpful to predict the likelihood of each custoomer to move the another cluster. If you have computed the likelihood of a customer belonging to each of these three classes, for instance by using Decision Trees, then the only thing yoou have to do to perform comparison between user's current class and the class this customer is most likely to go.

If the two are the same then the customer is likely to stay in the same class whereas, for example, if the user belongs to class Better and has the following likelihood distribution: Good 0.8, Better 0.1, and Best 0.1, then the client is most likely to move from the Better class to the Good class. This is what this algorthm is designed to do for a set of customers and given their clss distributions.

XGBoost Algorithm

View Scala Code here

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now. XGBoost and Gradient Boosting Machines (GBMs) are both ensemble tree methods that apply the principle of boosting weak learners (CARTs generally) using the gradient descent architecture. However, XGBoost improves upon the base GBM framework through systems optimization and algorithmic enhancements.

Linear Regression plot

Publications:

  • Chen T., and Guestrin C. E. (2016), XGBoost: A Scalable Tree Boosting System. KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust :785–794

RFM Customer Segmentation Algorithm

View Paper here

View R Code here

RFM stands for recency, frequency and monetary. RFM segmentation is a great method to identify groups of customers for special treatment. RFM segmentation allows marketers to target specific clusters of customers with communications that are much more relevant for their particular behavior – and thus generate much higher rates of response, plus increased loyalty and customer lifetime value. Data such as purchase history, browsing history, prior campaign response patterns and demographics can all be used to identify specific groups of customers that can be addressed with offers very relevant to each.

  • Recency: How much time has elapsed since a customer’s last activity or transaction with the brand?
  • Frequency: How often has a customer transacted or interacted with the brand during a particular period of time?
  • Monetary: Also referred to as “monetary value,” this factor reflects how much a customer has spent with the brand during a particular period of time.

More Repositories

1

mathematics-statistics-for-data-science

Mathematical & Statistical topics to perform statistical analysis and tests; Linear Regression, Probability Theory, Monte Carlo Simulation, Statistical Sampling, Bootstrapping, Dimensionality reduction techniques (PCA, FA, CCA), Imputation techniques, Statistical Tests (Kolmogorov Smirnov), Robust Estimators (FastMCD) and more in Python and R.
R
122
star
2

TatevKaren-data-science-portfolio

Data Science Portfolio of Tatev Karen Aslanyan including Case Studies and Research Projects that I have completed that solve business problems or introduce new products. Case Study papers, codes, and additional resources are all included.
Jupyter Notebook
59
star
3

recurrent-neural-network-pricing-model

Price Prediction Case Study predicting the Bitcoin price and the Google stock price using Deep Learning, RNN with LSTM layers with TensorFlow and Keras in Python. (Includes: Data, Case Study Paper, Code)
Python
50
star
4

BabyGPT-Build_GPT_From_Scratch

BabyGPT: Build Your Own GPT Large Language Model from Scratch Pre-Training Generative Transformer Models: Building GPT from Scratch with a Step-by-Step Guide to Generative AI in PyTorch and Python
Python
49
star
5

artificial-neural-network-business_case_study

Business Case Study to predict customer churn rate based on Artificial Neural Network (ANN), with TensorFlow and Keras in Python. This is a customer churn analysis that contains training, testing, and evaluation of an ANN model. (Includes: Case Study Paper, Code)
Python
47
star
6

free-resources-books-papers

Books and Papers in Mathematics, Econometrics, Machine Learning, Finance etc for different levels that can be useful for Data Scientists, Developers and everyone whoo is interesting in STEM.
38
star
7

econometric-algorithms

Popular Econometrics content with code; Simple Linear Regression, Multiple Linear Regression, OLS, Event Study including Time Series Analysis, Fixed Effects and Random Effects Regressions for Panel Data, Heckman_2_Step for selection bias, Hausman Wu test for Endogeneity in Python, R, and STATA.
Stata
34
star
8

convolutional-neural-network-image_recognition_case_study

Computer Vision Case Study in image recognition to classify an image to a binary class, based on Convolutional Neural Networks (CNN), with TensorFlow and Keras in Python, to identify from an image whether it is an image of a dog or cat. (Includes: Data, Case Study Paper, Code)
Python
19
star
9

Finance-Projects

Case Studies in Finance: Stock Price Valuation using Black-Scholes using Brownian Motions, Investment Project comparing Stocks and Bonds, Determining Pension Fund's Premium. (Case Study Papers and Code)
MATLAB
17
star
10

CaseStudies

Jupyter Notebook
12
star
11

multivariate-statistics

Case Study in ranking U.S. cities based on a single linear combination of rating variables. Dimensionality techniques used in the analysis are Principal Component Analysis (PCA), Factor Analysis (FA), Canonical Correlation Analysis (CCA)
R
8
star
12

PySpark_Tutorial

PySpark Tutorial
Python
8
star
13

Deep-Learning-for-Data-Science

Deep Learning Case Studies with Tensorflow and Keras for Beginners-Advanced: ANN, CNN, RNN, Self-Organizing Maps, Boltzmann Machines, Stacked Autoencoders
Python
8
star
14

DataStructuresAlgorithmsCourse

Python
4
star
15

Python-For-Data-Science

Python
4
star
16

Simple-convolutional-neural-network

Simple Convolutional Neural Network
3
star
17

What-makes-playlist-successful

Product Data Science Case Study: What Makes a Playlist Successful. Use EDA (Exploratory Data Analysis) and Simple Machine Learning to identify the features related to the sucessful playlists.
Python
2
star
18

Predicting-Jop-Postings-Salary

Predicting Salaries of Job Applications for Job Search Engine Indeed using Machine Learning with Python Implementation
Python
1
star