Building ML Powered Applications
Welcome to the companion code repository for the O'Reilly book Building ML Powered Applications. The book is available on Amazon.
This repository consists of three parts:
-
A set of Jupyter notebooks in the
notebook
folder serve to illustrate concepts covered in the book. -
A library in the
ml_editor
folder contains core functions for the book's case study example, a Machine Learning driven writing assistant. -
A Flask app demonstrates a simple way to serve results to users
-
The
images/bmlpa_figures
folder contains reproductions of a few figures which were hard to read in the first print version.
Credit and thanks go to Bruno Guisard who conducted a thorough review of the code in this repository.
Setup instructions
Python environment
This repository has been tested on Python 3.6 and 3.7. It aims to support any Python 3 version.
To setup, start by cloning the repository:
git clone https://github.com/hundredblocks/ml-powered-applications.git
Then, navigate to the repository and create a python virtual environment using virtualenv:
cd ml-powered-applications
virtualenv ml_editor
You can then activate it by running:
source ml_editor/bin/activate
Then, install project requirements by using:
pip install -r requirements.txt
The library uses a few models from spacy. To download the small and large English model (required to run the app and the notebooks), run these commands from a terminal with your virtualenv activated:
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_lg
Finally, the notebooks and library leverage the nltk
package.
The package comes with a set of resources that need to be individually downloaded.
To do so, open a Python session in an activated virtual environment, import
nltk
, and download the required resource.
Here is an example of how to do this for the punkt
package from an active
virtual environment with nltk
installed:
python
import nltk
nltk.download('punkt')
Notebook examples
The notebook folder contains usage examples for concepts covered in the book. Most of the examples only use one of the subfolders in archive (the one that contains data for writers.stackexchange.com).
I've included a processed version of the data as a .csv
for convenience.
If you wanted to generate this data yourself, or generate it for another subfolder, you should:
-
Download a subfolder from the stackoverflow archives
-
Run
parse_xml_to_csv
to convert it to a DataFrame -
Run
generate_model_text_features
to generate a DataFrames with precomputed features
The notebooks belong to a few categories of concepts, described below.
Data Exploration and Transformation
- Dataset Exploration
- Splitting Data
- Vectorizing Text
- Clustering Data
- Tabular Data Vectorization
- Exploring Data To Generate Features
Initial Model Training and Performance Analysis
Improving the Model
Model Comparison
Generating Suggestions from Models
Pretrained models
You can train and save models using the notebooks in the notebook
folder.
For convenience, I've included three trained models and two vectorizers,
serialized in the models
folder.
These models are loaded by notebooks demonstrating methods to compare model
results, as well as in the flask app.
Running the prototype Flask app
To run the app, simply navigate to the root of the repository and run:
FLASK_APP=app.py flask run
The above command should spin up a local web-app you can access at
http://127.0.0.1:5000/
Troubleshooting
If you have any questions or encounter any roadblocks, please feel free to open an issue or email me at [email protected].
Project structure inspired by the great Cookiecutter Data Science.