LogClass
This repository provides an open-source toolkit for LogClass framework from W. Meng et al., "LogClass: Anomalous Log Identification and Classification with Partial Labels," in IEEE Transactions on Network and Service Management, doi: 10.1109/TNSM.2021.3055425.
LogClass automatically and accurately detects and classifies anomalous logs based on partial labels.
Table of Contents
β
Requirements
Requirements are listed in requirements.txt
. To install these, run:
pip install -r requirements.txt
Quick Start
Run LogClass
Several example experiments using LogClass are included in this repository.
Here is an example to run one of them - training of the global experiment doing anomaly detection and classification. Run the following command in the home directory of this project:
python -m LogClass.logclass --train --kfold 3 --logs_type "bgl" --raw_logs "./Data/RAS_LOGS" --report macro
Arguments
python -m LogClass.logclass --help
usage: logclass.py [-h] [--raw_logs raw_logs] [--base_dir base_dir]
[--logs logs] [--models_dir models_dir]
[--features_dir features_dir] [--logs_type logs_type]
[--kfold kfold] [--healthy_label healthy_label]
[--features features [features ...]]
[--report report [report ...]]
[--binary_classifier binary_classifier]
[--multi_classifier multi_classifier] [--train] [--force]
[--id id] [--swap]
Runs binary classification with PULearning to detect anomalous logs.
optional arguments:
-h, --help show this help message and exit
--raw_logs raw_logs input raw logs file path (default: None)
--base_dir base_dir base output directory for pipeline output files
(default: ['{your_logclass_dir}\\output'])
--logs logs input logs file path and output for raw logs
preprocessing (default: None)
--models_dir models_dir
trained models input/output directory path (default:
None)
--features_dir features_dir
trained features_dir input/output directory path
(default: None)
--logs_type logs_type
Input type of logs. (default: ['open_Apache'])
--kfold kfold kfold crossvalidation (default: None)
--healthy_label healthy_label
the labels of unlabeled logs (default: ['unlabeled'])
--features features [features ...]
Features to be extracted from the logs messages.
(default: ['tfilf'])
--report report [report ...]
Reports to be generated from the model and its
predictions. (default: None)
--binary_classifier binary_classifier
Binary classifier to be used as anomaly detector.
(default: ['pu_learning'])
--multi_classifier multi_classifier
Multi-clas classifier to classify anomalies. (default:
['svm'])
--train If set, logclass will train on the given data.
Otherwiseit will run inference on it. (default: False)
--force Force training overwriting previous output with same
id. (default: False)
--id id Experiment id. Automatically generated if not
specified. (default: None)
--swap Swap testing/training data in kfold cross validation.
(default: False)
Directory Structure
.
βββ data
βΒ Β βββ open_source_logs # Included open-source log datasets
βΒ Β βββ Apache
βΒ Β βββ bgl
βΒ Β βββ hadoop
βΒ Β βββ hdfs
βΒ Β βββ hpc
βΒ Β βββ proxifier
βΒ Β βββ zookeeper
βββ output # Example output folder
βΒ Β βββ preprocessed_logs # Saved preprocessed logs for reuse
βΒ Β βΒ Β βββ open_Apache.txt
βΒ Β βΒ Β βββ open_bgl.txt
βΒ Β βββ train_multi_open_bgl_2283696426 # Example experiment output
βΒ Β Β Β βββ best_params.json
βΒ Β Β Β βββ features
βΒ Β Β Β βΒ Β βββ tfidf.pkl
βΒ Β Β Β βΒ Β βββ vocab.pkl
βΒ Β Β Β βββ models
βΒ Β Β Β βΒ Β βββ multi.pkl
βΒ Β Β Β βββ results.csv
βββ feature_engineering
βΒ Β βββ __init__.py
βΒ Β βββ length.py
βΒ Β βββ tf_idf.py
βΒ Β βββ tf_ilf.py
βΒ Β βββ tf.py
βΒ Β βββ registry.py
βΒ Β βββ vectorizer.py # Log message vectorizing utilities
βΒ Β βββ utils.py
βββ models
βΒ Β βββ __init__.py
βΒ Β βββ base_model.py # BaseModel class extended by all models
βΒ Β βββ pu_learning.py
βΒ Β βββ regular.py
βΒ Β βββ svm.py
βΒ Β βββ binary_registry.py
βΒ Β βββ multi_registry.py
βββ preprocess
βΒ Β βββ __init__.py
βΒ Β βββ bgl_preprocessor.py
βΒ Β βββ open_source_logs.py
βΒ Β βββ registry.py
βΒ Β βββ utils.py
βββ reporting
βΒ Β βββ __init__.py
βΒ Β βββ accuracy.py
βΒ Β βββ confusion_matrix.py
βΒ Β βββ macrof1.py
βΒ Β βββ microf1.py
βΒ Β βββ multi_class_acc.py
βΒ Β βββ top_k_svm.py
βΒ Β βββ bb_registry.py
βΒ Β βββ wb_registry.py
βββ puLearning # PULearning third party implementation
βΒ Β βββ __init__.py
βΒ Β βββ puAdapter.py
βββ __init__.py
βββ LICENSE
βββ README.md
βββ requirements.txt
βββ init_params.py # Parses arguments, initializes global parameters
βββ logclass.py # Performs training and inference of LogClass
βββ test_pu.py # Compares robustness of LogClass
βββ train_multi.py # Trains LogClass for anomalies classification
βββ train_binary.py # Trains LogClass for log anomaly detection
βββ run_binary.py # Loads trained LogClass and detects anomalies
βββ decorators.py
βββ utils.py
Datasets
In this repository we include various open-source logs datasets in the data
folder as well as their corresponding preprocessing module in the preprocess
package. Additionally there is another preprocessor provided for BGL logs data, which can be downloaded directly from here.
How to
Explain how to use and extend this toolkit.
How to add a new dataset
Add a new preprocessor module in the preprocess
package.
The module should implement a function that follows the preprocess_datset(params)
function template included in all preprocessors. It should be decorated with @register(f"{dataset_name}")
, e.g. open_Apache, and call the process_logs(input_source, output, process_line)
function. This process_line
function should also be defined in the processor as well.
When done, add the module name to the __init__.py
list of modules from the preprocess
package and also the name from the decorator in the argsparse parameters options as the logs type. For example, --logs_type open_Apache
.
Preprocessed Logs Format
This format is ensured by the process_line
function which is to be defined in each preprocessor.
def process_line(line):
"""
Processes a given line from the raw logs.
Parameter
---------
line : str
One line from the raw logs.
Returns
-------
str
String with the format f"{label} {msg}" where the `label` indicates whether
the log is anomalous and if so, which anomaly category, and `msg` is the
filtered log message without parameters.
"""
# your code
To filter the log message parameters, use the remove_parameters(msg)
function from the utils.py
module in the preprocess
package.
How to run a new experiment
Several experiments examples are included in the repository. The best way to start with creating a new one is to follow the example from the others, specially the main function structure and its experiment function be it training or testing focused.
The key things to consider the experiment should include are the following:
-
Args parsing: create custom
init_args()
andparse_args(args)
functions for your experiment that callinit_main_args()
from theinit_params.py
module. -
Output file handling: use
file_handling(params)
function (seeutils.py
in the main directory of the repo). -
Preprocessing raw logs: if
--raw_logs
argument is provided, get the preprocessing function using the--logs_type
argument from thepreprocess
module registry callingget_preprocessor(f'{logs_type}')
function. -
Load logs: call the
load_logs(params, ...)
function to get the preprocessed logs from the directory specified in the--logs
parameter. It will return a tuple of x, y, and target label names data.
Custom experiment
Main functions to consider for a custom experiment. Usually in its own function.
Feature Engineering
extract_features(x, params)
fromfeature_engineering
package'sutils.py
module: Extracts all specified features in--features
parameter from the preprocessed logs. See the function definition for further details.build_vocabulary(x)
fromfeature_engineering
package'svectorizer.py
module: Divides log into tokens and creates vocabulary. See the function definition for further details.log_to_vector(x, vocabulary)
fromfeature_engineering
package'svectorizer.py
module: Vectorizes each log message using a dict of words to index. See the function definition for further details.get_features_vector(x_vector, vocabulary, params)
fromfeature_engineering
package'sutils.py
module: Extracts all specified features from the vectorized logs. See the function definition for further details.
Model training and inference
Each model extends the BaseModel
class from module base_model.py
. See the class definition for further details.
There are two registries in the models
package, one for binary models meant to be used for anomaly detection and another one for multi-classification models to classify the anomalies. Get the constructor for either using the --binary_classifier
or --multi_classifier
argument specified. E.g. binary_classifier_registry.get_binary_model(params['binary_classifier'])
.
By extending BaseModel
the model is always saved when it fits the data. Load a model by calling its load()
method. It will use the params
attribute of the BaseModel
class to get the experiment id and load its corresponding model.
To save the params of an experiment call the save_params(params)
function from the utils.py
module in the main directory. load_params(params)
in case of only using the module for inference.
Reporting
There are two kinds of reports, black box and white box and a registry for each in the reporting
module.
To use them, call the corresponding registry and obtain the report wrapper using black_box_report_registry.get_bb_report('acc')
, for example.
To add new reports, see the analogous explanation for models or features below.
Saving results
Among the provided experiments, test_pu.py
and train_multi.py
save their results creating a dict of column names to lists of results. Then the save_results.py
function from the utils.py
module is used to save them to a CSV file.
How to add a new model
To add a new model, implement a class that extends the BaseModel
class and include its module in the models
package. See the class definition for further details.
Decorate a method that calls its constructor and returns an instance of the model with the @register(f"{model_name}")
decorator from either the binary_registry.py
or the multi_registry.py
modules from the models
package depending on whether the model is for anomaly detection or classification respectively.
Finally, make sure you add the module's name in the __init__.py
module from the models
package and the model option in the init_params.py
module within the list for either --binary_classifier
or multi_classifier
arguments. This way the constructor can be obtained by doing binary_classifier_registry.get_binary_model(params['binary_classifier'])
, for example.
How to extract a new feature
To add a new feature extractor, create a module in the feature_engineering
package that wraps your feature extractor function and returns the features. See length.py
module as an example for further details.
As in the other cases, decorate the wrapper function with @register(f"{feature_name}")
and make sure you add the module name in the __init__.py
from the feature_engineering
package and the feature as an option in the init_params.py
module --features
argument.
Included Experiments
High level overview of each of the experiments included in the repository.
Testing PULearning
test_pu.py
is mainly focused on proving the robustness of LogClass for anomaly detection when just providing few labeled data as anomalous.
It would compare PULearning+RandomForest with any other given anomaly detection algorithm. Using the given data, it would start with having only healthy logs on the unlabeled data and gradually increase this up to 10%. To test PULearning, run the following command in the home directory of this project:
python -m LogClass.test_pu --logs_type "bgl" --raw_logs "./Data/RAS from Weibin/RAS_raw_label.dat" --binary_classifier regular --ratio 8 --step 1 --top_percentage 11 --kfold 3
This would first preprocess the logs. Then, for each kfold iteration, it will perform feature extraction and force a 1:8 ratio of anomalous:healthy logs. Finally with a step of 1% it will go from 0% to 10% anomalous logs in the unlabeled set and compare the accuracy of both anomaly detection algorithms. If none specified it will default to a plain RF.
Testing Anomaly Classification
train_multi.py
is focused on showing the robustness of LogClass' TF-ILF feature extraction approach for multi-class anomaly classification. The main detail is that when using --kfold N
, one can swap training/testing data slices using the --swap
flag. This way, for instance, it can train on 10% logs and test on the remaining 90%, when pairing --swap
with n ==10. To run such an experiment, use the following command from the parent directory of the project:
python -m LogClass.train_multi --logs_type "open_Apache" --raw_logs "./Data/open_source_logs/" --kfold 10 --swap
Global LogClass
logclass.py
is set up so that it does both training or testing of the learned models depending on the flags. For example to train and preprocessing run the following command in the home directory of this project: :
python -m LogClass.logclass --train --kfold 3 --logs_type "bgl" --raw_logs "./Data/RAS_LOGS"
This would first preprocess the raw BGL logs and extract their TF-ILF features, then train and save both PULearning with a RandomForest for anomaly detection and an SVM for multi-class anomaly classification.
For running inference simply run:
python -m LogClass.logclass --logs_type
In this case it would load the learned feature extraction approach, both learned models and run inference on the whole logs.
Binary training/inference
train_binary.py
and run_binary.py
simply separate the binary part of logclass.py
into two modules: one for training both feature extraction and the models, and another one for loading these and running inference.
Citing
If you find LogClass is useful for your research, please consider citing the paper:
@ARTICLE{9339940, author={Meng, Weibin and Liu, Ying and Zhang, Shenglin and Zaiter, Federico and Zhang, Yuzhe and Huang, Yuheng and Yu, Zhaoyang and Zhang, Yuzhi and Song, Lei and Zhang, Ming and Pei, Dan},
journal={IEEE Transactions on Network and Service Management},
title={LogClass: Anomalous Log Identification and Classification with Partial Labels},
year={2021},
doi={10.1109/TNSM.2021.3055425}
}
This code was completed by @Weibin Meng and @Federico Zaiter.