• Stars
    star
    1,727
  • Rank 26,999 (Top 0.6 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created almost 7 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

AI on Hadoop

license Release Version PRs Welcome

XLearning is a convenient and efficient scheduling platform combined with the big data and artificial intelligence, support for a variety of machine learning, deep learning frameworks. XLearning is running on the Hadoop Yarn and has integrated deep learning frameworks such as TensorFlow, MXNet, Caffe, Theano, PyTorch, Keras, XGBoost. XLearning has the satisfactory scalability and compatibility.

中文文档

Architecture

architecture
There are three essential components in XLearning:

  • Client: start and get the state of the application.
  • ApplicationMaster(AM): the role for the internal schedule and lifecycle manager, including the input data distribution and containers management.
  • Container: the actual executor of the application to start the progress of Worker or PS(Parameter Server), monitor and report the status of the progress to AM, and save the output, especially start the TensorBoard service for TensorFlow application.

Functions

1 Support Multiple Deep Learning Frameworks

Besides the distributed mode of TensorFlow and MXNet frameworks, XLearning supports the standalone mode of all deep learning frameworks such as Caffe, Theano, PyTorch. Moreover, XLearning allows the custom versions and multi-version of frameworks flexibly.

2 Unified Data Management Based On HDFS

XLearning is enable to specify the input strategy for the input data --input by setting the --input-strategy parameter or xlearning.input.strategy configuration. XLearning support three ways to read the HDFS input data:

  • Download: AM traverses all files under the specified HDFS path and distributes data to workers in files. Each worker download files from the remote to local.
  • Placeholder: The difference with Download mode is that AM send the related HDFS file list to workers. The process in worker read the data from HDFS directly.
  • InputFormat: Integrated the InputFormat function of MapReduce, XLearning allows the user to specify any of the implementation of InputFormat for the input data. AM splits the input data and assigns fragments to the different workers. Each worker passes the assigned fragments through the pipeline to the execution progress.

Similar with the read strategy, XLearning allows to specify the output strategy for the output data --output by setting the --output-strategy parameter or xlearning.output.strategy configuration. There are two kinds of result output modes:

  • Upload: After the program finished, each worker upload the local directory of the output to specified HDFS path directly. The button, "Saved Model", on the web interface allows user to upload the intermediate result to remote during the execution.
  • OutputFormat: Integrated the OutputFormat function of MapReduce, XLearning allows the user to specify any of the implementation of OutputFormat for saving the result to HDFS.

More detail see data management

3 Visualization Display

The application interface can be divided into four parts:

  • All Containers:display the container list and corresponding information, including the container host, container role, current state of container, start time, finish time, current progress.
  • View TensorBoard:If set to start the service of TensorBoard when the type of application is TensorFlow, provide the link to enter the TensorBoard for real-time view.
  • Save Model:If the application has the output, user can upload the intermediate output to specified HDFS path during the execution of the application through the button of "Save Model". After the upload finished, display the list of the intermediate saved path.
  • Worker Metrix:display the resource usage information metrics of each worker.
    As shown below:

yarn1

4 Compatible With The Code At Native Frameworks

Except the automatic construction of the ClusterSpec at the distributed mode TensorFlow framework, the program at standalone mode TensorFlow and other deep learning frameworks can be executed at XLearning directly.

Compilation & Deployment Instructions

1 Compilation Environment Requirements

  • jdk >= 1.7
  • Maven >= 3.3

2 Compilation Method

Run the following command in the root directory of the source code:

mvn package

After compiling, a distribution package named xlearning-1.1-dist.tar.gz will be generated under target in the root directory.
Unpacking the distribution package, the following subdirectories will be generated under the root directory:

  • bin: scripts for application commit
  • lib: jars for XLearning and dependencies
  • conf: configuration files
  • sbin: scripts for history service
  • data: data and files for examples
  • examples: XLearning examples

3 Deployment Environment Requirements

  • CentOS 7.2
  • Java >= 1.7
  • Hadoop = 2.6, 2.7, 2.8
  • [optional] Dependent environment for deep learning frameworks at the cluster nodes, such as TensorFlow, numpy, Caffe.

4 XLearning Client Deployment Guide

Under the "conf" directory of the unpacking distribution package "$XLEARNING_HOME", configure the related files:

  • xlearning-env.sh: set the environment variables, such as:

    • JAVA_HOME
    • HADOOP_CONF_DIR
  • xlearning-site.xml: configure related properties. Note that the properties associated with the history service needs to be consistent with what has configured when the history service started.For more details, please see the Configuration part。

  • log4j.properties:configure the log level

5 Start Method of XLearning History Service [Optional]

  • run $XLEARNING_HOME/sbin/start-history-server.sh.

Quick Start

Use $XLEARNING_HOME/bin/xl-submit to submit the application to cluster in the XLearning client.
Here are the submit example for the TensorFlow application.

1 upload data to hdfs

upload the "data" directory under the root of unpacking distribution package to HDFS

cd $XLEARNING_HOME  
hadoop fs -put data /tmp/ 

2 submit

cd $XLEARNING_HOME/examples/tensorflow
$XLEARNING_HOME/bin/xl-submit \
   --app-type "tensorflow" \
   --app-name "tf-demo" \
   --input /tmp/data/tensorflow#data \
   --output /tmp/tensorflow_model#model \
   --files demo.py,dataDeal.py \
   --launch-cmd "python demo.py --data_path=./data --save_path=./model --log_dir=./eventLog --training_epochs=10" \
   --worker-memory 10G \
   --worker-num 2 \
   --worker-cores 3 \
   --ps-memory 1G \
   --ps-num 1 \
   --ps-cores 2 \
   --queue default \

The meaning of the parameters are as follows:

Property Name Meaning
app-name application name as "tf-demo"
app-type application type as "tensorflow"
input input file, HDFS path is "/tmp/data/tensorflow" related to local dir "./data"
output output file,HDFS path is "/tmp/tensorflow_model" related to local dir "./model"
files application program and required local files, including demo.py, dataDeal.py
launch-cmd execute command
worker-memory amount of memory to use for the worker process is 10GB
worker-num number of worker containers to use for the application is 2
worker-cores number of cores to use for the worker process is 3
ps-memory amount of memory to use for the ps process is 1GB
ps-num number of ps containers to use for the application is 1
ps-cores number of cores to use for the ps process is 2
queue the queue that application submit to

For more details, set the Submit Parameter part。

FAQ

XLearning FAQ

Authors

XLearning is designed, authored, reviewed and tested by the team at the github:

@Yuance Li, @Wen OuYang, @Runying Jia, @YuHan Jia, @Lei Wang

Contact us

Mail: [email protected]
QQ群:588356340
qq

More Repositories

1

RePlugin

RePlugin - A flexible, stable, easy-to-use Android Plug-in Framework
Java
7,261
star
2

Atlas

A high-performance and stable proxy for MySQL, it is developed by Qihoo's DBA and infrastructure team
C
4,650
star
3

wayne

Kubernetes multi-cluster management and publishing platform
TypeScript
3,706
star
4

evpp

A modern C++ network library for developing high performance network services in TCP/UDP/HTTP protocols.
C++
3,564
star
5

ArgusAPM

Powerful, comprehensive (Android) application performance management platform. 360线上移动性能检测平台
Java
2,673
star
6

safe-rules

详细的C/C++编程规范指南,由360质量工程部编著,适用于桌面、服务端及嵌入式软件系统。
2,363
star
7

Quicksql

A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources
Java
2,057
star
8

poseidon

A search engine which can hold 100 trillion lines of log data.
Go
1,966
star
9

QConf

Qihoo Distributed Configuration Management System
C++
1,865
star
10

phptrace

A tracing and troubleshooting tool for PHP scripts.
C
1,677
star
11

mysql-sniffer

mysql-sniffer is a network traffic analyzer tool for mysql, it is developed by Qihoo DBA and infrastructure team
C
845
star
12

huststore

High-performance Distributed Storage
C
823
star
13

doraemon

Doraemon is a Prometheus based monitor system
JavaScript
655
star
14

logkafka

Collect logs and send lines to Apache Kafka
C++
500
star
15

zeppelin

A Scalable, High-Performance Distributed Key-Value Platform
C++
399
star
16

tensornet

C++
316
star
17

qbusbridge

The Apache Kafka Client SDK
C++
292
star
18

360zhinao

360zhinao
Python
274
star
19

XSQL

Unified SQL Analytics Engine Based on SparkSQL
Scala
210
star
20

WatchAD2.0

WatchAD2.0是一款针对域威胁的日志分析与监控系统
CSS
206
star
21

zendAPI

The C++ wrapper of zend engine
C++
183
star
22

mongosync

mongosync is simple && useful tool to sync data between mongo replicaSet, it is developed by Qihoo's DBA and infrastructure team
C++
154
star
23

artdumper

从oat文件中dump出来dex的工具
C++
138
star
24

influx-proxy

influxdb HA
Go
128
star
25

kmemcache

linux kernel memcache server
C
126
star
26

XLearning-XDML

extremely distributed machine learning
Scala
123
star
27

simcc

A simple C++ common base library used in Qihoo 360
C++
116
star
28

nemo

A library that provide multiply data structure. Such as map, hash, list, set. We build these data structure base on rocksdb as the storage layer for Pika https://github.com/OpenAtomFoundation/pika .
C++
115
star
29

ngx_http_subrange_module

Split one big HTTP/Range request to multiple subrange requesets
C
107
star
30

blackwidow

A library implements REDIS commands(Strings, Hashes, Lists, Sorted Sets, Sets, Keys, HyperLogLog) based on rocksdb, as the storage layer for Pika https://github.com/OpenAtomFoundation/pika .
C++
99
star
31

QNAT

C
88
star
32

Mario

A Library that make the write from synchronous to asynchronous.
C++
78
star
33

Luwak

利用预训练语言模型从非结构化威胁报告中提取 MITRE ATT&CK TTP 信息
Python
68
star
34

mpic

A C++ embedded library of multiple processes framework developed and used at Qihoo360.
C++
50
star
35

nemo-rocksdb

Add TTL feature on rocksdb, and compatible with rocksdb
C++
44
star
36

dgl-operator

The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network training on Kubernetes
Go
44
star
37

ironwill

Useful iOS components for your project. 健壮且有用的OC代码, 可以直接在你的iOS应用中使用.
Objective-C
37
star
38

elog

A erlang log nif
C++
28
star
39

rust-jsonnet

rust-jsonnet - The Google Jsonnet( operation data template language) for rust
Rust
24
star
40

zeppelin-gateway

Object Gateway Provide Applications with a RESTful Gateway to zeppelin
C++
23
star
41

zeppelin-client

Client Library for zeppelin
C++
21
star
42

luajit-jsonnet

The Google Jsonnet( operation data template language) for Luajit
C++
16
star
43

HTTPSLayer

PHP
16
star
44

CReSS

Cross-model Retrieval between 13C NMR Spectrum and Structure
Python
15
star
45

wayne-backend-plugins

Wayne backend plugins
Go
13
star
46

gpstall

Stall Postgres' insert command
C++
8
star
47

cloud-website

360 cloud official website
PHP
8
star
48

wayne-frontend-plugins

Wayne UI Plugins
TypeScript
7
star
49

SEEChat

一见多模态对话模型
Python
5
star
50

wiki

wiki for qihoo infrastructure team
2
star
51

se-office

se-office扩展,提供基于开放标准的全功能办公生产力套件,基于浏览器预览和编辑office。
JavaScript
1
star