• Stars
    star
    123
  • Rank 290,145 (Top 6 %)
  • Language
    Scala
  • Created almost 6 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

extremely distributed machine learning

license Release Version PRs Welcome

XDML是一款基于参数服务器(Parameter Server),采用专门缓存机制的分布式机器学习平台。 XDML内化了学界最新研究成果,在效果保持稳定的同时,能大幅加速收敛进程,显著提升模型与算法的性能。同时,XDML还对接了一些优秀的开源成果和360公司自研成果,站在巨人的肩膀上,博采众长。 XDML还兼容hadoop生态,提供更好的大数据框架使用体验,将开发者从繁杂的工作中解脱出来。XDML已经在360内部海量规模数据上进行了大量测试和调优,在大规模数据量和超高维特征的机器学习任务上,具有良好的稳定性,扩展性和兼容性。

欢迎对机器学习或分布式有兴趣的同仁一起贡献代码,提交Issues或者Pull Requests。

架构设计

architecture

针对超大规模机器学习的场景,奇虎360开源了内部的超大规模机器学习计算框架XDML。XDML是一款基于参数服务器(Parameter Server),采用专门缓存机制的分布式机器学习平台。它在360内部海量规模数据上进行了测试和调优,在大规模数据量和超高维特征的机器学习任务上,具有良好的稳定性,扩展性和兼容性。

功能特性

1.提供特征预处理/分析,离线训练,模型管理等功能模块

2.实现常用的大规模数据量场景下的机器学习算法

3.充分利用现有的成熟技术,保证整个框架的高效稳定

4.完全兼容hadoop生态,和现有的大数据工具实现无缝对接,提升处理海量数据的能力

5.在系统架构和算法层面实现深度的工程优化,在不损失精度的前提下,大幅提高性能

代码结构

1.ps

XDML的核心参数服务器架构,包括以下组件:

2.conf

XDML的配置包,包括对参数服务器的配置和对作业及模型相关的配置。包括以下组件:

3.task

XDML向PS提交的作业,包括拉取和推送。包括以下任务:

  • Task
  • PullTask
  • PushTask

4.optimization

XDML模型的优化算法包。包括以下优化算法:

5.ml

XDML中已经实现的部分机器学习模型。包括以下模型:

6.feature

XDML中特征分析和特征处理模块。

  • 特征分析

    特征分析覆盖常见的分析指标,如数值型特征的偏度、峰度、分位数,与label相关的auc、ndcg、互信息、相关系数等指标。

  • 特征处理

    特征处理覆盖常见的数值型、类别型特征预处理方法。包括以下算子:

    • CategoryEncoder
    • MultiCategoryEncoder
    • NumericBuckter
    • NumericStandardizer

7.model

XDML中包含用南京大学李武军老师提出的Scope优化算法进行训练的线性模型,以及部分H2O模型的spark pipeline封装。具体包括以下模型:

Model:

  • LinearScope
  • MultiLinearScope
  • OVRLinearScope
  • H2ODRF
  • H2OGBM
  • H2OGLM
  • H2OMLP

8.example

XDML中作业提交实例,可以参考Example.

编译&部署指南

XDML是基于Kudu、HazelCast以及Hadoop生态圈的一款基于参数服务器的,采用专门缓存机制的分布式机器学习平台。

环境依赖

  • centos >= 7
  • Jdk >= 1.8
  • Maven >= 3.5.4
  • scala >= 2.11
  • hadoop >= 2.7.3
  • spark >= 2.3.0
  • sparkling-water-core >= 2.3.0
  • kudu >= 1.9
  • HazelCast >= 3.9.3

Kudu安装部署

XDML基于Kudu,请首先部署Kudu。Kudu的安装部署请参考Kudu

源码下载

git clone https://github.com/Qihoo360/XLearning-XDML

编译

mvn clean package -Dmaven.test.skip=true 编译完成后,在源码根目录的target目录下会生成:xdml-1.0.jarxdml-1.0-jar-with-dependencies.jar等多个文件,xdml-1.0.jar为未加spark、kudu等第三方依赖,xdml-1.0-jar-with-dependencies.jar添加了spark、kudu等依赖包。

运行示例

提交参数

  • 算法参数
    • spark.xdml.learningRate:学习率
  • 训练参数
    • spark.xdml.job.type:作业类型
    • spark.xdml.train.data.path:训练数据路径
    • spark.xdml.train.data.partitionNum:训练数据分区
    • spark.xdml.model.path:模型存储路径
    • spark.xdml.train.iter:训练迭代次数
    • spark.xdml.train.batchsize:训练数据batch大小
  • PS相关参数
    • spark.xdml.hz.clusterNum:hazelcast集群机器数目
    • spark.xdml.table.name:kudu表名称

提交命令

可以通过以下命令提交示例训练作业:

  $SPARK_HOME/bin/spark-submit \   
    --master yarn-cluster \    
    --class net.qihoo.xitong.xdml.example.LRTest \   
    --num-executors 50 \   
    --executor-memory 40g \   
    --executor-cores 2 \   
    --driver-memory 4g \   
    --conf "spark.xdml.table.name=lrtest" \   
    --conf "spark.xdml.job.type=train" \   
    --conf "spark.xdml.train.data.path=$trainpath" \   
    --conf "spark.xdml.train.data.partitionNum=50" \   
    --conf "spark.xdml.hz.clusterNum=50" \   
    --conf "spark.xdml.model.path=$modelpath" \   
    --conf "spark.xdml.train.iter=5" \   
    --conf "spark.xdml.train.batchsize=10000" \   
    --conf "spark.xdml.learningRate=0.1" \   
    --jars xdml-1.0-jar-with-dependencies.jar \   
    xdml-1.0-jar-with-dependencies.jar   

注:提交命令中的设置有$SPARK_HOME$trainpath$modelpath 分别代表spark客户端路径、训练数据HDFS路径、模型存储HDFS路径

FAQ

XDML常见问题

参考文献

XDML参考了学界及工业界诸多优秀成果,对此表示感谢!

联系我们

Mail: [email protected]
QQ群:874050710
qq

More Repositories

1

RePlugin

RePlugin - A flexible, stable, easy-to-use Android Plug-in Framework
Java
7,261
star
2

Atlas

A high-performance and stable proxy for MySQL, it is developed by Qihoo's DBA and infrastructure team
C
4,650
star
3

wayne

Kubernetes multi-cluster management and publishing platform
TypeScript
3,706
star
4

evpp

A modern C++ network library for developing high performance network services in TCP/UDP/HTTP protocols.
C++
3,564
star
5

ArgusAPM

Powerful, comprehensive (Android) application performance management platform. 360线上移动性能检测平台
Java
2,673
star
6

safe-rules

详细的C/C++编程规范指南,由360质量工程部编著,适用于桌面、服务端及嵌入式软件系统。
2,363
star
7

Quicksql

A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources
Java
2,057
star
8

poseidon

A search engine which can hold 100 trillion lines of log data.
Go
1,966
star
9

QConf

Qihoo Distributed Configuration Management System
C++
1,865
star
10

hbox

AI on Hadoop
Java
1,727
star
11

phptrace

A tracing and troubleshooting tool for PHP scripts.
C
1,677
star
12

mysql-sniffer

mysql-sniffer is a network traffic analyzer tool for mysql, it is developed by Qihoo DBA and infrastructure team
C
845
star
13

huststore

High-performance Distributed Storage
C
823
star
14

doraemon

Doraemon is a Prometheus based monitor system
JavaScript
655
star
15

logkafka

Collect logs and send lines to Apache Kafka
C++
500
star
16

zeppelin

A Scalable, High-Performance Distributed Key-Value Platform
C++
399
star
17

tensornet

C++
316
star
18

qbusbridge

The Apache Kafka Client SDK
C++
292
star
19

360zhinao

360zhinao
Python
274
star
20

XSQL

Unified SQL Analytics Engine Based on SparkSQL
Scala
210
star
21

WatchAD2.0

WatchAD2.0是一款针对域威胁的日志分析与监控系统
CSS
206
star
22

zendAPI

The C++ wrapper of zend engine
C++
183
star
23

mongosync

mongosync is simple && useful tool to sync data between mongo replicaSet, it is developed by Qihoo's DBA and infrastructure team
C++
154
star
24

artdumper

从oat文件中dump出来dex的工具
C++
138
star
25

influx-proxy

influxdb HA
Go
128
star
26

kmemcache

linux kernel memcache server
C
126
star
27

simcc

A simple C++ common base library used in Qihoo 360
C++
116
star
28

nemo

A library that provide multiply data structure. Such as map, hash, list, set. We build these data structure base on rocksdb as the storage layer for Pika https://github.com/OpenAtomFoundation/pika .
C++
115
star
29

ngx_http_subrange_module

Split one big HTTP/Range request to multiple subrange requesets
C
107
star
30

blackwidow

A library implements REDIS commands(Strings, Hashes, Lists, Sorted Sets, Sets, Keys, HyperLogLog) based on rocksdb, as the storage layer for Pika https://github.com/OpenAtomFoundation/pika .
C++
99
star
31

QNAT

C
88
star
32

Mario

A Library that make the write from synchronous to asynchronous.
C++
78
star
33

Luwak

利用预训练语言模型从非结构化威胁报告中提取 MITRE ATT&CK TTP 信息
Python
68
star
34

mpic

A C++ embedded library of multiple processes framework developed and used at Qihoo360.
C++
50
star
35

nemo-rocksdb

Add TTL feature on rocksdb, and compatible with rocksdb
C++
44
star
36

dgl-operator

The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network training on Kubernetes
Go
44
star
37

ironwill

Useful iOS components for your project. 健壮且有用的OC代码, 可以直接在你的iOS应用中使用.
Objective-C
37
star
38

elog

A erlang log nif
C++
28
star
39

rust-jsonnet

rust-jsonnet - The Google Jsonnet( operation data template language) for rust
Rust
24
star
40

zeppelin-gateway

Object Gateway Provide Applications with a RESTful Gateway to zeppelin
C++
23
star
41

zeppelin-client

Client Library for zeppelin
C++
21
star
42

luajit-jsonnet

The Google Jsonnet( operation data template language) for Luajit
C++
16
star
43

HTTPSLayer

PHP
16
star
44

CReSS

Cross-model Retrieval between 13C NMR Spectrum and Structure
Python
15
star
45

wayne-backend-plugins

Wayne backend plugins
Go
13
star
46

gpstall

Stall Postgres' insert command
C++
8
star
47

cloud-website

360 cloud official website
PHP
8
star
48

wayne-frontend-plugins

Wayne UI Plugins
TypeScript
7
star
49

SEEChat

一见多模态对话模型
Python
5
star
50

wiki

wiki for qihoo infrastructure team
2
star
51

se-office

se-office扩展,提供基于开放标准的全功能办公生产力套件,基于浏览器预览和编辑office。
JavaScript
1
star