• Stars
    star
    121
  • Rank 293,924 (Top 6 %)
  • Language
    Java
  • License
    Artistic License 2.0
  • Created over 8 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

本项目包含以下示例:

MapReduce

  • WordCount: 单词统计

Hive

  • sample.hive:表的简单查询

Pig

  • sample.pig:Pig处理OSS数据实例

Spark

  • SparkPi: 计算Pi
  • SparkWordCount: 单词统计
  • LinearRegression: 线性回归
  • OSSSample: OSS使用示例
  • ONSSample: ONS使用示例
  • ODPSSample: ODPS使用示例
  • MNSSample:MNS使用示例
  • LoghubSample:Loghub使用示例

PySpark

  • WordCount: 单词统计

依赖资源

测试数据(data目录下):

  • The_Sorrows_of_Young_Werther.txt:可作为WordCount(MapReduce/Spark)的输入数据
  • patterns.txt:WordCount(MapReduce)作业的过滤字符
  • u.data:sample.hive脚本的测试表数据
  • abalone:线性回归算法测试数据

依赖jar包(lib目录下)

  • tutorial.jar:sample.pig作业需要的依赖jar包

准备工作

本项目提供了一些测试数据,您可以简单地将其上传到OSS中即可使用。其他示例,例如ODPS,MNS,ONS和Loghub等等,需要您自己准备数据如下:

基本概念:

  • OSSURI: oss://accessKeyId:[email protected]/a/b/c.txt,用户在作业中指定输入输出数据源时使用,可以类比hdfs://。
  • 阿里云AccessKeyId/AccessKeySecret是您访问阿里云API的密钥,你可以在这里获取。

集群运行

  • Spark

    • SparkWordCount: spark-submit --class SparkWordCount examples-1.0-SNAPSHOT-shaded.jar <inputPath> <outputPath> <numPartition>
      • inputPath: 输入数据路径
      • outputPath: 输出路径
      • numPartition: 输入数据RDD分片数目
    • SparkPi: spark-submit --class SparkPi examples-1.0-SNAPSHOT-shaded.jar
    • SparkOssDemo:spark-submit --class SparkOssDemo examples-1.0-SNAPSHOT-shaded.jar <accessKeyId> <accessKeySecret> <endpoint> <inputPath> <numPartition>
      • accessKeyId: 阿里云AccessKeyId
      • accessKeySecret:阿里云AccessKeySecret
      • endpoint: 阿里云OSS endpoint
      • inputPath: 输入数据路径
      • numPartition:输入数据RDD分片数目
    • SparkRocketMQDemo: spark-submit --class SparkRocketMQDemo examples-1.0-SNAPSHOT-shaded.jar <accessKeyId> <accessKeySecret> <consumerId> <topic> <subExpression> <parallelism>
      • accessKeyId: 阿里云AccessKeyId
      • accessKeySecret:阿里云AccessKeySecret
      • consumerId: 参考Consumer ID说明
      • topic: 每个消息队列都有一个topic
      • subExpression: 参考消息过滤
      • parallelism:指定多少个接收器来消费队列消息。
    • SparkMaxComputeDemo: spark-submit --class SparkMaxComputeDemo examples-1.0-SNAPSHOT-shaded.jar <accessKeyId> <accessKeySecret> <envType> <project> <table> <numPartitions>
      • accessKeyId: 阿里云AccessKeyId
      • accessKeySecret:阿里云AccessKeySecret
      • envType: 0表示公网环境,1表示内网环境。如果是本地调试选择0,如果是在E-MapReduce上执行请选择1。
      • project:参考ODPS-快速开始
      • table:参考ODPS术语介绍
      • numPartition:输入数据RDD分片数目
    • SparkMNSDemo: spark-submit --class SparkMNSDemo examples-1.0-SNAPSHOT-shaded.jar <queueName> <accessKeyId> <accessKeySecret> <endpoint>
      • queueName:队列名,参考MNS名词解释
      • accessKeyId: 阿里云AccessKeyId
      • accessKeySecret:阿里云AccessKeySecret
      • endpoint:队列数据访问地址
    • SparkSLSDemo: spark-submit --class SparkSLSDemo examples-1.0-SNAPSHOT-shaded.jar <sls project> <sls logstore> <loghub group name> <sls endpoint> <access key id> <access key secret> <batch interval seconds>
      • sls project: LogService项目名
      • sls logstore: 日志库名
      • loghub group name:作业中消费日志数据的组名,可以任意取。sls project,sls store相同时,相同组名的作业会协同消费sls store中的数据;不同组名的作业会相互隔离地消费sls store中的数据。
      • sls endpoint: 参考日志服务入口
      • accessKeyId: 阿里云AccessKeyId
      • accessKeySecret:阿里云AccessKeySecret
      • batch interval seconds: Spark Streaming作业的批次间隔,单位为秒。
    • LinearRegression: spark-submit --class LinearRegression examples-1.0-SNAPSHOT-shaded.jar <inputPath> <numPartitions>
      • inputPath:输入数据
      • numPartition:输入数据RDD分片数目
  • PySpark

    • WordCount: spark-submit wordcount.py <inputPath> <outputPath> <numPartition>
      • inputPath: 输入数据路径
      • outputPath: 输出路径
      • numPartition: 输入数据RDD分片数目
  • Mapreduce

    • WordCount: hadoop jar examples-1.0-SNAPSHOT-shaded.jar WordCount -Dwordcount.case.sensitive=true <inputPath> <outputPath> -skip <patternPath>
      • inputPathl:输入数据路径
      • outputPath:输出路径
      • patternPath:过滤字符文件,可以使用data/patterns.txt
  • Hadoop Streaming

    • WordCount: hadoop jar /usr/lib/hadoop-current/share/hadoop/tools/lib/hadoop-streaming-*.jar -file <mapperPyFile> -mapper mapper.py -file <reducerPyFile> -reducer reducer.py -input <inputPath> -output <outputPath>
      • mapperPyFile mapper文件,mapper样例
      • reducerPyFile reducer文件, reducer样例
      • inputPath:输入数据路径
      • outputPath:输出路径
  • Hive

    • hive -f sample.hive -hiveconf inputPath=<inputPath>
      • inputPath:输入数据路径
  • Pig

    • pig -x mapreduce -f sample.pig -param tutorial=<tutorialJarPath> -param input=<inputPath> -param result=<resultPath>
      • tutorialJarPath:依赖Jar包,可使用lib/tutorial.jar
      • inputPath:输入数据路径
      • resultPath:输出路径
  • 注意:

    • 如果在E-MapReduce上使用时,请将测试数据和依赖jar包上传到OSS中,路径规则遵循OSSURI定义,见上。
    • 如果集群中使用,可以放在机器本地。

本地运行

这里主要介绍如何在本地运行Spark程序访问阿里云数据源,例如OSS等。如果希望本地调试运行,最好借助一些开发工具,例如Intellij IDEA或者Eclipse。尤其是Windows环境,否则需要在Windows机器上配置Hadoop和Spark运行环境,很麻烦。

  • Intellij IDEA

    • 前提:安装Intellij IDEA,Maven, Intellij IDEA Maven插件,Scala,Intellij IDEA Scala插件
    • 双击进入SparkWordCount.scala idea5
    • 从下图箭头所指处进入作业配置界面 idea1
    • 选择SparkWordCount,在作业参数框中按照所需传入作业参数 idea2
    • 点击“OK”
    • 点击运行按钮,执行作业 idea3
    • 查看作业执行日志 idea4
  • Scala IDE for Eclipse

    • 前提:安装Scala IDE for Eclipse,Maven,Eclipse Maven插件
    • 导入项目 eclipse2 eclipse3 eclipse4
    • Run As Maven build,快捷键是“Alt + Shilft + X, M”;也可以在项目名上右键,“Run As”选择“Maven build”
    • 等待编译完后,在需要运行的作业上右键,选择“Run Configuration”,进入配置页
    • 在配置页中,选择Scala Application,并配置作业的Main Class和参数等等。 eclipse5
    • 点击“Run”
    • 查看控制台输出日志 eclipse6

More Repositories

1

oss-browser

OSS Browser 提供类似windows资源管理器功能。用户可以很方便的浏览文件,上传下载文件,支持断点续传等。
JavaScript
3,175
star
2

aliyun-openapi-java-sdk

Alibaba Cloud SDK for Java
Java
1,379
star
3

aliyun-oss-java-sdk

Aliyun OSS SDK for Java
Java
1,216
star
4

alibaba-cloud-sdk-go

Alibaba Cloud SDK for Go
Go
1,104
star
5

alicloud-android-demo

Java
990
star
6

aliyun-openapi-python-sdk

Alibaba Cloud SDK for Python
Python
980
star
7

aliyun-oss-php-sdk

Aliyun OSS SDK for PHP
PHP
975
star
8

aliyun-oss-go-sdk

Aliyun OSS SDK for Go
Go
951
star
9

aliyun-oss-python-sdk

Aliyun OSS SDK for Python
Python
935
star
10

darabonba

Darabonba 是一种用于 OpenAPI 的 DSL 语言,可以用来生成多语言的 SDK、Code Sample、Test Case 等代码
JavaScript
894
star
11

alibabacloud-alfa

阿里云微前端解决方案
TypeScript
845
star
12

aliyun-oss-android-sdk

Android SDK for aliyun object storage service
Java
793
star
13

aliyun-cli

Alibaba Cloud CLI
Go
770
star
14

ossfs

Export s3fs for aliyun oss.
C++
735
star
15

aliyun-openapi-php-sdk

[Abandoned] Open API SDK for PHP developers
PHP
605
star
16

terraform-provider-alicloud

Terraform AliCloud provider
Go
590
star
17

aliyun-openapi-net-sdk

Alibaba Cloud SDK for .NET
C#
535
star
18

rds_dbsync

围绕 PostgreSQL Greenplum ,实现易用的数据的互迁功能项目
C
528
star
19

openapi-sdk-php

Alibaba Cloud SDK for PHP
PHP
501
star
20

iotkit-embedded

高速镜像: https://code.aliyun.com/linkkit/c-sdk
C
492
star
21

ossutil

A user friendly command line tool to access AliCloud OSS.
Go
456
star
22

aliyun-oss-ios-sdk

iOS SDK for aliyun object storage service
Objective-C
450
star
23

aliyun-odps-python-sdk

ODPS Python SDK and data analysis framework
Python
433
star
24

alicloud-ios-demo

Demos for AMS iOS SDKs
Objective-C
431
star
25

alibabacloud-microservice-demo

An Alibaba Cloud native microservice demo powered by Apache Dubbo and Spring Cloud Alibaba
Java
379
star
26

api-gateway-demo-sign-java

aliyun api gateway request signature demo by java
Java
371
star
27

NeWCRFs

Python
365
star
28

aliyun-oss-csharp-sdk

Aliyun OSS SDK for C#
C#
360
star
29

surftrace

surftrace is a tool that allows you to surf the linux kernel
Python
332
star
30

conditional-lane-detection

Python
328
star
31

aliyun-log-jaeger

Go
294
star
32

coolbpf

C
240
star
33

tablestore-examples

Example code for aliyun tablestore.
Java
238
star
34

tablestore-timeline

TableStore-Timeline Model for Social scene
Java
236
star
35

aliyun-log-c-sdk

Aliyun LOG Producer for C/C++
C
215
star
36

openapi-sdk-php-client

Official repository of the Alibaba Cloud Client for PHP
PHP
214
star
37

aliyun-log-logback-appender

Java
186
star
38

aliyun-log-android-sdk

Java
179
star
39

alibabacloud-jindodata

alibabacloud-jindodata
176
star
40

openapi-core-nodejs-sdk

OpenAPI POP core SDK for Node.js
JavaScript
175
star
41

aliyun-emapreduce-datasources

Extended datasource support for Spark/Hadoop on Aliyun E-MapReduce.
Scala
168
star
42

aliyun-log-python-sdk

Use python to manage, produce and consume data with Aliyun Log Service.
Python
166
star
43

data-mapping-component

A React Component which focus on Data-Mapping & Table-Field-Mapping.(基于React的数据/表字段映射组件)
JavaScript
155
star
44

aliyun-oss-react-native

Objective-C
148
star
45

aliyun-apsaradb-hbase-demo

C++
146
star
46

django-oss-storage

Django storage backends for AliCloud OSS.
Python
144
star
47

aliyun-oss-c-sdk

Aliyun OSS SDK for C
C
144
star
48

react-visual-modeling

A DAG React Component for visualization modeling, suitable for UML, database modeling, data warehouse construction.(一个基于React的数据可视化建模的DAG图,适用于UML,数据库建模,数据仓库建设等业务)
JavaScript
138
star
49

aliyun-oss-ruby-sdk

Aliyun OSS SDK for Ruby
Ruby
138
star
50

ram-policy-editor

AliCloud RAM Policy Editor for OSS
JavaScript
136
star
51

aliyun-log-java-sdk

Java
135
star
52

serverless-aliyun-function-compute

Serverless Alibaba Cloud Function Compute Plugin – Add Alibaba Cloud Function Compute support to the Serverless Framework
JavaScript
134
star
53

alibabacloud-console-components

阿里云企业云管理平台 UI 组件库
TypeScript
133
star
54

aliyun-log-java-producer

Aliyun LOG Java Producer
Java
131
star
55

fc-nodejs-sdk

The Node.js SDK of FunctionCompute.
JavaScript
130
star
56

aliyun-cms-grafana

JavaScript
127
star
57

aliyun-odps-jdbc

JDBC Driver for ODPS
Java
125
star
58

alibabacloud-quantization-networks

alibabacloud-quantization-networks
Python
122
star
59

aliyun-maxcompute-data-collectors

Java
119
star
60

alicloud-ams-demo

C#
117
star
61

alibabacloud-iot-device-sdk

alibaba cloud for iot device javascript SDK , connect with linkplatform , run at node/broswer/winxin min program /ali min program. 阿里云IoT物联网平台javascript版本sdk,可以运行在node/broswer/winxin min program /ali min program. 阿里云IoT物联网平台javascript版本sdk,可以运行在node/broswer/winxin min program /ali min program
JavaScript
110
star
62

MaxCompute-Spark

MaxCompute spark demo for building a runnable application.
Scala
106
star
63

api-gateway-nodejs-sdk

The API Gateway SDK for Node.js
JavaScript
104
star
64

cloud-design

阿里云前端组件库,由专有云&公有云前端团队共建
CSS
99
star
65

gm-jsse

开源国密通信纯 Java JSSE 实现
Java
95
star
66

aliyun-odps-console

ODPS Console Source Code.
Java
93
star
67

aliyun-openapi-cpp-sdk

Alibaba Cloud SDK for C++
C++
90
star
68

aliyun-odps-java-sdk

ODPS SDK for Java Developers
Java
89
star
69

aliyun-tablestore-nodejs-sdk

Aliyun TableStore(原OTS) SDK for Node.js
JavaScript
88
star
70

algorithm-base

让算法工程化更简单
Python
86
star
71

aliyun-log-ios-sdk

Aliyun LOG iOS SDK
Swift
84
star
72

iotx-api-demo

PHP
82
star
73

plugsched

Live upgrade Linux kernel scheduler subsystem
Python
82
star
74

DCT-Mask

Python
81
star
75

aliyun-openapi-nodejs-sdk

Alibaba Cloud SDK for Node.js
JavaScript
80
star
76

aliyun-specs

Aliyun Mobile Service CocoaPods specs.
Ruby
77
star
77

alibabacloud-console-design

阿里云管平台研发解决方案
TypeScript
77
star
78

alibabacloud-redis-training-demo

Java
76
star
79

aliyun-oss-php-sdk-laravel

A Laravel service provider for the AliCloud OSS SDK for PHP
PHP
75
star
80

aliyun-tablestore-go-sdk

TableStore SDK for Golang
Go
75
star
81

alibabacloud-sdk

Tea
75
star
82

fc-docker

Dockerfiles for local building or running function of FC
Dockerfile
74
star
83

elasticsearch-repository-oss

Java
74
star
84

dro-sfm

Python
74
star
85

packagist-mirror

Alibaba Cloud Packagist Mirror
Go
73
star
86

react-monitor-dag

A React-based operation/monitoring DAG diagram.(基于React的运维/监控DAG图)
JavaScript
69
star
87

aliyun_assist_client

Aliyun Assist Client 阿里云 云助手
Go
67
star
88

aliyun-log-php-sdk

PHP
67
star
89

alibabacloud-hologres-connectors

alibabacloud-hologres-connectors
Java
66
star
90

aliyun-tsdb-java-sdk

Aliyun TSDB SDK for Java
Java
64
star
91

aliyun-log-log4j-appender

aliyun-log-log4j-appender
Java
63
star
92

fc-java-sdk

The Java SDK of FunctionCompute.
Java
61
star
93

oss-ftp

The ftp proxy for Aliyun OSS.
Python
61
star
94

react-lineage-dag

JavaScript
61
star
95

aliyun-log-cli

Command Line Interface for Aliyun Log Service
Python
60
star
96

csb-sdk

The CSB-SDK is a client-side invocation SDK for HTTP or Web Service API opened by the CSB (Cloud Service Bus) product. It is responsible for invoking the open API and signing the request information.
Java
58
star
97

ossimport

Data migration tool
58
star
98

aliyun-log-flink-connector

flink log connector
Java
58
star
99

oss-emulator

OSS Emulator
Ruby
58
star
100

aliyun-log-dotnetcore-sdk

C#
55
star