recsys_spark
Spark SQL实现ItemCF,UserCF,Swing,推荐算法CF协同过滤召回模块
数据格式
商品交易数据,维度包括用户ID,商品ID,交易时间(userid,itemid,date),过滤掉黑名单用户和商品不参与计算
date | userid | itemid |
---|---|---|
2019-05-09 | 1901140040225006 | 103943 |
2019-05-09 | 1806041288325006 | 56610 |
2019-05-09 | 1812060050236636 | 16368 |
2019-05-09 | 1812060050261006 | 101562 |
2019-05-09 | 1901160070407006 | 79874 |
ItemCF(基于物品的协同过滤)
i2i2u算法,以用户曾经购买过的商品作为中间桥梁,连接用户和其他商品。 以商品共现作为相似度,对热门用户的长序列进行惩罚,相似度计算公式:
UserCF(基于用户的协同过滤)
u2u2i算法,以用户作为中间桥梁,连接其他用户和商品 以用户共现作为相似度,对热门商品的长用户序列进行惩罚,相似度计算公式只需要把ItemCF公式中分子分母里面的i,j(商品1,商品2)换成u,v(用户1,用户2),用户序列 N(u)替换成 N(i)商品序列即可。
Swing(基于图的协同过滤)
i2i2u算法,以用户已经购买的商品作为中间桥梁,连接用户和其他商品。 为了衡量物品 i 和 j 的相似性,考察都购买了物品 i和 j 的用户 u 和 v, 如果这两个用户共同购买的物品越少,则物品 i 和 j 的相似性越高。相似度计算公式
计算相似商品
phoenix查询hbase结果,ItemCF结果
item => [[item1, score],[item2, score]...]
spu | recommend |
---|---|
00017_201209 | [[201210,0.07535],[221502,0.03041],[215272,0.01753],[212219,0.01753],[228212,0.01688] |
00042_103060 | [[61212,0.03611],[10525,0.02616],[101486,0.03138],[91764,0.01898],[95527,0.02186],[661 |
0006d_25593 | [[6598,0.00319],[11129,0.00762],[178,0.00696],[8558,0.0041],[11398,0.0029],[25536,0.012 |
00077_35837 | [[25518,0.01044],[36420,0.41703],[36357,0.15762],[83810,0.02686],[103838,0.02686],[1038 |
0007c_9700 | [[227970,0.03401],[219462,0.02626],[219401,0.02626],[223635,0.02247],[223641,0.02247],[2 |
000cb_33363 | [[8572,0.00877],[19665,0.00756],[12812,0.01092],[11853,0.0094],[8528,0.01173],[1705,0.0 |
000d0_50738 | [[119582,0.03503],[100296,0.02922],[97248,0.02309],[72044,0.02153],[79245,0.02023],[119 |
000d5_68111 | [[50729,0.00632],[67871,0.02315],[68081,0.01277],[9624,0.01253],[57234,0.00996],[67983, |
000dd_45311 | [[3721,0.02095],[21908,0.0156],[25633,0.01145],[5002,0.01438],[28633,0.02605],[17088,0. |
计算推荐结果
user => [item1, item2, item3...]
userid | recommend |
---|---|
00000_180731 | [50648,14253,211049,14255,209517,112985,48507,13458,206846,35472,18769,97610,78105,21 |
00003_532933 | [203038,78262,81480,120623,203040,81447,100994,203009,101491,81457,114550,55115,80139 |
00007_552871 | [105023,10199,100894,100565,99769,96980,30781,115965,230960,95059,11129,104702,51831,6 |
0000b_194813 | [231082,60365,101950,57700,209504,113725,101939,5906,94771,59979,237823,102324,229264 |
0000e_398677 | [210020,210019,74081,91787,48428,90769,17449,91800,91822,17448,91823,91803,17437,1162 |
0000e_590120 | [106907,72369,94907,74972,79603,97245,202614,97243,207393,229353,74063,78596,210969,11 |
00010_180604 | [73633,24509,24507,7481,101877,107612,116350,100115,34379,229431,113725,229618,236254, |
00011_536634 | [209481,210381,112120,234451,113968,119215,64699,121035,106867,121057,103750,48503,12, |
00013_180604 | [212154,212156,212157,17141,62421,69801,232732,62407,211132,211029,37857,215047,8741,6 |