腾讯2018社交广告大赛。与朋友一起,借助FB的’Practical Lessons from Predicting Clicks on Ads at Facebook’一文的基本架构进入决赛。主要流程总结如下
import numpy as np import pandas as pd import scipy from sklearn import preprocessing from sklearn.preprocessing import LabelEncoder from sklearn.preprocessing import OneHotEncoder from sklearn.feature_extraction.text import CountVectorizer from scipy import sparse from scipy.sparse import csr_matrix from sklearn.model_selection import train_test_split import lightgbm as lgb import time from sklearn.metrics import roc_auc_score print('Start loading Train, Test, Adfeature...') # Data Loading train = pd.read_csv("/home/riren/Documents/4fun/preliminary_contest_data/train.csv",sep=",") test = pd.
READ MORE
在大多数情景中,当需要对HDFS上的数据做一些简单的ETL,我们常常直接选择Hive或者Apache Pig Latin来完成。而在其他少数情况下,如果想要插入其他脚本语言模块,如Python,来完成一些比较复杂的工作,这时我们一般有两种选择,UDF (User Defined Function) 或者Hadoop Streaming。
UDF with Python
It’s simple to use Py UDF with pig, just put a .py file and a .pig file under the same directory.
E.g.
udf_testing.py
@outputSchema('word:chararray') def hi_world(): return "hello world” def bingo(s): return s + 'bingo' udf_testing.pig
REGISTER 'udf_testing.py' using jython as my_udfs; page_views = LOAD '/data/tracking/PageViewEvent/' USING LiAvroStorage('date.range', 'start.date=20171001;end.date=20171002;error.on.missing=false'); hello_users = FOREACH page_views GENERATE requestHeader.pageKey, my_udfs.hi_world(), my_udfs.bingo(requestHeader.pageKey);; DUMP hello_users; Note: limitation from Jython
READ MORE
I reserve my headline for @Leanne and propose a toast for her luck in next a few months
Praise the sun