资讯专栏INFORMATION COLUMN

使用机器学习识别出拍卖场中作弊的机器人用户(二)

YanceyOfficial / 3197人阅读

摘要:本文承接上一篇文章使用机器学习识别出拍卖场中作弊的机器人用户本项目为上举行的一次比赛,地址见数据来源,完整代码见我的欢迎来玩代码数据探索数据预处理特征工程模型设计及评测项目数据来源项目所需额外工具包含有聚和算法项目整体运行时间预估为左右,在

本文承接上一篇文章:使用机器学习识别出拍卖场中作弊的机器人用户

本项目为kaggle上Facebook举行的一次比赛,地址见数据来源,完整代码见我的github,欢迎来玩~

代码

数据探索——Data_Exploration.ipynb

数据预处理&特征工程——Feature_Engineering.ipynb & Feature_Engineering2.ipynb

模型设计及评测——Model_Design.ipynb

项目数据来源

kaggle

项目所需额外工具包

numpy

pandas

matplotlib

sklearn

xgboost

lightgbm

mlxtend: 含有聚和算法Stacking
项目整体运行时间预估为60min左右,在Ubuntu系统,8G内存,运行结果见所提交的jupyter notebook文件


由于文章内容过长,所以分为两篇文章,总共包含四个部分

数据探索

数据预处理及特征工程

模型设计

评估及总结


特征工程(续)
import numpy as np
import pandas as pd
import pickle
%matplotlib inline
from IPython.display import display
# bids = pd.read_csv("bids.csv")
bids = pickle.load(open("bids.pkl"))
print bids.shape
display(bids.head())
(7656329, 9)

bid_id bidder_id auction merchandise device time country ip url
0 0 8dac2b259fd1c6d1120e519fb1ac14fbqvax8 ewmzr jewelry phone0 9759243157894736 us 69.166.231.58 vasstdc27m7nks3
1 1 668d393e858e8126275433046bbd35c6tywop aeqok furniture phone1 9759243157894736 in 50.201.125.84 jmqlhflrzwuay9c
2 2 aa5f360084278b35d746fa6af3a7a1a5ra3xe wa00e home goods phone2 9759243157894736 py 112.54.208.157 vasstdc27m7nks3
3 3 3939ac3ef7d472a59a9c5f893dd3e39fh9ofi jefix jewelry phone4 9759243157894736 in 18.99.175.133 vasstdc27m7nks3
4 4 8393c48eaf4b8fa96886edc7cf27b372dsibi jefix jewelry phone5 9759243157894736 in 145.138.5.37 vasstdc27m7nks3
bidders = bids.groupby("bidder_id")
针对国家、商品单一特征多类别转换为多个独立特征进行统计
cates = (bids["merchandise"].unique()).tolist()
countries = (bids["country"].unique()).tolist()

def dummy_coun_cate(group):
    coun_cate = dict.fromkeys(cates, 0)
    coun_cate.update(dict.fromkeys(countries, 0))
    for cat, value in group["merchandise"].value_counts().iteritems():
        coun_cate[cat] = value

    for c in group["country"].unique():
        coun_cate[c] = 1

    coun_cate = pd.Series(coun_cate)
    return coun_cate
bidder_coun_cate = bidders.apply(dummy_coun_cate)
display(bidder_coun_cate.describe())
bidder_coun_cate.to_csv("coun_cate.csv")
ad ae af ag al am an ao ar at ... vc ve vi vn ws ye za zm zw zz
count 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 ... 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000 6609.000000
mean 0.002724 0.205629 0.054774 0.001059 0.048570 0.023907 0.000303 0.036314 0.120442 0.052655 ... 0.000605 0.033591 0.000303 0.130882 0.001967 0.040551 0.274474 0.067181 0.069753 0.000757
std 0.052121 0.404191 0.227555 0.032530 0.214984 0.152770 0.017395 0.187085 0.325502 0.223362 ... 0.024596 0.180186 0.017395 0.337297 0.044311 0.197262 0.446283 0.250354 0.254750 0.027497
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 209 columns

同样的,对于每个用户需要统计他对于自己每次竞拍行为的时间间隔情况

def bidder_interval(group):
    time_diff = np.ediff1d(group["time"])
    bidder_interval = {}
    if len(time_diff) == 0:
        diff_mean = 0
        diff_std = 0
        diff_median = 0
        diff_zeros = 0
    else:
        diff_mean = np.mean(time_diff)
        diff_std = np.std(time_diff)
        diff_median = np.median(time_diff)
        diff_zeros = time_diff.shape[0] - np.count_nonzero(time_diff)
    bidder_interval["tmean"] = diff_mean
    bidder_interval["tstd"] = diff_std
    bidder_interval["tmedian"] = diff_median
    bidder_interval["tzeros"] = diff_zeros
    bidder_interval = pd.Series(bidder_interval)
    return bidder_interval
bidder_inv = bidders.apply(bidder_interval)
display(bidder_inv.describe())
bidder_inv.to_csv("bidder_inv.csv")
tmean tmedian tstd tzeros
count 6.609000e+03 6.609000e+03 6.609000e+03 6609.000000
mean 2.933038e+12 1.860285e+12 3.440901e+12 122.986231
std 8.552343e+12 7.993497e+12 6.512992e+12 3190.805229
min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000
25% 1.192853e+10 2.578947e+09 1.749995e+09 0.000000
50% 2.641139e+11 5.726316e+10 5.510107e+11 0.000000
75% 1.847456e+12 6.339474e+11 2.911282e+12 0.000000
max 7.610295e+13 7.610295e+13 3.800092e+13 231570.000000

按照用户-拍卖场分组进一步分析

之前的统计是按照用户进行分组,针对各个用户从整体上针对竞标行为统计其各项特征,下面根据拍卖场来对用户进一步细分,看一看每个用户在不同拍卖场的行为模式,类似上述按照用户分组来统计各个用户的各项特征,针对用户-拍卖场结对分组进行统计以下特征

基本计数统计,针对各个用户在各个拍卖场统计设备、国家、ip、url、商品类别、竞标次数等特征的数目作为新的特征

时间间隔统计:统计各个用户在各个拍卖场每次竞拍的时间间隔的 均值、方差、中位数和0值

针对商品类别、国家进一步转化为多类别进行统计

def auc_features_count(group):
    time_diff = np.ediff1d(group["time"])
    
    if len(time_diff) == 0:
        diff_mean = 0
        diff_std = 0
        diff_median = 0
        diff_zeros = 0
    else:
        diff_mean = np.mean(time_diff)
        diff_std = np.std(time_diff)
        diff_median = np.median(time_diff)
        diff_zeros = time_diff.shape[0] - np.count_nonzero(time_diff)

    row = dict.fromkeys(cates, 0)
    row.update(dict.fromkeys(countries, 0))

    row["devices_c"] = group["device"].unique().shape[0]
    row["countries_c"] = group["country"].unique().shape[0]
    row["ip_c"] = group["ip"].unique().shape[0]
    row["url_c"] = group["url"].unique().shape[0]
#     row["merch_c"] = group["merchandise"].unique().shape[0]
    row["bids_c"] = group.shape[0]
    row["tmean"] = diff_mean
    row["tstd"] = diff_std
    row["tmedian"] = diff_median
    row["tzeros"] = diff_zeros

    for cat, value in group["merchandise"].value_counts().iteritems():
        row[cat] = value

    for c in group["country"].unique():
        row[c] = 1

    row = pd.Series(row)
    return row
bidder_auc = bids.groupby(["bidder_id", "auction"]).apply(auc_features_count)
bidder_auc.to_csv("bids_auc.csv")
print bidder_auc.shape
(382336, 218)
模型设计与参数评估 合并特征

对之前生成的各项特征进行合并产生最终的特征空间

import numpy as np
import pandas as pd
%matplotlib inline
from IPython.display import display

首先将之前根据用户分组的统计特征合并起来,然后将其与按照用户-拍卖场结对分组的特征合并起来,最后加上时间特征,分别于训练集、测试集连接生成后续进行训练和预测的特征数据文件

def merge_data():    
    train = pd.read_csv("train.csv")
    test = pd.read_csv("test.csv")

    time_differences = pd.read_csv("tdiff.csv", index_col=0)
    bids_auc = pd.read_csv("bids_auc.csv")

    bids_auc = bids_auc.groupby("bidder_id").mean()
    
    bidders = pd.read_csv("cnt_bidder.csv", index_col=0)
    country_cate = pd.read_csv("coun_cate.csv", index_col=0)
    bidder_inv = pd.read_csv("bidder_inv.csv", index_col=0)
    bidders = bidders.merge(country_cate, right_index=True, left_index=True)
    bidders = bidders.merge(bidder_inv, right_index=True, left_index=True)

    bidders = bidders.merge(bids_auc, right_index=True, left_index=True)
    bidders = bidders.merge(time_differences, right_index=True,
                            left_index=True)

    train = train.merge(bidders, left_on="bidder_id", right_index=True)
    train.to_csv("train_full.csv", index=False)

    test = test.merge(bidders, left_on="bidder_id", right_index=True)
    test.to_csv("test_full.csv", index=False)    
merge_data()
train_full = pd.read_csv("train_full.csv")
test_full = pd.read_csv("test_full.csv")
print train_full.shape
print test_full.shape
(1983, 445)
(4626, 444)

train_full["outcome"] = train_full["outcome"].astype(int)
ytrain = train_full["outcome"]
train_full.drop("outcome", 1, inplace=True)

test_ids = test_full["bidder_id"]

labels = ["payment_account", "address", "bidder_id"]
train_full.drop(labels=labels, axis=1, inplace=True)
test_full.drop(labels=labels, axis=1, inplace=True)
设计交叉验证 模型选择

根据之前的分析,由于当前的数据集中存在正负例不均衡的问题,所以考虑选取了RandomForestClassfier, GradientBoostingClassifier, xgboost, lightgbm等四种模型来针对数据及进行训练和预测,确定最终模型的基本思路如下:

对四个模型分别使用评价函数roc_auc进行交叉验证并绘制auc曲线,对各个模型的多轮交叉验证得分取平均值并输出

根据得分确定最终选用的一个或多个模型

若最后发现一个模型的表现大幅度优于其他所有模型,则选择该模型进一步调参

若最后发现多个模型表现都不错,则进行模型的集成,得到聚合模型

使用GridSearchCV来从人为设定的参数列表中选择最佳的参数组合确定最终的模型

from scipy import interp
import matplotlib.pyplot as plt
from itertools import cycle

# from sklearn.cross_validation import StratifiedKFold
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, roc_curve, auc

def kfold_plot(train, ytrain, model):
#     kf = StratifiedKFold(y=ytrain, n_folds=5)
    kf = StratifiedKFold(n_splits=5)
    scores = []
    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100)
    exe_time = []
    
    colors = cycle(["cyan", "indigo", "seagreen", "yellow", "blue"])
    lw = 2
    
    i=0
    for (train_index, test_index), color in zip(kf.split(train, ytrain), colors):
        X_train, X_test = train.iloc[train_index], train.iloc[test_index]
        y_train, y_test = ytrain.iloc[train_index], ytrain.iloc[test_index]
        begin_t = time.time()
        predictions = model(X_train, X_test, y_train)
        end_t = time.time()
        exe_time.append(round(end_t-begin_t, 3))
#         model = model
#         model.fit(X_train, y_train)    
#         predictions = model.predict_proba(X_test)[:, 1]        
        scores.append(roc_auc_score(y_test.astype(float), predictions))        
        fpr, tpr, thresholds = roc_curve(y_test, predictions)
        mean_tpr += interp(mean_fpr, fpr, tpr)
        mean_tpr[0] = 0.0
        roc_auc = auc(fpr, tpr)
        plt.plot(fpr, tpr, lw=lw, color=color, label="ROC fold %d (area = %0.2f)" % (i, roc_auc))
        i += 1
    plt.plot([0, 1], [0, 1], linestyle="--", lw=lw, color="k", label="Luck")
    
    mean_tpr /= kf.get_n_splits(train, ytrain)
    mean_tpr[-1] = 1.0
    mean_auc = auc(mean_fpr, mean_tpr)
    plt.plot(mean_fpr, mean_tpr, color="g", linestyle="--", label="Mean ROC (area = %0.2f)" % mean_auc, lw=lw)
    
    plt.xlim([-0.05, 1.05])
    plt.ylim([-0.05, 1.05])
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("Receiver operating characteristic")
    plt.legend(loc="lower right")
    plt.show()
    
#     print "scores: ", scores
    print "mean scores: ", np.mean(scores)
    print "mean model process time: ", np.mean(exe_time), "s"
    
    return scores, np.mean(scores), np.mean(exe_time)

收集各个模型进行交叉验证的结果包括每轮交叉验证的auc得分、auc的平均得分以及模型的训练时间

dct_scores = {}
mean_score = {}
mean_time = {}
RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import time
from sklearn.ensemble import RandomForestClassifier

def forest_model(X_train, X_test, y_train):
#     begin_t = time.time()
    model = RandomForestClassifier(n_estimators=160, max_features=35, max_depth=8, random_state=7)
    model.fit(X_train, y_train)    
#     end_t = time.time()
#     print "train time of forest model: ",round(end_t-begin_t, 3), "s"
    predictions = model.predict_proba(X_test)[:, 1]
    return predictions
dct_scores["forest"], mean_score["forest"], mean_time["forest"] = kfold_plot(train_full, ytrain, forest_model)
# kfold_plot(train_full, ytrain, model_forest)

mean scores:  0.909571935157
mean model process time:  0.643 s

from sklearn.ensemble import GradientBoostingClassifier
def gradient_model(X_train, X_test, y_train):
    model = GradientBoostingClassifier(n_estimators=200, random_state=7, max_depth=5, learning_rate=0.03)
    model.fit(X_train, y_train)
    predictions = model.predict_proba(X_test)[:, 1]
    return predictions
dct_scores["gbm"], mean_score["gbm"], mean_time["gbm"] = kfold_plot(train_full, ytrain, gradient_model)

mean scores:  0.911847771023
mean model process time:  4.1948 s

import xgboost as xgb
def xgboost_model(X_train, X_test, y_train):
    X_train = xgb.DMatrix(X_train.values, label=y_train.values)
    X_test = xgb.DMatrix(X_test.values)
    params = {"objective": "binary:logistic", "eval_metric": "auc", "silent": 1, "seed": 7,
              "max_depth": 6, "eta": 0.01}    
    model = xgb.train(params, X_train, 600)
    predictions = model.predict(X_test)
    return predictions
/home/lancelot/anaconda2/envs/udacity/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

dct_scores["xgboost"], mean_score["xgboost"], mean_time["xgboost"] = kfold_plot(train_full, ytrain, xgboost_model)

mean scores:  0.915372340426
mean model process time:  3.1482 s

import lightgbm as lgb
def lightgbm_model(X_train, X_test, y_train):
    X_train = lgb.Dataset(X_train.values, y_train.values)
    params = {"objective": "binary", "metric": {"auc"}, "learning_rate": 0.01, "max_depth": 6, "seed": 7}
    model = lgb.train(params, X_train, num_boost_round=600)
    predictions = model.predict(X_test)
    return predictions
dct_scores["lgbm"], mean_score["lgbm"], mean_time["lgbm"] = kfold_plot(train_full, ytrain, lightgbm_model)

mean scores:  0.921512158055
mean model process time:  0.3558 s

模型比较

比较四个模型在交叉验证机上的roc_auc平均得分和模型训练的时间

def plot_model_comp(title, y_label, dct_result):
    data_source = dct_result.keys()
    y_pos = np.arange(len(data_source))
    # model_auc = [0.910, 0.912, 0.915, 0.922]
    model_auc = dct_result.values()
    barlist = plt.bar(y_pos, model_auc, align="center", alpha=0.5)
    # get the index of highest score
    max_val = max(model_auc)
    idx = model_auc.index(max_val)
    barlist[idx].set_color("r")
    plt.xticks(y_pos, data_source)
    plt.ylabel(y_label)
    plt.title(title)
    plt.show()
    print "The highest auc score is {0} of model: {1}".format(max_val, data_source[idx])
plot_model_comp("Model Performance", "roc-auc score", mean_score)

The highest auc score is 0.921512158055 of model: lgbm

def plot_time_comp(title, y_label, dct_result):
    data_source = dct_result.keys()
    y_pos = np.arange(len(data_source))
    # model_auc = [0.910, 0.912, 0.915, 0.922]
    model_auc = dct_result.values()
    barlist = plt.bar(y_pos, model_auc, align="center", alpha=0.5)
    # get the index of highest score
    min_val = min(model_auc)
    idx = model_auc.index(min_val)
    barlist[idx].set_color("r")
    plt.xticks(y_pos, data_source)
    plt.ylabel(y_label)
    plt.title(title)
    plt.show()
    print "The shortest time is {0} of model: {1}".format(min_val, data_source[idx])
plot_time_comp("Time of Building Model", "time(s)", mean_time)

The shortest time is 0.3558 of model: lgbm

auc_forest = dct_scores["forest"]
auc_gb = dct_scores["gbm"]
auc_xgb = dct_scores["xgboost"]
auc_lgb = dct_scores["lgbm"]
print "std of forest auc score: ",np.std(auc_forest)
print "std of gbm auc score: ",np.std(auc_gb)
print "std of xgboost auc score: ",np.std(auc_xgb)
print "std of lightgbm auc score: ",np.std(auc_lgb)
data_source = ["roc-fold-1", "roc-fold-2", "roc-fold-3", "roc-fold-4", "roc-fold-5"]
y_pos = np.arange(len(data_source))
plt.plot(y_pos, auc_forest, "b-", label="forest")
plt.plot(y_pos, auc_gb, "r-", label="gbm")
plt.plot(y_pos, auc_xgb, "y-", label="xgboost")
plt.plot(y_pos, auc_lgb, "g-", label="lightgbm")
plt.title("roc-auc score of each epoch")
plt.xlabel("epoch")
plt.ylabel("roc-auc score")
plt.legend()
plt.show()
std of forest auc score:  0.0413757504568
std of gbm auc score:  0.027746291638
std of xgboost auc score:  0.0232931322563
std of lightgbm auc score:  0.0287156755513

单从5次交叉验证的各模型roc-auc得分来看,xgboost的得分相对比较稳定

聚合模型

由上面的模型比较可以发现,四个模型的经过交叉验证的表现都不错,但是综合而言,xgboost和lightgbm更胜一筹,而且两者的训练时间也相对更短一些,所以接下来考虑进行模型的聚合,思路如下:

先通过GridSearchCV分别针对四个模型在整个训练集上进行调参获得最佳的子模型

针对子模型使用

stacking: 第三方库mlxtend里的stacking方法对子模型进行聚合得到聚合模型,并采用之前相同的cv方法对该模型进行打分评价

voting: 使用sklearn内置的VotingClassifier进行四个模型的聚合

最终对聚合模型在一次进行cv验证评分,根据结果确定最终的模型

先通过交叉验证针对模型选择参数组合

def choose_xgb_model(X_train, y_train): 
    tuned_params = [{"objective": ["binary:logistic"], "learning_rate": [0.01, 0.03, 0.05], 
                     "n_estimators": [100, 150, 200], "max_depth":[4, 6, 8]}]
    begin_t = time.time()
    clf = GridSearchCV(xgb.XGBClassifier(seed=7), tuned_params, scoring="roc_auc")
    clf.fit(X_train, y_train)
    end_t = time.time()
    print "train time: ",round(end_t-begin_t, 3), "s"
    print "current best parameters of xgboost: ",clf.best_params_
    return clf.best_estimator_
bst_xgb = choose_xgb_model(train_full, ytrain)
train time:  48.141 s
current best parameters of xgboost:  {"n_estimators": 150, "objective": "binary:logistic", "learning_rate": 0.05, "max_depth": 4}

def choose_lgb_model(X_train, y_train): 
    tuned_params = [{"objective": ["binary"], "learning_rate": [0.01, 0.03, 0.05], 
                     "n_estimators": [100, 150, 200], "max_depth":[4, 6, 8]}]
    begin_t = time.time()
    clf = GridSearchCV(lgb.LGBMClassifier(seed=7), tuned_params, scoring="roc_auc")
    clf.fit(X_train, y_train)
    end_t = time.time()
    print "train time: ",round(end_t-begin_t, 3), "s"
    print "current best parameters of lgb: ",clf.best_params_
    return clf.best_estimator_
bst_lgb = choose_lgb_model(train_full, ytrain)
train time:  12.543 s
current best parameters of lgb:  {"n_estimators": 150, "objective": "binary", "learning_rate": 0.05, "max_depth": 4}

先使用stacking集成两个综合表现最佳的模型lgb和xgb,此处元分类器使用较为简单的LR模型来在已经训练好了并且经过参数选择的模型上进一步优化预测结果

from mlxtend.classifier import StackingClassifier
from sklearn import linear_model

def stacking_model(X_train, X_test, y_train):    
    lr = linear_model.LogisticRegression(random_state=7)
    sclf = StackingClassifier(classifiers=[bst_xgb, bst_lgb], use_probas=True, average_probas=False, 
                              meta_classifier=lr)
    sclf.fit(X_train, y_train)
    predictions = sclf.predict_proba(X_test)[:, 1]
    return predictions
dct_scores["stacking_1"], mean_score["stacking_1"], mean_time["stacking_1"] = kfold_plot(train_full, ytrain, stacking_model)

mean scores:  0.92157674772
mean model process time:  0.7022 s

可以看到相对之前的得分最高的模型lightgbm,将lightgbm与xgboost经过stacking集成并且使用lr作为元分类器得到的auc得分有轻微的提升,接下来考虑进一步加入另外的RandomForest和GBDT模型看看增加一点模型的差异性使用Stacking是不是会有所提升

def choose_forest_model(X_train, y_train):    
    tuned_params = [{"n_estimators": [100, 150, 200], "max_features": [8, 15, 30], "max_depth":[4, 8, 10]}]
    begin_t = time.time()
    clf = GridSearchCV(RandomForestClassifier(random_state=7), tuned_params, scoring="roc_auc")
    clf.fit(X_train, y_train)
    end_t = time.time()
    print "train time: ",round(end_t-begin_t, 3), "s"
    print "current best parameters: ",clf.best_params_
    return clf.best_estimator_
bst_forest = choose_forest_model(train_full, ytrain)
train time:  42.201 s
current best parameters:  {"max_features": 15, "n_estimators": 150, "max_depth": 8}

def choose_gradient_model(X_train, y_train):    
    tuned_params = [{"n_estimators": [100, 150, 200], "learning_rate": [0.03, 0.05, 0.07], 
                     "min_samples_leaf": [8, 15, 30], "max_depth":[4, 6, 8]}]
    begin_t = time.time()
    clf = GridSearchCV(GradientBoostingClassifier(random_state=7), tuned_params, scoring="roc_auc")
    clf.fit(X_train, y_train)
    end_t = time.time()
    print "train time: ",round(end_t-begin_t, 3), "s"
    print "current best parameters: ",clf.best_params_
    return clf.best_estimator_
bst_gradient = choose_gradient_model(train_full, ytrain)
train time:  641.872 s
current best parameters:  {"n_estimators": 100, "learning_rate": 0.03, "max_depth": 8, "min_samples_leaf": 30}

def stacking_model2(X_train, X_test, y_train):    
    lr = linear_model.LogisticRegression(random_state=7)
    sclf = StackingClassifier(classifiers=[bst_xgb, bst_forest, bst_gradient, bst_lgb], use_probas=True, 
                              average_probas=False, meta_classifier=lr)
    sclf.fit(X_train, y_train)
    predictions = sclf.predict_proba(X_test)[:, 1]
    return predictions
dct_scores["stacking_2"], mean_score["stacking_2"], mean_time["stacking_2"] = kfold_plot(train_full, ytrain, stacking_model2)

mean scores:  0.92686550152
mean model process time:  4.0878 s

可以看到四个模型的聚合效果比用两个模型的stacking聚合效果要好不少,接下来尝试使用voting对四个模型进行聚合

from sklearn.ensemble import VotingClassifier

def voting_model(X_train, X_test, y_train):    
    vclf = VotingClassifier(estimators=[("xgb", bst_xgb), ("rf", bst_forest), ("gbm",bst_gradient),
                                       ("lgb", bst_lgb)], voting="soft", weights=[2, 1, 1, 2])
    vclf.fit(X_train, y_train)
    predictions = vclf.predict_proba(X_test)[:, 1]
    return predictions
dct_scores["voting"], mean_score["voting"], mean_time["voting"] = kfold_plot(train_full, ytrain, voting_model)

mean scores:  0.926889564336
mean model process time:  4.055 s

再次比较单模型与集成模型的得分

plot_model_comp("Model Performance", "roc-auc score", mean_score)

The highest auc score is 0.926889564336 of model: voting

由上可以看到最终通过voting将四个模型进行聚合可以得到得分最高的模型,确定为最终用来预测的模型

综合模型,对测试文件进行最终预测
# predict(train_full, test_full, y_train)
def submit(X_train, X_test, y_train, test_ids):
    predictions = voting_model(X_train, X_test, y_train)

    sub = pd.read_csv("sampleSubmission.csv")
    result = pd.DataFrame()
    result["bidder_id"] = test_ids
    result["outcome"] = predictions
    sub = sub.merge(result, on="bidder_id", how="left")

    # Fill missing values with mean
    mean_pred = np.mean(predictions)
    sub.fillna(mean_pred, inplace=True)

    sub.drop("prediction", 1, inplace=True)
    sub.to_csv("result.csv", index=False, header=["bidder_id", "prediction"])
submit(train_full, test_full, ytrain, test_ids)

最终结果提交到kaggle上进行评分,得分如下

以上就是整个完整的流程,当然还有很多模型可以尝试,很多聚合方法也可以使用,此外,特征工程部分还有很多空间可以挖掘,就留给大家去探索啦~

参考资料

Chen, K. T., Pao, H. K. K., & Chang, H. C. (2008, October). Game bot identification based on manifold learning. In Proceedings of the 7th ACM SIGCOMM Workshop on Network and System Support for Games (pp. 21-26). ACM.

Alayed, H., Frangoudes, F., & Neuman, C. (2013, August). Behavioral-based cheating detection in online first person shooters using machine learning techniques. In Computational Intelligence in Games (CIG), 2013 IEEE Conference on (pp. 1-8). IEEE.

https://www.kaggle.com/c/face...

http://stats.stackexchange.co...

https://en.wikipedia.org/wiki...

https://en.wikipedia.org/wiki...

https://en.wikipedia.org/wiki...

https://xgboost.readthedocs.i...

https://github.com/Microsoft/...

https://en.wikipedia.org/wiki...

http://stackoverflow.com/ques...

http://pandas.pydata.org/pand...

http://stackoverflow.com/a/18...

http://www.cnblogs.com/jasonf...

修改日志

感谢评论区@Frank同学的指正,已修改原文中的stacking的错误,此外针对绘图等细节做了点优化处理。

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/38586.html

相关文章

  • OpenAI开发ChatGPT“反作弊神器”,99.9%超高命率,好消息:还没上线

    检查内容是否用了ChatGPT,准确率高达99.9%!OpenAI又左右互搏上了,给AI生成的文本打水印,高达99.9%准确率抓「AI枪手」作弊代写。其能够精准识别出论文或研究报告是否由ChatGPT撰写,甚至能追溯其使用的具体时间点。它能专门用来检测是否用ChatGPT水了论文/作业。早在2022年11月(ChatGPT发布同月)就已经提出想法了。但是!这么好用的东西,却被内部雪藏了2年,现在都...

    UCloud小助手 评论0 收藏0

发表评论

0条评论

YanceyOfficial

|高级讲师

TA的文章

阅读更多
最新活动
阅读需要支付1元查看
<