天池学习记录——O2O优惠券使用预测赛题[1]

soasme 发布于2019-07-30 16:52 / 2598人阅读

摘要：然而随机投放的优惠券对多数用户造成无意义的干扰。下面我们分别对训练集中的类数据对优惠券使用的影响进行分析。在里有两种折扣方法代表折扣率表示满减。这里我们还要将满减类型用式子转换成折扣率。进行预测计算平均得到结果。

赛题说明

应用背景：以优惠券盘活老用户或吸引新客户进店消费是O2O（Online to Offline）的一种重要营销方式。然而随机投放的优惠券对多数用户造成无意义的干扰。对商家而言，滥发的优惠券可能降低品牌声誉，同时难以估算营销成本。而个性化投放是提高优惠券核销率的重要技术，它可以让具有一定偏好的消费者得到真正的实惠，同时赋予商家更强的营销能力。

目标：根据提供的O2O场景相关的丰富数据，通过分析建模，精准预测用户是否会在规定时间内使用相应优惠券。

数据分析

读取数据：

我们看到在 offline 训练数据集中有以下 7 类数据：
User_id
Merchant_id
Coupon_id
Discount_rate
Distance
Date_received
Date

当 Coupon_id 为 null 时表示无优惠券消费，此时Discount_rate和Date_received字段无意义。

具体字段意义请参考赛题链接。

根据 Coupon_id 和 Date 是否为 null，可以将数据分为四种类型：

print("有优惠券，购买商品条数", dfoff[(dfoff["Coupon_id"] != "null") & (dfoff["Date"] != "null")].shape[0])
print("无优惠券，购买商品条数", dfoff[(dfoff["Coupon_id"] == "null") & (dfoff["Date"] != "null")].shape[0])
print("有优惠券，没有购买商品条数", dfoff[(dfoff["Coupon_id"] != "null") & (dfoff["Date"] == "null")].shape[0])
print("无优惠券，也没有购买商品条数", dfoff[(dfoff["Coupon_id"] == "null") & (dfoff["Date"] == "null")].shape[0])

得到结果：

其中，75382 表示用优惠券进行了消费的数量，即正样本；977900 表示领取优惠券但没有使用，这部分优惠券就被浪费了，即负样本；701602 表示没有优惠券的普通消费。

下面我们分别对训练集中的 7 类数据对优惠券使用的影响进行分析。

1. 优惠券和距离

print("Discount_rate 类型:",dfoff["Discount_rate"].unique())
print("Distance 类型:", dfoff["Distance"].unique())

我们看到输出的是str类型的数据，需要将它们转换成numeric类型。

在Discount_rate里有两种折扣方法：x in [0,1] 代表折扣率；x : y 表示满 x 减 y。这里我们还要将满 x 减 y 类型用式子1-y/x转换成折扣率。并建立折扣券相关的特征 discount_rate, discount_man, discount_jian, discount_type。代码如下：

# convert Discount_rate and Distance

def getDiscountType(row):
    if row == "null":
        return "null"
    elif ":" in row:
        return 1
    else:
        return 0

def convertRate(row):
    """Convert discount to rate"""
    if row == "null":
        return 1.0
    elif ":" in row:
        rows = row.split(":")
        return 1.0 - float(rows[1])/float(rows[0])
    else:
        return float(row)

def getDiscountMan(row):
    if ":" in row:
        rows = row.split(":")
        return int(rows[0])
    else:
        return 0

def getDiscountJian(row):
    if ":" in row:
        rows = row.split(":")
        return int(rows[1])
    else:
        return 0

def processData(df):
    
    # convert discunt_rate
    df["discount_rate"] = df["Discount_rate"].apply(convertRate)
    df["discount_man"] = df["Discount_rate"].apply(getDiscountMan)
    df["discount_jian"] = df["Discount_rate"].apply(getDiscountJian)
    df["discount_type"] = df["Discount_rate"].apply(getDiscountType)
    print(df["discount_rate"].unique())
    
    # convert distance
    df["distance"] = df["Distance"].replace("null", -1).astype(int)
    print(df["distance"].unique())
    return df

dfoff = processData(dfoff)
dftest = processData(dftest)

2. 时间
对收到优惠券的日期date_received和消费日期date_buy进行处理：

date_received = dfoff["Date_received"].unique()
date_received = sorted(date_received[date_received != "null"])

date_buy = dfoff["Date"].unique()
date_buy = sorted(date_buy[date_buy != "null"])

date_buy = sorted(dfoff[dfoff["Date"] != "null"]["Date"])

并输出结果：

查看顾客每天收到的优惠券数量：

couponbydate = dfoff[dfoff["Date_received"] != "null"][["Date_received", "Date"]].groupby(["Date_received"], as_index=False).count()
couponbydate.columns = ["Date_received","count"]
couponbydate.head()

查看顾客用这些优惠券进行了消费的数量：

buybydate = dfoff[(dfoff["Date"] != "null") & (dfoff["Date_received"] != "null")][["Date_received", "Date"]].groupby(["Date_received"], as_index=False).count()
buybydate.columns = ["Date_received","count"]
buybydate.head()

将以上数据可视化：

plt.figure(figsize = (12,8))
date_received_dt = pd.to_datetime(date_received, format="%Y%m%d")

plt.subplot(211)
plt.bar(date_received_dt, couponbydate["count"], label = "number of coupon received" )
plt.bar(date_received_dt, buybydate["count"], label = "number of coupon used")
plt.yscale("log")
plt.ylabel("Count")
plt.legend()

plt.subplot(212)
plt.bar(date_received_dt, buybydate["count"]/couponbydate["count"])
plt.ylabel("Ratio(coupon used/coupon received)")
plt.tight_layout()

提取特征

上面显示的是多带带一天的数据量，我们知道人们一般在星期天上街比较多，使用优惠券的可能性也增大，所以现在我们以星期为依据新建特征。

def getWeekday(row):
    if row == "null":
        return row
    else:
        return date(int(row[0:4]), int(row[4:6]), int(row[6:8])).weekday() + 1

dfoff["weekday"] = dfoff["Date_received"].astype(str).apply(getWeekday)
dftest["weekday"] = dftest["Date_received"].astype(str).apply(getWeekday)

# weekday_type :  周六和周日为1，工作日为0
dfoff["weekday_type"] = dfoff["weekday"].apply(lambda x : 1 if x in [6,7] else 0 )
dftest["weekday_type"] = dftest["weekday"].apply(lambda x : 1 if x in [6,7] else 0 )

# change weekday to one-hot encoding 
weekdaycols = ["weekday_" + str(i) for i in range(1,8)]
print(weekdaycols)

tmpdf = pd.get_dummies(dfoff["weekday"].replace("null", np.nan))
tmpdf.columns = weekdaycols
dfoff[weekdaycols] = tmpdf

tmpdf = pd.get_dummies(dftest["weekday"].replace("null", np.nan))
tmpdf.columns = weekdaycols
dftest[weekdaycols] = tmpdf

得到的tmpdf为以下形式：

对["date_received"]数据进行标注，转换成numeric：

def label(row):
    if row["Date_received"] == "null":
        return -1
    if row["Date"] != "null":
        td = pd.to_datetime(row["Date"], format="%Y%m%d") -  pd.to_datetime(row["Date_received"], format="%Y%m%d")
        if td <= pd.Timedelta(15, "D"):
            return 1
    return 0
dfoff["label"] = dfoff.apply(label, axis = 1)

若 Date_received == "null"，则 y = -1；Date != "null" & Date-Date_received <= 15，则 y = 1；否则，y = 0。

此时，这些转换后的数据已经以0，1，-1的形式存在了label列中。

模型训练

在应用模型前，首先对数据进行划分。在这里，我们将 20160101 到 20160515 的数据用作训练集(train)，20160516 到 20160615 的数据用作验证集(valid)。

df = dfoff[dfoff["label"] != -1].copy()
train = df[(df["Date_received"] < "20160516")].copy()
valid = df[(df["Date_received"] >= "20160516") & (df["Date_received"] <= "20160615")].copy()
print(train["label"].value_counts())
print(valid["label"].value_counts())

用线性模型 SGDClassifier 进行预测。

predictors = original_feature
print(predictors)

def check_model(data, predictors):
    
    classifier = lambda: SGDClassifier(
        loss="log", 
        penalty="elasticnet", 
        fit_intercept=True, 
        max_iter=100, 
        shuffle=True, 
        n_jobs=1,
        class_weight=None)

    model = Pipeline(steps=[
        ("ss", StandardScaler()),
        ("en", classifier())
    ])

    parameters = {
        "en__alpha": [ 0.001, 0.01, 0.1],
        "en__l1_ratio": [ 0.001, 0.01, 0.1]
    }

    folder = StratifiedKFold(n_splits=3, shuffle=True)
    
    grid_search = GridSearchCV(
        model, 
        parameters, 
        cv=folder, 
        n_jobs=-1, 
        verbose=1)
    grid_search = grid_search.fit(data[predictors], 
                                  data["label"])
    
    return grid_search

if not os.path.isfile("1_model.pkl"):
    model = check_model(train, predictors)
    print(model.best_score_)
    print(model.best_params_)
    with open("1_model.pkl", "wb") as f:
        pickle.dump(model, f)
else:
    with open("1_model.pkl", "rb") as f:
        model = pickle.load(f)

接下来，对每个优惠券预测的结果计算 AUC，再对所有的取平均。计算 AUC 的时候，如果label只有一类，就直接跳过，因为 AUC 无法计算。

进行预测：

y_valid_pred = model.predict_proba(valid[predictors])
valid1 = valid.copy()
valid1["pred_prob"] = y_valid_pred[:, 1]

计算平均 AUC：

vg = valid1.groupby(["Coupon_id"])
aucs = []
for i in vg:
    tmpdf = i[1] 
    if len(tmpdf["label"].unique()) != 2:
        continue
    fpr, tpr, thresholds = roc_curve(tmpdf["label"], tmpdf["pred_prob"], pos_label=1)
    aucs.append(auc(fpr, tpr))
print(np.average(aucs))

得到结果0.5348655160896371。

对测试集进行预测并提交结果：

y_test_pred = model.predict_proba(dftest[predictors])
dftest1 = dftest[["User_id","Coupon_id","Date_received"]].copy()
dftest1["label"] = y_test_pred[:,1]
dftest1.to_csv("submit1.csv", index=False, header=False)

至此，我们已经得到一个提交结果，在这个过程中用到的特征是优惠券，距离和时间。预测效果较差，还需要进行进一步的特征工程，来得到更好的效果。

思路解答

总结以上思路，首先对数据进行分析，通过画图可以更直观的反映出数据的特征；然后根据对数据对分析结果，进行特征提取，用这些特征训练所用的模型。在训练过程中通过划分数据集，分为训练集和验证集两部分，对模型进行训练；最后，将测试集的数据喂给训练好的模型，得到预测结果，并转换为能提交的.csv格式的文件。

这就是进行一次数据分析的大致思路，就本题来说，在特征工程和模型的选择上还有更多的思考余地，来提高准确率。

用到的知识点

one-hot encoding
AUC

遇到的问题

针对博主的学习，在这次的赛题总结中反映出的问题有以下 3 点：

数据可视化的代码部分，不够了解，而画图可能为我们提供很多思路

对各个模型的参数有哪些需要深入了解，如果不想做调包侠客，就更要掌握调参背后的原理

特征工程是制胜的关键，需要不断的练习学习

参考链接：
https://tianchi.aliyun.com/no...
https://tianchi.aliyun.com/no...

不足之处，欢迎指正。

GPU云服务器云服务器机器学习预测 asp微机使用记录服务器文件使用记录表使用云服务器上的旺旺聊天记录

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/41879.html

天池大数据比赛总结

摘要：这次比赛的题目是给定年月份的用户在不同地点口碑购买记录，以及年月淘宝上用户的购物行为数据，来预测月这一整月用户来到一个地点之后会光顾哪些口碑商铺。一直想总结一下这次的比赛，拖啊拖。。。一直等到现在，趁着现在要找实习，好好总结一下。比赛题目比赛的官方网站在这，IJCAI SocInf16。这次比赛的题目是给定 2015 年 7 ~ 11 月份的用户在不同地点口碑购买记录，以及 2...

printempw 2019-07-31 10:58 评论0 收藏0
人工智能/数据科学比赛汇总 2019.2

摘要：内容来自，人工智能数据科学比赛整理平台。消费者人群画像信用智能评分月日月中国移动福建公司提供年月份的样本数据脱敏，包括客户的各类通信支出欠费情况出行情况消费场所社交个人兴趣等丰富的多维度数据。内容来自 DataSciComp，人工智能/数据科学比赛整理平台。Github：iphysresearch/DataSciComp 本项目由 ApacheCN 强力支持。微博 | 知乎 | C...

twohappy 2019-06-26 18:47 评论0 收藏0
人工智能/数据科学比赛汇总 2019.3

摘要：内容来自，人工智能数据科学比赛整理平台。本项目由强力支持。天池阅读更多内容来自 DataSciComp，人工智能/数据科学比赛整理平台。Github：iphysresearch/DataSciComp 本项目由 ApacheCN 强力支持。微博 | 知乎 | CSDN | 简书 | OSChina | 博客园全球城市计算AI挑战赛 3月19日 - 4月11日, 2019 // ...

mayaohua 2019-06-26 18:53 评论0 收藏0