资讯专栏INFORMATION COLUMN

聚类分析——Kmeans

Scholer / 1806人阅读

摘要:导入数据预处理计算值从到对应的平均畸变程度用求解距离平均畸变程度用肘部法则来确定最佳的值建模

导入数据
cus_general = customer[["wm_poi_id","city_type","pre_book","aor_type","is_selfpick_poi","is_selfpick_trade_poi"]]
cus_ord = customer[["wm_poi_id","month_original_price","month_order_cnt","service_fee_30day","abnor_rate_30day"]]
cus = customer[["wm_poi_id","comment_1star","comment_5star","pic_comment_cnt"]]
cus = customer[["wm_poi_id","waybill_received_ratio","waybill_delivered_ratio","waybill_ontime_ratio","waybill_normal_arrived_delivery_total_interval_avg","waybill_normal_poi_push_interval_avg","waybill_normal_receive_interval_avg","waybill_normal_fetch_interval_avg","waybill_normal_delivery_interval_avg","waybill_delivery_ontime_ratio","loss_amt"]]
cus_all = customer[["wm_poi_id","c5","ol_time","primary_first_tag_id","city_level",
                    "month_original_price","month_order_cnt","service_fee_30day","abnor_cnt_30day",
                    "comment_1star","comment_5star","pic_comment_cnt",
                    "area_30day","waybill_grab_5mins_ratio","waybill_delivered_ratio","waybill_normal_arrived_delivery_total_interval_avg","waybill_normal_receive_interval_avg",
                    "call.call_cnt","call.call_cnt_ord","call.call_cnt_poi","call.call_cnt_oth"]]
预处理
from sklearn import preprocessing
cus = pd.DataFrame(preprocessing.scale(cus_general.iloc[:,1:6]))
cus = pd.DataFrame(preprocessing.scale(cus_ord.iloc[:,1:5]))
cus = pd.DataFrame(preprocessing.scale(cus_all.iloc[:,1:21]))
cus.columns = ["city_type","pre_book","aor_type","is_selfpick_poi","is_selfpick_trade_poi"]
cus.columns = ["month_original_price","month_order_cnt","service_fee_30day","abnor_rate_30day"]
cus.columns = ["comment_1star","comment_5star","pic_comment_cnt"]
cus.columns = ["waybill_push_ratio","waybill_delivered_ratio","waybill_ontime_ratio","waybill_normal_arrived_delivery_total_interval_avg","waybill_normal_poi_push_interval_avg","waybill_normal_receive_interval_avg","waybill_normal_fetch_interval_avg","waybill_normal_delivery_interval_avg","waybill_delivery_ontime_ratio","loss_amt"]
cus.columns = ["c5","ol_time","primary_first_tag_id","city_level",
               "month_original_price","month_order_cnt","service_fee_30day","abnor_cnt_30day",
               "comment_1star","comment_5star","pic_comment_cnt",
               "area_30day","waybill_grab_5mins_ratio","waybill_delivered_ratio","waybill_normal_arrived_delivery_total_interval_avg","waybill_normal_receive_interval_avg",
               "call.call_cnt","call.call_cnt_ord","call.call_cnt_poi","call.call_cnt_oth"]
计算K值从1到10对应的平均畸变程度:用scipy求解距离
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
K=range(1,15)
meandistortions=[]
for k in K:
    kmeans=KMeans(n_clusters=k)
    kmeans.fit(cus)
    meandistortions.append(sum(np.min(cdist(cus,kmeans.cluster_centers_,"euclidean"),axis=1)))
plt.plot(K,meandistortions,"bx-")
plt.xlabel("k")
plt.ylabel(u"平均畸变程度")
plt.title(u"用肘部法则来确定最佳的K值")
Kmean建模
from sklearn.cluster import KMeans
clf = KMeans(n_clusters=12)
clf.fit(cus)
pd.Series(pd.Series(clf.labels_).value_counts())

centres = pd.DataFrame(clf.cluster_centers_)
centres.columns = cus_all.iloc[:,1:21].columns
centres.plot(kind="bar", subplots=True, figsize=(6,15))
clf.inertia_

cus_general = pd.concat([cus_general, pd.DataFrame(clf.fit_predict(cus))], axis=0)
cus_general = cus_general.rename(columns={0:"general"})
cus_ord = pd.concat([cus_ord, pd.DataFrame(clf.fit_predict(cus))], axis=0)
cus_ord = cus_ord.rename(columns={0:"order"})
cus_all = pd.concat([cus_all, pd.DataFrame(clf.fit_predict(cus))], axis=0)
cus_all = cus_all.rename(columns={0:"cluster"})

centres = cus_all.groupby(["cluster"]).mean()

cus_all.to_csv("cluster.csv")

result = cus_all[cus_all["cluster"]==2]

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/44576.html

相关文章

  • 用户地理位置的聚类算法实现—基于DBSCAN和Kmeans的混合算法

    摘要:聚类算法简介聚类的目标是使同一类对象的相似度尽可能地大不同类对象之间的相似度尽可能地小。用户地理位置信息的的聚类实现本实验用实现,依赖等科学计算。 1. 聚类算法简介 聚类的目标是使同一类对象的相似度尽可能地大;不同类对象之间的相似度尽可能地小。目前聚类的方法很多,根据基本思想的不同,大致可以将聚类算法分为五大类:层次聚类算法、分割聚类算法、基于约束的聚类算法、机器学习中的聚类算法和用...

    garfileo 评论0 收藏0
  • Python使用Numpy实现Kmeans算法

    摘要:如何确定最佳的值类别数本文选取手肘法手肘法对于每一个值,计算它的误差平方和其中是点的个数,是第个点,是对应的中心。随着聚类数的增大,样本划分会更加精细,每个簇的聚合程度会逐渐提高,那么误差平方和自然会逐渐变小。 目录 Kmeans聚类算法介绍: 1.聚类概念: 2.Kmeans算法: 定义...

    hankkin 评论0 收藏0
  • opencv python K-Means聚类

    摘要:指定最大迭代次数的整数要求的准确性重复试验算法次数,将会返回最好的一次结果该标志用于指定初始中心的采用方式。第一列对应于所有个人的高度,第二列对应于它们的权重。类似地,剩余的行对应于其他人的高度和重量。 K-Means Clustering in OpenCV cv2.kmeans(data, K, bestLabels, criteria, attempts, flags[, cen...

    superPershing 评论0 收藏0

发表评论

0条评论

Scholer

|高级讲师

TA的文章

阅读更多
最新活动
阅读需要支付1元查看
<