资讯专栏INFORMATION COLUMN

python学习笔记 --- scikit-learn 学习 [1]

dingding199389 / 2887人阅读

摘要:详细讲解记录在传送门我在这里只是大概整理我使用过学习过的。这部分先放过,接下讲。这种特殊的策略也叫或是,完全忽略词在文中位置关系。具体在项目中是如下使用。使用技巧来适配大数据集,没用过,看上去很牛

Feature extraction

详细讲解记录在 传送门

我在这里只是大概整理我使用过学习过的api。

Loading features from dicts

这个方便提取数据特征,比如我们的数据是dict形式的,里面有city是三种不同城市,就可以one-hot encode。

使用的是 DictVectorizer 这个模块

>>> measurements = [
...     {"city": "Dubai", "temperature": 33.},
...     {"city": "London", "temperature": 12.},
...     {"city": "San Fransisco", "temperature": 18.},
... ]

>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()

>>> vec.fit_transform(measurements).toarray()
array([[  1.,   0.,   0.,  33.],
       [  0.,   1.,   0.,  12.],
       [  0.,   0.,   1.,  18.]])

>>> vec.get_feature_names()
["city=Dubai", "city=London", "city=San Fransisco", "temperature"]

下面官网又举了个使用例子,是关于pos_window的,词性这方面我也没做过,但是我一开始以为的是在讲这种方式在这种情况下不行,因为有很多0,但是细看后又觉得不是,希望有人能帮我解答。

以下英文是原文摘抄。

For example, suppose that we have a first algorithm that extracts Part of Speech (PoS) tags that we want to use as complementary tags for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word ‘sat’ in the sentence ‘The cat sat on the mat.’:

    
>>>

>>> pos_window = [
...     {
...         "word-2": "the",
...         "pos-2": "DT",
...         "word-1": "cat",
...         "pos-1": "NN",
...         "word+1": "on",
...         "pos+1": "PP",
...     },
...     # in a real application one would extract many such dictionaries
... ]

This description can be vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for normalization):

>>>

>>> vec = DictVectorizer()
>>> pos_vectorized = vec.fit_transform(pos_window)
>>> pos_vectorized                
<1x6 sparse matrix of type "<... "numpy.float64">"
    with 6 stored elements in Compressed Sparse ... format>
>>> pos_vectorized.toarray()
array([[ 1.,  1.,  1.,  1.,  1.,  1.]])
>>> vec.get_feature_names()
["pos+1=PP", "pos-1=NN", "pos-2=DT", "word+1=on", "word-1=cat", "word-2=the"]

As you can imagine, if one extracts such a context around each individual word of a corpus of documents the resulting matrix will be very wide (many one-hot-features) with most of them being valued to zero most of the time. So as to make the resulting data structure able to fit in memory the DictVectorizer class uses a scipy.sparse matrix by default instead of a numpy.ndarray.

这部分先放过,接下讲。

Feature hashing

FeatureHasher 这个类使用来高速低占用内存向量化,使用的技术是feature hashing,由于现在还没怎么接触这个方面,不细聊了。

基于murmurhash,这个蛮出名的,以前接触过。由于scipy.sparse的限制,最大的feature个数上限是

$$2^{31}-1$$

Text feature extraction 文本特征提取
Common Vectorizer usage 普通用法

vectorization ,也就是将文本集合转化成数字向量。这种特殊的策略也叫 "Bag of words" 或是 "Bag of n-grams",完全忽略词在文中位置关系。

第一个介绍 CountVectorizer。

 >>> from sklearn.feature_extraction.text import CountVectorizer

有很多的参数

 >>> vectorizer = CountVectorizer(min_df=1)
>>> vectorizer                     
CountVectorizer(analyzer=..."word", binary=False, decode_error=..."strict",
    dtype=<... "numpy.int64">, encoding=..."utf-8", input=..."content",
    lowercase=True, max_df=1.0, max_features=None, min_df=1,
    ngram_range=(1, 1), preprocessor=None, stop_words=None,
    strip_accents=None, token_pattern=..."(?u)ww+",
    tokenizer=None, vocabulary=None)
    
    

下面稍微使用一下

>>> corpus = [
...     "This is the first document.",
...     "This is the second second document.",
...     "And the third one.",
...     "Is this the first document?",
... ]
>>> X = vectorizer.fit_transform(corpus)
>>> X                              
<4x9 sparse matrix of type "<... "numpy.int64">"
    with 19 stored elements in Compressed Sparse ... format>
    

结果

>>> vectorizer.get_feature_names() == (
...     ["and", "document", "first", "is", "one",
...      "second", "the", "third", "this"])
True

>>> X.toarray()           
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)

可以看出这是根据单词来统计feature个数,属于one-hot,一般来讲不实用。

Tf–idf term weighting

这个能好点,tf-idf我就不讲了,原理很简单。

下面可贴一个实例,count里面就是计算好了的单词出现的个数,只有三个单词。

>>> counts = [[3, 0, 1],
...           [2, 0, 0],
...           [3, 0, 0],
...           [4, 0, 0],
...           [3, 2, 0],
...           [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf                         
<6x3 sparse matrix of type "<... "numpy.float64">"
    with 9 stored elements in Compressed Sparse ... format>

>>> tfidf.toarray()                        
array([[ 0.81940995,  0.        ,  0.57320793],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.47330339,  0.88089948,  0.        ],
       [ 0.58149261,  0.        ,  0.81355169]])

具体在项目中是如下使用。

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer(min_df=1)
>>> vectorizer.fit_transform(corpus)

Vectorizing a large text corpus with the hashing trick

使用hash技巧来适配大数据集,没用过,看上去很牛

The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large datasets:

the larger the corpus, the larger the vocabulary will grow and hence the memory use too,

fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset.
building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a strictly online manner.
pickling and un-pickling vectorizers with a large vocabulary_ can be very slow (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size),
it is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers’ performance to the point of making them slower than the sequential variant.
>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> hv = HashingVectorizer(n_features=10)
>>> hv.transform(corpus)

文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。

转载请注明本文地址:https://www.ucloud.cn/yun/40663.html

相关文章

  • ApacheCN 人工智能知识树 v1.0

    摘要:贡献者飞龙版本最近总是有人问我,把这些资料看完一遍要用多长时间,如果你一本书一本书看的话,的确要用很长时间。为了方便大家,我就把每本书的章节拆开,再按照知识点合并,手动整理了这个知识树。 Special Sponsors showImg(https://segmentfault.com/img/remote/1460000018907426?w=1760&h=200); 贡献者:飞龙版...

    刘厚水 评论0 收藏0
  • 五位专家跟你讲讲为啥Python更适合做AI/机器学习

    摘要:研究人员和机器学习的作者对于数学和面向数据的人来说,非常容易使用。这对于机器学习和领域的工作是非常重要的。高级脚本语言非常适合人工智能和机器学习,因为我们可以快速移动并重试。 摘要: 为什么Python会在这股深度学习浪潮中成为编程语言的头牌?听听大牛如何解释吧! showImg(https://segmentfault.com/img/bV59KD?w=780&h=405); 1.P...

    刘德刚 评论0 收藏0
  • 五位专家跟你讲讲为啥Python更适合做AI/机器学习

    摘要:研究人员和机器学习的作者对于数学和面向数据的人来说,非常容易使用。这对于机器学习和领域的工作是非常重要的。高级脚本语言非常适合人工智能和机器学习,因为我们可以快速移动并重试。 摘要: 为什么Python会在这股深度学习浪潮中成为编程语言的头牌?听听大牛如何解释吧! showImg(https://segmentfault.com/img/bV59KD?w=780&h=405); 1.P...

    jiekechoo 评论0 收藏0

发表评论

0条评论

dingding199389

|高级讲师

TA的文章

阅读更多
最新活动
阅读需要支付1元查看
<