搜索引擎 - ElasticSearch

zengdongbao 发布于2019-07-24 10:35 / 1298人阅读

摘要：注是开源项目，预先安装和。一介绍是基于的开源搜索引擎，目前被认为是最先进性能最好功能最全的搜索引擎。具体详见这章分析器。由于最近在用多进程并发查询的功能，当请求数量在一段时间内增加时，会有多个进程的响应超时的问题。

注：ES是Java开源项目，预先安装Jre和NodeJS。

一、介绍

Elasticsearch是基于Apache Lucene的开源搜索引擎，目前被认为是最先进、性能最好、功能最全的搜索引擎。

1、名词

分片：集群中节点存放文档的地方，分片保存在不同节点可用于数据恢复，每个分片占用的CPU、RAM、IO越高索引速度就越快

index（索引）: 类似数据库，多个索引就代表多个数据库

type（类型）: 类似表名

mapping ：表结构

doc（文档）：数据，一条Json数据为一个文档

ES Json ：ES API请求模板，用于索引数据，格式ES有严格规定（不同版本有区别）

filter（过滤）：ES有俩种查询模式，一是根据条件查询（速度慢），二全部查询后再条件过滤

aggs（聚合）：类似数据库的group by，可多个聚合嵌套使用

二、安装配置

以下为单节点配置：

1、下载 ES压缩包，解压到本地。

2、打开/ES/config/下 elasticsearch.yml

为了显示整洁，去掉了注释和没使用的配置项

# ---------------------------------- Cluster -----------------------------------
cluster.name: elasticsearch #ES根据此名将节点放到集群中

# ------------------------------------ Node ------------------------------------
node.name: node-master #节点名称，集群需更改!!!

# ----------------------------------- Paths ------------------------------------
#path.data: /path/to/data
#path.logs: /path/to/logs

# ----------------------------------- Memory -----------------------------------
#bootstrap.memory_lock: true

# ---------------------------------- Network -----------------------------------
network.host: 127.0.0.1 #节点绑定的ip
transport.tcp.port: 9301 #集群需更改!!!
http.port: 9401 #集群需更改!!!

# --------------------------------- Discovery ----------------------------------
#discovery.zen.ping.unicast.hosts: ["host1", "host2"] #主节点列表
##########Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):##########
discovery.zen.minimum_master_nodes: 1 #至少1个主节点

# ---------------------------------- Gateway -----------------------------------
#gateway.recover_after_nodes: 3

# ---------------------------------- Various -----------------------------------
#action.destructive_requires_name: true

1、命令

1、命令行到/ES/bin/下，运行 elasticsearch 或 elasticsearch -d 隐藏运行

2、非隐藏运行可使用 Ctrl+C 关闭。隐藏模式可使用 ps -ef | grep elastic 或 jps 查看进程号

3、当集群中的节点出现红色Unassigned，则检查处理问题（节点状态可使用下面的ES插件进行观察等其它操作）

（1）查看集群相关信息

curl "localhost:9401/_nodes/process?pretty"

（2）找出 UNASSIGNED 相关信息

curl -XGET localhost:9401/_cat/shards|grep UNASSIGNED

（3）依次修改以上UNASSIGNED

curl -XPOST "localhost:9401/_cluster/reroute" -d "{
    "commands" : [ {
        "allocate" : {
            "index" : "graylog_83",
            "shard" : 1,
            "node" : "Auq82gfGQVWgOBw6S7ajRQ",
            "allow_primary" : true
        }
    }]
}"

2、安装ES监控

1、下载开源项目 elasticsearch-head

2、进入到elasticsearch-head下，命令行 npm install grunt-cli 安装grunt客户端

3、在elasticsearch-head下打开Gruntfile.js

4、运行监控插件及结果

三、ES Api 1、创建索引

{
    "student": {
        "properties": {
            "no": {
                "type": "string",
                "fielddata": true,
                "index": "analyzed"
            },
            "name": {
                "type": "string",
                "index": "analyzed"
            },
            "age": {
                "type": "integer"
            },
            "birth": {
                "type": "date",
                "format": "yyyy-MM-dd"
            },
            "isLeader": {
                "type": "boolean"
            }
        }
    }

}

然后用REST方式调用ES接口创建索引和类型：

ES监控插件上显示：

2、bulk批处理

bulk API 允许在单个步骤中进行多次 create 、 index 、 update 或 delete 请求。

curl -XPOST "http://172.16.13.4:9401/_bulk?pretty" -d "
{"delete": {"_index": "megacorp", "_type": "employee", "_id": "2"}}
{"create": {"_index": "megacorp", "_type": "employee", "_id": "2"}}
{"name": "first"}
{"index": {"_index": "megacorp", "_type": "employee"}}

3、ES分析器

分析器包括三个功能：字符过滤器（过滤掉HTML，特殊符号转换）、分词器也叫分析器（标准分析器、简单、空格、语言分析器）、token过滤器（删除改变无用词）。具体详见这章 ES分析器。

四、ES集群

配置很简单就不做详细说明了，原理跟redis集群差不多，判断节点超时、投票选取主节点。

#####################################主节点1#####################################
# ---------------------------------- Cluster -----------------------------------
cluster.name: alex-es

# ------------------------------------ Node ------------------------------------
node.name: node1
node.master: true
node.data: true

# ----------------------------------- Path ------------------------------------
path.data: /path/to/data
path.logs: /path/to/logs

# ----------------------------------- Memory -----------------------------------
bootstrap.memory_lock: true

# ---------------------------------- Network -----------------------------------
network.host: 172.16.13.4
transport.tcp.port: 9301
transport.tcp.compress: true
http.port: 9401
http.max_content_length: 100mb
http.enabled: true
http.cors.enabled: true
http.cors.allow-origin: "*"

# --------------------------------- Discovery ----------------------------------
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["172.16.13.4:9301", "172.16.13.4:9302"]

# ---------------------------------- Gateway -----------------------------------
gateway.recover_after_nodes: 3
gateway.recover_after_time: 5m
gateway.expected_nodes: 3

#####################################主节点2#####################################
# ---------------------------------- Cluster -----------------------------------
cluster.name: alex-es

# ------------------------------------ Node ------------------------------------
node.name: node2
node.master: true
node.data: true

# ----------------------------------- Path ------------------------------------
path.data: /path/to/data2
path.logs: /path/to/logs2

# ----------------------------------- Memory -----------------------------------
bootstrap.memory_lock: true

# ---------------------------------- Network -----------------------------------
network.host: 172.16.13.4
transport.tcp.port: 9302
transport.tcp.compress: true
http.port: 9402
http.max_content_length: 100mb
http.enabled: true
http.cors.enabled: true
http.cors.allow-origin: "*"

# --------------------------------- Discovery ----------------------------------
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["172.16.13.4:9301", "172.16.13.4:9302"]

# ---------------------------------- Gateway -----------------------------------
gateway.recover_after_nodes: 3
gateway.recover_after_time: 5m
gateway.expected_nodes: 3

#####################################子节点######################################
# ---------------------------------- Cluster -----------------------------------
cluster.name: alex-es

# ------------------------------------ Node ------------------------------------
node.name: node3
node.master: false
node.data: true

# ----------------------------------- Path ------------------------------------
path.data: /path/to/data3
path.logs: /path/to/logs3

# ----------------------------------- Memory -----------------------------------
bootstrap.memory_lock: true

# ---------------------------------- Network -----------------------------------
network.host: 172.16.13.4
transport.tcp.port: 9303
transport.tcp.compress: true
http.port: 9403
http.max_content_length: 100mb
http.enabled: true
http.cors.enabled: true
http.cors.allow-origin: "*"

# --------------------------------- Discovery ----------------------------------
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.unicast.hosts: ["172.16.13.4:9301", "172.16.13.4:9302"]

# ---------------------------------- Gateway -----------------------------------
gateway.recover_after_nodes: 3
gateway.recover_after_time: 5m
gateway.expected_nodes: 3

以上配置信息不能包含空格，配置好后，全部启动，在ES-head上监控显示：

五、ES客户端问题

官方提供了基于Python、Java等语言的客户端，其中实现了对es连接池轮训、查询、索引、批量等操作。

由于最近在用多进程并发查询es的功能，当请求数量在一段时间内增加时，会有多个进程的响应超时的问题。

经过调查，已排查掉以下可能存在的问题：

1、Java GC机制问题（包括并发GC、FullGC、GCone等），因为根据GC的机制不同，会影响es的性能
2、es队列大小
3、进程池，基本上是同一时间异步调用es查询，所以这个不存在问题
4、CPU内存及es配置优化等

最后在服务器上抓包发现，部分请求要经过一定时间才能传到es上，而且随着请求数量加大，时间间隔有递增趋势，这样问题就定位在es客户端发送请求那。

经过一番研究，可能是es客户端所采用的传输协议会导致请求时间延长，最后决定用Python的 pycurl 来代替es客户端，下面是代码，可以自己实现es轮训：

import pycurl
import StringIO
import random

def es_pool():
    return ["ip:port", "ip:port"]

# curl请求
def curl_req(index="", rtype="", body=""):
    s = StringIO.StringIO()
    c = pycurl.Curl()

    es_hosts = es_pool()
    host = es_hosts[random.randint(0, len(es_hosts)) % len(es_hosts)]  # 根据es池大小随机选择
    url = host + "/" + index + "/" + rtype + "/_search"

    c.setopt(pycurl.URL, url)
    c.setopt(pycurl.POST, 1)
    c.setopt(pycurl.POSTFIELDS, body)
    c.setopt(pycurl.WRITEFUNCTION, s.write)
    c.perform()
    c.close()
    return s.getvalue()

文章版权归作者所有，未经允许请勿转载,若此文章存在违规行为，您可以联系管理员删除。

转载请注明本文地址：https://www.ucloud.cn/yun/35817.html

Solr vs. Elasticsearch谁是开源搜索引擎王者

摘要：本文，我们将比较业界两个最流行的开源搜索引擎，和。关于基于业界大名鼎鼎的开源搜索引擎，更多的是一个软件包，还不能称之为搜索引擎，而则完成对的封装，是一个真正意义上的搜索引擎框架。当前是云计算和数据快速增长的时代,今天的应用程序正以PB级和ZB级的速度生产数据，但人们依然在不停的追求更高更快的性能需求。随着数据的堆积，如何快速有效的搜索这些数据，成为对后端服务的挑战。本文，我们将比较业...

freewolf 2019-06-27 15:45 评论0 收藏0
Elasticsearch，为了搜索

摘要：为了方便调试，可以修改文件，加入以下两行安装中文分词插件原装分词器会简单地拆分每个汉字，没有根据词库来分词，这样的后果就是搜索结果很可能不是你想要的。原文链接参考资料权威指南为你的站点插上的翅膀安裝中文分詞中的简介使用实现博客站内搜索 Elasticsearch是一个基于Apache Lucene(TM)的开源搜索引擎。无论在开源还是专有领域，Lucene可以被认为是迄今为止最先进、...

mindwind 2019-06-27 14:14 评论0 收藏0
使用Node，Vue和ElasticSearch构建实时搜索引擎

摘要：建立在之上，它是一个高性能的文本搜索引擎库。目录在今天的课程中，您将学习如何使用，和构建实时搜索引擎。接下来，您需要安装实时搜索引擎所需的库。这是的官方库，它是实时搜索的引擎。主要的搜索查询包含在查询对象中。但是，可以从客户端进行搜索。（译者注：相关阅读：node.js,vue.js,Elasticsearch）介绍 Elasticsearch是一个分布式的RESTful搜索和分析...

jerryloveemily 2019-08-22 17:17 评论0 收藏0
使用Node，Vue和ElasticSearch构建实时搜索引擎

摘要：建立在之上，它是一个高性能的文本搜索引擎库。目录在今天的课程中，您将学习如何使用，和构建实时搜索引擎。接下来，您需要安装实时搜索引擎所需的库。这是的官方库，它是实时搜索的引擎。主要的搜索查询包含在查询对象中。但是，可以从客户端进行搜索。（译者注：相关阅读：node.js,vue.js,Elasticsearch）介绍 Elasticsearch是一个分布式的RESTful搜索和分析...

GHOST_349178 2019-08-29 15:24 评论0 收藏0