摘要:在中,一个线程可以处理多个,但是一个只能绑定到一个,这是基于线程安全和同步考虑而设计的。线程阻塞再次进行压力测试,结果如下最终结果没有任何提升,利用率依然不超过,也还是在单个利用率最高不超过,说明这次的瓶颈不是。但是其中出现了软中断。
1. 问题
spring-cloud-gateway 网关新增了一个限流功能,使用的是模块自带的限流过滤器 RequestRateLimiterGatewayFilterFactory,基于令牌桶算法,通过 redis 实现。
其原理是 redis 中针对每个限流要素(比如针对接口限流),保存 2 个 key:tokenKey(令牌数量),timeKey(调用时间)。每次接口调用时,更新 tokenKey 的值为:原先的值 + (当前时间 - 原先时间)* 加入令牌的速度,如果新的 tokenKey 的值大于 1,那么允许调用,否则不允许;同时更新 redis 中 tokenKey,timeKey 的值。整个过程通过 lua 脚本实现。
在加入限流功能之前,500 客户端并发访问,tps 为 6800 req/s,50% 时延为 70ms;加入限流功能之后,tps 为 2300 req/s,50% 时延为 205ms,同时,原先 cpu 占用率几乎 600%(6 核) 变成不到 400%(cpu 跑不满了)。
2. 排查和解决过程查看单个线程的 cpu 占用:
[root@auth-service imf2]# top -Hp 29360 top - 15:16:27 up 102 days, 18:04, 1 user, load average: 1.61, 0.72, 0.34 Threads: 122 total, 9 running, 113 sleeping, 0 stopped, 0 zombie %Cpu(s): 42.0 us, 7.0 sy, 0.0 ni, 49.0 id, 0.0 wa, 0.0 hi, 2.0 si, 0.0 st KiB Mem : 7678384 total, 126844 free, 3426148 used, 4125392 buff/cache KiB Swap: 6291452 total, 2212552 free, 4078900 used. 3347956 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 29415 root 20 0 6964708 1.1g 14216 R 97.9 15.1 3:01.65 java 29392 root 20 0 6964708 1.1g 14216 R 27.0 15.1 0:45.42 java 29391 root 20 0 6964708 1.1g 14216 R 24.8 15.1 0:43.95 java 29387 root 20 0 6964708 1.1g 14216 R 23.8 15.1 0:46.38 java 29388 root 20 0 6964708 1.1g 14216 R 23.4 15.1 0:48.21 java 29390 root 20 0 6964708 1.1g 14216 R 23.0 15.1 0:45.93 java 29389 root 20 0 6964708 1.1g 14216 R 22.3 15.1 0:44.36 java
线程 29415 几乎跑满了 cpu,查看是什么线程:
[root@auth-service imf2]# printf "%x " 29415 72e7 [root@auth-service imf2]# jstack 29360 | grep 72e7 "lettuce-nioEventLoop-4-1" #40 daemon prio=5 os_prio=0 tid=0x00007f604cc92000 nid=0x72e7 runnable [0x00007f606ce90000]
果然是操作 redis 的线程,和预期一致。
查看 redis:cpu 占用率不超过 15%,没有 10ms 以上的慢查询。应该不会是 redis 的问题。
查看线程栈信息:
通过以下脚本每秒记录一次 jstack:
[root@eureka2 jstack]# cat jstack.sh #!/bin/sh i=0 while [ $i -lt 30 ]; do /bin/sleep 1 i=`expr $i + 1` jstack 29360 > "$i".txt done
查看 lettuce 线程主要执行哪些函数:
"lettuce-nioEventLoop-4-1" #36 daemon prio=5 os_prio=0 tid=0x00007f1eb07ab800 nid=0x4476 runnable [0x00007f1eec8fb000] java.lang.Thread.State: RUNNABLE at sun.misc.URLClassPath$Loader.findResource(URLClassPath.java:715) at sun.misc.URLClassPath.findResource(URLClassPath.java:215) at java.net.URLClassLoader$2.run(URLClassLoader.java:569) at java.net.URLClassLoader$2.run(URLClassLoader.java:567) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findResource(URLClassLoader.java:566) at org.springframework.boot.loader.LaunchedURLClassLoader.findResource(LaunchedURLClassLoader.java:57) at java.lang.ClassLoader.getResource(ClassLoader.java:1096) at org.springframework.core.io.ClassPathResource.resolveURL(ClassPathResource.java:155) at org.springframework.core.io.ClassPathResource.getURL(ClassPathResource.java:193) at org.springframework.core.io.AbstractFileResolvingResource.lastModified(AbstractFileResolvingResource.java:220) at org.springframework.scripting.support.ResourceScriptSource.retrieveLastModifiedTime(ResourceScriptSource.java:119) at org.springframework.scripting.support.ResourceScriptSource.isModified(ResourceScriptSource.java:109) - locked <0x000000008c074d00> (a java.lang.Object) at org.springframework.data.redis.core.script.DefaultRedisScript.getSha1(DefaultRedisScript.java:89) - locked <0x000000008c074c10> (a java.lang.Object) at org.springframework.data.redis.core.script.DefaultReactiveScriptExecutor.eval(DefaultReactiveScriptExecutor.java:113) at org.springframework.data.redis.core.script.DefaultReactiveScriptExecutor.lambda$execute$0(DefaultReactiveScriptExecutor.java:105) at org.springframework.data.redis.core.script.DefaultReactiveScriptExecutor$$Lambda$1268/1889039573.doInRedis(Unknown Source) at org.springframework.data.redis.core.script.DefaultReactiveScriptExecutor.lambda$execute$6(DefaultReactiveScriptExecutor.java:167) at org.springframework.data.redis.core.script.DefaultReactiveScriptExecutor$$Lambda$1269/1954779522.get(Unknown Source) at reactor.core.publisher.FluxDefer.subscribe(FluxDefer.java:46)
可知该线程主要在执行 ReactiveRedisTemplate 类的 execute(RedisScript
猜想:既然是因为 lettuce-nioEventLoop 线程跑满了 cpu,那么通过创建多个 lettuce-nioEventLoop 线程,以充分利用多核的特点,是否可以解决呢?
以下为源码分析阶段:
// 1. RedisConnectionFactory bean 的创建依赖 ClientResources @Bean @ConditionalOnMissingBean(RedisConnectionFactory.class) public LettuceConnectionFactory redisConnectionFactory( ClientResources clientResources) throws UnknownHostException { LettuceClientConfiguration clientConfig = getLettuceClientConfiguration( clientResources, this.properties.getLettuce().getPool()); return createLettuceConnectionFactory(clientConfig); } // 2. ClientResources bean 的创建如下 @Bean(destroyMethod = "shutdown") @ConditionalOnMissingBean(ClientResources.class) public DefaultClientResources lettuceClientResources() { return DefaultClientResources.create(); } public static DefaultClientResources create() { return builder().build(); } // 3. 创建 EventLoopGroupProvider 对象 protected DefaultClientResources(Builder builder) { this.builder = builder; // 默认为 null,执行这块代码 if (builder.eventLoopGroupProvider == null) { // 设置处理 redis 连接的线程数:默认为 // Math.max(1, // SystemPropertyUtil.getInt("io.netty.eventLoopThreads", // Math.max(MIN_IO_THREADS, Runtime.getRuntime().availableProcessors()))); // 针对多核处理器,该值一般等于 cpu 的核的数量 int ioThreadPoolSize = builder.ioThreadPoolSize; if (ioThreadPoolSize < MIN_IO_THREADS) { logger.info("ioThreadPoolSize is less than {} ({}), setting to: {}", MIN_IO_THREADS, ioThreadPoolSize, MIN_IO_THREADS); ioThreadPoolSize = MIN_IO_THREADS; } this.sharedEventLoopGroupProvider = false; // 创建 EventLoopGroupProvider 对象 this.eventLoopGroupProvider = new DefaultEventLoopGroupProvider(ioThreadPoolSize); } else { this.sharedEventLoopGroupProvider = true; this.eventLoopGroupProvider = builder.eventLoopGroupProvider; } // 以下代码省略 ... } // 4. 通过 EventLoopGroupProvider 创建 EventExecutorGroup 对象 public staticEventExecutorGroup createEventLoopGroup(Class type, int numberOfThreads) { if (DefaultEventExecutorGroup.class.equals(type)) { return new DefaultEventExecutorGroup(numberOfThreads, new DefaultThreadFactory("lettuce-eventExecutorLoop", true)); } // 我们采用的是 Nio 模式,会执行这个分支 if (NioEventLoopGroup.class.equals(type)) { return new NioEventLoopGroup(numberOfThreads, new DefaultThreadFactory("lettuce-nioEventLoop", true)); } if (EpollProvider.isAvailable() && EpollProvider.isEventLoopGroup(type)) { return EpollProvider.newEventLoopGroup(numberOfThreads, new DefaultThreadFactory("lettuce-epollEventLoop", true)); } if (KqueueProvider.isAvailable() && KqueueProvider.isEventLoopGroup(type)) { return KqueueProvider.newEventLoopGroup(numberOfThreads, new DefaultThreadFactory("lettuce-kqueueEventLoop", true)); } throw new IllegalArgumentException(String.format("Type %s not supported", type.getName())); } // 5. NioEventLoopGroup 继承了 MultithreadEventLoopGroup; // 创建了多个 NioEventLoop; // 每个 NioEventLoop 都是单线程; // 每个 NioEventLoop 都可以处理多个连接。 public class NioEventLoopGroup extends MultithreadEventLoopGroup { ... } public abstract class MultithreadEventLoopGroup extends MultithreadEventExecutorGroup implements EventLoopGroup { ... } public final class NioEventLoop extends SingleThreadEventLoop { ... }
以上分析可知,默认创建的 RedisConnectionFactory bean 其实是支持多线程的,但通过 jstack 等方式查看 lettuce-nioEventLoop 线程却只有一个。
[root@ ~]# ss | grep 6379 tcp ESTAB 0 0 ::ffff:10.201.0.27:36184 ::ffff:10.201.0.30:6379
查看 redis 连接,发现只有一个。在 Netty 中,一个 EventLoop 线程可以处理多个 Channel,但是一个 Channel 只能绑定到一个 EventLoop,这是基于线程安全和同步考虑而设计的。这解释了为什么只有一个 lettuce-nioEventLoop。
下面继续分析为什么会只有一个连接呢?继续源码分析:
// 1. 创建 RedisConnectionFactory bean @Bean @ConditionalOnMissingBean(RedisConnectionFactory.class) public LettuceConnectionFactory redisConnectionFactory( ClientResources clientResources) throws UnknownHostException { LettuceClientConfiguration clientConfig = getLettuceClientConfiguration( clientResources, this.properties.getLettuce().getPool()); return createLettuceConnectionFactory(clientConfig); } // 2. 查看 createLettuceConnectionFactory(clientConfig) 方法 private LettuceConnectionFactory createLettuceConnectionFactory( LettuceClientConfiguration clientConfiguration) { if (getSentinelConfig() != null) { return new LettuceConnectionFactory(getSentinelConfig(), clientConfiguration); } if (getClusterConfiguration() != null) { return new LettuceConnectionFactory(getClusterConfiguration(), clientConfiguration); } // 没有哨兵模式,没有集群,执行这块代码 return new LettuceConnectionFactory(getStandaloneConfig(), clientConfiguration); } // 3. 获取 redis 连接 private boolean shareNativeConnection = true; public LettuceReactiveRedisConnection getReactiveConnection() { // 默认为 true return getShareNativeConnection() ? new LettuceReactiveRedisConnection(getSharedReactiveConnection(), reactiveConnectionProvider) : new LettuceReactiveRedisConnection(reactiveConnectionProvider); } LettuceReactiveRedisConnection(StatefulConnectionsharedConnection, LettuceConnectionProvider connectionProvider) { Assert.notNull(sharedConnection, "Shared StatefulConnection must not be null!"); Assert.notNull(connectionProvider, "LettuceConnectionProvider must not be null!"); this.dedicatedConnection = new AsyncConnect(connectionProvider, StatefulConnection.class); this.pubSubConnection = new AsyncConnect(connectionProvider, StatefulRedisPubSubConnection.class); // 包装 sharedConnection this.sharedConnection = Mono.just(sharedConnection); } protected Mono extends StatefulConnection > getConnection() { // 直接返回 sharedConnection if (sharedConnection != null) { return sharedConnection; } return getDedicatedConnection(); } // 4. shareNativeConnection 是怎么来的 protected StatefulConnection getSharedReactiveConnection() { return shareNativeConnection ? getOrCreateSharedReactiveConnection().getConnection() : null; } private SharedConnection getOrCreateSharedReactiveConnection() { synchronized (this.connectionMonitor) { if (this.reactiveConnection == null) { this.reactiveConnection = new SharedConnection<>(reactiveConnectionProvider, true); } return this.reactiveConnection; } } StatefulConnection getConnection() { synchronized (this.connectionMonitor) { // 第一次通过 getNativeConnection() 获取连接;之后直接返回该连接 if (this.connection == null) { this.connection = getNativeConnection(); } if (getValidateConnection()) { validateConnection(); } return this.connection; } }
分析以上源码,关键就在于 shareNativeConnection 默认为 true,导致只有一个连接。
更改 shareNativeConnection 的值为 true,并开启 lettuce 连接池,最大连接数设置为 6;再次测试,
[root@eureka2 jstack]# ss | grep 6379 tcp ESTAB 0 0 ::ffff:10.201.0.27:48937 ::ffff:10.201.0.30:6379 tcp ESTAB 0 0 ::ffff:10.201.0.27:35842 ::ffff:10.201.0.30:6379 tcp ESTAB 0 0 ::ffff:10.201.0.27:48932 ::ffff:10.201.0.30:6379 tcp ESTAB 0 0 ::ffff:10.201.0.27:48930 ::ffff:10.201.0.30:6379 tcp ESTAB 0 0 ::ffff:10.201.0.27:48936 ::ffff:10.201.0.30:6379 tcp ESTAB 0 0 ::ffff:10.201.0.27:48934 ::ffff:10.201.0.30:6379 [root@eureka2 jstack]# jstack 23080 | grep lettuce-epollEventLoop "lettuce-epollEventLoop-4-6" #69 daemon prio=5 os_prio=0 tid=0x00007fcfa4012000 nid=0x5af2 runnable [0x00007fcfa81ef000] "lettuce-epollEventLoop-4-5" #67 daemon prio=5 os_prio=0 tid=0x00007fcf94003800 nid=0x5af0 runnable [0x00007fcfa83f1000] "lettuce-epollEventLoop-4-4" #60 daemon prio=5 os_prio=0 tid=0x00007fcfa0003000 nid=0x5ae9 runnable [0x00007fcfa8af8000] "lettuce-epollEventLoop-4-3" #59 daemon prio=5 os_prio=0 tid=0x00007fcfb00b8000 nid=0x5ae8 runnable [0x00007fcfa8bf9000] "lettuce-epollEventLoop-4-2" #58 daemon prio=5 os_prio=0 tid=0x00007fcf6c00f000 nid=0x5ae7 runnable [0x00007fcfa8cfa000] "lettuce-epollEventLoop-4-1" #43 daemon prio=5 os_prio=0 tid=0x00007fcfac248800 nid=0x5a64 runnable [0x00007fd00c2b9000]
可以看到已经建立了 6 个 redis 连接,并且创建了 6 个 eventLoop 线程。
再次进行压力测试,结果如下:
[root@hystrix-dashboard wrk]# wrk -t 10 -c 500 -d 30s --latency -T 3s -s post-test.lua "http://10.201.0.27:8888/api/v1/json" Running 30s test @ http://10.201.0.27:8888/api/v1/json 10 threads and 500 connections Thread Stats Avg Stdev Max +/- Stdev Latency 215.83ms 104.38ms 1.00s 75.76% Req/Sec 234.56 49.87 434.00 71.45% Latency Distribution 50% 210.63ms 75% 281.30ms 90% 336.78ms 99% 519.51ms 69527 requests in 30.04s, 22.43MB read Requests/sec: 2314.14 Transfer/sec: 764.53KB [root@eureka2 jstack]# top -Hp 23080 top - 10:08:10 up 162 days, 12:31, 2 users, load average: 2.92, 1.19, 0.53 Threads: 563 total, 9 running, 554 sleeping, 0 stopped, 0 zombie %Cpu(s): 50.5 us, 10.2 sy, 0.0 ni, 36.2 id, 0.1 wa, 0.0 hi, 2.9 si, 0.0 st KiB Mem : 7677696 total, 215924 free, 3308248 used, 4153524 buff/cache KiB Swap: 6291452 total, 6291452 free, 0 used. 3468352 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 23280 root 20 0 7418804 1.3g 7404 R 42.7 17.8 0:54.75 java 23272 root 20 0 7418804 1.3g 7404 S 31.1 17.8 0:44.63 java 23273 root 20 0 7418804 1.3g 7404 S 31.1 17.8 0:44.45 java 23271 root 20 0 7418804 1.3g 7404 R 30.8 17.8 0:44.63 java 23282 root 20 0 7418804 1.3g 7404 S 30.5 17.8 0:44.96 java 23119 root 20 0 7418804 1.3g 7404 R 24.8 17.8 1:27.30 java 23133 root 20 0 7418804 1.3g 7404 R 23.8 17.8 1:29.55 java 23123 root 20 0 7418804 1.3g 7404 S 23.5 17.8 1:28.98 java 23138 root 20 0 7418804 1.3g 7404 S 23.5 17.8 1:44.19 java 23124 root 20 0 7418804 1.3g 7404 R 22.8 17.8 1:32.21 java 23139 root 20 0 7418804 1.3g 7404 R 22.5 17.8 1:29.49 java
最终结果没有任何提升,cpu 利用率依然不超过 400%,tps 也还是在 2300 request/s;单个 cpu 利用率最高不超过 50%,说明这次的瓶颈不是 cpu。
通过 jstack 查看线程状态,
"lettuce-epollEventLoop-4-3" #59 daemon prio=5 os_prio=0 tid=0x00007fcfb00b8000 nid=0x5ae8 waiting for monitor entry [0x00007fcfa8bf8000] java.lang.Thread.State: BLOCKED (on object monitor) at org.springframework.data.redis.core.script.DefaultRedisScript.getSha1(DefaultRedisScript.java:88) - waiting to lock <0x000000008c1da690> (a java.lang.Object) at org.springframework.data.redis.core.script.DefaultReactiveScriptExecutor.eval(DefaultReactiveScriptExecutor.java:113) at org.springframework.data.redis.core.script.DefaultReactiveScriptExecutor.lambda$execute$0(DefaultReactiveScriptExecutor.java:105) at org.springframework.data.redis.core.script.DefaultReactiveScriptExecutor$$Lambda$1317/1912229933.doInRedis(Unknown Source) at org.springframework.data.redis.core.script.DefaultReactiveScriptExecutor.lambda$execute$6(DefaultReactiveScriptExecutor.java:167) at org.springframework.data.redis.core.script.DefaultReactiveScriptExecutor$$Lambda$1318/1719274268.get(Unknown Source) at reactor.core.publisher.FluxDefer.subscribe(FluxDefer.java:46) at reactor.core.publisher.FluxDoFinally.subscribe(FluxDoFinally.java:73) at reactor.core.publisher.FluxOnErrorResume.subscribe(FluxOnErrorResume.java:47) at reactor.core.publisher.MonoReduceSeed.subscribe(MonoReduceSeed.java:65) at reactor.core.publisher.MonoMapFuseable.subscribe(MonoMapFuseable.java:59) at reactor.core.publisher.MonoFlatMap.subscribe(MonoFlatMap.java:60) at reactor.core.publisher.Mono.subscribe(Mono.java:3608) at reactor.core.publisher.FluxFlatMap.trySubscribeScalarMap(FluxFlatMap.java:169) at reactor.core.publisher.MonoFlatMap.subscribe(MonoFlatMap.java:53) at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:52) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:150) at reactor.core.publisher.FluxSwitchIfEmpty$SwitchIfEmptySubscriber.onNext(FluxSwitchIfEmpty.java:67) at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1476) at reactor.core.publisher.MonoFlatMap$FlatMapInner.onNext(MonoFlatMap.java:241) at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1476) at reactor.core.publisher.MonoProcessor.subscribe(MonoProcessor.java:457) at reactor.core.publisher.MonoFlatMap$FlatMapMain.onNext(MonoFlatMap.java:150) at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1476) at reactor.core.publisher.MonoHasElement$HasElementSubscriber.onNext(MonoHasElement.java:74) at reactor.core.publisher.Operators$MonoSubscriber.complete(Operators.java:1476) at reactor.core.publisher.MonoProcessor.onNext(MonoProcessor.java:389) at reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:76) at reactor.core.publisher.FluxDoFinally$DoFinallySubscriber.onNext(FluxDoFinally.java:123) at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:114) at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:114) at reactor.core.publisher.FluxFilter$FilterSubscriber.onNext(FluxFilter.java:107) at reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:76) at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onNext(FluxOnErrorResume.java:73) at reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner.onNext(MonoFlatMapMany.java:238) at reactor.core.publisher.FluxDefaultIfEmpty$DefaultIfEmptySubscriber.onNext(FluxDefaultIfEmpty.java:92) at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:114) at reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:76) at io.lettuce.core.RedisPublisher$RedisSubscription.onNext(RedisPublisher.java:270) at io.lettuce.core.RedisPublisher$SubscriptionCommand.complete(RedisPublisher.java:754) at io.lettuce.core.protocol.CommandWrapper.complete(CommandWrapper.java:59) at io.lettuce.core.protocol.CommandHandler.complete(CommandHandler.java:646) at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:604) at io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:556) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:433) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:330) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)
有 4 个 lettuce-epollEventLoop 线程都处于 BLOCKED 状态,继续查看源码:
public class DefaultRedisScriptimplements RedisScript , InitializingBean { private @Nullable ScriptSource scriptSource; private @Nullable String sha1; private @Nullable Class resultType; public String getSha1() { // 1. 线程需要先获取 shaModifiedMonitor 锁 synchronized (shaModifiedMonitor) { // 第一次调用时或者 lua 脚本文件被修改时,需要重新计算 sha1 的值 // 否则直接返回sha1 if (sha1 == null || scriptSource.isModified()) { this.sha1 = DigestUtils.sha1DigestAsHex(getScriptAsString()); } return sha1; } } public String getScriptAsString() { try { return scriptSource.getScriptAsString(); } catch (IOException e) { throw new ScriptingException("Error reading script text", e); } } } public class ResourceScriptSource implements ScriptSource { // 只有第一次调用或者 lua 脚本文件被修改时,才会执行这个方法 @Override public String getScriptAsString() throws IOException { synchronized (this.lastModifiedMonitor) { this.lastModified = retrieveLastModifiedTime(); } Reader reader = this.resource.getReader(); return FileCopyUtils.copyToString(reader); } @Override public boolean isModified() { // 2. 每次都需要判断 lua 脚本是否被修改 // 线程需要再获取 lastModifiedMonitor 锁 synchronized (this.lastModifiedMonitor) { return (this.lastModified < 0 || retrieveLastModifiedTime() > this.lastModified); } } }
对于限流操作,重要性并没有那么高,而且计算接口调用次数的 lua 脚本,一般也不会经常改动,所以没必要获取 sha1 的值的时候都查看下脚本是否有改动;如果偶尔改动的话,可以通过新增一个刷新接口,在改动脚本文件后通过手动刷新接口来改变 sha1 的值。
所以这里,可以把同步操作去掉;我改成了这样:
public class CustomRedisScriptextends DefaultRedisScript { private @Nullable String sha1; CustomRedisScript(ScriptSource scriptSource, Class resultType) { setScriptSource(scriptSource); setResultType(resultType); this.sha1 = DigestUtils.sha1DigestAsHex(getScriptAsString()); } @Override public String getSha1() { return sha1; } }
继续测试,结果如下:
[root@hystrix-dashboard wrk]# wrk -t 10 -c 500 -d 30s -T 3s -s post-test.lua --latency "http://10.201.0.27:8888/api/v1/json" Running 30s test @ http://10.201.0.27:8888/api/v1/json 10 threads and 500 connections Thread Stats Avg Stdev Max +/- Stdev Latency 155.60ms 110.40ms 1.07s 67.68% Req/Sec 342.90 64.88 570.00 70.35% Latency Distribution 50% 139.14ms 75% 211.03ms 90% 299.74ms 99% 507.03ms 102462 requests in 30.02s, 33.15MB read Requests/sec: 3413.13 Transfer/sec: 1.10MB
cpu 利用率 500% 左右,tps 达到了 3400 req/s,性能大幅度提升。查看 cpu 状态:
[root@eureka2 imf2]# top -Hp 19021 top - 16:24:09 up 163 days, 18:47, 2 users, load average: 3.03, 1.08, 0.47 Threads: 857 total, 7 running, 850 sleeping, 0 stopped, 0 zombie %Cpu0 : 60.2 us, 10.0 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 25.4 si, 0.0 st %Cpu1 : 64.6 us, 16.3 sy, 0.0 ni, 19.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 65.7 us, 15.8 sy, 0.0 ni, 18.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 54.5 us, 15.8 sy, 0.0 ni, 29.5 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 55.0 us, 17.8 sy, 0.0 ni, 27.2 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 53.2 us, 16.4 sy, 0.0 ni, 30.0 id, 0.3 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 7677696 total, 174164 free, 3061892 used, 4441640 buff/cache KiB Swap: 6291452 total, 6291452 free, 0 used. 3687692 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19075 root 20 0 7722156 1.2g 14488 S 41.4 15.9 0:55.71 java 19363 root 20 0 7722156 1.2g 14488 R 40.1 15.9 0:41.33 java 19071 root 20 0 7722156 1.2g 14488 R 37.1 15.9 0:56.38 java 19060 root 20 0 7722156 1.2g 14488 S 35.4 15.9 0:52.74 java 19073 root 20 0 7722156 1.2g 14488 R 35.1 15.9 0:55.83 java
cpu0 利用率达到了 95.7%,几乎跑满。但是其中出现了 si(软中断): 25.4%。
查看软中断类型:
[root@eureka2 imf2]# watch -d -n 1 "cat /proc/softirqs" CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 HI: 0 0 0 0 0 0 TIMER: 1629142082 990710808 852299786 606344269 586896512 566624764 NET_TX: 291570 833710 9616 5295 5358 2012064 NET_RX: 2563401537 32502894 31370533 6886869 6530120 6490002 BLOCK: 18130 1681 41404591 8751054 8695636 8763338 BLOCK_IOPOLL: 0 0 0 0 0 0 TASKLET: 39225643 0 0 817 17304 2516988 SCHED: 782335782 442142733 378856479 248794679 238417109 259695794 HRTIMER: 0 0 0 0 0 0 RCU: 690827224 504025610 464412234 246695846 254062933 248859132
其中 NET_RX,CPU0 的中断次数远远大于其他 CPU,初步判断是网卡问题。
我这边网卡是 ens32,查看网卡的中断号:
[root@eureka2 imf2]# cat /proc/interrupts | grep ens 18: 2524017495 0 0 0 0 7 IO-APIC-fasteoi ens32 [root@eureka2 imf2]# cat /proc/irq/18/smp_affinity 01 [root@eureka2 imf2]# cat /proc/irq/18/smp_affinity_list 0
网卡的中断配置到了 CPU0。(01:表示 cpu0,02:cpu1,04:cpu2,08:cpu3,10:cpu4,20:cpu5)
smp_affinity:16 进制;smp_affinity_list:配置到了哪些 cpu。
查看网卡队列模式:
[root@eureka2 ~]# lspci -vvv 02:00.0 Ethernet controller: Intel Corporation 82545EM Gigabit Ethernet Controller (Copper) (rev 01) Subsystem: VMware PRO/1000 MT Single Port Adapter Physical Slot: 32 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort-SERR- 由于是单队列模式,所以通过修改 /proc/irq/18/smp_affinity 的值不能生效。
可以通过 RPS/RFS 在软件层面模拟多队列网卡功能。
[root@eureka2 ~]# echo 3e > /sys/class/net/ens32/queues/rx-0/rps_cpus [root@eureka2 rx-0]# sysctl net.core.rps_sock_flow_entries=32768 [root@eureka2 rx-0]# echo 32768 > /sys/class/net/ens32/queues/rx-0/rps_flow_cnt/sys/class/net/ens32/queues/rx-0/rps_cpus: 1e,设置模拟网卡中断分配到 cpu1-5 上。
继续测试,
[root@hystrix-dashboard wrk]# wrk -t 10 -c 500 -d 30s -T 3s -s post-test.lua --latency "http://10.201.0.27:8888/api/v1/json" Running 30s test @ http://10.201.0.27:8888/api/v1/json 10 threads and 500 connections Thread Stats Avg Stdev Max +/- Stdev Latency 146.75ms 108.45ms 1.01s 65.53% Req/Sec 367.80 64.55 575.00 67.93% Latency Distribution 50% 130.93ms 75% 200.72ms 90% 290.32ms 99% 493.84ms 109922 requests in 30.02s, 35.56MB read Requests/sec: 3661.21 Transfer/sec: 1.18MB [root@eureka2 rx-0]# top -Hp 19021 top - 09:39:49 up 164 days, 12:03, 1 user, load average: 2.76, 2.02, 1.22 Threads: 559 total, 9 running, 550 sleeping, 0 stopped, 0 zombie %Cpu0 : 55.1 us, 13.0 sy, 0.0 ni, 17.5 id, 0.0 wa, 0.0 hi, 14.4 si, 0.0 st %Cpu1 : 60.1 us, 14.0 sy, 0.0 ni, 22.5 id, 0.0 wa, 0.0 hi, 3.4 si, 0.0 st %Cpu2 : 59.5 us, 14.3 sy, 0.0 ni, 22.4 id, 0.0 wa, 0.0 hi, 3.7 si, 0.0 st %Cpu3 : 58.6 us, 15.2 sy, 0.0 ni, 22.2 id, 0.0 wa, 0.0 hi, 4.0 si, 0.0 st %Cpu4 : 59.1 us, 14.8 sy, 0.0 ni, 22.7 id, 0.0 wa, 0.0 hi, 3.4 si, 0.0 st %Cpu5 : 57.7 us, 16.2 sy, 0.0 ni, 23.0 id, 0.0 wa, 0.0 hi, 3.1 si, 0.0 st KiB Mem : 7677696 total, 373940 free, 3217180 used, 4086576 buff/cache KiB Swap: 6291452 total, 6291452 free, 0 used. 3533812 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 19060 root 20 0 7415812 1.2g 13384 S 40.7 16.7 3:23.05 java 19073 root 20 0 7415812 1.2g 13384 R 40.1 16.7 3:20.56 java 19365 root 20 0 7415812 1.2g 13384 R 40.1 16.7 2:36.65 java可以看到软中断也分配到了 cpu1-5 上;至于为什么还是 cpu0 上软中断比例最高,猜测是因为还有一些其他中断并且默认配置在 cpu0 上?
同时,tps 也从 3400 -> 3600,提升不大。
2.4 增加 redis 连接
经过以上修改,cup 利用率还是不超过 500%,说明在某些地方还是存在瓶颈。
尝试修改了下 lettuce 连接池,
spring: redis: database: x host: x.x.x.x port: 6379 lettuce: pool: max-active: 18 min-idle: 1 max-idle: 18主要是把 max-active 参数 6 增大到了 18,继续测试:
[root@hystrix-dashboard wrk]# wrk -t 10 -c 500 -d 120s -T 3s -s post-test.lua --latency "http://10.201.0.27:8888/api/v1/json" Running 2m test @ http://10.201.0.27:8888/api/v1/json 10 threads and 500 connections Thread Stats Avg Stdev Max +/- Stdev Latency 117.66ms 96.72ms 1.34s 86.48% Req/Sec 485.42 90.41 790.00 70.80% Latency Distribution 50% 90.04ms 75% 156.01ms 90% 243.63ms 99% 464.04ms 578298 requests in 2.00m, 187.01MB read Requests/sec: 4815.57 Transfer/sec: 1.56MB6 核 cpu 几乎跑满,同时 tps 也从 3600 -> 4800,提升明显!
这说明之前的瓶颈出在 redis 连接上,那么如何判断 tcp 连接是瓶颈呢?(尝试通过 ss、netstat 等命令查看 tcp 发送缓冲区、接收缓冲区、半连接队列、全连接队列等,未发现问题。先放着,以后在研究)
文章版权归作者所有,未经允许请勿转载,若此文章存在违规行为,您可以联系管理员删除。
转载请注明本文地址:https://www.ucloud.cn/yun/73051.html
摘要:哪吒社区技能树打卡打卡贴函数式接口简介领域优质创作者哪吒公众号作者架构师奋斗者扫描主页左侧二维码,加入群聊,一起学习一起进步欢迎点赞收藏留言前情提要无意间听到领导们的谈话,现在公司的现状是码农太多,但能独立带队的人太少,简而言之,不缺干 ? 哪吒社区Java技能树打卡 【打卡贴 day2...
摘要:结构型模式适配器模式桥接模式装饰模式组合模式外观模式享元模式代理模式。行为型模式模版方法模式命令模式迭代器模式观察者模式中介者模式备忘录模式解释器模式模式状态模式策略模式职责链模式责任链模式访问者模式。 主要版本 更新时间 备注 v1.0 2015-08-01 首次发布 v1.1 2018-03-12 增加新技术知识、完善知识体系 v2.0 2019-02-19 结构...
摘要:作为面试官,我是如何甄别应聘者的包装程度语言和等其他语言的对比分析和主从复制的原理详解和持久化的原理是什么面试中经常被问到的持久化与恢复实现故障恢复自动化详解哨兵技术查漏补缺最易错过的技术要点大扫盲意外宕机不难解决,但你真的懂数据恢复吗每秒 作为面试官,我是如何甄别应聘者的包装程度Go语言和Java、python等其他语言的对比分析 Redis和MySQL Redis:主从复制的原理详...
摘要:作为面试官,我是如何甄别应聘者的包装程度语言和等其他语言的对比分析和主从复制的原理详解和持久化的原理是什么面试中经常被问到的持久化与恢复实现故障恢复自动化详解哨兵技术查漏补缺最易错过的技术要点大扫盲意外宕机不难解决,但你真的懂数据恢复吗每秒 作为面试官,我是如何甄别应聘者的包装程度Go语言和Java、python等其他语言的对比分析 Redis和MySQL Redis:主从复制的原理详...
阅读 2416·2021-10-09 09:44
阅读 3758·2021-09-22 15:43
阅读 2907·2021-09-02 09:47
阅读 2515·2021-08-12 13:29
阅读 3840·2019-08-30 15:43
阅读 1665·2019-08-30 13:06
阅读 2165·2019-08-29 16:07
阅读 2727·2019-08-29 15:23