spring-data-redis millions of QPS are under too much pressure and the connection fails, I am stupid

Hello everyone, our business volume has skyrocketed recently, which has caused me to be stupid recently. A few days ago, I discovered that due to the surge in business pressure, a few new instances of the core microservice were newly expanded. To varying degrees, there were exceptions of Redis connection failure :

  org.springframework.data.redis.RedisConnectionFailureException: Unable to connect to Redis; nested exception is io.lettuce.core.RedisConnectionException: Unable to connect to redis.production.com at org.springframework.data.redis.connection .lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.translateException(LettuceConnectionFactory.java:1553) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ ExceptionTranslatingConnectionProvider.getConnection(LettuceConnectionFactory.java:1461) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.connection.lettuce.LettuceConnection.doGetAsyncDedicatedConnection(LettuceConnection.java :1027) ~[spring-data-redis-2.4.9.jar!/:2 .4.9] at org.springframework.data.redis.connection.lettuce.LettuceConnection.getOrCreateDedicatedConnection(LettuceConnection.java:1013) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework .data.redis.connection.lettuce.LettuceConnection.openPipeline(LettuceConnection.java:527) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.connection. DefaultStringRedisConnection.openPipeline(DefaultStringRedisConnection.java:3245) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at jdk.internal.reflect.GeneratedMethodAccessor319.invoke(Unknown Source) ~[?:?] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] at org.springframework .data.redis.core.CloseSuppressingInvocationHandler.invoke(CloseSuppressingInvocationHandler.java:61) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at com.sun.proxy.$Proxy355.openPipeline(Unknown Source) ~[?:?] at org.springframework.data.redis.cor e.RedisTemplate.lambda$executePipelined$1(RedisTemplate.java:318) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.core.RedisTemplate.execute( RedisTemplate.java:222) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:189) ~[spring -data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:176) ~[spring-data-redis-2.4.9. jar!/:2.4.9] at org.springframework.data.redis.core.RedisTemplate.executePipelined(RedisTemplate.java:317) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.core.RedisTemplate.executePipelined(RedisTemplate.java:307) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.core .RedisTemplate$$FastClassBySpringCGLIB$$81812bd6.invoke() ~[spring-data-redis-2.4.9.jar!/:2.4.9] //Omit some stacks Caused by: org.springframework.dao.QueryTimeoutException : Redis command timed out at org.springframewor k.data.redis.connection.lettuce.LettuceConnection.closePipeline(LettuceConnection.java:592) ~[spring-data-redis-2.4.9.jar!/:2.4.9] ... 142 more  

at the same time ,There are also business calls Redis command timeout exception:

  org.springframework.data.redis.connection.RedisPipelineException: Pipeline containedframe spring one or more invalid commands; nested exception is org. redis.connection.RedisPipelineException: Pipeline contained one or more invalid commands; nested exception is org.springframework.dao.QueryTimeoutException: Redis command timed out at org.springframework.data.redis.connection.lettuce.LettuceConnection.closePipeline(LettuceConnection.java: 594) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.connection.DefaultStringRedisConnection.closePipeline(DefaultStringRedisConnection.java:3224) ~[spring-data-redis -2.4.9.jar!/:2.4.9] at jdk.internal.reflect.GeneratedMethodAccessor198.invoke(Unknown Source) ~[?:?] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.lang.re flect.Method.invoke(Method.java:566) ~[?:?] at org.springframework.data.redis.core.CloseSuppressingInvocationHandler.invoke(CloseSuppressingInvocationHandler.java:61) ~[spring-data-redis-2.4.9 .jar!/:2.4.9] at com.sun.proxy.$Proxy355.closePipeline(Unknown Source) ~[?:?] at org.springframework.data.redis.core.RedisTemplate.lambda$executePipelined$1(RedisTemplate. java:326) ~[spring-data-  redis  -2.4.9.jar!/:2.4.9] at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:222) ~[spring -data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:189) ~[spring-data-redis-2.4.9. jar!/:2.4.9] at org.springframework.data.redis.core.RedisTemplate.execute(RedisTemplate.java:176) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.core.RedisTemplate.executePipelined(RedisTemplate.java:317) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.core .RedisTemplate.execute Pipelined(RedisTemplate.java:307) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.data.redis.core.RedisTemplate$$FastClassBySpringCGLIB$$81812bd6.invoke(<>) ~[spring-data-redis-2.4.9.jar!/:2.4.9] at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) ~[spring-core-5.3. 7.jar!/:5.3.7] at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:779) ~[spring-aop-5.3.7.jar!/:5.3.7] at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.3.7.jar!/:5.3.7] at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed (CglibAopProxy.java:750) ~[spring-aop-5.3.7.jar!/:5.3.7] at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97) ~[spring-aop- 5.3.7.jar!/:5.3.7] at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:18 6) ~[spring-aop-5.3.7.jar!/:5.3.7] at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:750) ~[spring-aop-5.3.7 .jar!/:5.3.7] at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:692) ~[spring-aop-5.3.7.jar!/:5.3.7] at org .springframework.data.redis.core.StringRedisTemplate$$EnhancerBySpringCGLIB$$c9b8cc15.executePipelined() ~[spring-data-redis-2.4.9.jar!/:2.4.9]//Omit part of the stack Caused by : org.springframework.data.redis.connection.RedisPipelineException: Pipeline contained one or more invalid commands; nested exception is org.springframework.dao.QueryTimeoutException: Redis command timed out at org.springframework.data.redis.connection.lettuce.LettuceConnection .closePipeline(LettuceConnection.java:592) ~[spring-data-redis-2.4.9.jar!/:2.4.9] ... 142 moreCaused by: org.springframework.dao.QueryTimeoutException: Redis command timed out at org .springframework.data.redis.connection.lettuce. LettuceConnection.closePipeline(LettuceConnection.java:592) ~[spring-data-redis-2.4.9.jar!/:2.4.9] ... 142 more  

The configuration of our spring-data-redis is:

  spring: redis: host: redis.production.com port: 6379 # Command timeout timeout: 3000 lettuce: pool: max-active: 128 max-idle: 128 max-wait: 3000  

These requests though The instance to which the first request was sent failed,But we have a retry mechanism, and the request succeeded in the end. But it is 3s longer than the normal request, and this part of the request accounts for about 3% of all requests.

It can be seen from the exception stack that the root cause of the exception of is the redis command timeout , but why is there a Redis command executed when a Redis connection is established?

lettuce connection establishment process

Our Redis access uses spring-data-redis + Lettuce connection pool. By default, the process of establishing a Redis connection in Lettuce is:

  1. to establish a TCP connection
  2. to perform the necessary handshake:
  3. for Redis 2.x ~ 5.xbr 1 version: If you need a user name and password, send the user name and password information
    2. If the heartbeat before using the connection is turned on, send PING
  4. for Redis 6.x version: after 6.x, a new command HELLO was introduced, use This command is used to uniformly initialize the Redis connection: REDIS HELLO. The user name and password can be included in the parameters of this command to complete the verification.

is for Redis 2.x ~ 5.x version,We can configure whether to send a PING heartbeat before the connection is enabled, the default is is :

ClientOptions

 code19 code of  ClientOptions 

code_pre_precode_pre_precode_pre_precode_pre_Bolean DEFA_precode_pre_precode_precode_pre_pre_Bolean_precode19. It is the latest 6.x, so in the phase of establishing a connection and handshake, you must send a HELLO command and wait for the response to succeed before the connection is established successfully.  

So why does this simple command time out?

View Redis command pressure through JFR

In our project, the redis operation is through the spring-data-redis + Lettuce connection pool, which is enabled and added to the JFR monitoring of the Lettuce command. You can refer to this article of mine: The new monitoring method of this Redis connection pool is not pricked~ I will add a little bit of condiments. As of now, my pull request has been merged, and this feature will be released in version 6.2.x. Let's look at the Redis command collection near the time of the problem, as shown in the following figure:

It can be seen that Redis pressure is still relatively large at this time (the unit of firstResponsePercentiles in the figure is microseconds).At this time, there are 7 instances. When this instance was just started, the pressure was relatively small compared to other instances, and the connection command timeout had already occurred. And we only intercepted the HGET command here, and the number of executions of the GET command is the same order of magnitude as HGET, and then the remaining commands add up to half of HGET. At this time, from the client's perspective, the QPS of commands sent to Redis has exceeded one million.

From Redis monitoring, there is indeed some pressure, which may cause some commands to wait too long and cause a timeout exception.

optimization thinking

Let’s make it clear that for spring-data-redis + lettuce, if we do not use commands that require exclusive connections (including Redis transactions and Redis Pipeline), then we don’t need Connection pool , because lettuce is asynchronous and responsive, requests that can use a shared connection will use the same actual redis connection for the request, no connection pool is required. However, in this microservice, a large number of pipeline commands are used to improve query efficiency. If we do not use the connection pool, it will cause frequent connection closures and creations (hundreds of thousands per second), which will seriously reduce efficiency. Although the official website says that lettuce does not require a connection pool, this is when you do not use transactions and pipelines .

First, Redis expansion : Our Redis is deployed on the public cloud,If the expansion is to improve the machine configuration, the next higher configuration index is doubled compared to the current one, and the cost is almost doubled. At present, only when there is instantaneous pressure, less than 3% of requests will fail and retry the next instance, and finally succeed. For this expansion of Redis, is not worth considering from the cost. .

Then, for applications with excessive pressure, we have a dynamic expansion mechanism. For failed requests, we also retry. But the impact of this problem on us is:

  1. Due to the arrival of instantaneous pressure, the newly started instance may have a large number of requests coming at the beginning, resulting in a mixture of interface requests and heartbeat requests after the connection is established. And because these requests are not sorted in a fair queue, some heartbeat requests are too slow to respond and cause failures. Re-establishing connections may still fail.
  2. Some instances may establish fewer connections, which cannot meet the concurrency requirements. As a result, many requests are actually blocked in the process of waiting for connection, so that the CPU pressure does not suddenly become very large, so the expansion is not continued to be triggered. This brings greater hysteresis to capacity expansion.

In fact, if we have a way to minimize or avoid connection creation failures, this problem can be greatly optimized. That is, before the microservice instance starts to provide services, all connections in the connection pool are created.

How to realize Redis connection pool connection pre-creation

Let’s first see if we can implement this connection pool with the help of official configuration.

Let’s check the official documentation,Two configurations were found:

min-idle is the least number of connections in the connection pool. time-between-eviction-runs is a timed task to check whether the connections in the connection pool meet at least the number of min-idle, and at the same time, it does not exceed the number of max-idle. According to the official document, min-idle will only take effect if it is configured with time-between-eviction-runs. The reason is: lettuce's link pool is implemented based on commons-pool. The connection pool can be configured with min-idle, but you need to manually call preparePool to create at least min-idle objects:

GenericObjectPool

  public void preparePool() throws Exception {//If valid is configured min-idle, then call ensureMinIdle to ensure that at least min-idle objects are created if (this.getMinIdle() >= 1) {this.ensureMinIdle(); }}  

So when is this called? The commons-pool has timed tasks. The initial delay and timed intervals are both time-between-eviction-runs. The configuration is:

  public void run() {final ClassLoader savedClassLoader = Thread.currentThread().getContextClassLoader (); try {if (factoryClassLoader != null) {// Set the class loader for the factory final ClassLoader cl = factoryClassLoader.get(); if (cl == null) {// The pool has been dereferenced and the class loader // GC'd. Cancel this timer so the pool can be GC'd as // well. cancel(); return;} Thread.currentThread().setContextClassLoader(cl);} // Evict from the pool try { evict();} catch(final Exception e) {swallowException(e);} catch(final OutOfMemoryError oome) {// Log problem but give evictor thread a chance to continue // in case error is recoverable oome.printStackTrace(System.err);} // Re-create idle instances. try {ensureMinIdle();} catch (final Exception e) {swallowException(e);}} finally {// Restore the previous CCL Thread.currentThread ().setContextClassLoader(savedClassLoader); }}  

It can be seen thatThe execution of this timing task mainly ensures that the number of free objects in the current pool does not exceed max-idle, and there are at least min-idle links. These are the mechanisms of common-pools. But there is nothing we need, and all links are initialized as soon as the connection pool is created.

This needs to be implemented by ourselves, we first configure min-idle = max-idle = max-active, so that there are the same maximum number of links in the connection pool at any time. After that, we modify the source code in the place where the connection pool is created, and force preparePool to be called to initialize all links, namely:

ConnectionPoolSupport

  // This method is called when lettuce initializes the creation of the connection pool public static <>> GenericObjectPool createGenericObjectPool( Supplier connectionSupplier, GenericObjectPoolConfig config, boolean wrapConnections) {//Omit other codes GenericObjectPool pool = new GenericObjectPool(new RedisPooledObjectFactory (connectionSupplier), config) {@Override public T borrowObject() throws Exception {return wrapConnections? ConnectionWrapping.wrapConnection(super.borrowObject(), poolRef.get()): super.borrowObject();} @Override public void returnObject(T obj) {if (wrapConnections && obj instanceof HasTargetConnection) {super.returnObject((T) ((HasTargetConnection) obj).getTargetConnection()); return;} super.returnObject(obj);} }; //After creation,Call preparePool try {pool.preparePool();} catch (Exception e) {throw new RedisConnectionException("prepare connection pool failed",e);} //Omit other codes}  

In this way, we can initialize Redis At the time, all Redis links are initialized before the microservices actually provide services. Since the source code modification is involved here, you can currently replace the source code of the dependent library by adding a class with the same name and the same path to the project. For this optimization, I also raised an issue to lettuce and the corresponding pull request:

.