问题
配置完了k8s优雅下线后,发现配置了滚动发布后,两个服务同时在running状态,其中旧服务开始下线会导致有三四秒的时间调用该服务的接口会负载均衡到该服务,接口调用就会报错服务异常。
经排查,具体原因是服务虽然会到终止状态,但是nacos上有缓存,导致会调用到旧服务。
滚动发布配置:
负载到已下线pod效果:
依赖
需要健康检查和prometheus的支持
<dependency><groupId>org.springframework.boot</groupId><artifactId>spring-boot-starter-actuator</artifactId></dependency><dependency><groupId>io.micrometer</groupId><artifactId>micrometer-registry-prometheus</artifactId><scope>runtime</scope></dependency>
bootstrap.yaml
加入 actuator 独立的端口配置,避免通过gateway能直接访问
# 优雅下线配置
management:server:port: 9081endpoints:web:exposure:include: '*' #对外公开 health,info,shutdown 端点,默认只会公开前两个
bootstrap-xx.yaml
对应环境的配置文件也需要单独配置,全都配置到bootstrap会失效。暴露对应端点才能使后面k8s配置检查成功
management:endpoints:web:exposure:include: 'refresh,info,health,logfile,loggers,heapdump,threaddump,metrics,prometheus,mappings,env' #暴露所有端点base-path: /actuator
对应服务需要添加下线端点 NacosDeRegistryEndpoint
优化是可以做到公司的common包里,就不必每个服务都重复引入一遍。
import com.alibaba.cloud.nacos.NacosDiscoveryProperties;
import com.alibaba.cloud.nacos.registry.NacosRegistration;
import com.alibaba.cloud.nacos.registry.NacosServiceRegistry;
import lombok.extern.slf4j.Slf4j;
import org.springframework.boot.actuate.endpoint.annotation.Endpoint;
import org.springframework.boot.actuate.endpoint.annotation.WriteOperation;
import org.springframework.stereotype.Component;@Slf4j
@Component
@Endpoint(id = "nacos-deregister")
public class NacosDeRegistryEndpoint {private final NacosRegistration nacosRegistration;private final NacosServiceRegistry nacosServiceRegistry;private final NacosDiscoveryProperties nacosDiscoveryProperties;public NacosDeRegistryEndpoint(NacosRegistration nacosRegistration, NacosServiceRegistry nacosServiceRegistry, NacosDiscoveryProperties nacosDiscoveryProperties) {this.nacosRegistration = nacosRegistration;this.nacosServiceRegistry = nacosServiceRegistry;this.nacosDiscoveryProperties = nacosDiscoveryProperties;}/*** 从 nacos 中主动下线,用于 k8s 滚动更新时,提前下线分流流量*/@WriteOperationpublic String endpoint() {String serviceName = nacosDiscoveryProperties.getService();String groupName = nacosDiscoveryProperties.getGroup();String clusterName = nacosDiscoveryProperties.getClusterName();String ip = nacosDiscoveryProperties.getIp();int port = nacosDiscoveryProperties.getPort();log.info("deregister from nacos, serviceName:{}, groupName:{}, clusterName:{}, ip:{}, port:{}", serviceName, groupName, clusterName, ip, port);// 设置服务下线nacosServiceRegistry.setStatus(nacosRegistration, "DOWN");return "success";}
}
rancher修改服务配置
主要是设置一个停止服务前的钩子,钩子会执行设置好的命令。这条命令是用配置好的端口9081,以及添加的服务端点 nacos-deregister,做到调用nacos下线接口的操作,避免缓存问题能被负载到。
/bin/sh -c 'curl -X POST 'http://127.0.0.1:9081/actuator/nacos-deregister' && sleep 40'
可视化改容易出问题,也可以通过yaml改:
preStop:exec:command:- /bin/sh- -c- curl -X POST 'http://127.0.0.1:9081/actuator/nacos-deregister' &&sleep 40
最终效果是