[Reprint source] https://cloud.tencent.

2025/06/1520:28:42 hotcomm 1930

【转载来源】https://cloud.tencent.com/developer/article/1638897

前言

谈到 Service Mesh，人们总是想起微服务和服务治理，从 Dubbo 到 Spring Cloud (2016开始进入国内研发的视野，2017年繁荣)再到 Service Mesh (2018年开始被大家所熟悉)，正所谓长江后浪推前浪，作为后浪，Service Mesh 别无选择，而 Spring Cloud is full of envy of Service Mesh. The emergence and prosperity of microservice architecture is a huge breakthrough in the architectural form in the Internet era. Service Mesh has certain learning costs, but in fact, there are not many cases of implementation in China. Most of them are cloud merchants and leading enterprises. With the improvement of performance and ecology and the implementation of container scenarios in major communities, Service Mesh has also begun to take root and sprout in large and small companies to make up for the shortages in the container layer and Kubernetes in service governance. This time, we will take a look at the mainstream observability practices in Service Mesh from the perspective of a selective researcher.

From Service Mesh talks about how to monitor

Observability Philosophy of observability

Observability is not a new term. It was born a long time ago, but it is an emerging thing in the IT field. Observability is defined in Wikipedia as follows: "In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs." The first time this term appeared in the cloud native field was in 2017 when the concept of cloud native was in full swing. Under the trend of cloud native, the use of traditional description methods is no longer enough to summarize the monitoring demands of this era, and Observability seems much more appropriate.

Recall the traditional monitoring method. In addition to host monitoring, JVM monitoring, and message queue monitoring at the operation and maintenance level, how many monitoring is thought about how to do it in advance? rare! In fact, many times, what we do is to review the fault after the failure occurs, in addition to bug reproduction and repair, we will also customize and add some monitoring to expect a real-time alarm when the same situation occurs next time. After receiving the alarm, R&D personnel can quickly deal with the problem and minimize losses as much as possible. Therefore, most traditional monitoring modes are about making up for the failure and lack initiative.

is different in the containerized system in the cloud native era. The life cycle of containers and services is closely linked. Coupled with the perfect isolation characteristics of containers, coupled with the container management layer of Kubernetes, application services appear to be more black-boxed when running in containers. Compared with traditional physical hosts or virtual machines, it seems very inconvenient to troubleshoot problems. Therefore, in the cloud native era, observability is emphasized. Such monitoring is always the first step before the troops move. We need to think about how we should observe the services in the container and the topological information between the services, and the collection of various indicators. These monitoring capabilities are very important.

There is no clear time on when observability began to become popular in the cloud native field. The industry believes that observability was first proposed by Cindy Sridharan. In fact, an engineer from Berlin, Germany, Peter Bourgon, had an article discussing observability as early as February 2017. Peter was the first developer in the industry to discuss observability. His famous blog post "Metrics, Tracing, and Logging" has been translated into multiple languages. True observability becomes a standard, which is a cloud-native standard defined by Matt Stine from Pivotal. Observability ranks among them, so observability has become a standard theme in the cloud-native era.The three pillars of observability proposed by

Peter Bourgon revolve around Metrics, Tracing and Logging. These three dimensions almost cover various representation behaviors of applications. Developers can do various things by collecting and viewing data from these three dimensions and keeping abreast of the operation of the application. The understanding of the three pillars is as follows:

Metrics: Metrics is an aggregated data form. QPS, TP99, TP95, etc. that are often exposed to in daily life belong to Metrics The scope of the category is most closely related to statistics, and it often requires the use of statistical principles to make some designs;
Tracing: Tracing This concept is almost compensated by the complexity brought by the SOA era. It is difficult to locate problems by relying solely on logs. Therefore, its manifestation is more complex than Metrics. Fortunately, multiple protocols emerged in the industry to support the unified implementation of the Tracing dimension;
Logging: Logging It is a form of triggered by a request or event, used in applications to record state snapshot information. Simply put, it is a log, but this log is not just printed out. Its unified collection, storage and parsing are challenging things. For example, Structured and Unstructed log processing often requires a high-performance parser and buffer;

In addition, Peter Bourgon also mentioned some ideal output forms of the three pillars combined states in his blog post, as well as his dependence on storage. Due to the different aggregation degrees, Metrics, Tracing, and Logging dependencies on storage are from low to high. For more details, interested students can check the original link at the end of the article.

Peter Bourgon's thinking on the three pillars of observability is not only that. He also discussed the deep significance of Metrics, Tracing and Logging in industrial production in 2018, above the sharing of GopherCon EU in 2018. This time he discussed 4 dimensions.

CapEx: indicates the initial collection cost of the indicator, it is obvious that the log cost is the lowest, just bury the points; secondly, Metrics, the most difficult is Tracing data. With the support of the protocol, many data points are still needed to define to complete the metadata definition collection required for link tracking;
OpEx: indicates the operation and maintenance cost, generally referring to the storage cost, which has been discussed before;
Reaction: indicates the response sensitivity of abnormal situations. Obviously, the data after aggregation can show fluctuations, so Metrics is the most sensitive to abnormal situations; secondly, you can also use Logging to Exceptions are found during cleaning; Tracing seems to have nothing to do with response sensitivity, and at most it is used in troubleshooting and positioning scenarios;
Investigation: Standard fault location capability, this dimension is Tracing's strength, you can intuitively see faults in the link and accurately position them; Logging is second to it; Metrics dimension can only feedback fluctuations, which is not helpful for positioning faults;

in CNCF Landscape Among them, there is an area that is specifically used to show the observability solutions in cloud-native scenarios, which are divided into several dimensions. The picture shows the latest map as of May 14, 2020, and more excellent solutions will emerge in the future. Among the 10 project databases that CNCF currently graduates, 3 are related to observability, which shows how much CNCF attaches importance to observability.

Speaking of this, many students may be more interested in observability-related protocols. There are several more popular ones at present, such as OpenTracing, OpenCensus, OpenTelemetry, OpenMetrics, etc. The top three are currently popular. OpenMetrics project is no longer maintained.

OpenTracing can be said to be the most widely used distributed link tracking protocol at present. The famous SkyWalking is based on it. It defines the manufacturer-independent and language-independent link tracking protocol API, making it easier to build cross-platform link tracking. It is currently thriving in CNCF incubators.

OpenCensus is a protocol proposed by Google for the Tracing and Metrics scenarios. It is backed by Dapper and historical background. Even Microsoft supports it very much and is currently very popular in the commercial field.

other protocols such as W3C Trace Context are also very popular. It even compresses data at the head, which has nothing to do with the implementation layer. Perhaps CNCF realized that various protocols are emerging one after another, and they will become a climate in the future, and each middleware will have to be compatible. This is not conducive to the entire technology ecosystem itself, so OpenTelemetry emerged. From the literal meaning, CNCF will carry out observability "telemetry" to the end. It integrates the protocol content of OpenTracing and OpenCensus to improve the unified collection and processing of observability metrics in the cloud-native era. Currently, OpenTelemetry has entered the beta version. What is gratifying is that the Java version of the SDK already has a non-invasive probe based on the byte-buddy framework similar to SkyWalking. Currently, telemetry data can be obtained from 47 Java libraries, and also launched APIs and SDKs for use by Erlang, Go, Java, JavaScript, and Python. In addition, the data collector OpenTelemetry Collector can also be used, and it can be used to receive data transmitted by the OpenTelemetry client and collect and process it in a unified manner. Currently, the CNCF has suspended the formulation of Logging-related agreements, but there is a working group that is also doing the standardized things in this regard.

Service Mesh and observability

To talk about the relationship between Service Mesh and observability, that is, observability is a subset of the functions of Service Mesh. Service Mesh is one of the most popular technical concepts today. It is committed to providing unified service discovery, edge routing, security, traffic control, observability and other capabilities for large-scale services running in containers in the cloud-native era. It is a supplementary and strengthening of Kubernetes service governance capabilities. It can be said that Service Mesh is an inevitable product of the era of cloud native containerization, and it will have a profound impact on cloud service architecture. The architectural concept of Service Mesh is to treat container service operation units as grids, hijack traffic in each group of operation units, and then a unified control panel is processed uniformly. All grids and control panels maintain a certain connection, so that the control panel can serve as a bridge between observability solutions and the container environment.

The most common Service Mesh technologies on the market include Linkerd, Istio, Conduit, etc., but to implement it in a production environment, it must withstand the evaluation of strict performance, reasonable architecture, and community activity.

Linkerd is developed and maintained by Buoyant and is considered the first generation of product in the Service Mesh field. Linkerd1.x is written based on Scala and can be run based on the host. Everyone knows that the Scala running environment depends on JDK, so it consumes relatively high resources. The official subsequently made rectifications and launched a new generation of data plane component Conduit, written based on Rust and Go, combined with Linkerd to become Linkerd2.x. Overall, Linkerd2.x's performance has been greatly improved, and it also has a visual interface for operation, but it is not popular in China and the community has never been able to develop.

Turn to look at the Istio that appeared in 2017. It was also born with a golden spoon in it. It was initiated by Google, IBM, and Lyft. Although it was one year late for Linkerd, it received widespread attention and sought after once it was launched. Istio is written based on Golang, perfectly fits the Kubernetes environment, integrates Envoy with data planes, and has clear responsibilities in service governance. The domestic implementation cases are more extensive than Linkerd.

Istio is currently a young open source middleware in general. There are quite a difference in component architectures between large versions. For example, 1.0 introduced Galley (as shown in the left part of the figure), 1.5 removed Mixer, and the control plane was integrated into a monomer, adding the WASM expansion mechanism (as shown in the right part of the figure). The overall architecture has not changed much. The data plane still focuses on traffic hijacking and forwarding strategies, and the control plane still does telemetry collection, policy issuance and security. At present, cloud merchants and leading companies are in the leading position in the domestic industry for the use of Istio. For example, Ant Financial has developed its own Golang-based data plane MOSN, which is compatible with Istio, and has done a lot of optimization work. It has set an example of Istio's implementation in China. For more information, you can understand in-depth and see how to build a Service Mesh architecture that is more suitable for the domestic Internet. Although Mixer has basically been abandoned in version 1.5 and has entered the maintenance stage until version 1.7. Mixer is completely closed by default, most of the current implementation plans are still based on the version 1.0-1.4. Therefore, in the absence of overall upgrades, and the performance of WASM is unclear, it seems that Mixer is still inseparable from Mixer. As mentioned earlier, Service Mesh is the bridge between cloud-native container environment and observability. Mixer's Adapter can be regarded as the main body of the bridge and has good scalability. In addition to checking traffic, Mixer Adapter is more important to collect telemetry data during the pre-checking and reporting stages. The telemetry data is exposed or transmitted to various observation terminals through Adapter. The observation terminal draws rich traffic trajectories and event snapshots based on the data. The commonly used Adapter for observability has adapted to various commercial solutions, such as Datadog, New Relic, etc., and the open source solutions Apache SKyWalking, Zipkin, Fluentd, Prometheus, etc. The relevant content will be expanded below.

data plane, such as Envoy, will report log information (Log), link data (Trace), monitoring indicators (Metric) and other data to Mixer. The original data reported by Envoy are all attribute information (Attributes). The attribute information is metadata of name and type, used to describe the environment information when the inlet and exit traffic and traffic is generated. Then Mixer will format the attributes according to the format configured by LogEntry, Metric or TraceSpan templates, and finally hand it over to Mixer Adapter for further processing. Of course, for the huge amount of data, Log information and Trace information, you can choose to directly report to the processing end. Envoy also natively supports some specific components. Different Adapters require different Attributes. The template defines the schema of Attributes to the Adapter input data map. One Adapter can support multiple templates. Three configuration models can be abstracted in Mixer:

Handler: represents a configured Adapter instance;
Instance: defines mapping rules for Attributes information;
Rule: assigns Instance and triggering rules to the Handler;

Handler: The following figure is a Metric template and a LogEntry template. You can also set the default value above the mapping relationship. For more settings, you can view the official document.

The following figure is a TraceSpan template. Students who are familiar with OpenTracing may be more familiar with the mapping content. Many information is the standard values of the OpenTracing protocol, such as various description information of Span, as well as http.method, http.status_code, etc. Interested students can also check out the standard definition of OpenTracing. In addition, there is a common problem with link tracking in Service Mesh, that is, no matter how you do traffic hijacking on the data plane, how to pass information through, and how to generate or inherit Span, there is a problem that cannot be connected in the ingress and egress traffic. To solve this problem, it is still necessary for the service main container to bury the point and pass the link information through to the next request. This problem is inevitable, and the subsequent implementation of OpenTelemetry can solve this standardization problem.

Istio Observability Practice

In the Istio Mixer Adapter, we can know that Istio supports link tracing of Apache SKyWalking, Zipkin, and Jaeger. All three middlewares support the OpenTracing protocol, so there is no problem using the TraceSpan template to access at the same time. The slightly different things between the three are:

Zipkin is an old link tracking middleware. The project was initiated in 2012, and the new version of the functions are also easy to use;
Jaeger is an emerging project initiated in 2016 and written in Go. However, due to the support of cloud native, it is committed to solving the link tracking problem in the cloud native era, so it has developed very quickly. It is extremely simple to integrate in Istio and is also the official recommended solution of Istio.
SkyWalking is a project that started open source in 2015 and is currently booming, but the slightly different is that it is currently in line with The combination of Istio is through out-of-process adaptation, with slightly larger access losses, and there are corresponding solutions in the latest 8.0 version (not yet released);

Another middleware that has been widely used in the field of cloud native link tracking is Jaeger. It is open sourced by Uber and accepted by CNCF. It is currently a graduation project. It natively supports the OpenTracing protocol, is interoperable with other middleware on the market, supports multiple backend storage and has flexible scalability. Jaeger is natively supported in Envoy. When the request reaches Envoy, Envoy will choose to create a Span or inherit Span to ensure link consistency. It supports Zipkin's B3 series header transmission, as well as Jaeger and LightStep's header. The following figure is a display of the link in Jaeger, and a request can be accurately positioned through TraceId. ELK can be said to be a household name. Since Spring Cloud's popularity, it has been an excellent choice for logging solutions. With the development of technology, EFK has appeared in recent years. There is no particularly big change in the storage component ElasticSearch and interface Kibana. However, in the strict online environment and container environment, as well as in various resource-sensitive scenarios, the requirements for log collection components are getting higher and higher. The most popular solution is to use Fluentd or Filebeat to replace Logstash. Here are some introductions to the three of them:

Logstash: written in Java, the resource consumption is high, and now it is not recommended to be used for log collection;
Fluentd: The main body is written in C, the plug-in is written in Ruby, from April 2019 After graduation from CNCF, the resource consumption is very small, usually occupying about 30MB of memory, and the log can be transmitted to multiple buffers, that is, multiple receivers, and it is currently a common component in containers;
Filebeat: written in Go, but there have been problems of pulling up the load average of the underlying resource on the line and the resource consumption is relatively large, about 10 times that of Fluentd. Before Fluentd, it was widely used in virtual machines;

For logging solutions in Istio, although Mixer provides Fluentd Adapter, everyone knows that this method is not good, so from Envoy Going to get the original attribute log and then processing and transmitting it to the storage end is relatively friendly to the application and can save a large part of the resources.

In the log dimension, if you want to locate the problem, it is best to bind it to the request. Bind the request and log requires a specific identity, which can be TransactionId or TraceId. Therefore, link tracing and log convergence is an imperative industry demand. Therefore, when choosing link tracing middleware, you must consider how to better obtain TraceId and combine it with the log.

So is Fluentd the best log collection and launch solution?

is not.Fluentd's R&D team has launched the lighter Fluent Bit, which is written in pure C and takes up less resources. It has been directly reduced from Fluentd's MB level to KB level, which is more suitable as a log collector. There are many types of Fluentd plug-ins, and there are nearly thousands of plug-ins at present, so it is more suitable as a log aggregation processor and used in processing and transmission after log collection. In actual applications, using Fluent Bit may encounter some problems. Using earlier versions may cause dynamic configuration loading. The solution is to start a new process to control the start and stop of Fluent Bit, and listen to configuration changes at the same time. If there are any changes, reload.

Regarding Loki in the picture above, its core idea is as the project introduction, "Like Prometheus, but for logs", an aggregation log solution similar to prometheus, was opened in December 2018. In just two years, there have been nearly 10,000 Stars! It was developed by the Grafana team, which shows the purpose of Grafana for unifying cloud-native observability.

In the cloud native era, it seems that it is not appropriate to store a large number of raw logs directly into expensive storage media, like in the past, using expensive full-text indexes, such as ES, or columnar storage, such as HBase. Because 99% of the original logs will not be queried, the logs also need to be merged, compressed into gzip after merging, and labeled with various tags, which may be more in line with the principle of refined operation in the cloud native era.

And Loki can store a large number of logs in cheap object storage, and it marks the logs into log streams. This way we can quickly retrieve the corresponding log entries. But be aware that it is unwise to use Loki to replace EFK. They target different scenarios, and the integrity guarantee and retrieval capabilities of data are also different.

has firmly occupied the main position of monitoring indicators since the emergence of Prometheus. Prometheus should be the most widely used open source system monitoring and alarm platform at present. With the development of container technology with Kubernetes as the core, Prometheus's powerful multi-dimensional data model, efficient data collection capabilities, flexible query syntax, and its scalable and convenient integration features, especially its combination with the cloud-native ecosystem, has made it more and more widely used.

Prometheus was officially released in 2015, joined CNCF in 2016, and became the second project to graduate from CNCF in 2018 (the first one is Kubernetes, which shows its influence). Currently Envoy supports the TCP and UDP statsd protocols. First, let Envoy push metrics to statsd, and then you can use Prometheus to pull metrics from statsd for visual display by Grafana. In addition, we can also provide Mixer Adapter to receive and process telemetry data for Prometheus to collect.

may have some problems in the actual use of Prometheus. For example, if a pod is killed, it needs to be enabled, resulting in the loss of Prometheus data. This requires a highly available solution for data persistence of Prometheus. There is a project called Thanos in CNCF's sandbox project. Its core idea is to create a database sharding solution on Prometheus. It has two architectural patterns: Sidecar and Receiver. The Sidecar solution used in the official architecture diagram is currently used. Receiver is a component that has not been fully released yet. The Sidecar solution is relatively mature, more efficient and easier to expand.

Service Linkerd and Conduit in the Mesh solution have visual interfaces.Istio is relatively black box and has been criticized, but Istio Community Joint Kiali jointly launched a visualization solution, providing the following functions:

Topology: service topology diagram;
Health: visual health check;
Metrics: indicator visualization;
Tracing: Distributed link tracking visualization;
Validations: Configuration verification;
Wizards: routing configuration;
Configuration: CRD Visualization and editing of resources;

Below is Kiali's architecture. It can be seen more clearly that it is a front-end and back-end architecture, and can obtain metric data from Prometheus or cluster-specific APIs. It also includes the Jaeger link tracing interface and Grafana display interface, but they are not used out of the box. The three-party components that Kiali depends on need to be deployed separately.

Summary

In many small and medium-sized companies, Service Mesh is still in the pre-research stage. There are many factors that need to be considered when actually implemented. How to obtain a good input-output efficiency ratio is something that every person who chooses the type must experience. In fact, regardless of the implementation, given the cloud-native observability philosophy, observability can be solved simultaneously while implementing it, and avoid spending too much resources on meaningless things. The three pillars of comprehensive observability and the support for observability in Service Mesh are summarized as follows:

Metrics: Reasonable use of Prometheus and good persistence and high availability are the key;
Tracing: Selecting the right link tracking middleware is the integration fit and integration Logging, storage, and display are considered;
Logging: What scenario uses the original log and what scenario uses the summary log, it must be clear;

Observation cloud - quickly realize system observability

At present, the cloud computing market has a huge demand for system observability, but there are very few unified real-time monitoring products that are truly observable. As the first domestic integrated system observable platform - Observation Cloud, it can quickly realize system observability and meet your cloud, cloud native, application and business monitoring needs.

Observation Cloud is a new generation of integrated data platform, completely different from traditional solutions. Support full-scene monitoring, comprehensive data-driven, and use digital means to fully guarantee project team solutions and ensure system reliability and stability. Since its debut,

Observation Cloud has been widely paid high attention in the industry and well-received by the market, and the number of active users has continued to grow rapidly. In April 2022, Observation Cloud passed the "Advanced" evaluation (the highest level) of the "Observability Platform Technical Capability" of the China Academy of Information and Communications Technology, and was promoted to the leading brand in the domestic observable SaaS service track. On April 28, at the "2022 Observation Cloud Product Launch Conference", Observation Cloud emphasized that it will increase its investment in the dissemination and practice of SRE concepts in China and the promotion and popularization of the new generation of observability technologies. The release of the community version is one of the important measures for Observation Cloud to fulfill its commitments. Following the SaaS version of Observation Cloud, the deployment version is also open for free experience.