In the future, the ChaosBlade community will enhance its original fields, such as enhancing cloud-native field scenarios, and will also add more scenarios in more fields, such as: Golang application chaos experiment scenario NodeJS application chaos experiment scenario In additio

2024/05/1404:20:34 hotcomm 1956
About the author of

: Xiao Changjun, Alibaba technical expert, nicknamed Qionggu, has many years of experience in application performance monitoring research and development and distributed system high-availability architecture. He is now focusing on the field of chaos engineering and has many years of experience in chaos engineering research and development and practice. The person in charge of the open source project ChaosBlade, Alibaba Cloud Application High Availability Service (AHAS) product development, and chaos engineering evangelist.

In the future, the ChaosBlade community will enhance its original fields, such as enhancing cloud-native field scenarios, and will also add more scenarios in more fields, such as: Golang application chaos experiment scenario NodeJS application chaos experiment scenario In additio - DayDayNews

project background

Alibaba internally introduced chaos engineering to solve the dependency problem of microservices, to verify the steady state of business services and cloud services, and further upgraded to the business continuity guarantee of public cloud and private cloud, as well as the verification of cloud native systems. We have accumulated a wealth of scenarios and practical experience in terms of stability and other aspects. At that time, the open source tools related to chaos engineering had problems such as scattered scene capabilities, difficulty in getting started, lack of experimental model standards, and difficulty in scene expansion and precipitation. These problems will make it difficult to implement platformization, and it will be difficult for you to include these tools through a platform. Therefore, we open sourced ChaosBlade, a chaos engineering experiment execution tool, with the purpose of serving the chaos engineering community and jointly promoting the development of the field of chaos engineering. Introduction to the

project The

ChaosBlade project is hosted on the Github platform and placed under the chaosblade-io organization to facilitate project management and community development. ChaosBlade was designed with ease of use and scene expansion in mind at the beginning, making it easier for everyone to get started and expand more experimental scenes according to their own needs. It follows the chaos experiment model to provide a unified and simple execution tool, and divides it according to the field. Scenario implementation is encapsulated into separate projects to facilitate scene expansion in the field. The scene areas currently included are as follows:

  • Basic resources: such as CPU, memory, network, disk, process, etc. Experimental scenarios

  • Java Application: such as database, cache, message, JVM itself, microservices, etc. You can also specify any class method to inject various Complex experimental scenarios

  • C++ application: such as specifying any method or a certain line of code to inject delay, tampering with variables and return values, etc.

  • Docker container: such as killing the container, in-container CPU, memory, network, disk, process, etc. Experimental scenarios

  • Kubernetes platform: For example, the experimental scenarios of CPU, memory, network, disk, and process on the node, the Pod network and Pod itself experimental scenarios such as killing the Pod, and the experimental scenarios of the container are such as the above-mentioned Docker container experimental scenario

  • Cloud resources: such as Alibaba Cloud ECS downtime and other experimental scenarios

The above scene areas are separately packaged into a project for implementation. The currently included projects are as follows:

  • chaosblade: Chaos experiment management tool, including commands to create experiments, destroy experiments, query experiments, experiment environment preparation, experiment environment cancellation, etc., and is the first choice for chaos experiments. Execution tool, execution methods include CLI and HTTP. Provides complete instructions for commands, experimental scenarios, and scene parameters, making the operation simple and clear.

  • chaosblade-spec-go: Golang language definition of chaos experimental model. Scenarios that are easy to implement using Golang language can be easily implemented based on this specification.

  • chaosblade-exec-os: Basic resource experimental scenario implementation.

  • chaosblade-exec-docker: Docker container experimental scenario implementation, standardized implementation by calling Docker API.

  • chaosblade-operator: Kubernetes platform experimental scenario implementation. Chaos experiments are defined through the Kubernetes standard CRD method. It is very convenient to use Kubernetes resource operations to create, update, and delete experimental scenarios, including execution using kubectl, client-go, etc. And it can also be executed using the chaosblade cli tool mentioned above.

  • chaosblade-exec-jvm: Java application experimental scenario implementation, using Java Agent technology to dynamically mount, without any access, zero-cost use, and supports uninstallation, fully recycling various resources created by Agent.

  • chaosblade-exec-cplus: C++ application experimental scenario implementation, using GDB technology to implement method and code line level experimental scenario injection.

The above projects all follow the chaos experiment model to define experimental scenarios. This not only achieves horizontal domain expansion of experimental scenarios, but also has a separate project for each scenario domain. It uses standard methods in this domain to design and implement scenarios, so it is very convenient to realize vertical scenarios in the domain. Extension. In addition to projects related to experimental scenarios,

also has related documentation projects:

  • chaosblade-help-doc: ChaosBlade tool and scenario usage documentation

  • chaosblade-dev-doc: ChaosBlade project development documentation

  • awesome-chaosblade: ChaosBlade related external documents

experimental model

front It is mentioned that the ChaosBlade project follows the chaos experiment model design, which not only simplifies the definition of experimental scenarios, but also allows easy expansion of scenarios, and can be uniformly called through the chaosblade cli tool to facilitate the construction of an upper-level chaos experiment platform. This model is introduced in detail below through the derivation, introduction, significance and specific applications of the experimental model.

Derivation of experimental model

The current chaos experiment mainly includes fault simulation. We generally describe the fault as follows:

  • 0.0.0.1 The A disk mounted on the machine is full, causing the service to be unavailable.

  • The B dubbo service on all nodes is because Slow execution causes a delay in calling the upstream A dubbo service, resulting in slow user access.

  • Kubernetes All cores of the CPU on node B in cluster A are fully loaded, causing Pod scheduling abnormalities in cluster A

  • Kubernetes Pod network abnormality in D in cluster C, causing D-related Service access exception

Through the above, we can use the following sentence structure to describe the fault: Because which component on a certain machine (or resource in the cluster, such as Node, Pod) failed, which caused related impacts. We can also look at the breakdown of fault descriptions through the following figure:

In the future, the ChaosBlade community will enhance its original fields, such as enhancing cloud-native field scenarios, and will also add more scenarios in more fields, such as: Golang application chaos experiment scenario NodeJS application chaos experiment scenario In additio - DayDayNews

The existing fault scenario can be described through these four parts, so we abstracted a fault scenario model, also called the chaos experimental model

In the future, the ChaosBlade community will enhance its original fields, such as enhancing cloud-native field scenarios, and will also add more scenarios in more fields, such as: Golang application chaos experiment scenario NodeJS application chaos experiment scenario In additio - DayDayNews

Introduction to the experimental model

This The detailed description of the experimental model is as follows:

  • Scope: The scope of the experiment implementation, which refers to the specific machines, clusters and resources that implement the experiment.

  • Target: The experimental target, which refers to the components where the experiment occurs. Such as CPU, network, disk, etc. in the basic resource scenario, application components such as Dubbo, Redis, RocketMQ, JVM, etc. in the Java scenario, Node, Pod, Container itself, etc. in the container scenario

  • Matcher: Experimental rule matcher, according to the configuration Target defines relevant experimental matching rules, multiple of which can be configured. Since each Target may have its own special matching conditions, for example, Dubbo and gRPC in the RPC field can be matched based on the services provided by the service provider and the services called by service consumers. Redis in the cache field can be matched based on set and get operations. . You can also extend the matcher, such as extending the experimental scenario execution strategy and controlling the experiment trigger time.

  • Action: refers to the specific scenario of the experimental simulation. The target is different, and the implementation scenarios are also different. For example, if the disk is full, the disk IO read and write is high, the disk hardware failure, etc. can be exercised. If it is an application, you can abstract experimental scenarios such as delays, exceptions, returning specified values ​​(error codes, large objects, etc.), parameter tampering, and repeated calls. If it is a container service, you can simulate Node, Pod, Container resource exceptions or basic resource exceptions on them, etc.

Using this model can clearly express the following issues that need to be clarified when implementing chaos experiments:

  • What is the implementation scope of chaos experiments

  • What are the objects of chaos experiments

  • What are the conditions for experimental objects to trigger experiments

  • What specific experimental scenarios are implemented

The significance of the experimental model

This model has the following characteristics:

  • Simple: clear hierarchy and easy to understand

  • Universal: covering all current fault scenarios, including basic resources, application services, container services, cloud resources, etc.

  • Easy to implement: very convenient Clearly defined interface specifications, simple expansion of experimental scenarios

  • Language and domain-independent: multi-language, multi-domain model implementation can be extended

This model has the following significance:

  • Describe chaotic experimental scenarios more accurately

  • Better understand chaotic experiments Inject

  • to facilitate the precipitation of existing experimental scenarios

  • Explore more scenarios based on the model

  • The chaos experiment tool is more standardized and concise

Application of the experimental model

The projects under ChaosBlade follow the design of this chaos experiment model. It should be noted that this model defines chaos How to design the experimental scene, but the specific implementation of the experimental scene is different in each field, so ChaosBlade is encapsulated into independent projects according to the field implementation. Each project is implemented according to the best practices in each field, which can not only meet the needs of various fields. Habits, and you can also use the chaos experiment model to establish a relationship with the chaosblade cli project, which is convenient for unified calling using chaosblade. Experiment scenarios in various fields generate yaml file descriptions based on the chaos experiment model, and expose them to the upper-level chaos experiment platform. Chaos experiment platform According to changes in the experimental scene description file, changes in the experimental scene are automatically sensed, and platform development is performed when there is no need to add new scenes, allowing the chaos platform to focus more on other parts of the chaos engineering. The following is divided into three parts: chaosblade cli design based on the chaos experiment model, chaosblade operator design based on the chaos experiment model, and chaos experiment platform construction based on the chaos experiment model to introduce the application of the chaos experiment model in detail.

Chaosblade cli design based on chaos experimental model

In the future, the ChaosBlade community will enhance its original fields, such as enhancing cloud-native field scenarios, and will also add more scenarios in more fields, such as: Golang application chaos experiment scenario NodeJS application chaos experiment scenario In additio - DayDayNews

chaosblade The project itself is built using Golang and is ready to use after decompression. The tool is executed in CLI mode, which is simple to use and has a complete command prompt. According to the chaosblade-spec-go project's definition of the chaos experimental model, analyze the yaml description of the experimental scenario implemented according to the chaos experimental model, convert the experimental scenario into the command parameters supported by the cobra framework, realize variable parameterization, parameter standardization, and convert the entire The experiment is object-oriented, and each experimental object will have a UID to facilitate management.

In the future, the ChaosBlade community will enhance its original fields, such as enhancing cloud-native field scenarios, and will also add more scenarios in more fields, such as: Golang application chaos experiment scenario NodeJS application chaos experiment scenario In additio - DayDayNews

illustrates the use of chaosblade cli through a specific experimental scenario.

In the future, the ChaosBlade community will enhance its original fields, such as enhancing cloud-native field scenarios, and will also add more scenarios in more fields, such as: Golang application chaos experiment scenario NodeJS application chaos experiment scenario In additio - DayDayNews

The experiment we performed was to inject a delay fault in calling mk-demo database into one of the provider service instances. You can see the lower left corner of the above picture. This is the command to inject delay into the demo database. It can be seen that the command is very concise and clear, such as It is clearly expressed that our experimental goal is mysql, and our experimental scenario is to do delay. These are the matchers of these databases, such as tables, query types, and controlling the number of effects of the experiment, etc. Using ChaosBlade can be very easy Effectively control the explosion radius of the experiment. Executing this command will inject a fault into the provider service of this machine. You can see that after I injected the fault, the picture here shows that I immediately received an alarm from DingTalk. So this case is in line with the expected case, but even if it meets Anticipated cases are also valuable. Relevant development and operation and maintenance personnel are required to investigate the root cause of the delay problem and restore it, which helps to improve the efficiency of fault emergency response.Chaosblade's Chinese usage documentation: https://chaosblade-io.gitbook.io/chaosblade-help-zh-cn

chaosblade operator design based on chaos experiment model

In the future, the ChaosBlade community will enhance its original fields, such as enhancing cloud-native field scenarios, and will also add more scenarios in more fields, such as: Golang application chaos experiment scenario NodeJS application chaos experiment scenario In additio - DayDayNews

chaosblade-operator project is a chaos experiment injection tool implemented for the Kubernetes platform, following The above chaos experiment model standardizes the experimental scenario, defines the experiment as a Kubernetes CRD resource, and maps the four parts of the experimental model to Kubernetes resource attributes. It is very friendly to combine the chaos experiment model with Kubernetes declarative design. It is convenient to rely on the chaos experiment model. While developing scenarios, it can also be well integrated with the Kubernetes design concept and directly call the Kubernetes API through kubectl or writing code to create, update, and delete chaos experiments. Moreover, the resource status can clearly represent the execution status of the experiment, and standardize Kubernetes faults. injection. In addition to using the above methods to execute experiments, you can also use the chaosblade cli method to easily execute kubernetes experimental scenarios, query experimental status, etc. In addition to the above advantages, the chaosblade operator implemented following the chaos experiment model can also realize the reuse of basic resources, application services, Docker containers and other scenarios, which greatly facilitates the expansion of Kubernetes scenarios. Therefore, in addition to complying with the Kubernetes standardized implementation scenario method, combined with The chaos experiment model can realize and use chaos experiment scenarios more effectively, clearly, and conveniently. The following uses a specific case to illustrate the use of chaosblade-operator: accessing the local port 40690 of the cn-hangzhou.192.168.0.205 node simulates 60% network packet loss.

uses yaml configuration and uses kubectl to perform experiments

apiVersion: chaosblade.io/v1alpha1kind: ChaosBlademetadata: name: loss-node-network-by-namesspec: experiments: - scope: node target: network

5 action: loss desc: "node network loss" matchers: - name: names value: ["cn-hangzhou.192.168.0.205"] - name: percent value: ["60"] - name: interface value: ["eth0"] - name: local-port value: ["40690"]

Execute the experiment:

kubectl apply -f loss-node-network-by-names.yaml

Query the experiment status and return the following information (spec and other contents are omitted):

~ » kubectl get blade loss-node-network-by-names -o json { "apiVersion": "chaosblade.io/v1alpha1", "kind": "ChaosBlade", "metadata": { "creationTimestamp" : "2019-11-04T09:56:36Z", "finalizers": [ "finalizer.chaosblade.io" ], "generation": 1, "name": "loss-node-network-by -names", "resourceVersion": "9262302", "selfLink": "/apis/chaosblade.io/v1alpha1/chaosblades/loss-node-network-by-names", "uid": "63a926dd-fee9 -11e9-b3be-00163e136d88" }, "status": { "expStatuses": [ { "action": "loss", "resStatuses": [ { "id": "057acaa47ae 69363",  "kind": "node", "name": "cn-hangzhou.192.168.0.205", "nodeName": "cn-hangzhou.192.168.0.205", "state": "Success", " "success": true , "target": "network" } ], "phase": "Running" }}

From the above content, you can clearly see the running status of the chaos experiment. Execute the following command to stop the experiment:

kubectl delete -f loss-node-network-by-names.yaml

or directly delete this blade resource

kubectl delete blade loss-node-network-by-names 

You can also edit the yaml file to update the experimental content and execute it. The chaosblade operator will complete the update of the experiment. operate.

Use the blade command of chaosblade cli to execute

blade create k8s node-network loss --percent 60 --interface eth0 --local-port 40690 --kubeconfig config --names cn-hangzhou.192.168.0.205

If the execution fails, it will be returned Detailed error information; if the execution is successful, the UID of the experiment will be returned:

{"code":200,"success":true,"result":"e647064f5f20953c"}

You can query the experiment status through the following command:

blade query k8s create e647064f5f20953c --kubeconfig config
{ "code": 200, "success": true, "result": { "uid": "e647064f5f20953c", "success": true, "error": "", "statuses": [ { "id": "fa471a6285ec45f5", "uid": "e179b30d-df77-11e9-b3be-00163e136d88", "name": "cn-hangzhou.192.168.0 .205", "state": "Success", "kind": "node", "success": true, "nodeName": "cn-hangzhou.192.168.0.205" } ] }}

Destruction experiment:

blade destroy e647064f5f20953c

In addition to the above two methods, it can also be executed using kubernetes client-go. For details, please refer to: https://github.com/chaosblade-io/chaosblade/blob/master/exec/kubernetes/executor.go Code.

Through the above introduction, it can be seen that the cloud native experimental scenario was considered in the early stage of designing the ChaosBlade project, and the chaos experimental model and the Kubernetes design concept are combined in a friendly way. It can not only follow the Kubernetes standardized implementation, but also reuse scenarios and scenarios in other fields. In the chaosblade cli calling method, the so-called historical baggage does not exist at all :-).

Build a chaos experiment platform based on the chaos experiment model

As mentioned earlier, the experimental scenarios implemented by following the chaos experiment model can be described through yaml files. The upper-layer experimental platform can automatically sense changes in the experimental scenarios, without the need for further development of the platform, to achieve the experimental platform The purpose of decoupling from experimental scenarios allows everyone to focus more on the development of the chaos experiment platform itself. Let's take the AHAS Chaos platform as an example to illustrate how to build a chaos experiment platform based on the chaos experiment model and ChaosBlade.

In the future, the ChaosBlade community will enhance its original fields, such as enhancing cloud-native field scenarios, and will also add more scenarios in more fields, such as: Golang application chaos experiment scenario NodeJS application chaos experiment scenario In additio - DayDayNews

can see:

  • chaosblade will merge the yaml files of all domain scenes and provide them to ChaosBlade SDK

  • ChaosBlade SDK senses the changes in yaml files, re-parses the scene description file, and transparently transmits it to the upper platform, including changes in scenes and scene parameters

  • ChaosBlade SDK transparently transmits the user's Parameters configured on the platform, call the chaosblade tool to execute

  • The chaosblade tool will call different executors based on the calling parameters and parsing the yaml scene description files in each field.

Summary

The application of the chaos experiment model can be summarized as the following points:

  • Chaos experiment The model parametrizes the variables of the experimental scene, and the parameter standardization

  • can follow the model to realize the horizontal expansion of the experimental scene domain.

  • can combine the chaos experimental model with the standardized implementation in the field, and conveniently realize the vertical expansion of the scene in the field.

  • The upper layer of the field scene can be complex. Use the scene defined by the chaos experiment model

  • The scene description declared by the chaos experiment model can be well connected to the chaosblade cli

  • Following the experiment model, you can easily build the upper-level chaos experiment platform

Project significance

The field of chaos engineering has been proposed for many years Everyone in the chaos engineering community contributes their own efforts to improve the entire chaos engineering field system. In particular, the introduction of chaos engineering theory has promoted the rapid development of the entire chaos engineering field. We have practiced chaos engineering within Alibaba for many years. We know that the road to implementing chaos engineering is full of challenges. We also know that injecting chaos experiments is only a part of chaos engineering. The thinking, implementation plans and practical experience behind chaos engineering are also very important. a part of. We just want to dedicate internal tools that we think are useful to the community, and then share the practical experience just mentioned with everyone through various channels. You can combine this tool with practical experience and use it as a tool for enterprises to implement chaos engineering. A starting point to jointly promote progress in the field of chaos engineering, nothing more. The above describes in detail the design and thinking behind the ChaosBlade tool, as well as the advantages of combining the chaos experimental model with the implementation of standards in various fields. Everyone who is interested in high-availability architecture is welcome to join the ChaosBlade community and join the chaos engineering community. Come. All in all, ChaosBlade believes: In the open source world, any help is a contribution.

future plan

ChaosBlade community will not only enhance the original fields, such as enhancing cloud native field scenarios, but also add more fields of scenarios, such as:

  • Golang application chaos experiment scenario

  • NodeJS application chaos experiment scenario

In addition to the experimental scenarios, the following will also be included Planning:

  • Provide a chaos experiment platform for everyone to use

  • Improve the development documentation of ChaosBlade projects

  • Improve the English documentation of the chaosblade tool

Welcome everyone to join and build together, not limited to:

  • bug report

  • feature request

  • performance issue

  • help wanted

  • doc incomplete

  • test missing

  • feature design

  • any question on project

ChaosBlade project It has just begun. Any ideas and questions that open source enthusiasts have while using ChaosBlade are welcome to be fed back to Github through issues or pull requests.

  • Milestone Dubbo 2.7.5 version released, performance increased by 30%, supporting HTTP/2, TLS, Protobuf and other features

  • html How to implement a 10 million-level delayed task queue, see Meitu open source-LMSTFY

  • Count with your fingers: Your How many millions more did CDN cost?

  • Meitu’s evolution of billions of message storage every day - from Redis to Titan, perfectly solving the expansion problem

This article was commissioned by High Availability Architecture. For technical originality and architecture practice articles, you are welcome to submit through the "Contact Us" menu of the official account.

High availability architecture

Change the way the Internet is built

hotcomm Category Latest News