Preface
Application Crash phenomenon will bring users a very poor user experience. This article will start from the bottom of the iOS system, sort out the core knowledge points, explain the collection of various types of Crash and the monitoring and analysis of OOM (out of memory), as well as the effects presented in the APM system
In the document given by Apple , in the abstract system architecture hierarchy diagram shown, the system architecture hierarchy of OSX and iOS is the same, divided into 4 levels:
- User experience layer: including Aqua, Dashboard/Spotlight, etc.;
- Application framework layer: Cocoa, Carbon, Java;
- Core framework: sometimes also called the graphics and media layer. Including core framework, OpenGL;
- Darwin: Including kernel and UNIX shell environment
levels, Darwin is completely open source, the foundation of the entire system, and provides the underlying API.
Figure 1 System architecture diagram for OSX and iOS
OSX and iOS system frameworks can be represented in Figure 1 in abstraction, but in fact there are still some differences in details between them, so I will not introduce them more here. What is more important here is the Darwin framework
Figure 2 Darwin framework diagram Source: "In-depth analysis of Mac OS X & iOS operating system"
Darwin's kernel is XUN, which is also the core of OS X itself. From Figure 2, XUN has the following components:
- Mach
- BSD
- LibKern
- I/O Kithml4
The most important of these are Mach and BSD.
Mach is a microkernel, this microkernel can only handle the most basic responsibilities of the operating system:
- process and thread abstract
- virtual memory management
- Any scheduling
- inter-process communication and message delivery mechanism
Mach itself has very limited APIs, but these APIs are very basic, if these are not available API, other tasks cannot be implemented, and Mach's exception handling is also designed based on the above four capabilities.
In Mach, exceptions are passed through messages in the kernel. The exception is thrown by the wrong task or thread through msg_send() and caught by a handler through msg_recv(). The handler can handle exceptions, clear exceptions (mark the exception as completed and continue), and terminate the thread. The exception handler of
Mach runs in different contexts. The error thread sends a message to the pre-specified exception port and then waits for a reply. Each task can register an exception port, which will be effective for all threads in the same task. In addition, a single thread can also register its own exception port through thread_set_exception_ports. Normally, the exception ports of tasks and threads are NULL, which means that the exception will not be processed. And once exception ports are created, these ports can be handed over to other tasks and even other hosts just like other ports in the system.
When an exception occurs, it will follow the following steps:
- tries to throw the exception port in the thread
- tries to throw the exception port of the task
- tries to throw the exception port of the host (i.e. the default port registered by the host).
If no port returns KERN_SUCCESS, then the entire task is terminated. According to the previous description, Mach does not provide exception handling logic - just a framework that provides exception notifications.
BSD layer is built on Mach. This layer is a very reliable and more modern API, providing POSIX compatibility and providing higher-level abstractions, including but not limited to:
- UNIX Process model
- POSiX thread model (pthread) and related synchronization primitives
- UNIX Users and groups
- Network protocol stack
- File system access
- Device access
In handling exceptions, Mach has provided the underlying trap processing through the exception mechanism, while BSD has built a signal processing mechanism on top of the exception mechanism. The signals generated by the hardware are captured by the Mach layer and converted into the corresponding UNIX signal. To maintain a unified mechanism, the signals generated by the operating system and user are first converted to Mach exceptions and then converted to signals. When the
BSD process is started by the bsdinit_task() function, the ux_handle_init() function is also called, which sets up a Mach kernel thread named ux_handle. bsdinit_task() can only register to use ux_Exception_port after the ux_handle_init() function returns. bsdinit_task() redirects all Mach exception messages to ux_exception_port, which is held by the ux_handle thread. Following the Mach exception message delivery method, process exception handling with PID of 1 will be handled by the ux_handle() thread outside the process. Since all user-mode processes created later are descendants of PID1, these processes will automatically inherit this exception port, which is equivalent to the ux_handle() thread being responsible for handling every Mach exception generated by UNIX processes on the system. The ux_handle() function is very simple. When entering, this function will first set ux_handle_port, and then enter an infinite loop of Mach message loop. The message loop accepts Mach exception message, and then calls mach_exc_server() to handle the exception. The entire flow chart is as follows:
Figure 3 Mach exception handling and the process of converting to UNIX signals
2. Crash collection method
To understand Crash, we should first understand several basic concepts and their relationship:
- Software exception: mainly comes from the calls to kill() and pthread_kill() of the two APIs, and the NSException not caught and abort() function calls that are often encountered in iOS are all in this case.
- Hardware exception: This type of exception starts with a processor trap, such as crashing access to a wild pointer.
- Mach exception: Mach exception handling process is referred to as
- UNIX signal: such as SIGBUS, SIGSEGV, SIGABRT, SIGKILL, etc.
...... Exception Type: EXC_CRASH (SIGABRT) Exception Codes: 0x00000000000000000000000000, 0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 EXC_CORPSE_NOTIFY Triggered by Thread: 0 Last Exception Backtrace:
... |
This is an App crash log. From the Exception Type: EXC_CRASH (SIGABRT) in the log, we can know that this is an EXC_CRASH exception occurred in the Mach layer, which was converted to the SIGABRT signal.So you may have a question? Since the Mach layer can catch exceptions and registering UNIX signals can also catch exceptions, how are the two methods systems selected? Moreover, from Figure 3, we can see that Mach exceptions will eventually be converted into UNIX signals. So do we just need to intercept the UNIX signals?
is not actually the case. There are two reasons for this:
- Because not all Mach exception types have corresponding UNIX signals for mapping
- UNIX signals are crashing thread callbacks. If you encounter stack overflow, then there is no stack space to execute the callback code.
So do you just need to intercept Mach exceptions? The answer is the same no, because the user-state software exception directly passes the signal flow. If it is not intercepted, it will cause this part of Crash to be lost.
Therefore, in the collection of Crash, the monitoring system should have a variety of exception handling capabilities. There are many such tools on the market, one of which is KSCrash, which is also the most popular and perfect Crash collection tool at present. Most of the source code is written based on C language . WeChat's open source project Matrix is also developed based on KSCrash, and the iOS crash monitoring in our APM system is also written based on this tool.
2.1 Mach layer exception handling
Reading the source code is the fastest way to understand tools. Let's take a look at how KSCrash handles Mach layer exceptions (KSCrashMonitor_MachException.c) as follows:
statictml4 bool installExceptionHandler() { ...... //Get the current task const task_t thisTask = mach_task_self(); exception_mask_t mask = EXC_MASK_BAD_ACCESS | EXC_MASK_BAD_INstructtION | EXC_MASK_ARITHMETIC | EXC_MASK_ARITHMETIC | EXC_MASK_SOFTWARE | EXC_MASK_BREAKPOINT; //Get all exception ports of the current task and save them in the kr attribute kr = task_get_exception_ports(thisTask, mask, g_previousExceptionPorts.masks, &g_previousExceptionPorts.count, g_previousExceptionPorts.ports, g_previousExceptionPorts.behaviors, g_previousExceptionPorts.flavors); if(kr != KERN_SUCCESS) { KSLOG_ERROR("task_get_exception_ports: %s", mach_error_string(kr)); goto failed; } if(g_exceptionPort == MACH_PORT_NULL) { KSLOG_DEBUG("Allocating new port with receive rights."); //Application for an exception port kr = mach_port_allocate(thisTask, MACH_PORT_RIGHT_RECEIVE, &g_exceptionPort); if(kr != KERN_SUCCESS) { KSLOG_ERROR("mach_port_allocate: %s", mach_error_string(kr)); goto failed; } KSLOG_DEBUG("Adding send rights to port."); //Set port permissions kr = mach_port_insert_right(thisTask, g_exceptionPort, g_exceptionPort, g_exceptionPort, MACH_MSG_TYPE_MAKE_SEND); if(kr != KERN_SUCCESS) { KSLOG_ERROR("mach_port_insert_right: %s", mach_error_string(kr)); goto failed; } } } ... //Create the corresponding thread and catch the exception error = pthread_create(&g_secondaryPThread, &attr, &handleExceptions, kThreadSecondary); ...... |
According to the source code, the following flow chart can be summarized:
In 1.2, mentioned The Mach layer handles exceptions, so the exception capture in the Mach layer is also carried out according to this processing process. The idea is to first apply for an exception handling port, apply for permissions for this port, then set up an exception port, create a new kernel thread, and wait for an exception in the thread loop, but when an exception occurs, the thread will be suspended and the information when Crash occurs will be assembled into the JSON file.However, in order to prevent the exception port you registered from preempting other SDKs or logic set by developers, you need to save other exception ports first, and after the collection logic is completed, the exception processing is handed over to the logical processing in other ports.
2.2 Signal Exception Handling
Signal Exception Capture is in the installSignalHandler function in KSCrashMonitor_signal.c. The specific solution is to first use the sigaltstack function to allocate a piece of memory on the heap and set the signal stack area. The purpose is to replace the stack of the signal processing function, because a process may have n threads, and each thread has its own task. If a thread executes an error, it will cause the entire process to collapse. Therefore, in order for the signal processing exception function to run normally, a separate running space needs to be set.
Next set the signal processing function sigaction, then traverse the signal array to be processed, bind the processing function of each signal to the sigaction, and use g_previousSignalHandlers to save the processing function of the current signal. During signal processing, save the thread's context information.
Finally wait until the KSCrash signal processing is restored before the signal processing permission is restored. Preface
Application Crash phenomenon will bring users a very poor user experience. This article will start from the bottom of the iOS system, sort out the core knowledge points, explain the collection of various types of Crash and the monitoring and analysis of OOM (out of memory), as well as the effects presented in the APM system
In the document given by Apple , in the abstract system architecture hierarchy diagram shown, the system architecture hierarchy of OSX and iOS is the same, divided into 4 levels:
- User experience layer: including Aqua, Dashboard/Spotlight, etc.;
- Application framework layer: Cocoa, Carbon, Java;
- Core framework: sometimes also called the graphics and media layer. Including core framework, OpenGL;
- Darwin: Including kernel and UNIX shell environment
levels, Darwin is completely open source, the foundation of the entire system, and provides the underlying API.
Figure 1 System architecture diagram for OSX and iOS
OSX and iOS system frameworks can be represented in Figure 1 in abstraction, but in fact there are still some differences in details between them, so I will not introduce them more here. What is more important here is the Darwin framework
Figure 2 Darwin framework diagram Source: "In-depth analysis of Mac OS X & iOS operating system"
Darwin's kernel is XUN, which is also the core of OS X itself. From Figure 2, XUN has the following components:
- Mach
- BSD
- LibKern
- I/O Kithml4
The most important of these are Mach and BSD.
Mach is a microkernel, this microkernel can only handle the most basic responsibilities of the operating system:
- process and thread abstract
- virtual memory management
- Any scheduling
- inter-process communication and message delivery mechanism
Mach itself has very limited APIs, but these APIs are very basic, if these are not available API, other tasks cannot be implemented, and Mach's exception handling is also designed based on the above four capabilities.
In Mach, exceptions are passed through messages in the kernel. The exception is thrown by the wrong task or thread through msg_send() and caught by a handler through msg_recv(). The handler can handle exceptions, clear exceptions (mark the exception as completed and continue), and terminate the thread. The exception handler of
Mach runs in different contexts. The error thread sends a message to the pre-specified exception port and then waits for a reply. Each task can register an exception port, which will be effective for all threads in the same task. In addition, a single thread can also register its own exception port through thread_set_exception_ports. Normally, the exception ports of tasks and threads are NULL, which means that the exception will not be processed. And once exception ports are created, these ports can be handed over to other tasks and even other hosts just like other ports in the system.
When an exception occurs, it will follow the following steps:
- tries to throw the exception port in the thread
- tries to throw the exception port of the task
- tries to throw the exception port of the host (i.e. the default port registered by the host).
If no port returns KERN_SUCCESS, then the entire task is terminated. According to the previous description, Mach does not provide exception handling logic - just a framework that provides exception notifications.
BSD layer is built on Mach. This layer is a very reliable and more modern API, providing POSIX compatibility and providing higher-level abstractions, including but not limited to:
- UNIX Process model
- POSiX thread model (pthread) and related synchronization primitives
- UNIX Users and groups
- Network protocol stack
- File system access
- Device access
In handling exceptions, Mach has provided the underlying trap processing through the exception mechanism, while BSD has built a signal processing mechanism on top of the exception mechanism. The signals generated by the hardware are captured by the Mach layer and converted into the corresponding UNIX signal. To maintain a unified mechanism, the signals generated by the operating system and user are first converted to Mach exceptions and then converted to signals. When the
BSD process is started by the bsdinit_task() function, the ux_handle_init() function is also called, which sets up a Mach kernel thread named ux_handle. bsdinit_task() can only register to use ux_Exception_port after the ux_handle_init() function returns. bsdinit_task() redirects all Mach exception messages to ux_exception_port, which is held by the ux_handle thread. Following the Mach exception message delivery method, process exception handling with PID of 1 will be handled by the ux_handle() thread outside the process. Since all user-mode processes created later are descendants of PID1, these processes will automatically inherit this exception port, which is equivalent to the ux_handle() thread being responsible for handling every Mach exception generated by UNIX processes on the system. The ux_handle() function is very simple. When entering, this function will first set ux_handle_port, and then enter an infinite loop of Mach message loop. The message loop accepts Mach exception message, and then calls mach_exc_server() to handle the exception. The entire flow chart is as follows:
Figure 3 Mach exception handling and the process of converting to UNIX signals
2. Crash collection method
To understand Crash, we should first understand several basic concepts and their relationship:
- Software exception: mainly comes from the calls to kill() and pthread_kill() of the two APIs, and the NSException not caught and abort() function calls that are often encountered in iOS are all in this case.
- Hardware exception: This type of exception starts with a processor trap, such as crashing access to a wild pointer.
- Mach exception: Mach exception handling process is referred to as
- UNIX signal: such as SIGBUS, SIGSEGV, SIGABRT, SIGKILL, etc.
...... Exception Type: EXC_CRASH (SIGABRT) Exception Codes: 0x00000000000000000000000000, 0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 EXC_CORPSE_NOTIFY Triggered by Thread: 0 Last Exception Backtrace:
... |
This is an App crash log. From the Exception Type: EXC_CRASH (SIGABRT) in the log, we can know that this is an EXC_CRASH exception occurred in the Mach layer, which was converted to the SIGABRT signal.So you may have a question? Since the Mach layer can catch exceptions and registering UNIX signals can also catch exceptions, how are the two methods systems selected? Moreover, from Figure 3, we can see that Mach exceptions will eventually be converted into UNIX signals. So do we just need to intercept the UNIX signals?
is not actually the case. There are two reasons for this:
- Because not all Mach exception types have corresponding UNIX signals for mapping
- UNIX signals are crashing thread callbacks. If you encounter stack overflow, then there is no stack space to execute the callback code.
So do you just need to intercept Mach exceptions? The answer is the same no, because the user-state software exception directly passes the signal flow. If it is not intercepted, it will cause this part of Crash to be lost.
Therefore, in the collection of Crash, the monitoring system should have a variety of exception handling capabilities. There are many such tools on the market, one of which is KSCrash, which is also the most popular and perfect Crash collection tool at present. Most of the source code is written based on C language . WeChat's open source project Matrix is also developed based on KSCrash, and the iOS crash monitoring in our APM system is also written based on this tool.
2.1 Mach layer exception handling
Reading the source code is the fastest way to understand tools. Let's take a look at how KSCrash handles Mach layer exceptions (KSCrashMonitor_MachException.c) as follows:
statictml4 bool installExceptionHandler() { ...... //Get the current task const task_t thisTask = mach_task_self(); exception_mask_t mask = EXC_MASK_BAD_ACCESS | EXC_MASK_BAD_INstructtION | EXC_MASK_ARITHMETIC | EXC_MASK_ARITHMETIC | EXC_MASK_SOFTWARE | EXC_MASK_BREAKPOINT; //Get all exception ports of the current task and save them in the kr attribute kr = task_get_exception_ports(thisTask, mask, g_previousExceptionPorts.masks, &g_previousExceptionPorts.count, g_previousExceptionPorts.ports, g_previousExceptionPorts.behaviors, g_previousExceptionPorts.flavors); if(kr != KERN_SUCCESS) { KSLOG_ERROR("task_get_exception_ports: %s", mach_error_string(kr)); goto failed; } if(g_exceptionPort == MACH_PORT_NULL) { KSLOG_DEBUG("Allocating new port with receive rights."); //Application for an exception port kr = mach_port_allocate(thisTask, MACH_PORT_RIGHT_RECEIVE, &g_exceptionPort); if(kr != KERN_SUCCESS) { KSLOG_ERROR("mach_port_allocate: %s", mach_error_string(kr)); goto failed; } KSLOG_DEBUG("Adding send rights to port."); //Set port permissions kr = mach_port_insert_right(thisTask, g_exceptionPort, g_exceptionPort, g_exceptionPort, MACH_MSG_TYPE_MAKE_SEND); if(kr != KERN_SUCCESS) { KSLOG_ERROR("mach_port_insert_right: %s", mach_error_string(kr)); goto failed; } } } ... //Create the corresponding thread and catch the exception error = pthread_create(&g_secondaryPThread, &attr, &handleExceptions, kThreadSecondary); ...... |
According to the source code, the following flow chart can be summarized:
In 1.2, mentioned The Mach layer handles exceptions, so the exception capture in the Mach layer is also carried out according to this processing process. The idea is to first apply for an exception handling port, apply for permissions for this port, then set up an exception port, create a new kernel thread, and wait for an exception in the thread loop, but when an exception occurs, the thread will be suspended and the information when Crash occurs will be assembled into the JSON file.However, in order to prevent the exception port you registered from preempting other SDKs or logic set by developers, you need to save other exception ports first, and after the collection logic is completed, the exception processing is handed over to the logical processing in other ports.
2.2 Signal Exception Handling
Signal Exception Capture is in the installSignalHandler function in KSCrashMonitor_signal.c. The specific solution is to first use the sigaltstack function to allocate a piece of memory on the heap and set the signal stack area. The purpose is to replace the stack of the signal processing function, because a process may have n threads, and each thread has its own task. If a thread executes an error, it will cause the entire process to collapse. Therefore, in order for the signal processing exception function to run normally, a separate running space needs to be set.
Next set the signal processing function sigaction, then traverse the signal array to be processed, bind the processing function of each signal to the sigaction, and use g_previousSignalHandlers to save the processing function of the current signal. During signal processing, save the thread's context information.
Finally wait until the KSCrash signal processing is restored before the signal processing permission is restored.
core code is as follows:
statictml4 bool installSignalHandler() { KSLOG_DEBUG("Installing signal handler."); #if KSCRASH_HAS_SIGNAL_STACK // Allocate a piece of memory on the heap, if(g_signalStack.ss_size == 0) { KSLOG_DEBUG("Allocating signal stack area."); g_signalStack.ss_size = SIGSTKSZ; g_signalStack.ss_sp = malloc(g_signalStack.ss_size); } } KSLOG_DEBUG("Setting signal stack area."); // The stack of signal processing functions is moved to the heap, and does not share the same stack area as the process if(sigaltstack(&g_signalStack, NULL) != 0) { KSLOG_ERROR("signalstack: %s", strerror(errno)); goto failed; } #endif const int* fatalSignals = kssignal_fatalSignals(); int fatalSignalsCount = kssignal_numFatalSignals(); if(g_previousSignalHandlers == NULL) { KSLOG_DEBUG("Allocating memory to store previous signal handlers."); g_previousSignalHandlers = malloc(sizeof(*g_previousSignalHandlers) * (unsigned)fatalSignalsCount); } } // Set the second parameter of the signal processing function sigaction, type sigaction structure struct sigaction action = {{0}}; action.sa_flags = SA_SIGINFO | SA_ONSTACK; #if KSCRASH_HOST_APPLE && defined(__LP64__) action.sa_flags |= SA_64REGSET; #endif sigemptyset(&action.sa_mask); action.sa_sigaction = &handleSignal; for(int i = 0; i fatalSignalsCount; i++) { KSLOG_DEBUG("Assigning handler for signal %d", fatalSignals[i]); // Bind the processing function of each signal to the action declared above, and use g_previousSignalHandlers to save the processing function of the current signal if(sigaction(fatalSignals[i], &action, &g_previousSignalHandlers[i]) != 0) { char sigNameBuff[30]; const char*sigName = kssignal_signalName(fatalSignals[i]); if(sigName == NULL) { snprintf(sigNameBuff, sizeof(sigNameBuff), "%d", fatalSignals[i]); sigName = sigNameBuff; } KSLOG_ERROR("sigaction (%s): %s", sigName, strerror(errno)); // Try to reverse the damage for(i--;i = 0; i--) { sigaction(fatalSignals[i], &g_previousSignalHandlers[i], NULL); } goto failed; } } KSLOG_DEBUG("Signal handlers installed."); return true; failed: KSLOG_DEBUG("Failed to install signal handlers."); return false; } ... |
2.3 C++Exception handling
c++Exception handling relies on the standard library's std::set_terminate(CPPExceptionTerminate) function.
In iOS, if C and C++ exceptions can be converted into NSException, Objective-C exception handling will be performed. If not, it is default_terminate_handler. This C++ exception default_terminate_handler function calls the abort_message function, and the system generates a SIGABRT signal.
statictml4 void CPPExceptionTerminate(void) { ...... // The conditions of exception, the NSException inherited from NSException will be treated as cpp exception // if (name == NULL || strcmp(name, "NSException") != 0 if (g_capturedStackCursor && (name == NULL || strcmp(name, "NSException") != 0)) { kscm_notifyFatalExceptionCaptured(false); KSCrash_MonitorContext* crashContext = &g_monitorContext; memset(crashContext, 0, sizeof(*crashContext)); char* description = descriptionBuff; const char* description = descriptionBuff; descriptionBuff[0] = 0; KSLOG_DEBUG("Discovering what kind of exception was thrown."); g_captureNextStackTrace = false; try { { throw; } catch(std::exception& exc) { strncpy(descriptionBuff, exc.what(), sizeof(descriptionBuff)); } } ...... |
For NSException exception handling at the OC level is relatively easy. You can register NSUncaughtExceptionHandler to capture exception information, collect Crash information through NSException parameters, and hand it over to the data reporting component. For example,
KSCrash.sharedInstance.uncaughtExceptionHandler = &handleException;
. OOM related concepts
OOM is the abbreviation of out of memory, which refers to iOS The current application on the device is forced to terminate by the operating system due to excessive memory usage. The perception on the user side is that the App crashes in a flash, which is not significantly different from ordinary Crash. However, when we encounter this kind of crash in the debugging stage, we cannot find ordinary crash logs in the device analysis and improvement. You can find logs starting with Jetsam. This log is a log generated by the system after the OOM crash specifically reflects memory exception problems.
According to the running status of the program, OOM is generally divided into the following two types:
Foreground Out Of Memory
OOM crashes in the foreground and the application is running in the foreground and
- Background Out Of Memory
OOM crashes in the background
Jetsam
Jetsam is a resource management mechanism adopted by the iOS operating system to control the excessive use of memory resources. Unlike MacOS, Linux, Windows and other desktop operating systems. For performance considerations, the iOS system does not design a memory swap space mechanism. Therefore, in iOS, if the overall memory of the device is tight, the system can only directly terminate some processes with low priority or excessive memory.Some log information intercepted below
:
{ "uuid" : "a02fb850-9725-4051-817a-8a5dc0950872", "states" : [ "frontmost" //Application status: Foreground running ], "lifetimeMax" : 92802, "purgeable" : 0, "coalition" : 68, "rpages" : 92802, //Equipment page "reason" : "per-process-limit", //Crash reason: exceeding the upper limit of single process "name" : "MyCoolApp" } |
Detailed description can be used to refer to the official document
Jetsam mechanism cleaning strategy is divided into two situations:
- Single App process is online over memory
- The physical memory usage of the device will be cleaned according to priority level:
- Backstage application Front-stage application
- Application with high memory footprint Application with low memory footprint
- User application System application
Function introduction & principle
OOM warning
OOM warning function mainly reports the memory status related information of the APM platform when the memory reaches the predetermined threshold.The flowchart is as follows:
provides a structure representing memory information based on the system kernel
The task_info method can obtain the relevant usage of memory
kern_return_t task_info ( task_name_t target_task, task_flavor_t flavor, task_info_t task_info_out, mach_msg_type_number_t *task_info_outCnt ); |
Monitoring memory size code is as follows:
int64_t memoryUsageInByte = 0; task_vm_info_data_t vmInfo; mach_msg_type_number_t count = TASK_VM_INFO_COUNT; kern_return_t kernelReturn = task_info(mach_task_self(), TASK_VM_INFO, (task_info_t) &vmInfo, &count); if(kernelReturn == KERN_SUCCESS) { memoryUsageInByte = (int64_t) vmInfo.phys_footprint; } |
The top processing logic is as follows:
That is, when the set threshold value is exceeded, the memory information at that time is reported!
-(void)saveLastSingleLoginMaxMemory{ if(_hasUpoad){ NSString* currentMemory = [NSString stringWithFormat:@"%f", _singleLoginMaxMemory]; NSString* overflowMemoryLimit =[NSString stringWithFormat:@"%f", overflow_limit]; if(_singleLoginMaxMemory overflow_limit){ static BOOL isFirst = YES; if(isFirst){ _firstOOMTime = [[NSDate date] timeIntervalSince1970]; isFirst = NO; } } } } NSDictionary *minidumpdata = [NSDictionary dictionaryWithObjectsAndKeys:currentMemory,@"singleMemory",overflowMemoryLimit,@"threshold",[NSString stringWithFormat: @"%.2lf", _firstOOMTime],@"LaunchTime",nil]; NSString *fileDir = [self singleLoginMaxMemoryDir]; if (![[NSFileManager defaultManager] fileExistsAtPath:fileDir]) { [[NSFileManager defaultManager] createDirectoryAtPath:fileDir withIntermediateDirectories:YES attributes:nil error:nil]; } NSString *filePath = [fileDir stringByAppendingString:@"/apmLastMaxMemory.plist"]; if(minidumpdata != nil){ if([[NSFileManager defaultManager] fileExistsAtPath:filePath]){ [[NSFileManager defaultManager] removeItemAtPath:filePath error:nil]; } [minidumpdata writeToFile:filePath atomically:YES]; } } } } |
simulated memory top to get log records:
OOM monitoring
OOM monitoring is to timely record the stack information at that time when the App crashes due to OOM, and report to the APM platform for subsequent problem analysis.When the
Jetsam mechanism terminates the process by sending a SIKILL exception signal, but it cannot be captured by the current process. The conventional Crash capture scheme using monitoring the exception signal is not possible. So how to monitor it? In 2015, Facebook proposed an idea, using the exclusion method.
Every time the App starts, determine the reason for the last startup process termination. Known are:
- App updated version
- App crash occurred
- User manually exited
- Operation system updated version
- App process terminated after switching backend
If the last startup process termination was not the above reason, it is determined that OOM crash occurred during the last startup.
core code logic is as follows:
-(NSDictionary *)parseFoomData:(NSDictionary *)foomDict { ...... if(appState == APPENTERFORGROUND){ BOOL isExit = [[foomDict objectForKey:@"isExit"] boolValue]; BOOL isDeadLock = [[foomDict objectForKey:@"isDeadLock"] boolValue]; NSString *lastSysVersion = [foomDict objectForKey:@"systemVersion"]; NSString *lastAppVersion = [foomDict objectForKey:@"appVersion"]; if(!isCrashed && !isExit && [_systemVersion isEqualToString:lastSysVersion] && [_appVersion isEqualToString:lastAppVersion]){ if(isDeadLock){ OOM_Log("The app occurred lastTime,detail info:%s",[[foomDict description] UTF8String]); [result setObject:@deadlock_crash forKey:@"crash_type"]; NSDictionary *stack = [foomDict objectForKey:@"deadlockStack"]; if(stack && stack.count 0){ [result setObject:stack forKey:@"stack_deadlock"]; OOM_Log("The app deadlock stack:%s",[[stack description] UTF8String]); } } } else { OOM_Log("The app occurred lastTime,detail info:%s",[[foomDict description] UTF8String]); [result setObject:@foom_crash forKey:@"crash_type"]; NSString *uuid = [foomDict objectForKey:@"uuid"]; NSArray *oomStack = [[OOMDector getInstance] getOOMDataByUUID:uuid]; if(oomStack && oomStack.count 0) { { NSData *oomData = [NSJSONSerialization dataWithJSONObject:oomStack options:0 error:nil]; if(oomData.length 0){ // NSString *stackStr = [NSString stringWithUTF8String:(const char *)oomData.bytes]; OOM_Log("The app foom stack:%s",[[oomStack description] UTF8String]); } [result setObject:[self getAPMOOMStack:oomStack] forKey:@"stack_oom"]; } } } } } return result; } } ... ... } |
memory portrait
memory portrait, when the program reaches the top situation, it snapshots the memory, exports the memory node reference situation, and finds the reason for the large memory! There are two things to do:
. Getting of memory node
Getting of memory node
Getting of memory node is to scan all VM Regions in the process through the vm_region_recurse/vm_region_recure64 function of the mach kernel, and obtain detailed information through the vm_region_submap_info_64 structure.
. Analyze the reference relationship between nodes
Here are two situations: the VM Region where the heap is maintained by libmalloc is located. The OC objects, C/C++ objects, buffers, etc. contained in the VM Region, C/C++ objects, buffers, etc. can obtain detailed reference relationships and need to be processed separately.The VM Region's separate memory nodes that are not maintained by libmalloc only record the start address and Dirty, Swapped memory size, and reference relationships with other nodes.
The core code for obtaining the node is as follows:
void VMRegionCollect::startCollet() {
......
while (1) {
structt vm_region_submap_info_64 info;
mach_msg_type_number_t count = VM_REGION_SUBMAP_INFO_COUNT_64;
krc = vm_region_recurse_64(mach_task_self(), &address, &size, &depth, (vm_region_info_64_t)&info, &count);
if (krc == KERN_INVALID_ADDRESS){
break;
}
if (info.is_submap){
depth++;
} else {
//do stuff
proc_regionfilename(pid, address, buf, sizeof(buf));
printf("Found VM Region: %08x to %08x (depth=%d) user_tag:%s name:%s\n", (uint32_t)address, (uint32_t)(address+size), depth, [visualMemoryTypeString(info.user_tag) cStringUsingEncoding:NSUTF8StringEncoding], buf);
address += size;
}
}
}
}
}
Scan node case The data information is as follows:
The core code of the heap memory node reference relationship is as follows:
matches the address of the class member variable and the isa pointer address of the reference class, so as to find out whether there is a reference relationship!
statictml4 void range_callback(task_t task, void *context, unsigned type, vm_range_t *ranges, unsigned rangeCount) {
if (!context) {
return;
}
for (unsigned int i = 0; i rangeCount; i++) {
vm_range_t range = ranges[i];
flex_maybe_object_t *tryObject = (flex_maybe_object_t *)range.address;
Class tryClass = NULL;
#ifdef __arm64__
// See http://www.sealiesoftware.com/blog/archive/2013/09/24/objc_explain_Non-pointer_isa.html
extern uint64_t objc_debug_isa_class_mask WEAK_IMPORT_ATTRIBUTE;
tryClass = (__bridge Class)((void *)((uint64_t)tryObject-isa & objc_debug_isa_class_mask));
#else
tryClass = tryObject-isa;
#endif
// If the class pointer matches one in our set of class points from the runtime, then we should have an object.
if (CFSetContainsValue(registeredClasses, (__bridge const void *)(tryClass))) {
(*(object_enumeration_block_t __unsafe_unretained *)context)((__bridge id)tryObject, tryClass);
}
}
}
}
statictml4 kern_return_t reader(_unused task_t remote_task, vm_address_t remote_address, __unused vm_size_t size, void **local_memory) {
*local_memory = (void *)remote_address;
return KERN_SUCCESS;
}
+ (void)enumerateLiveObjectsUsingBlock:(object_enumeration_block_t)block {
if (!block) {
return;
}
[self updateRegisteredClasses];
vm_address_t *zones = NULL;
unsigned int zoneCount = 0;
kern_return_t result = malloc_get_all_zones(TASK_NULL, reader, &zones, &zoneCount);
if (result == KERN_SUCCESS) {
for (unsigned int i = 0; i zoneCount; i++) {
malloc_zone_t *zone = (malloc_zone_t *)zones[i];
malloc_introspection_t *introspection = zone-introspection;
if (!introspection) {
void;
}
void (*lock_zone)(malloc_zone_t *zone) = introduction-force_lock;
void (*unlock_zone)(malloc_zone_t *zone) = introduction-force_unlock;
object_enumeration_block_t callback = ^(__unsafe_unretained id object, __unsafe_unretained Class actualClass) {
unlock_zone(zone);
block(object, actualClass);
lock_zone(zone);
};
};
BOOL lockZoneValid = PointerIsReadable(lock_zone);
BOOL unlockZoneValid = PointerIsReadable(lock_zone);
if (introspection-enumerator && lockZoneValid && unlockZoneValid) {
lock_zone(zone);
introspection-enumerator(TASK_NULL, (void *)&callback, MALLOC_PTR_IN_USE_RANGE_TYPE, (vm_address_t)zone, reader, &range_callback);
unlock_zone(zone);
}
}
}
}
}
}
}
case to get the reference relationship The data is as follows:
uses a reverse order output reference relationship, so it looks like a step-by-step form!
Take out some of the data borrowing tools to analyze the reference relationship as shown in the figure:
This way, you can clearly see the reference relationship between the heap memory nodes and the memory occupied.
Summary
The above is an introduction to the functions of OOM in the APM system, which mainly includes three major functional points:
- OOM warning can be found that online apps record when exceeding the memory threshold, to identify the risk of OOM causing crash.
- OOM monitoring records the case scene in a timely manner when OOM occurs, providing clues to subsequent developer problem searches.
- memory portrait exports its reference relationship when OOM occurs, records node size and other information, and more intuitively finds where the memory is large.
4. Crash log
4.1 APM reports Crash log process
Project integration APM. When SDK is initialized, Crash monitoring will be turned on by default. When Crash occurs, the following steps will be performed:
- KSCrash After collecting the crash log, the crashCallBack function of the APM is executed.
- Write the log to APMLog in the crashCallBack function and cache it
- When the next startup is launched, APM After the initialization is successful, the Crash file is reported according to the reporting process to the server
4.2 Log parsing
APM iOS SDK only needs to successfully report the log, and the server will perform symbolic work based on the information of Crash, such as version number, binaryImage, and UUID. After the symbolization is successful, the corresponding log can be viewed in the management background.
References
.Black, David L. The mach Exception Handing Facility.
.iOS Crash Analysis Guide https://developer.aliyun.com/article/766088
. "In-depth Analysis of Mac OSX & iOS Operating System"
Author: Zheng Genghao, Lan Haiting
Source: WeChat public account: Yingke technology
Source: https://mp.weixin.qq.com/s/WGod1JhojaWhuOap45QxaA
In iOS, if C and C++ exceptions can be converted into NSException, Objective-C exception handling will be performed. If not, it is default_terminate_handler. This C++ exception default_terminate_handler function calls the abort_message function, and the system generates a SIGABRT signal.
statictml4 void CPPExceptionTerminate(void)
{
......
// The conditions of exception, the NSException inherited from NSException will be treated as cpp exception
// if (name == NULL || strcmp(name, "NSException") != 0
if (g_capturedStackCursor && (name == NULL || strcmp(name, "NSException") != 0))
{
kscm_notifyFatalExceptionCaptured(false);
KSCrash_MonitorContext* crashContext = &g_monitorContext;
memset(crashContext, 0, sizeof(*crashContext));
char* description = descriptionBuff;
const char* description = descriptionBuff;
descriptionBuff[0] = 0;
KSLOG_DEBUG("Discovering what kind of exception was thrown.");
g_captureNextStackTrace = false;
try
{
{
throw;
}
catch(std::exception& exc)
{
strncpy(descriptionBuff, exc.what(), sizeof(descriptionBuff));
}
}
......
For NSException exception handling at the OC level is relatively easy. You can register NSUncaughtExceptionHandler to capture exception information, collect Crash information through NSException parameters, and hand it over to the data reporting component. For example,
KSCrash.sharedInstance.uncaughtExceptionHandler = &handleException;
. OOM related concepts
OOM is the abbreviation of out of memory, which refers to iOS The current application on the device is forced to terminate by the operating system due to excessive memory usage. The perception on the user side is that the App crashes in a flash, which is not significantly different from ordinary Crash. However, when we encounter this kind of crash in the debugging stage, we cannot find ordinary crash logs in the device analysis and improvement. You can find logs starting with Jetsam. This log is a log generated by the system after the OOM crash specifically reflects memory exception problems.
According to the running status of the program, OOM is generally divided into the following two types:
Foreground Out Of Memory
OOM crashes in the foreground and the application is running in the foreground and
- Background Out Of Memory
OOM crashes in the background
Jetsam
Jetsam is a resource management mechanism adopted by the iOS operating system to control the excessive use of memory resources. Unlike MacOS, Linux, Windows and other desktop operating systems. For performance considerations, the iOS system does not design a memory swap space mechanism. Therefore, in iOS, if the overall memory of the device is tight, the system can only directly terminate some processes with low priority or excessive memory.Some log information intercepted below
:
{
"uuid" : "a02fb850-9725-4051-817a-8a5dc0950872",
"states" : [
"frontmost" //Application status: Foreground running
],
"lifetimeMax" : 92802,
"purgeable" : 0,
"coalition" : 68,
"rpages" : 92802, //Equipment page
"reason" : "per-process-limit", //Crash reason: exceeding the upper limit of single process
"name" : "MyCoolApp"
}
Detailed description can be used to refer to the official document
Jetsam mechanism cleaning strategy is divided into two situations:
- Single App process is online over memory
- The physical memory usage of the device will be cleaned according to priority level:
- Backstage application Front-stage application
- Application with high memory footprint Application with low memory footprint
- User application System applicationFunction introduction & principle
OOM warning
OOM warning function mainly reports the memory status related information of the APM platform when the memory reaches the predetermined threshold.The flowchart is as follows:
provides a structure representing memory information based on the system kernel
The task_info method can obtain the relevant usage of memory
kern_return_t task_info
(
task_name_t target_task,
task_flavor_t flavor,
task_info_t task_info_out,
mach_msg_type_number_t *task_info_outCnt
);
Monitoring memory size code is as follows:
int64_t memoryUsageInByte = 0;
task_vm_info_data_t vmInfo;
mach_msg_type_number_t count = TASK_VM_INFO_COUNT;
kern_return_t kernelReturn = task_info(mach_task_self(), TASK_VM_INFO, (task_info_t) &vmInfo, &count);
if(kernelReturn == KERN_SUCCESS) {
memoryUsageInByte = (int64_t) vmInfo.phys_footprint;
}
The top processing logic is as follows:
That is, when the set threshold value is exceeded, the memory information at that time is reported!
-(void)saveLastSingleLoginMaxMemory{
if(_hasUpoad){
NSString* currentMemory = [NSString stringWithFormat:@"%f", _singleLoginMaxMemory];
NSString* overflowMemoryLimit =[NSString stringWithFormat:@"%f", overflow_limit];
if(_singleLoginMaxMemory overflow_limit){
static BOOL isFirst = YES;
if(isFirst){
_firstOOMTime = [[NSDate date] timeIntervalSince1970];
isFirst = NO;
}
}
}
}
NSDictionary *minidumpdata = [NSDictionary dictionaryWithObjectsAndKeys:currentMemory,@"singleMemory",overflowMemoryLimit,@"threshold",[NSString stringWithFormat: @"%.2lf", _firstOOMTime],@"LaunchTime",nil];
NSString *fileDir = [self singleLoginMaxMemoryDir];
if (![[NSFileManager defaultManager] fileExistsAtPath:fileDir])
{
[[NSFileManager defaultManager] createDirectoryAtPath:fileDir withIntermediateDirectories:YES attributes:nil error:nil];
}
NSString *filePath = [fileDir stringByAppendingString:@"/apmLastMaxMemory.plist"];
if(minidumpdata != nil){
if([[NSFileManager defaultManager] fileExistsAtPath:filePath]){
[[NSFileManager defaultManager] removeItemAtPath:filePath error:nil];
}
[minidumpdata writeToFile:filePath atomically:YES];
}
}
}
}
simulated memory top to get log records:
OOM monitoring
OOM monitoring is to timely record the stack information at that time when the App crashes due to OOM, and report to the APM platform for subsequent problem analysis.When the
Jetsam mechanism terminates the process by sending a SIKILL exception signal, but it cannot be captured by the current process. The conventional Crash capture scheme using monitoring the exception signal is not possible. So how to monitor it? In 2015, Facebook proposed an idea, using the exclusion method.
Every time the App starts, determine the reason for the last startup process termination. Known are:
- App updated version
- App crash occurred
- User manually exited
- Operation system updated version
- App process terminated after switching backend
If the last startup process termination was not the above reason, it is determined that OOM crash occurred during the last startup.
core code logic is as follows:
-(NSDictionary *)parseFoomData:(NSDictionary *)foomDict
{
......
if(appState == APPENTERFORGROUND){
BOOL isExit = [[foomDict objectForKey:@"isExit"] boolValue];
BOOL isDeadLock = [[foomDict objectForKey:@"isDeadLock"] boolValue];
NSString *lastSysVersion = [foomDict objectForKey:@"systemVersion"];
NSString *lastAppVersion = [foomDict objectForKey:@"appVersion"];
if(!isCrashed && !isExit && [_systemVersion isEqualToString:lastSysVersion] && [_appVersion isEqualToString:lastAppVersion]){
if(isDeadLock){
OOM_Log("The app occurred lastTime,detail info:%s",[[foomDict description] UTF8String]);
[result setObject:@deadlock_crash forKey:@"crash_type"];
NSDictionary *stack = [foomDict objectForKey:@"deadlockStack"];
if(stack && stack.count 0){
[result setObject:stack forKey:@"stack_deadlock"];
OOM_Log("The app deadlock stack:%s",[[stack description] UTF8String]);
}
}
}
else {
OOM_Log("The app occurred lastTime,detail info:%s",[[foomDict description] UTF8String]);
[result setObject:@foom_crash forKey:@"crash_type"];
NSString *uuid = [foomDict objectForKey:@"uuid"];
NSArray *oomStack = [[OOMDector getInstance] getOOMDataByUUID:uuid];
if(oomStack && oomStack.count 0)
{
{
NSData *oomData = [NSJSONSerialization dataWithJSONObject:oomStack options:0 error:nil];
if(oomData.length 0){
// NSString *stackStr = [NSString stringWithUTF8String:(const char *)oomData.bytes];
OOM_Log("The app foom stack:%s",[[oomStack description] UTF8String]);
}
[result setObject:[self getAPMOOMStack:oomStack] forKey:@"stack_oom"];
}
}
}
}
}
return result;
}
}
...
...
}
memory portrait
memory portrait, when the program reaches the top situation, it snapshots the memory, exports the memory node reference situation, and finds the reason for the large memory! There are two things to do:
. Getting of memory node
Getting of memory node
Getting of memory node is to scan all VM Regions in the process through the vm_region_recurse/vm_region_recure64 function of the mach kernel, and obtain detailed information through the vm_region_submap_info_64 structure.
. Analyze the reference relationship between nodes
Here are two situations: the VM Region where the heap is maintained by libmalloc is located. The OC objects, C/C++ objects, buffers, etc. contained in the VM Region, C/C++ objects, buffers, etc. can obtain detailed reference relationships and need to be processed separately.The VM Region's separate memory nodes that are not maintained by libmalloc only record the start address and Dirty, Swapped memory size, and reference relationships with other nodes.
Preface
Application Crash phenomenon will bring users a very poor user experience. This article will start from the bottom of the iOS system, sort out the core knowledge points, explain the collection of various types of Crash and the monitoring and analysis of OOM (out of memory), as well as the effects presented in the APM system
. Exception processing .1 OSX/iOS System architecture In the document given by Apple , in the abstract system architecture hierarchy diagram shown, the system architecture hierarchy of OSX and iOS is the same, divided into 4 levels:
- User experience layer: including Aqua, Dashboard/Spotlight, etc.;
- Application framework layer: Cocoa, Carbon, Java;
- Core framework: sometimes also called the graphics and media layer. Including core framework, OpenGL;
- Darwin: Including kernel and UNIX shell environment
levels, Darwin is completely open source, the foundation of the entire system, and provides the underlying API.
Figure 1 System architecture diagram for OSX and iOS
OSX and iOS system frameworks can be represented in Figure 1 in abstraction, but in fact there are still some differences in details between them, so I will not introduce them more here. What is more important here is the Darwin framework
Figure 2 Darwin framework diagram Source: "In-depth analysis of Mac OS X & iOS operating system"
Darwin's kernel is XUN, which is also the core of OS X itself. From Figure 2, XUN has the following components:
- Mach
- BSD
- LibKern
- I/O Kithml4
The most important of these are Mach and BSD.
.2 Mach layer Mach is a microkernel, this microkernel can only handle the most basic responsibilities of the operating system:
- process and thread abstract
- virtual memory management
- Any scheduling
- inter-process communication and message delivery mechanism
Mach itself has very limited APIs, but these APIs are very basic, if these are not available API, other tasks cannot be implemented, and Mach's exception handling is also designed based on the above four capabilities.
In Mach, exceptions are passed through messages in the kernel. The exception is thrown by the wrong task or thread through msg_send() and caught by a handler through msg_recv(). The handler can handle exceptions, clear exceptions (mark the exception as completed and continue), and terminate the thread. The exception handler of
Mach runs in different contexts. The error thread sends a message to the pre-specified exception port and then waits for a reply. Each task can register an exception port, which will be effective for all threads in the same task. In addition, a single thread can also register its own exception port through thread_set_exception_ports. Normally, the exception ports of tasks and threads are NULL, which means that the exception will not be processed. And once exception ports are created, these ports can be handed over to other tasks and even other hosts just like other ports in the system.
When an exception occurs, it will follow the following steps:
- tries to throw the exception port in the thread
- tries to throw the exception port of the task
- tries to throw the exception port of the host (i.e. the default port registered by the host).
If no port returns KERN_SUCCESS, then the entire task is terminated. According to the previous description, Mach does not provide exception handling logic - just a framework that provides exception notifications.
.3 BSD layer BSD layer is built on Mach. This layer is a very reliable and more modern API, providing POSIX compatibility and providing higher-level abstractions, including but not limited to:
- UNIX Process model
- POSiX thread model (pthread) and related synchronization primitives
- UNIX Users and groups
- Network protocol stack
- File system access
- Device access
In handling exceptions, Mach has provided the underlying trap processing through the exception mechanism, while BSD has built a signal processing mechanism on top of the exception mechanism. The signals generated by the hardware are captured by the Mach layer and converted into the corresponding UNIX signal. To maintain a unified mechanism, the signals generated by the operating system and user are first converted to Mach exceptions and then converted to signals. When the
BSD process is started by the bsdinit_task() function, the ux_handle_init() function is also called, which sets up a Mach kernel thread named ux_handle. bsdinit_task() can only register to use ux_Exception_port after the ux_handle_init() function returns. bsdinit_task() redirects all Mach exception messages to ux_exception_port, which is held by the ux_handle thread. Following the Mach exception message delivery method, process exception handling with PID of 1 will be handled by the ux_handle() thread outside the process. Since all user-mode processes created later are descendants of PID1, these processes will automatically inherit this exception port, which is equivalent to the ux_handle() thread being responsible for handling every Mach exception generated by UNIX processes on the system. The ux_handle() function is very simple. When entering, this function will first set ux_handle_port, and then enter an infinite loop of Mach message loop. The message loop accepts Mach exception message, and then calls mach_exc_server() to handle the exception. The entire flow chart is as follows:
Figure 3 Mach exception handling and the process of converting to UNIX signals
2. Crash collection method
To understand Crash, we should first understand several basic concepts and their relationship:
- Software exception: mainly comes from the calls to kill() and pthread_kill() of the two APIs, and the NSException not caught and abort() function calls that are often encountered in iOS are all in this case.
- Hardware exception: This type of exception starts with a processor trap, such as crashing access to a wild pointer.
- Mach exception: Mach exception handling process is referred to as
- UNIX signal: such as SIGBUS, SIGSEGV, SIGABRT, SIGKILL, etc.
......
Exception Type: EXC_CRASH (SIGABRT)
Exception Codes: 0x00000000000000000000000000, 0x0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 EXC_CORPSE_NOTIFY
Triggered by Thread: 0
Last Exception Backtrace:
CoreFoundation 0x1843765ac __exceptionPreprocess + 220 (NSException.m:199)
libobjc.A.dylib 0x1983f042c objc_exception_throw + 60 (objc-exception.mm:565)
...
This is an App crash log. From the Exception Type: EXC_CRASH (SIGABRT) in the log, we can know that this is an EXC_CRASH exception occurred in the Mach layer, which was converted to the SIGABRT signal.So you may have a question? Since the Mach layer can catch exceptions and registering UNIX signals can also catch exceptions, how are the two methods systems selected? Moreover, from Figure 3, we can see that Mach exceptions will eventually be converted into UNIX signals. So do we just need to intercept the UNIX signals?
is not actually the case. There are two reasons for this:
- Because not all Mach exception types have corresponding UNIX signals for mapping
- UNIX signals are crashing thread callbacks. If you encounter stack overflow, then there is no stack space to execute the callback code.
So do you just need to intercept Mach exceptions? The answer is the same no, because the user-state software exception directly passes the signal flow. If it is not intercepted, it will cause this part of Crash to be lost.
Therefore, in the collection of Crash, the monitoring system should have a variety of exception handling capabilities. There are many such tools on the market, one of which is KSCrash, which is also the most popular and perfect Crash collection tool at present. Most of the source code is written based on C language . WeChat's open source project Matrix is also developed based on KSCrash, and the iOS crash monitoring in our APM system is also written based on this tool.
2.1 Mach layer exception handling
Reading the source code is the fastest way to understand tools. Let's take a look at how KSCrash handles Mach layer exceptions (KSCrashMonitor_MachException.c) as follows:
statictml4 bool installExceptionHandler()
{
......
//Get the current task
const task_t thisTask = mach_task_self();
exception_mask_t mask = EXC_MASK_BAD_ACCESS |
EXC_MASK_BAD_INstructtION |
EXC_MASK_ARITHMETIC |
EXC_MASK_ARITHMETIC |
EXC_MASK_SOFTWARE |
EXC_MASK_BREAKPOINT;
//Get all exception ports of the current task and save them in the kr attribute
kr = task_get_exception_ports(thisTask,
mask,
g_previousExceptionPorts.masks,
&g_previousExceptionPorts.count,
g_previousExceptionPorts.ports,
g_previousExceptionPorts.behaviors,
g_previousExceptionPorts.flavors);
if(kr != KERN_SUCCESS)
{
KSLOG_ERROR("task_get_exception_ports: %s", mach_error_string(kr));
goto failed;
}
if(g_exceptionPort == MACH_PORT_NULL)
{
KSLOG_DEBUG("Allocating new port with receive rights.");
//Application for an exception port
kr = mach_port_allocate(thisTask,
MACH_PORT_RIGHT_RECEIVE,
&g_exceptionPort);
if(kr != KERN_SUCCESS)
{
KSLOG_ERROR("mach_port_allocate: %s", mach_error_string(kr));
goto failed;
}
KSLOG_DEBUG("Adding send rights to port.");
//Set port permissions
kr = mach_port_insert_right(thisTask,
g_exceptionPort,
g_exceptionPort,
g_exceptionPort,
MACH_MSG_TYPE_MAKE_SEND);
if(kr != KERN_SUCCESS)
{
KSLOG_ERROR("mach_port_insert_right: %s", mach_error_string(kr));
goto failed;
}
}
}
...
//Create the corresponding thread and catch the exception
error = pthread_create(&g_secondaryPThread,
&attr,
&handleExceptions,
kThreadSecondary);
......
According to the source code, the following flow chart can be summarized:
In 1.2, mentioned The Mach layer handles exceptions, so the exception capture in the Mach layer is also carried out according to this processing process. The idea is to first apply for an exception handling port, apply for permissions for this port, then set up an exception port, create a new kernel thread, and wait for an exception in the thread loop, but when an exception occurs, the thread will be suspended and the information when Crash occurs will be assembled into the JSON file.However, in order to prevent the exception port you registered from preempting other SDKs or logic set by developers, you need to save other exception ports first, and after the collection logic is completed, the exception processing is handed over to the logical processing in other ports.
2.2 Signal Exception Handling
Signal Exception Capture is in the installSignalHandler function in KSCrashMonitor_signal.c. The specific solution is to first use the sigaltstack function to allocate a piece of memory on the heap and set the signal stack area. The purpose is to replace the stack of the signal processing function, because a process may have n threads, and each thread has its own task. If a thread executes an error, it will cause the entire process to collapse. Therefore, in order for the signal processing exception function to run normally, a separate running space needs to be set.
Next set the signal processing function sigaction, then traverse the signal array to be processed, bind the processing function of each signal to the sigaction, and use g_previousSignalHandlers to save the processing function of the current signal. During signal processing, save the thread's context information.
Finally wait until the KSCrash signal processing is restored before the signal processing permission is restored.
core code is as follows:
statictml4 bool installSignalHandler()
{
KSLOG_DEBUG("Installing signal handler.");
#if KSCRASH_HAS_SIGNAL_STACK
// Allocate a piece of memory on the heap,
if(g_signalStack.ss_size == 0)
{
KSLOG_DEBUG("Allocating signal stack area.");
g_signalStack.ss_size = SIGSTKSZ;
g_signalStack.ss_sp = malloc(g_signalStack.ss_size);
}
}
KSLOG_DEBUG("Setting signal stack area.");
// The stack of signal processing functions is moved to the heap, and does not share the same stack area as the process
if(sigaltstack(&g_signalStack, NULL) != 0)
{
KSLOG_ERROR("signalstack: %s", strerror(errno));
goto failed;
}
#endif
const int* fatalSignals = kssignal_fatalSignals();
int fatalSignalsCount = kssignal_numFatalSignals();
if(g_previousSignalHandlers == NULL)
{
KSLOG_DEBUG("Allocating memory to store previous signal handlers.");
g_previousSignalHandlers = malloc(sizeof(*g_previousSignalHandlers)
* (unsigned)fatalSignalsCount);
}
}
// Set the second parameter of the signal processing function sigaction, type sigaction structure
struct sigaction action = {{0}};
action.sa_flags = SA_SIGINFO | SA_ONSTACK;
#if KSCRASH_HOST_APPLE && defined(__LP64__)
action.sa_flags |= SA_64REGSET;
#endif
sigemptyset(&action.sa_mask);
action.sa_sigaction = &handleSignal;
for(int i = 0; i fatalSignalsCount; i++)
{
KSLOG_DEBUG("Assigning handler for signal %d", fatalSignals[i]);
// Bind the processing function of each signal to the action declared above, and use g_previousSignalHandlers to save the processing function of the current signal
if(sigaction(fatalSignals[i], &action, &g_previousSignalHandlers[i]) != 0)
{
char sigNameBuff[30];
const char*sigName = kssignal_signalName(fatalSignals[i]);
if(sigName == NULL)
{
snprintf(sigNameBuff, sizeof(sigNameBuff), "%d", fatalSignals[i]);
sigName = sigNameBuff;
}
KSLOG_ERROR("sigaction (%s): %s", sigName, strerror(errno));
// Try to reverse the damage
for(i--;i = 0; i--)
{
sigaction(fatalSignals[i], &g_previousSignalHandlers[i], NULL);
}
goto failed;
}
}
KSLOG_DEBUG("Signal handlers installed.");
return true;
failed:
KSLOG_DEBUG("Failed to install signal handlers.");
return false;
}
...
2.3 C++Exception handling
c++Exception handling relies on the standard library's std::set_terminate(CPPExceptionTerminate) function.
In iOS, if C and C++ exceptions can be converted into NSException, Objective-C exception handling will be performed. If not, it is default_terminate_handler. This C++ exception default_terminate_handler function calls the abort_message function, and the system generates a SIGABRT signal.
statictml4 void CPPExceptionTerminate(void)
{
......
// The conditions of exception, the NSException inherited from NSException will be treated as cpp exception
// if (name == NULL || strcmp(name, "NSException") != 0
if (g_capturedStackCursor && (name == NULL || strcmp(name, "NSException") != 0))
{
kscm_notifyFatalExceptionCaptured(false);
KSCrash_MonitorContext* crashContext = &g_monitorContext;
memset(crashContext, 0, sizeof(*crashContext));
char* description = descriptionBuff;
const char* description = descriptionBuff;
descriptionBuff[0] = 0;
KSLOG_DEBUG("Discovering what kind of exception was thrown.");
g_captureNextStackTrace = false;
try
{
{
throw;
}
catch(std::exception& exc)
{
strncpy(descriptionBuff, exc.what(), sizeof(descriptionBuff));
}
}
......
For NSException exception handling at the OC level is relatively easy. You can register NSUncaughtExceptionHandler to capture exception information, collect Crash information through NSException parameters, and hand it over to the data reporting component. For example,
KSCrash.sharedInstance.uncaughtExceptionHandler = &handleException;
. OOM related concepts
OOM is the abbreviation of out of memory, which refers to iOS The current application on the device is forced to terminate by the operating system due to excessive memory usage. The perception on the user side is that the App crashes in a flash, which is not significantly different from ordinary Crash. However, when we encounter this kind of crash in the debugging stage, we cannot find ordinary crash logs in the device analysis and improvement. You can find logs starting with Jetsam. This log is a log generated by the system after the OOM crash specifically reflects memory exception problems.
According to the running status of the program, OOM is generally divided into the following two types:
Foreground Out Of Memory
OOM crashes in the foreground and the application is running in the foreground and
- Background Out Of Memory
OOM crashes in the background
Jetsam
Jetsam is a resource management mechanism adopted by the iOS operating system to control the excessive use of memory resources. Unlike MacOS, Linux, Windows and other desktop operating systems. For performance considerations, the iOS system does not design a memory swap space mechanism. Therefore, in iOS, if the overall memory of the device is tight, the system can only directly terminate some processes with low priority or excessive memory.Some log information intercepted below
:
{
"uuid" : "a02fb850-9725-4051-817a-8a5dc0950872",
"states" : [
"frontmost" //Application status: Foreground running
],
"lifetimeMax" : 92802,
"purgeable" : 0,
"coalition" : 68,
"rpages" : 92802, //Equipment page
"reason" : "per-process-limit", //Crash reason: exceeding the upper limit of single process
"name" : "MyCoolApp"
}
Detailed description can be used to refer to the official document
Jetsam mechanism cleaning strategy is divided into two situations:
- Single App process is online over memory
- The physical memory usage of the device will be cleaned according to priority level:
- Backstage application Front-stage application
- Application with high memory footprint Application with low memory footprint
- User application System applicationFunction introduction & principle
OOM warning
OOM warning function mainly reports the memory status related information of the APM platform when the memory reaches the predetermined threshold.The flowchart is as follows:
provides a structure representing memory information based on the system kernel
The task_info method can obtain the relevant usage of memory
kern_return_t task_info
(
task_name_t target_task,
task_flavor_t flavor,
task_info_t task_info_out,
mach_msg_type_number_t *task_info_outCnt
);
Monitoring memory size code is as follows:
int64_t memoryUsageInByte = 0;
task_vm_info_data_t vmInfo;
mach_msg_type_number_t count = TASK_VM_INFO_COUNT;
kern_return_t kernelReturn = task_info(mach_task_self(), TASK_VM_INFO, (task_info_t) &vmInfo, &count);
if(kernelReturn == KERN_SUCCESS) {
memoryUsageInByte = (int64_t) vmInfo.phys_footprint;
}
The top processing logic is as follows:
That is, when the set threshold value is exceeded, the memory information at that time is reported!
-(void)saveLastSingleLoginMaxMemory{
if(_hasUpoad){
NSString* currentMemory = [NSString stringWithFormat:@"%f", _singleLoginMaxMemory];
NSString* overflowMemoryLimit =[NSString stringWithFormat:@"%f", overflow_limit];
if(_singleLoginMaxMemory overflow_limit){
static BOOL isFirst = YES;
if(isFirst){
_firstOOMTime = [[NSDate date] timeIntervalSince1970];
isFirst = NO;
}
}
}
}
NSDictionary *minidumpdata = [NSDictionary dictionaryWithObjectsAndKeys:currentMemory,@"singleMemory",overflowMemoryLimit,@"threshold",[NSString stringWithFormat: @"%.2lf", _firstOOMTime],@"LaunchTime",nil];
NSString *fileDir = [self singleLoginMaxMemoryDir];
if (![[NSFileManager defaultManager] fileExistsAtPath:fileDir])
{
[[NSFileManager defaultManager] createDirectoryAtPath:fileDir withIntermediateDirectories:YES attributes:nil error:nil];
}
NSString *filePath = [fileDir stringByAppendingString:@"/apmLastMaxMemory.plist"];
if(minidumpdata != nil){
if([[NSFileManager defaultManager] fileExistsAtPath:filePath]){
[[NSFileManager defaultManager] removeItemAtPath:filePath error:nil];
}
[minidumpdata writeToFile:filePath atomically:YES];
}
}
}
}
simulated memory top to get log records:
OOM monitoring
OOM monitoring is to timely record the stack information at that time when the App crashes due to OOM, and report to the APM platform for subsequent problem analysis.When the
Jetsam mechanism terminates the process by sending a SIKILL exception signal, but it cannot be captured by the current process. The conventional Crash capture scheme using monitoring the exception signal is not possible. So how to monitor it? In 2015, Facebook proposed an idea, using the exclusion method.
Every time the App starts, determine the reason for the last startup process termination. Known are:
- App updated version
- App crash occurred
- User manually exited
- Operation system updated version
- App process terminated after switching backend
If the last startup process termination was not the above reason, it is determined that OOM crash occurred during the last startup.
core code logic is as follows:
-(NSDictionary *)parseFoomData:(NSDictionary *)foomDict
{
......
if(appState == APPENTERFORGROUND){
BOOL isExit = [[foomDict objectForKey:@"isExit"] boolValue];
BOOL isDeadLock = [[foomDict objectForKey:@"isDeadLock"] boolValue];
NSString *lastSysVersion = [foomDict objectForKey:@"systemVersion"];
NSString *lastAppVersion = [foomDict objectForKey:@"appVersion"];
if(!isCrashed && !isExit && [_systemVersion isEqualToString:lastSysVersion] && [_appVersion isEqualToString:lastAppVersion]){
if(isDeadLock){
OOM_Log("The app occurred lastTime,detail info:%s",[[foomDict description] UTF8String]);
[result setObject:@deadlock_crash forKey:@"crash_type"];
NSDictionary *stack = [foomDict objectForKey:@"deadlockStack"];
if(stack && stack.count 0){
[result setObject:stack forKey:@"stack_deadlock"];
OOM_Log("The app deadlock stack:%s",[[stack description] UTF8String]);
}
}
}
else {
OOM_Log("The app occurred lastTime,detail info:%s",[[foomDict description] UTF8String]);
[result setObject:@foom_crash forKey:@"crash_type"];
NSString *uuid = [foomDict objectForKey:@"uuid"];
NSArray *oomStack = [[OOMDector getInstance] getOOMDataByUUID:uuid];
if(oomStack && oomStack.count 0)
{
{
NSData *oomData = [NSJSONSerialization dataWithJSONObject:oomStack options:0 error:nil];
if(oomData.length 0){
// NSString *stackStr = [NSString stringWithUTF8String:(const char *)oomData.bytes];
OOM_Log("The app foom stack:%s",[[oomStack description] UTF8String]);
}
[result setObject:[self getAPMOOMStack:oomStack] forKey:@"stack_oom"];
}
}
}
}
}
return result;
}
}
...
...
}
memory portrait
memory portrait, when the program reaches the top situation, it snapshots the memory, exports the memory node reference situation, and finds the reason for the large memory! There are two things to do:
. Getting of memory node
Getting of memory node
Getting of memory node is to scan all VM Regions in the process through the vm_region_recurse/vm_region_recure64 function of the mach kernel, and obtain detailed information through the vm_region_submap_info_64 structure.
. Analyze the reference relationship between nodes
Here are two situations: the VM Region where the heap is maintained by libmalloc is located. The OC objects, C/C++ objects, buffers, etc. contained in the VM Region, C/C++ objects, buffers, etc. can obtain detailed reference relationships and need to be processed separately.The VM Region's separate memory nodes that are not maintained by libmalloc only record the start address and Dirty, Swapped memory size, and reference relationships with other nodes.
The core code for obtaining the node is as follows:
void VMRegionCollect::startCollet() {
......
while (1) {
structt vm_region_submap_info_64 info;
mach_msg_type_number_t count = VM_REGION_SUBMAP_INFO_COUNT_64;
krc = vm_region_recurse_64(mach_task_self(), &address, &size, &depth, (vm_region_info_64_t)&info, &count);
if (krc == KERN_INVALID_ADDRESS){
break;
}
if (info.is_submap){
depth++;
} else {
//do stuff
proc_regionfilename(pid, address, buf, sizeof(buf));
printf("Found VM Region: %08x to %08x (depth=%d) user_tag:%s name:%s\n", (uint32_t)address, (uint32_t)(address+size), depth, [visualMemoryTypeString(info.user_tag) cStringUsingEncoding:NSUTF8StringEncoding], buf);
address += size;
}
}
}
}
}
Scan node case The data information is as follows:
The core code of the heap memory node reference relationship is as follows:
matches the address of the class member variable and the isa pointer address of the reference class, so as to find out whether there is a reference relationship!
statictml4 void range_callback(task_t task, void *context, unsigned type, vm_range_t *ranges, unsigned rangeCount) {
if (!context) {
return;
}
for (unsigned int i = 0; i rangeCount; i++) {
vm_range_t range = ranges[i];
flex_maybe_object_t *tryObject = (flex_maybe_object_t *)range.address;
Class tryClass = NULL;
#ifdef __arm64__
// See http://www.sealiesoftware.com/blog/archive/2013/09/24/objc_explain_Non-pointer_isa.html
extern uint64_t objc_debug_isa_class_mask WEAK_IMPORT_ATTRIBUTE;
tryClass = (__bridge Class)((void *)((uint64_t)tryObject-isa & objc_debug_isa_class_mask));
#else
tryClass = tryObject-isa;
#endif
// If the class pointer matches one in our set of class points from the runtime, then we should have an object.
if (CFSetContainsValue(registeredClasses, (__bridge const void *)(tryClass))) {
(*(object_enumeration_block_t __unsafe_unretained *)context)((__bridge id)tryObject, tryClass);
}
}
}
}
statictml4 kern_return_t reader(_unused task_t remote_task, vm_address_t remote_address, __unused vm_size_t size, void **local_memory) {
*local_memory = (void *)remote_address;
return KERN_SUCCESS;
}
+ (void)enumerateLiveObjectsUsingBlock:(object_enumeration_block_t)block {
if (!block) {
return;
}
[self updateRegisteredClasses];
vm_address_t *zones = NULL;
unsigned int zoneCount = 0;
kern_return_t result = malloc_get_all_zones(TASK_NULL, reader, &zones, &zoneCount);
if (result == KERN_SUCCESS) {
for (unsigned int i = 0; i zoneCount; i++) {
malloc_zone_t *zone = (malloc_zone_t *)zones[i];
malloc_introspection_t *introspection = zone-introspection;
if (!introspection) {
void;
}
void (*lock_zone)(malloc_zone_t *zone) = introduction-force_lock;
void (*unlock_zone)(malloc_zone_t *zone) = introduction-force_unlock;
object_enumeration_block_t callback = ^(__unsafe_unretained id object, __unsafe_unretained Class actualClass) {
unlock_zone(zone);
block(object, actualClass);
lock_zone(zone);
};
};
BOOL lockZoneValid = PointerIsReadable(lock_zone);
BOOL unlockZoneValid = PointerIsReadable(lock_zone);
if (introspection-enumerator && lockZoneValid && unlockZoneValid) {
lock_zone(zone);
introspection-enumerator(TASK_NULL, (void *)&callback, MALLOC_PTR_IN_USE_RANGE_TYPE, (vm_address_t)zone, reader, &range_callback);
unlock_zone(zone);
}
}
}
}
}
}
}
case to get the reference relationship The data is as follows:
uses a reverse order output reference relationship, so it looks like a step-by-step form!
Take out some of the data borrowing tools to analyze the reference relationship as shown in the figure:
This way, you can clearly see the reference relationship between the heap memory nodes and the memory occupied.
Summary
The above is an introduction to the functions of OOM in the APM system, which mainly includes three major functional points:
- OOM warning can be found that online apps record when exceeding the memory threshold, to identify the risk of OOM causing crash.
- OOM monitoring records the case scene in a timely manner when OOM occurs, providing clues to subsequent developer problem searches.
- memory portrait exports its reference relationship when OOM occurs, records node size and other information, and more intuitively finds where the memory is large.
4. Crash log
4.1 APM reports Crash log process
Project integration APM. When SDK is initialized, Crash monitoring will be turned on by default. When Crash occurs, the following steps will be performed:
- KSCrash After collecting the crash log, the crashCallBack function of the APM is executed.
- Write the log to APMLog in the crashCallBack function and cache it
- When the next startup is launched, APM After the initialization is successful, the Crash file is reported according to the reporting process to the server
4.2 Log parsing
APM iOS SDK only needs to successfully report the log, and the server will perform symbolic work based on the information of Crash, such as version number, binaryImage, and UUID. After the symbolization is successful, the corresponding log can be viewed in the management background.
References
.Black, David L. The mach Exception Handing Facility.
.iOS Crash Analysis Guide https://developer.aliyun.com/article/766088
. "In-depth Analysis of Mac OSX & iOS Operating System"
Author: Zheng Genghao, Lan Haiting
Source: WeChat public account: Yingke technology
Source: https://mp.weixin.qq.com/s/WGod1JhojaWhuOap45QxaA
Summary
The above is an introduction to the functions of OOM in the APM system, which mainly includes three major functional points:
- OOM warning can be found that online apps record when exceeding the memory threshold, to identify the risk of OOM causing crash.
- OOM monitoring records the case scene in a timely manner when OOM occurs, providing clues to subsequent developer problem searches.
- memory portrait exports its reference relationship when OOM occurs, records node size and other information, and more intuitively finds where the memory is large.
4. Crash log
4.1 APM reports Crash log process
Project integration APM. When SDK is initialized, Crash monitoring will be turned on by default. When Crash occurs, the following steps will be performed:
- KSCrash After collecting the crash log, the crashCallBack function of the APM is executed.
- Write the log to APMLog in the crashCallBack function and cache it
- When the next startup is launched, APM After the initialization is successful, the Crash file is reported according to the reporting process to the server
4.2 Log parsing
APM iOS SDK only needs to successfully report the log, and the server will perform symbolic work based on the information of Crash, such as version number, binaryImage, and UUID. After the symbolization is successful, the corresponding log can be viewed in the management background.
References
.Black, David L. The mach Exception Handing Facility.
.iOS Crash Analysis Guide https://developer.aliyun.com/article/766088
. "In-depth Analysis of Mac OSX & iOS Operating System"
Author: Zheng Genghao, Lan Haiting
Source: WeChat public account: Yingke technology
Source: https://mp.weixin.qq.com/s/WGod1JhojaWhuOap45QxaA