| .. hwpoison: | 
 |  | 
 | ======== | 
 | hwpoison | 
 | ======== | 
 |  | 
 | What is hwpoison? | 
 | ================= | 
 |  | 
 | Upcoming Intel CPUs have support for recovering from some memory errors | 
 | (``MCA recovery``). This requires the OS to declare a page "poisoned", | 
 | kill the processes associated with it and avoid using it in the future. | 
 |  | 
 | This patchkit implements the necessary infrastructure in the VM. | 
 |  | 
 | To quote the overview comment:: | 
 |  | 
 | 	High level machine check handler. Handles pages reported by the | 
 | 	hardware as being corrupted usually due to a 2bit ECC memory or cache | 
 | 	failure. | 
 |  | 
 | 	This focusses on pages detected as corrupted in the background. | 
 | 	When the current CPU tries to consume corruption the currently | 
 | 	running process can just be killed directly instead. This implies | 
 | 	that if the error cannot be handled for some reason it's safe to | 
 | 	just ignore it because no corruption has been consumed yet. Instead | 
 | 	when that happens another machine check will happen. | 
 |  | 
 | 	Handles page cache pages in various states. The tricky part | 
 | 	here is that we can access any page asynchronous to other VM | 
 | 	users, because memory failures could happen anytime and anywhere, | 
 | 	possibly violating some of their assumptions. This is why this code | 
 | 	has to be extremely careful. Generally it tries to use normal locking | 
 | 	rules, as in get the standard locks, even if that means the | 
 | 	error handling takes potentially a long time. | 
 |  | 
 | 	Some of the operations here are somewhat inefficient and have non | 
 | 	linear algorithmic complexity, because the data structures have not | 
 | 	been optimized for this case. This is in particular the case | 
 | 	for the mapping from a vma to a process. Since this case is expected | 
 | 	to be rare we hope we can get away with this. | 
 |  | 
 | The code consists of a the high level handler in mm/memory-failure.c, | 
 | a new page poison bit and various checks in the VM to handle poisoned | 
 | pages. | 
 |  | 
 | The main target right now is KVM guests, but it works for all kinds | 
 | of applications. KVM support requires a recent qemu-kvm release. | 
 |  | 
 | For the KVM use there was need for a new signal type so that | 
 | KVM can inject the machine check into the guest with the proper | 
 | address. This in theory allows other applications to handle | 
 | memory failures too. The expection is that near all applications | 
 | won't do that, but some very specialized ones might. | 
 |  | 
 | Failure recovery modes | 
 | ====================== | 
 |  | 
 | There are two (actually three) modes memory failure recovery can be in: | 
 |  | 
 | vm.memory_failure_recovery sysctl set to zero: | 
 | 	All memory failures cause a panic. Do not attempt recovery. | 
 | 	(on x86 this can be also affected by the tolerant level of the | 
 | 	MCE subsystem) | 
 |  | 
 | early kill | 
 | 	(can be controlled globally and per process) | 
 | 	Send SIGBUS to the application as soon as the error is detected | 
 | 	This allows applications who can process memory errors in a gentle | 
 | 	way (e.g. drop affected object) | 
 | 	This is the mode used by KVM qemu. | 
 |  | 
 | late kill | 
 | 	Send SIGBUS when the application runs into the corrupted page. | 
 | 	This is best for memory error unaware applications and default | 
 | 	Note some pages are always handled as late kill. | 
 |  | 
 | User control | 
 | ============ | 
 |  | 
 | vm.memory_failure_recovery | 
 | 	See sysctl.txt | 
 |  | 
 | vm.memory_failure_early_kill | 
 | 	Enable early kill mode globally | 
 |  | 
 | PR_MCE_KILL | 
 | 	Set early/late kill mode/revert to system default | 
 |  | 
 | 	arg1: PR_MCE_KILL_CLEAR: | 
 | 		Revert to system default | 
 | 	arg1: PR_MCE_KILL_SET: | 
 | 		arg2 defines thread specific mode | 
 |  | 
 | 		PR_MCE_KILL_EARLY: | 
 | 			Early kill | 
 | 		PR_MCE_KILL_LATE: | 
 | 			Late kill | 
 | 		PR_MCE_KILL_DEFAULT | 
 | 			Use system global default | 
 |  | 
 | 	Note that if you want to have a dedicated thread which handles | 
 | 	the SIGBUS(BUS_MCEERR_AO) on behalf of the process, you should | 
 | 	call prctl(PR_MCE_KILL_EARLY) on the designated thread. Otherwise, | 
 | 	the SIGBUS is sent to the main thread. | 
 |  | 
 | PR_MCE_KILL_GET | 
 | 	return current mode | 
 |  | 
 | Testing | 
 | ======= | 
 |  | 
 | * madvise(MADV_HWPOISON, ....) (as root) - Poison a page in the | 
 |   process for testing | 
 |  | 
 | * hwpoison-inject module through debugfs ``/sys/kernel/debug/hwpoison/`` | 
 |  | 
 |   corrupt-pfn | 
 | 	Inject hwpoison fault at PFN echoed into this file. This does | 
 | 	some early filtering to avoid corrupted unintended pages in test suites. | 
 |  | 
 |   unpoison-pfn | 
 | 	Software-unpoison page at PFN echoed into this file. This way | 
 | 	a page can be reused again.  This only works for Linux | 
 | 	injected failures, not for real memory failures. | 
 |  | 
 |   Note these injection interfaces are not stable and might change between | 
 |   kernel versions | 
 |  | 
 |   corrupt-filter-dev-major, corrupt-filter-dev-minor | 
 | 	Only handle memory failures to pages associated with the file | 
 | 	system defined by block device major/minor.  -1U is the | 
 | 	wildcard value.  This should be only used for testing with | 
 | 	artificial injection. | 
 |  | 
 |   corrupt-filter-memcg | 
 | 	Limit injection to pages owned by memgroup. Specified by inode | 
 | 	number of the memcg. | 
 |  | 
 | 	Example:: | 
 |  | 
 | 		mkdir /sys/fs/cgroup/mem/hwpoison | 
 |  | 
 | 	        usemem -m 100 -s 1000 & | 
 | 		echo `jobs -p` > /sys/fs/cgroup/mem/hwpoison/tasks | 
 |  | 
 | 		memcg_ino=$(ls -id /sys/fs/cgroup/mem/hwpoison | cut -f1 -d' ') | 
 | 		echo $memcg_ino > /debug/hwpoison/corrupt-filter-memcg | 
 |  | 
 | 		page-types -p `pidof init`   --hwpoison  # shall do nothing | 
 | 		page-types -p `pidof usemem` --hwpoison  # poison its pages | 
 |  | 
 |   corrupt-filter-flags-mask, corrupt-filter-flags-value | 
 | 	When specified, only poison pages if ((page_flags & mask) == | 
 | 	value).  This allows stress testing of many kinds of | 
 | 	pages. The page_flags are the same as in /proc/kpageflags. The | 
 | 	flag bits are defined in include/linux/kernel-page-flags.h and | 
 | 	documented in Documentation/admin-guide/mm/pagemap.rst | 
 |  | 
 | * Architecture specific MCE injector | 
 |  | 
 |   x86 has mce-inject, mce-test | 
 |  | 
 |   Some portable hwpoison test programs in mce-test, see below. | 
 |  | 
 | References | 
 | ========== | 
 |  | 
 | http://halobates.de/mce-lc09-2.pdf | 
 | 	Overview presentation from LinuxCon 09 | 
 |  | 
 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git | 
 | 	Test suite (hwpoison specific portable tests in tsrc) | 
 |  | 
 | git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git | 
 | 	x86 specific injector | 
 |  | 
 |  | 
 | Limitations | 
 | =========== | 
 | - Not all page types are supported and never will. Most kernel internal | 
 |   objects cannot be recovered, only LRU pages for now. | 
 |  | 
 | --- | 
 | Andi Kleen, Oct 2009 |