|  | dm-log-writes | 
|  | ============= | 
|  |  | 
|  | This target takes 2 devices, one to pass all IO to normally, and one to log all | 
|  | of the write operations to.  This is intended for file system developers wishing | 
|  | to verify the integrity of metadata or data as the file system is written to. | 
|  | There is a log_write_entry written for every WRITE request and the target is | 
|  | able to take arbitrary data from userspace to insert into the log.  The data | 
|  | that is in the WRITE requests is copied into the log to make the replay happen | 
|  | exactly as it happened originally. | 
|  |  | 
|  | Log Ordering | 
|  | ============ | 
|  |  | 
|  | We log things in order of completion once we are sure the write is no longer in | 
|  | cache.  This means that normal WRITE requests are not actually logged until the | 
|  | next REQ_PREFLUSH request.  This is to make it easier for userspace to replay | 
|  | the log in a way that correlates to what is on disk and not what is in cache, | 
|  | to make it easier to detect improper waiting/flushing. | 
|  |  | 
|  | This works by attaching all WRITE requests to a list once the write completes. | 
|  | Once we see a REQ_PREFLUSH request we splice this list onto the request and once | 
|  | the FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only | 
|  | completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to | 
|  | simulate the worst case scenario with regard to power failures.  Consider the | 
|  | following example (W means write, C means complete): | 
|  |  | 
|  | W1,W2,W3,C3,C2,Wflush,C1,Cflush | 
|  |  | 
|  | The log would show the following | 
|  |  | 
|  | W3,W2,flush,W1.... | 
|  |  | 
|  | Again this is to simulate what is actually on disk, this allows us to detect | 
|  | cases where a power failure at a particular point in time would create an | 
|  | inconsistent file system. | 
|  |  | 
|  | Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as | 
|  | they complete as those requests will obviously bypass the device cache. | 
|  |  | 
|  | Any REQ_DISCARD requests are treated like WRITE requests.  Otherwise we would | 
|  | have all the DISCARD requests, and then the WRITE requests and then the FLUSH | 
|  | request.  Consider the following example: | 
|  |  | 
|  | WRITE block 1, DISCARD block 1, FLUSH | 
|  |  | 
|  | If we logged DISCARD when it completed, the replay would look like this | 
|  |  | 
|  | DISCARD 1, WRITE 1, FLUSH | 
|  |  | 
|  | which isn't quite what happened and wouldn't be caught during the log replay. | 
|  |  | 
|  | Target interface | 
|  | ================ | 
|  |  | 
|  | i) Constructor | 
|  |  | 
|  | log-writes <dev_path> <log_dev_path> | 
|  |  | 
|  | dev_path	: Device that all of the IO will go to normally. | 
|  | log_dev_path : Device where the log entries are written to. | 
|  |  | 
|  | ii) Status | 
|  |  | 
|  | <#logged entries> <highest allocated sector> | 
|  |  | 
|  | #logged entries	       : Number of logged entries | 
|  | highest allocated sector   : Highest allocated sector | 
|  |  | 
|  | iii) Messages | 
|  |  | 
|  | mark <description> | 
|  |  | 
|  | You can use a dmsetup message to set an arbitrary mark in a log. | 
|  | For example say you want to fsck a file system after every | 
|  | write, but first you need to replay up to the mkfs to make sure | 
|  | we're fsck'ing something reasonable, you would do something like | 
|  | this: | 
|  |  | 
|  | mkfs.btrfs -f /dev/mapper/log | 
|  | dmsetup message log 0 mark mkfs | 
|  | <run test> | 
|  |  | 
|  | This would allow you to replay the log up to the mkfs mark and | 
|  | then replay from that point on doing the fsck check in the | 
|  | interval that you want. | 
|  |  | 
|  | Every log has a mark at the end labeled "dm-log-writes-end". | 
|  |  | 
|  | Userspace component | 
|  | =================== | 
|  |  | 
|  | There is a userspace tool that will replay the log for you in various ways. | 
|  | It can be found here: https://github.com/josefbacik/log-writes | 
|  |  | 
|  | Example usage | 
|  | ============= | 
|  |  | 
|  | Say you want to test fsync on your file system.  You would do something like | 
|  | this: | 
|  |  | 
|  | TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" | 
|  | dmsetup create log --table "$TABLE" | 
|  | mkfs.btrfs -f /dev/mapper/log | 
|  | dmsetup message log 0 mark mkfs | 
|  |  | 
|  | mount /dev/mapper/log /mnt/btrfs-test | 
|  | <some test that does fsync at the end> | 
|  | dmsetup message log 0 mark fsync | 
|  | md5sum /mnt/btrfs-test/foo | 
|  | umount /mnt/btrfs-test | 
|  |  | 
|  | dmsetup remove log | 
|  | replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync | 
|  | mount /dev/sdb /mnt/btrfs-test | 
|  | md5sum /mnt/btrfs-test/foo | 
|  | <verify md5sum's are correct> | 
|  |  | 
|  | Another option is to do a complicated file system operation and verify the file | 
|  | system is consistent during the entire operation.  You could do this with: | 
|  |  | 
|  | TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" | 
|  | dmsetup create log --table "$TABLE" | 
|  | mkfs.btrfs -f /dev/mapper/log | 
|  | dmsetup message log 0 mark mkfs | 
|  |  | 
|  | mount /dev/mapper/log /mnt/btrfs-test | 
|  | <fsstress to dirty the fs> | 
|  | btrfs filesystem balance /mnt/btrfs-test | 
|  | umount /mnt/btrfs-test | 
|  | dmsetup remove log | 
|  |  | 
|  | replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs | 
|  | btrfsck /dev/sdb | 
|  | replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ | 
|  | --fsck "btrfsck /dev/sdb" --check fua | 
|  |  | 
|  | And that will replay the log until it sees a FUA request, run the fsck command | 
|  | and if the fsck passes it will replay to the next FUA, until it is completed or | 
|  | the fsck command exists abnormally. |