Understanding and Optimizing Disk IO


  Name Size Creator (Last Modifier) Creation Date Last Mod Date Comment  
JPEG File clientdiagramjpeg.jpg 209 kb Pranab Kelkar Nov 25, 2011 Nov 25, 2011  
File Understanding and Optimizing Disk IO - Part 3.WMV 61.06 Mb Pranab Kelkar Dec 01, 2011 Dec 01, 2011  
File Understanding and Optimizing Disk IO - Part 1.WMV 1,024.10 Mb Pranab Kelkar Dec 01, 2011 Dec 01, 2011  
File Understanding and Optimizing Disk IO - Part 2.WMV 1,024.17 Mb Pranab Kelkar Dec 01, 2011 Dec 01, 2011  

Disk IO

In most architectures Disk IO is typically the slowest component. Therefore sizing and optimizing Disk IO is by far the most crucial aspect in any application deployment

Understanding IO types

Disk IO is broadly classified as sequential or random. Sequential I/O is typical of large file reads and writes, and typically involves operating on one block immediately after its neighbor. With this type of I/O, there is little penalty associated with the disk drive head having to move to a new location. Random I/O, on the other hand, involves large numbers of seeks and rotations, and is usually much slower.

You can find out whether your IO needs are predominantly sequential or random by determining the average block size of each IO operation. You can use avgrq-sz (average request size) from iostat to determine this. If the average size per IO operation is around 16 KB or lesser than your IO is predominantly random. If on the other hand the average size per IO operation is greater than 128 KB your IO is predominantly sequential.

Note the results above maybe different for reads and different for writes and different overall.

Command recipes:

  • iostat -dkx 1 - shows r/s, w/s, read KB and write KB/s and also average size of each request. This value (avgrq-sz) provides the average size of each IO operation in sectors (each sector is 512 bytes)

Types of disks

The below table provides an indicative comparison of various storage options based on a sample selection of each device

Device Capacity IOPs Cost Cost per GB
SATA 7200 RPM 1024 GB 80 $149 $0.14
SAS 10000 RPM 600 GB 150 $450 $0.75
SAS 15000 RPM 600 GB 185 $450 $0.75
Flash MLC 360 GB 50000 $989 $2.74
Flash SLC     2x of MLC $5
RAM 8 GB $199 200,000 $24
Fusion-io/Verident 200 GB 320,000 $7200 $36
  • RPM in SAS and SATA drives refers to their rotational speed. Higher the rotational speed, lower the seek time and average latency
  • Random seek time for non-flash hard drives refers to the average time taken by the drive head to move to a random location
  • Average latency refers to time taken for head to reach desired location on track to begin read / write
  • Seek time refers to time taken for head to reach desired track to begin read / write
  • IOPs refers to the number of random IO operations the disk is capable of doing per second. This number can vary significantly in actual practice depending on the nature of the IO. For instance while a SATA disk should do about 80 random IO operations per second it could do a significant multiple of that if the IO operations are sequential

Cost of IO

  1 KB 128 KB
Spindle speed 10,000 10,000
Average rotational latency 3,000 3,000
Seek time (ms) 4 4
Sustained Trfer Rate (MB/s) 100 100
Trfer KB per ms 102 102
Trfer time 0.01 1.25
Total trfer time 7 8

Understanding RAID

RAID allows combining various homogenous disks together into a single array for easier management, larger storage partitions, improved performance, improved redundancy etc

Types of RAID, performance impact and redundancy

  • RAID 0 - Read and write IOPs directly proportional to number of disks. Provides no redundancy and a failure of one drive will result in complete loss of data
  • RAID 1 - Read IOPs proportional to number of disks. Write IOPs equal to that of a single disk. Can tolerate failure of upto one drive
  • RAID 5 - Read IOPs proportional to number of disks less 1. Write IOPs are proportional to number of disks less 1 but suffer from a write penalty due to parity calculations. In smaller RAID 5 arrays (consisting of 3 disks), writes may not improve and may actually degrade in comparison to a single disk. However in larger RAID 5 arrays write performance will increase with the increase in number of disks. Can tolerate failure of upto one drive
  • RAID 6 - Read IOPs proportional to number of disks less 2. Write IOPs are proportional to number of disks less 1 but suffer from a greater write penalty on account of there being 2 parity disks. Can tolerate failure of any 2 drives
  • RAID 10 - Read IOPS will be the sum of all disks. Write IOPs will be the sum of half the disks in the array. Can tolerate failure of 1 or more drives as long as they are not in the same mirror set. Failure of a drive still maintains the same IOPs since all remaining drives continue to be used
  • RAID 0+1 - Read IOPS will be the sum of all disks. Write IOPs will be the sum of half the disks in the array. Can tolerate failure of 1 or more drives as long as they are in the same stripe set. Failure of a drive reduces IOPs to half since the entire stripe set of the failed drive will not be usable

Considerations in choosing a RAID type

  • For mere performance and increase in storage capacity use RAID 0. However it is not recommended to use RAID 0 as such in any application.
  • For maximum storage capacity and redundancy at the cost of write performance use RAID 6. Ideal for backups. Can work as a backend array to a flashcache where a lot of backend storage is required but the write load on the backend array is mostly sequential. For eg mail servers wherein incoming mails are written to flash and asynchronously flushed to the backend drive. Note that the read performance of a RAID 6 is almost as good as that of a RAID 10
  • For decent storage capacity and redundancy, and high performance use RAID 10. Ideal for databases and as a backend array to a flashcache where the random write load to the backend SATA array is likely to be reasonably intensive. For eg databases

Hardware RAID vs Software RAID

  • Typically independent servers being used as storage boxes have significant spare cpu capacity and therefore can implement software RAID as opposed to having to invest in hardware RAID controller cards
  • Software RAID however may not provide hot swappability of drives. Additionally RAID controller cards can be battery backed offering a writeback mode and therefore increased write performance which is not available in software RAID
  • Hardware RAID maybe more reliable than software RAID.

Types of Storage systems

When the storage is directly connected to one single machine it is known as Direct Attached Storage. Direct attached storage is the easiest to setup and also does not suffer from any latency on account of intermediate layers or network. The downside is that there is limited scope for expansion. DAS can consist of harddisks in a machine, or an external array connected to a RAID controller card in the machine. It can be expanded to some degree, by adding additional JBODs to the external array.

An external storage device exposed as an NFS or CIFS mount is known as Network Attached storage.

An external storage array exposed as a block device to a machine is known as a SAN. A SAN can consist of a network of storage devices. The servers can connect to a SAN over FC / iSCSI / Infiniband (fastest) or some such protocol

Hardware appliances vs Server based SAN
A SAN or NAS Device can be a hardware appliance or a server chasis with several drives running an OS and exposing its storage over iSCSI or infiniband. Hardware appliances maybe more stable (though that is debatable). On the other hand server boxes running Linux provide considerable flexibility. One can implement flashcache in these devices. One can upgrade RAM which provides considerably more memory than the onboard cache on a hardware appliance. One can also monitor various performance counters on a Linux box making it easier to determine the utilization and performance of the storage array. These boxes also provide greater flexibility in terms of storage expansion.

Considerations in choosing a storage subsystem

  • Random Read and write intensive applications (eg small databases) that do not require a lot of space should opt for high capacity flash drives for storage ideally in a RAID 1 configuration
  • Random Read and/or write intensive applications (eg large databases / mail servers) , where a common portion of the data is read most of the time should use flashcache with an underlying RAID 10 or RAID 6 array of SATA drives. Once again depending on the eventual size of the data one can opt for disks within the server or use an external DAS/SAN. Incase of an external DAS/SAN one may want to keep the flash drives within the same server to reduce latency. If the data is expected to grow significantly use a DAS/SAN that is easily expandable without downtime
  • Applications that engage in a lot of sequential IO (eg backup servers) can skip flash drives and directly opt to write to a SAN comprising of RAID 6 SATA drives
  • Always ensure the hard drives are hot swappable to the extent possible (some of the high performance flash drives are only available in PCIe form factor and hence may not be hot swappable)
  • Find below examples of storage subsystems

Databases or Datastores with random reads and writes:

  • A minimum of 3 machines with high performance flash drives
    • Verident / fusion IO are not hot-swappable, while sandforce drives may fail faster OR provide lower performance
  • Depending on the size of the database either the entire database is stored on the flash drive OR we use flashcache
  • Master slave configuration with reads served from all machines and writes sent to a single master
  • If we are using verident or fusion io, write master has drives in RAID 1 configuration while slaves have single drives
  • If we are using cheaper sandforce drives all machines have drives in RAID 1 configuration
  • MMM for mysql multi-master management
  • Incase of a flash drive failure on one machine
    • if the drive failure is on the master and we are using verident/fusion io
      • switch the writer role to another machine so that the first machine can be brought down to replace the flash drive
      • switch the writer role back after the machine has been brought up

Large file servers (photos/videos etc) or mail servers:

  • Datasize is several Terrabytes and continues to grow rapidly
  • Most commonly accessed data is a few hundred GB. Older data is seldom accessed/used
  • 1 or 2 frontend machines with 2 or more hot swappable flash drives in RAID 1 or RAID 10 configuration
  • Backend SAN storage array with 0 or more JBODs
  • one or more RAID 6 arrays on SAN storage mounted on each front end box
  • Infiniband connect
  • Flashcache
  • Replication across front end machines done using glusterfs

Web Hosting servers:

  • n frontend machines with hot swappable SATA OS disks in RAID 1 running the host OS
  • Backend storage machine to store VMs with -
    • 2 hot swappable flash drives in RAID 1 for site data
    • 2 hot swappable flash drives in RAID 10 for database
    • 2 RAID 6 arrays, one for site data and one for database divided into partitions mounted on each frontend machine
    • Flashcache
    • ability to expand the storage via DAS/SAN and additional JBODs
    • the storage machine can be a thin box with attached DAS/SAN or a box wih all the hard drives in it
      • The former gives the additional flexibility of a standby taking over the responsibility incase of a frontend box failure
      • Cost effectiveness may have to be determined
  • iSCSI connect
  • Incase of drive failures drives must be replaced immediately
  • Backups must be available for DR
  • Advantages
    • A front end machine going down does not impact uptime since the same VM can be instantly rebooted on another server
    • Adequate redundancy as a result of RAID 6 and RAID 10

Backup servers:

  • A front end machine
  • Backend SAN with JBODs conected over infiniband
  • RAID 6 SATA array mounted on the frontend machine

Considerations in choosing an interconnect

One can connect to an external SAN / NAS using iSCSI / infiniband / SAS connectors. The choice of connection largely depends on what the device supports. Infiniband is ideal since it is the fastest and costs the same as iSCSI. Infact iSCSI suffers from overheads of TCP and other software layers.

Understanding inodes

Unix makes a clear distinction between the contents of a file and the information about a file. With the exception of device files and files of special filesystems, each file consists of a sequence of bytes. The file does not include any control information, such as its length or an end-of-file (EOF) delimiter. All information needed by the filesystem to handle a file is included in a data structure called an inode. Each file has its own inode, which the filesystem uses to identify the file.

While filesystems and the kernel functions handling them can vary widely from one Unix system to another, they must always provide at least the following attributes, which are specified in the POSIX standard:

File type
Number of hard links associated with the file
File length in bytes
Device ID (i.e., an identifier of the device containing the file)
Inode number that identifies the file within the filesystem
UID of the file owner
User group ID of the file
Several timestamps that specify the inode status change time, the last access time, and the last modify time
Access rights and file mode

Understanding IOPs

Figure 14.1

Each operation on a block device driver involves a large number of kernel components; the most important ones are shown in the above figure.

Let us suppose, for instance, that a process issued a read( ) system call on some disk file. We'll see that write requests are handled essentially in the same way. Here is what the kernel typically does to service the process request:

  • The service routine of the read( ) system call activates a suitable VFS function, passing to it a file descriptor and an offset inside the file. The Virtual Filesystem is the upper layer of the block device handling architecture, and it provides a common file model adopted by all filesystems supported by Linux.
  • The VFS function determines if the requested data is already available and, if necessary, how to perform the read operation. Sometimes there is no need to access the data on disk, because the kernel keeps in RAM the data most recently read from or written to a block device.
  • Let's assume that the kernel must read the data from the block device, thus it must determine the physical location of that data. To do this, the kernel relies on the mapping layer, which typically executes two steps:
    • It determines the block size of the filesystem including the file and computes the extent of the requested data in terms of file block numbers. Essentially, the file is seen as split into many blocks, and the kernel determines the numbers (indices relative to the beginning of file) of the blocks containing the requested data.
    • Next, the mapping layer invokes a filesystem-specific function that accesses the file's disk inode and determines the position of the requested data on disk in terms of logical block numbers. Essentially, the disk is seen as split in blocks, and the kernel determines the numbers (indices relative to the beginning of the disk or partition) corresponding to the blocks storing the requested data. Because a file may be stored in nonadjacent blocks on disk, a data structure stored in the disk inode maps each file block number to a logical block number.
  • The kernel can now issue the read operation on the block device. It makes use of the generic block layer, which starts the I/O operations that transfer the requested data. In general, each I/O operation involves a group of sectors that are adjacent on disk. Because the requested blocks are not necessarily adjacent on disk, the generic block layer might start several I/O operations.
  • The generic block layer hides the peculiarities of each hardware block device, thus offering an abstract view of the block devices. Because almost all block devices are disks, the generic block layer also provides some general data structures that describe "disks" and "disk partitions."
  • Below the generic block layer, the "I/O scheduler" sorts the pending I/O data transfer requests according to predefined kernel policies. The purpose of the scheduler is to group requests of data that lie near each other on the physical medium as well as to order the requests in a manner so as to minimize seek time (by ordering them such that heads have to move from inner to outer sectors as linearly as possible as opposed to jumping all over the place). That request to the I/O scheduler essentially describes the requested sectors and the kind of operation to be performed on them (read or write). The scheduler does not satisfy a request as soon as it is created, the I/O operation is just scheduled and will be performed at a later time. This artificial delay is paradoxically the crucial mechanism for boosting the performance of block devices. When a new block data transfer is requested, the I/O scheduler checks whether it can be satisfied by slightly enlarging a previous request that is still waiting (i.e., whether the new request can be satisfied without further seek operations). Because disks tend to be accessed sequentially, this simple mechanism is very effective.
  • Finally, the block device drivers take care of the actual data transfer by sending suitable commands to the hardware interfaces of the disk controllers.
    As you can see, there are many kernel components that are concerned with data stored in block devices; each of them manages the disk data using chunks of different length:
  • The controllers of the hardware block devices transfer data in chunks of fixed length called "sectors." Therefore, the I/O scheduler and the block device drivers must manage sectors of data.
  • The Virtual Filesystem, the mapping layer, and the filesystems group the disk data in logical units called "blocks." A block corresponds to the minimal disk storage unit inside a filesystem.
  • The disk caches work on "pages" of disk data, each of which fits in a page frame.

Command recipes:

  • iostat -dkx <interval> shows number of times in a given interval reqd requests were merged (rrqm/s) and number of times that the write requests (wrqm/s) were merged. Note this is not the number of merged requests but rather the number of times requests could be merged. Multiple merges will result in a single request. The actual request count including merged requests are shown in (r/s) and (w/s)

Understanding the disk buffer cache or page cache

The page cache is the main disk cache used by the Linux kernel. In most cases, the kernel refers to the page cache when reading from or writing to disk. New pages are added to the page cache to satisfy User Mode processes's read requests. If the page is not already in the cache, a new entry is added to the cache and filled with the data read from the disk. If there is enough free memory, the page is kept in the cache for an indefinite period of time and can then be reused by other processes without accessing the disk.

Similarly, before writing a page of data to a block device, the kernel verifies whether the corresponding page is already included in the cache; if not, a new entry is added to the cache and filled with the data to be written on disk. The I/O data transfer does not start immediately: the disk update is delayed for a few seconds, thus giving a chance to the processes to further modify the data to be written (in other words, the kernel implements deferred write operations).

As we have seen, the kernel keeps filling the page cache with pages containing data of block devices. Whenever a process modifies some data, the corresponding page is marked as dirty that is, its PG_dirty flag is set.

Unix systems allow the deferred writes of dirty pages into block devices, because this noticeably improves system performance. Several write operations on a page in cache could be satisfied by just one slow physical update of the corresponding disk sectors. Moreover, write operations are less critical than read operations, because a process is usually not suspended due to delayed writings, while it is most often suspended because of delayed reads. Thanks to deferred writes, each physical block device will service, on the average, many more read requests than write ones.

A dirty page might stay in main memory until the last possible moment that is, until system shutdown. However, pushing the delayed-write strategy to its limits has two major drawbacks:

  • If a hardware or power supply failure occurs, the contents of RAM can no longer be retrieved, so many file updates that were made since the system was booted are lost.
  • The size of the page cache, and hence of the RAM required to contain it, would have to be huge at least as big as the size of the accessed block devices.

Therefore, dirty pages are flushed (written) to disk under the following conditions:

  • The page cache gets too full and more pages are needed, or the number of dirty pages becomes too large.
  • Too much time has elapsed since a page has stayed dirty.
  • A process requests all pending changes of a block device or of a particular file to be flushed; it does this by invoking a sync( ), fsync( ), or fdatasync( ) system call
    • sync( ) - Allows a process to flush all dirty buffers to disk
    • fsync( ) - Allows a process to flush all blocks that belong to a specific open file to disk
    • fdatasync( ) - Very similar to fsync( ), but doesn't flush the inode block of the file

Practically all read( ) and write( ) file operations rely on the page cache. The only exception occurs when a process opens a file with the O_DIRECT flag set: in this case, the page cache is bypassed and the I/O data transfers make use of buffers in the User Mode address space of the process; several database applications make use of the O_DIRECT flag so that they can use their own disk caching algorithm. Note: Using an O_DIRECT flag for writes would mean that data freshly written to disk would not be available to read from RAM. Therefore in applications where this is required (eg mail delivery agents) one should not use the O_DIRECT flag for writes. On the other hand when writing large sequential backups or performing large sequential reads, one should use the O_DIRECT flag so as to prevent wastage of the page cache. Additionally if the application has its own caching it should use the O_DIRECT flag on reads to ensure that it is not wasting double the memory space by caching the data in both its own cache as well as in the page cache.

Note that while O_DIRECT mode bypasses the page cache, it still operates via the I/O scheduler. Therefore it should still optimize merging of reads and writes.

Flash drives, filesystems and I/O schedulers

Conventional I/O scheduler algorithms have been largely written for optimizing performance on mechanical drives. The optimizations include -

  • Merging smaller adjacent writes together to result in a bigger sequential write. This in general helps reduce disk seek time. Now coincidentally this optimization can also help flash storage to a certain extent since flash storage writes may require erase cycles of the entire block on the flash drive and therefore merging adjacent writes can help reduce the number of erase cycles as well as increase performance. However incase of reads this may not make a difference in performance and skip[ping merging reads may actually result in lower latencies.
  • Read-ahead - where the kernel reads a few sectors ahead of the requested sector on the assumption that a request for data in one sector will result in subsequent requests in adjacent sectors. This is not required in flash drives since flash drives support high random IO speeds.
  • O_DIRECT mode for writes - in this mode data bypasses the buffer cache and is directly written to the underlying drive. O_DIRECT mode may result in higher performance by bypassing the additional layer of page cache as well as in turn conserving page cache space by not using it for writes. However in applications where data written is typically immediately accessed (eg mail servers) one should not use O_DIRECT mode as such.

Understanding journaling

The idea behind journaling is to perform each high-level change to the filesystem in two steps. Journaling comes into the picture when the filesystem data actually needs to be written to the disk from memory (either periodically or on an fsync() call). First, a copy of the blocks to be written is stored in the journal; then, when the I/O data transfer to the journal is completed (in short, data is committed to the journal), the blocks are written in the filesystem. When the I/O data transfer to the filesystem terminates (data is committed to the filesystem), the copies of the blocks in the journal are discarded.

The recovery process checks -

  • If the system failure occurred before a commit to the journal: Either the copies of the blocks relative to the high-level change are missing from the journal or they are incomplete; in both cases, the recovery process ignores them
  • If the system failure occurred after a commit to the journal: The copies of the blocks are valid, and are written into the filesystem

In the first case, the high-level change to the filesystem is lost, but the filesystem state is still consistent. In the second case, e2fsck applies the whole high-level change, thus fixing every inconsistency due to unfinished I/O data transfers into the filesystem.

Journaling filesystems however do not usually copy all blocks into the journal. In fact, each filesystem consists of two kinds of blocks: those containing the so-called metadata and those containing regular data. In the case of Ext2 and Ext3, there are six kinds of metadata: superblocks, group block descriptors, inodes, blocks used for indirect addressing (indirection blocks), data bitmap blocks, and inode bitmap blocks. Other filesystems may use different metadata.

Several journaling filesystems, such as SGI's XFS and IBM's JFS, limit themselves to logging the operations affecting metadata. In fact, metadata's log records are sufficient to restore the consistency of the on-disk filesystem data structures. However, since operations on blocks of file data are not logged, nothing prevents a system failure from corrupting the contents of the files.

ext3 and ext4 offer 3 different journaling modes -

writeback mode: In data=writeback mode, ext4 does not journal data at all. This mode provides a similar level of journaling as that of XFS, JFS, and ReiserFS in its default mode - metadata journaling. A crash+recovery can cause incorrect data to appear in files which were written shortly before the crash. This mode will typically provide the best ext4 performance. However this mode can corrupt files incase of crashes. This mode is not recommended unless you can live with some individual files getting corrupted incase of a system crash.

ordered mode: In data=ordered mode, ext4 only officially journals metadata, but it logically groups metadata information related to data changes with the data blocks into a single unit called a transaction. When it's time to write the new metadata out to disk, the associated data blocks are written first. In general, this mode performs slightly slower than writeback but significantly faster than journal mode. When appending data to files, data=ordered mode provides all of the integrity guarantees offered by ext3's full data journaling mode. However, if part of a file is being overwritten and the system crashes, it's possible that the region being written will contain a combination of original blocks interspersed with updated blocks. This is because data=ordered provides no guarantees as to which blocks are overwritten first, so you can't assume that just because overwritten block x was updated, that overwritten block x-1 was updated as well. Instead, data=ordered leaves the write ordering up to the hard drive's write cache. In general, this limitation doesn't end up negatively impacting people very often, since file appends are generally much more common than file overwrites. For this reason, data=ordered mode is a good higher-performance replacement for full data journaling. The usage pattern where data that you really care about is overwritten regularly (as opposed to logs, which simply append) is rare, except in the case of database servers which are covered by their own write ahead logs (append-only). So in almost all circumstances one will only need data=ordered mode

journal mode: data=journal mode provides full data and metadata journaling. All new data is written to the journal first, and then to its final location. In the event of a crash, the journal can be replayed, bringing both data and metadata into a consistent state. This mode is the slowest except when data needs to be read from and written to disk at the same time where it outperforms all others modes (eg mail servers). Currently ext4 does not have delayed allocation support if this data journalling mode is selected.

Measuring IO


[user@server proc]$ cat /proc/diskstats
8 0 sda 138015 12167 5327670 1170008 2996261 19388203 179057288 102422068 0 2497264 103630084
8 1 sda1 334 672 5388 10776
8 2 sda2 4 17 152 1216
8 3 sda3 293 2323 162807 1302456
8 5 sda5 149543 5324514 22217747 177741952
8 16 sdb 2514 832 26747 14124 3128 3384 52096 45588 0 16304 59712

Field 1 – # of reads completed - This is the total number of reads completed successfully.
Field 2 – # of reads merged, field 6 – # of writes merged. Reads and writes which are adjacent to each other may be merged for efficiency. Thus two 4K reads may become one 8K read before it is ultimately handed to the disk, and so it will be counted (and queued) as only one I/O. This field lets you know how often this was done.
Field 3 – # of sectors read. This is the total number of sectors read successfully.
Field 4 – # of milliseconds spent reading. This is the total number of milliseconds spent by all reads (as measured from __make_request() to end_that_request_last())
Field 5 – # of writes completed. This is the total number of writes completed successfully.
Field 7 – # of sectors written. This is the total number of sectors written successfully.
Field 8 – # of milliseconds spent writing. This is the total number of milliseconds spent by all writes (as measured from __make_request() to end_that_request_last()).
Field 9 – # of I/Os currently in progress. The only field that should go to zero. Incremented as requests are given to appropriate struct request_queue and decremented as they finish.
Field 10 – # of milliseconds spent doing I/Os. This field is increases so long as field 9 is nonzero. This would mean that as long as there is even "1" IO request being serviced field 10 will be incremented. Note: iostat uses this field to determine %utilization of the underlying device. This is inaccurate since the underlying array of disks maybe capable of servicing multiple requests. Therefore if field 9 has a non-zero value it still does not mean that the underlying system cannot take any more IO requests. %util could show up as 100% but it does not actually mean the underlying disk subsystem is 100% utilized. It may actually mean that the device is still capable of handling more requests.
Field 11 – weighted # of milliseconds spent doing I/Os. This field is incremented at each I/O start, I/O completion, I/O merge, or read of these stats by the number of I/Os in progress (field 9) times the number of milliseconds spent doing I/O since the last update of this field. This can provide an easy measure of both I/O completion time and the backlog that may be accumulating.


IO utilization tools like iostat, vmstat etc that monitor overall IO typically use /proc/diskstats to do so. This file contains io counters for each disk. iostat provides a meaningful representation of these stats. The values in iostat are entirely calculated from /proc/diskstats. Below is a description of the relevant values. Note: the values of iostat are only relevant between two intervals. Therefore when running iostat the first set of values should always be completely discarded

  • rrqm/s - The number of read requests merged per second that were queued to the device.
  • wrqm/s - The number of write requests merged per second that were queued to the device.
  • r/s - The number (after merges) of read requests completed per second for the device. This value together with w/s provides total IOPs per second.
  • w/s - The number (after merges) of write requests completed per second for the device.
  • rsec/s (rkB/s, rMB/s) - The number of sectors (kilobytes, megabytes) read from the device per second.
  • wsec/s (wkB/s, wMB/s) -The number of sectors (kilobytes, megabytes) written to the device per second.
  • avgrq-sz - The average size (in sectors) of the requests that were issued to the device in the time interval. Each sector is 512 bytes. This value provides the average size of each IO request and hence denotes whether the IO is largely random (<16 KB) or largely sequential (>128 KB)
  • avgqu-sz - The average queue length of the requests that were issued to the device
  • await - The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. This is equal to IOPs divided by cumulative time taken for all IOPs to complete in that time interval ie (field8'+field4')/(field5'+field1')
  • r_await - The average time (in milliseconds) for read requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. This is equal to read IOPs divided by cumulative time taken for all read IOPs to complete in that time interval
  • w_await - The average time (in milliseconds) for write requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them. This is equal to write IOPs divided by cumulative time taken for all read IOPs to complete in that time interval
  • svctm - The average service time (in milliseconds) for I/O requests that were issued to the device. This field is inaccurate for two reasons. Firstly the kernel seems to have a bug where the underlying fields used to compute this sometimes becomes negative. Secondly this field is calculated based on %util which has its own inaccuracies. In general in systems with multiple disks svctim maybe over reported by a factor ie the actual svctim maybe lower than what you see by a significant factor.
  • %util - Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). This field is calculated by incrementing a timer for each millisecond where there is atleast one IO request in the queue. The counter's incremented value during the interval represents the number of milliseconds during which atleast one request was in queue. Dividing this value by the interval time provides device utilization. This number would be entirely accurate on a single disk system. However on a multi-disk system when the IO requests are small and hence multiple IO Requests can be serviced independently by each underlying disks, this value is inaccurate since the counter is incremented irrespective of whether the current queue value is 1 or 10. If for eg your disk aray can service 10 requests at a time, but the queue consists of only 1 request every millisecond, the underlying %util value will show as 100% even though your disks are only 10% utilized. The inaccuracy however may not be that drastic since some IO requests may end up getting serviced in under 1ms and hence a number of 1 will generally represent a higher number. For a more elaborate explanation refer to my comments on http://www.xaprb.com/blog/2010/09/06/beware-of-svctm-in-linuxs-iostat/

Analyzing system wide IO stats:
While the above stats provide good trends they do not allow one to monitor whether a disk subsystem is saturated. We can analyze the values as follows -

  • IOPs: One can typically calculate a rough estimate of the max IOPs that your disks are capable of. Flash drives provide this data (though one should run tests to confirm). In case of SATA drives one can calculate this using disk seek and latency. Incase of RAID arrays one can add up the IOPs of all the disks based on the RAID type (refer to the RAID section above) and determine rough count of read IOPs and write IOPs the system is capable of. Using this one can determine roughly if the disk is saturated based on the number of IOPs being sent to the disk.
  • await: This time accurately represents the average time each request took from the time it was issued to the time it completed. One can determine if the number here is acceptable. await times in double digit milliseconds maybe acceptable in terms of latency. This analysis is not accurate but merely indicative. If await times go up significantly it may result in unacceptable responses wrt your application. Note that await includes the entire time taken for queuing and servicing a request. Hence incase of network filesystems it includes time taken for the data to be transferred between the SAN and the machine.
  • %util: While this figure is inaccurate on multi-disk systems with lots of random IO, it is indicative. %util is generally under-reported by a factor. However a consistent increase in %util does demonstrate higher and higher utilization of the underlying disks. One should look at %util in conjunction with await. If at 100% util the await has not substantially increased from before then it signifies that the disks are not saturated. If however await has taken a significant jump it may signify a saturation of disks

Note: A typical flash drive can perform 50000 random IOPs in a second. This means 20 micro seconds per IO operation. The %util number is based on computing the number of milliseconds during which the IO queue had atleast one request. This counter is measured in increments of milliseconds. However if operations take significantly lower than a millisecond this number may not reflect the utilization accurately, especially if the time interval between measurements is not high enough. This is conjecture on my part.


[user@server ~]$ cat /proc/23720/io
rchar: 169015898848
wchar: 29381578785
syscr: 146493522
syscw: 11102300
read_bytes: 19410272256
write_bytes: 29825761280
cancelled_write_bytes: 809271296

Field Descriptions:
rchar - bytes read
wchar - bytes written
syscr - number of read syscalls
syscw - number of write syscalls
read_bytes - number of bytes caused by this process to read from underlying storage
write_bytes - number of bytes caused by this process to written from underlying storage

I/O counter: chars read
The number of bytes which this task has caused to be read from storage. This is simply the sum of bytes which this process passed to read() and pread(). It includes things like tty IO and it is unaffected by whether or not actual physical disk IO was required (the read might have been satisfied from pagecache)

I/O counter: chars written
The number of bytes which this task has caused, or shall cause to be written to disk. Similar caveats apply here as with rchar.

I/O counter: read syscalls
Attempt to count the number of read I/O operations, i.e. syscalls like read() and pread().

I/O counter: write syscalls
Attempt to count the number of write I/O operations, i.e. syscalls like write() and pwrite().

I/O counter: bytes read
Attempt to count the number of bytes which this process really did cause to be fetched from the storage layer. Done at the submit_bio() level, so it is accurate for block-backed filesystems

I/O counter: bytes written
Attempt to count the number of bytes which this process caused to be sent to the storage layer. This is done at page-dirtying time.

The big inaccuracy here is truncate. If a process writes 1MB to a file and then deletes the file, it will in fact perform no writeout. But it will have been accounted as having caused 1MB of write.In other words: The number of bytes which this process caused to not happen, by truncating pagecache. A task can cause "negative" IO too. If this task truncates some dirty pagecache, some IO which another task has been accounted for (in its write_bytes) will not be happening. We could just subtract that from the truncating task's write_bytes, but there is information loss in doing that.

This command is the holy grail of IO and disk cache monitoring. If I interpret this correctly the difference between rchar and read_bytes over rchar represents the ratio of read requests that were served from the disk cache for a particular process.


pidstat can be used to monitor per process IO stats. pidstats uses /proc/<process>/stats to provide io counters per process and per thread. Note: the values of pidstat are only relevant between two intervals. Therefore when running pidsat the first set of values should always be completely discarded. pidstat provides the following stats per process and LWPs -

  • PID - The identification number of the task being monitored.
  • kB_rd/s - Number of kilobytes the task has caused to be read from disk per second in the interval
  • kB_wr/s - Number of kilobytes the task has caused, or shall cause to be written to disk per second in the interval
  • kB_ccwr/s- Number of kilobytes whose writing to disk has been cancelled by the task in the interval. This may occur when the task truncates some dirty pagecache. In this case, some IO which another task has been accounted for will not be happening.
  • Note: actual kB/s written would be (kB_wr/s - kB_ccwr/s)

Optimizing Storage

Planning your storage system

  • Sequential IO
    • You could get away with a RAID array of SATA/SAS drives
    • eg sequential backups / video servers / large file servers
  • Small data set with random IO
    • Use directly connected flash drives in RAID 1 for your storage
    • Provide as much RAM as can store the most frequently used data in RAM
    • eg databases
  • Large dataset with random IO
    • Use flashcache. Depending on whether your application is read intensive or write intensive you can set the flashcache threshold based on the following scenarios
      • If your app is read intensive set the dirty threshold to a lower value. This will free up recently written pages on the flashcache faster and create room for new pages to be read into flashcache from the underlying system eg databases
      • If your app is write intensive, set the dirty threshold to a higher value. This will ensure that when data is written back from the flash drive to the backend SATA array, the writes are larger sequential writes.
      • If your app is read and write intensive, and the data read is most likely to be the data last written, set the dirty threshold to a higher value. This will ensure that the most recently written data is always in flashcache and older data is sent to the backend disk. eg mail servers
    • Provide as much RAM as can store the most frequently used data in RAM
  • Type of SAN/DAS to be used
    • Use a simple DAS within the main machine chasis if the total storage of the box is not intended to go beyond a certain amount
    • If the storage requirement will constantly grow you can use a SAN which is expandable to meet the storage needs
    • If multiple machines are intended to share the storage and the storage needs to be flashcached at the storage layer you will need to use a server based SAN (eg delta-v's or Jackrabbits) preferably connected via infiniband
  • RAID selection
    • As described in the RAID section above
  • Filesystem based replication
    • In certain applications you may want to setup filesystem based replication for redundancy. For instance mail servers, file servers, image server, video servers. Glusterfs provides a user space based filesystem with replication so that incase of downtime of a storage system you do not have any data loss

Optimizing your storage system

  • Is your storage system maxing out
    • High %iowait
    • High %util
  • Determine whether IO is sequential or random
  • Determine what is killing you - reads/writes
  • Sequential reads or writes
    • You should not see a high IOPs value, however the throughput in terms of KB/s should be high
    • Reducing IOPs may not be feasible and may not even help if most IOPs are already large enough in size that the disks are not spending much of their time seeking
    • Additional rotational drives / faster rotational drives will help
    • If it is sequential reads then additional RAM can help too
    • data that will be accessed together should be stored together
    • in write heavy operations try merge writes into sequential writes
  • Random reads or writes
    • Flash drives
    • Flashcache
    • Increase RAM (helps mostly for reads unless your application is not performing immediate fsync()'s for each write)
    • Optimize code to reduce IOPs
    • Ensure that the cache is not invalidated by full backup / vacuum and other such processes. I believe you maybe able to figure this out by taking a look at the "Active memory" graph during such processes. Under normal circumstances since Linux uses an LRU/2 algorithm for its page cache, a page will need to be accessed atleast twice for it to make it to the active list (refer to memory section). A typical backup script which reads the entire filesystem will not result in the page being accessed twice and therefore may not impact your disk cache drastically. One can also check the page cache consitituents per file using fincore immediately before and after a backup script is run and see how different the outputs are. Lastly if a large chunk of your memory is used by disk cache and a reasonable chunk of that is inactive at all times it may signify that most of the data that your application needs is available in the disk cache. This line of thought is speculative and has not been verified
  • Ensure that flashcache is not invalidated by full backup / vacuum and other such processes. You can figure this out by your cache hit/miss ratios during and after backup times. Ensure that backup processes and such like always bypass flashcache. Refer to the section on flashcache for further details
  • Optimize flashcache (refer to the flashcache section)
  • noatime: Linux records information about when files were created and last modified as well as when it was last accessed. There is a cost associated with recording the last access time. Linux has a special mount option for file systems called noatime which disables this and therefore read accesses to the file system will no longer result in an update to the atime information associated with the file.
  • nobarrier: This enables/disables the use of write barriers. Write barriers enforce proper on-disk ordering of journal commits, making volatile disk write caches safe to use, at some performance penalty. If your disks are battery-backed or you are not using your harddisk cache, disabling barriers will improve performance. This is an option that improves the integrity of the filesystem at the cost of some performance. From this LWN article: "The filesystem code must, before writing the [journaling] commit record, be absolutely sure that all of the transaction's information has made it to the journal. Just doing the writes in the proper order is insufficient; contemporary drives maintain large internal caches and will reorder operations for better performance. So the filesystem must explicitly instruct the disk to get all of the journal data onto the media before writing the commit record; if the commit record gets written first, the journal may be corrupted. The kernel's block I/O subsystem makes this capability available through the use of barriers; in essence, a barrier forbids the writing of any blocks after the barrier until all blocks written before the barrier are committed to the media. By using barriers, filesystems can make sure that their on-disk structures remain consistent at all times."
  • commit=nrsec: Ext4 can be told to sync all its data and metadata every 'nrsec' seconds. The default value is 5 seconds. This means that if you lose your power, you will lose as much as the latest 5 seconds of work (your filesystem will not be damaged though, thanks to the journaling). This default value (or any low value) will hurt performance, but it's good for data-safety. Setting it to 0 will have the same effect as leaving it at the default (5 seconds). Setting it to very large values will improve performance. If you have writes being replicated to multiple machines and you do not expect all of them to crash at the same time due to power issues or any such concerns you could theoretically set this value substantially higher and enhance performance considerably. The performance improvement occurs both due to the ability to combine multiple writes into one single larger write, as well as cancelling out updates to previous writes within the commit time window. Note: I believe this setting only governs how often the pdflush threads are woken up. Whether dirty pages get written out to disk or not depends on the pdflush options set (refer below)
  • Hard disk writeback cache: Enabling harddisk writeback caches can significantly enhance write performance since the harddisk will cache all writes and send them in at one go. If the harddisks are connected to a battery backed unit that will ensure that they flush their data incase of a power failure then one should totally enable writeback caches. However if there are chances of random power failures or cold machine reboots then you can lose data and much worse result in filesystem inconsistencies if the harddisk cache is enabled. The latter can be prevented by enabling write barriers.
  • journaling mode (slowest to fastest): journal, ordered (recommended), writeback
  • bh/nobh: "bh" option forces use of buffer heads. "nobh" option tries to avoid associating buffer heads (supported only for "writeback" mode)
  • journal_checksum: seemingly the documentation states this may increase performance, though the same is yet to be investigated
  • scheduler: select the appropriate scheduler based on drive and IO operation types
  • pdflush options - dirty_writeback_centisecs, dirty_expire_centiseconds, dirty_background_ratio and dirty_ratio. These parameters govern how often dirty pages get written to the underlying disk. For write intensive workloads one may optimize this by increasing thresholds at the risk of data loss. If one has adequate data replication this may not be of concern. Refer http://www.westnet.com/~gsmith/content/linux-pdflush.htm for a detailed explanation
  • vm tuning options - http://www.redhat.com/magazine/001nov04/features/vm/
  • swappiness - /proc/sys/vm/swappiness. Check http://www.westnet.com/~gsmith/content/linux-pdflush.htm for a detailed explanation
  • fsync() - Many applications use fsync() to confirm commit of data to disk (eg mail servers, database servers wrt WAL etc). This is required in a critical environment where a confirmation of a transaction to a user or client can only be given once the system knows the same has been persisted to memory, otherwise incase of a crash the data that is in the page cache will be lost even though the client has received a confirmation that the same has been committed. However if one has multiple replicated machines (file servers) each of which are storing the same data, then one can afford to in certain cases turn fsync() off. This is because in the worst case if one system crashes the other replicas still have a copy of the data in memory. One can therefore delay committing to disk. One should ensure however that the replicas draw independent power or better yet are geo-distributed. Note: This only helps if your disk IO is write intensive
  • IO alignment: This is to be researched in detail. However conceptually one should try and ensure that sizes of blocks at various layers align for higher performance - ie filesystem block size, RAID chunk size, RAID stripe size and harddisk sector/block size. The harddisk sector/block size is fixed. You need to derive optimum values for everything else based on research and device type. For instance this maybe different for a flash device as compared to a hard disk
  • RAID stripe size: Based on your RAID type, and median IO request size you will need to optimize your RAID stripe size. Infact the stripe count and sizes will also depend on access patterns (is it multiple processes accessing a single file or multiple processes accessing multiple files)
  • Separating Journals, WALs and other IO types: In your deployment you may have filesystem journals, write-ahead logs for databases and in general different type of IO access patterns for different devices. Each of these may need a different setting for journaling, and different type of RAID/filesystem settings. Therefore it maybe prudent to separate these judiciously based on research and testing.
  • Selecting a filesystem: Each filesystem has specific features that make it suitable to a certain usage pattern. Perform adequate research in selecting your filesystem for your workload.
  • Interconnect: On SANs select the right interconnect for your requirements (SAS/iSCSI/infiniband)
  • Network settings: When using a storage system over ethernet (iSCSI) there are various network topologies and settings that can optimize performance (eg jumbo frames, packet sizes)
  • Choosing the right IO scheduler
  • Tuning page cache settings
  • fadvise: Using fadvise() in your disk i/o calls allows the kernel to make more intelligent decisions on read-ahead and caching techniques
  • Use http://code.google.com/p/pagecache-mangagement/ - This tool allows the user to limit the amount of pagecache used by applications under Linux. This is similar to nice, ionice etc. in that it usually doesn't make an application go faster, but does reduce the impact of the application on other applications performance. This is especially useful for applications that walk sequentially through data sets larger than memory, as discarding their pagecache does not reduce their performance (although this tool does add overhead of about 2%).
  • Use ionice to reduce the IO priority of runaway processes that are hogging IO. Refer http://linux.die.net/man/1/ionice
    • Note this is supported only with the CFQ scheduler
  • Track kernel updates. In almost all updates there are updates to kernel code for Block devices as well as filesystems which can result in significant IO performance enhancements

Optmizing your filesystem for flash drives

  • If you can live without journalling support then disable journalling for higher performance
  • Use the NO-OP scheduler. NOOP scheduler is best used with devices that do not depend on mechanical movement to access data. Flash drives do not require re-ordering of multiple I/O requests, a technique used by other schedulers to group together I/O requests that are physically close together on the disk.
  • A log-structured filesystem is a file system designed for high write throughput, all updates to data and metadata are written sequentially to a continuous stream, called a log. On flash memory--where seek times are usually negligible--the log structure may not confer a worthwhile performance gain because write fragmentation has much less of an impact on write throughput. However many flash based devices can only write a complete block at a time, and they must first perform a (slow) erase cycle of each block before being able to write, so by putting all the writes in one block, this can help performance as opposed to writes scattered into various blocks, each one of which must be copied into a buffer, erased, and written back. Currently Log file systems are not mainstream production grade. You may want to try them out but should perform adequate benchmarking and testing. Options include Logfs and UBIfs
  • RAID stripe size: Ideally RAID 5/6 should never be used for flash drives due to the extra parity writes as well as the requirement of performing a read-read-write-write cycle for each write. However if one does use RAID 5/6 then ensure the default chunk size is small and equivalent to the flash block size (like say 4 KB)

Some flowcharts


  • considerations when modeling data
  • write operations - O_DIRECT
    • are you going to read this data shortly?
    • are you committing write to disk immediately?
  • write operations - delayed fsync
    • can you live with data loss
    • do you have a way to recover data loss





From Blogs & Wikis

Directi Presentations

General Wikis

Directi Univ Wikis

Company Blogs


Home.pw - Chat and collaboration for companies and individuals. LogicBoxes - Registry & Registrar Solutions Hosting Reseller BigRock - Domain Names, Domain Registration India, Web Hosting, Domains Skenzo - Exclusive Traffic Monetization Programs WebHosting - Web Hosting Information CodeChef - Online Programming Competition
All content in the Directi Wiki is licensed under a Creative Commons Attribution-Share Alike 3.0 .