| ======================================= | 
 | Oracle Data Analytics Accelerator (DAX) | 
 | ======================================= | 
 |  | 
 | DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8 | 
 | (DAX2) processor chips, and has direct access to the CPU's L3 caches | 
 | as well as physical memory. It can perform several operations on data | 
 | streams with various input and output formats.  A driver provides a | 
 | transport mechanism and has limited knowledge of the various opcodes | 
 | and data formats. A user space library provides high level services | 
 | and translates these into low level commands which are then passed | 
 | into the driver and subsequently the Hypervisor and the coprocessor. | 
 | The library is the recommended way for applications to use the | 
 | coprocessor, and the driver interface is not intended for general use. | 
 | This document describes the general flow of the driver, its | 
 | structures, and its programmatic interface. It also provides example | 
 | code sufficient to write user or kernel applications that use DAX | 
 | functionality. | 
 |  | 
 | The user library is open source and available at: | 
 |  | 
 |     https://oss.oracle.com/git/gitweb.cgi?p=libdax.git | 
 |  | 
 | The Hypervisor interface to the coprocessor is described in detail in | 
 | the accompanying document, dax-hv-api.txt, which is a plain text | 
 | excerpt of the (Oracle internal) "UltraSPARC Virtual Machine | 
 | Specification" version 3.0.20+15, dated 2017-09-25. | 
 |  | 
 |  | 
 | High Level Overview | 
 | =================== | 
 |  | 
 | A coprocessor request is described by a Command Control Block | 
 | (CCB). The CCB contains an opcode and various parameters. The opcode | 
 | specifies what operation is to be done, and the parameters specify | 
 | options, flags, sizes, and addresses.  The CCB (or an array of CCBs) | 
 | is passed to the Hypervisor, which handles queueing and scheduling of | 
 | requests to the available coprocessor execution units. A status code | 
 | returned indicates if the request was submitted successfully or if | 
 | there was an error.  One of the addresses given in each CCB is a | 
 | pointer to a "completion area", which is a 128 byte memory block that | 
 | is written by the coprocessor to provide execution status. No | 
 | interrupt is generated upon completion; the completion area must be | 
 | polled by software to find out when a transaction has finished, but | 
 | the M7 and later processors provide a mechanism to pause the virtual | 
 | processor until the completion status has been updated by the | 
 | coprocessor. This is done using the monitored load and mwait | 
 | instructions, which are described in more detail later.  The DAX | 
 | coprocessor was designed so that after a request is submitted, the | 
 | kernel is no longer involved in the processing of it.  The polling is | 
 | done at the user level, which results in almost zero latency between | 
 | completion of a request and resumption of execution of the requesting | 
 | thread. | 
 |  | 
 |  | 
 | Addressing Memory | 
 | ================= | 
 |  | 
 | The kernel does not have access to physical memory in the Sun4v | 
 | architecture, as there is an additional level of memory virtualization | 
 | present. This intermediate level is called "real" memory, and the | 
 | kernel treats this as if it were physical.  The Hypervisor handles the | 
 | translations between real memory and physical so that each logical | 
 | domain (LDOM) can have a partition of physical memory that is isolated | 
 | from that of other LDOMs.  When the kernel sets up a virtual mapping, | 
 | it specifies a virtual address and the real address to which it should | 
 | be mapped. | 
 |  | 
 | The DAX coprocessor can only operate on physical memory, so before a | 
 | request can be fed to the coprocessor, all the addresses in a CCB must | 
 | be converted into physical addresses. The kernel cannot do this since | 
 | it has no visibility into physical addresses. So a CCB may contain | 
 | either the virtual or real addresses of the buffers or a combination | 
 | of them. An "address type" field is available for each address that | 
 | may be given in the CCB. In all cases, the Hypervisor will translate | 
 | all the addresses to physical before dispatching to hardware. Address | 
 | translations are performed using the context of the process initiating | 
 | the request. | 
 |  | 
 |  | 
 | The Driver API | 
 | ============== | 
 |  | 
 | An application makes requests to the driver via the write() system | 
 | call, and gets results (if any) via read(). The completion areas are | 
 | made accessible via mmap(), and are read-only for the application. | 
 |  | 
 | The request may either be an immediate command or an array of CCBs to | 
 | be submitted to the hardware. | 
 |  | 
 | Each open instance of the device is exclusive to the thread that | 
 | opened it, and must be used by that thread for all subsequent | 
 | operations. The driver open function creates a new context for the | 
 | thread and initializes it for use.  This context contains pointers and | 
 | values used internally by the driver to keep track of submitted | 
 | requests. The completion area buffer is also allocated, and this is | 
 | large enough to contain the completion areas for many concurrent | 
 | requests.  When the device is closed, any outstanding transactions are | 
 | flushed and the context is cleaned up. | 
 |  | 
 | On a DAX1 system (M7), the device will be called "oradax1", while on a | 
 | DAX2 system (M8) it will be "oradax2". If an application requires one | 
 | or the other, it should simply attempt to open the appropriate | 
 | device. Only one of the devices will exist on any given system, so the | 
 | name can be used to determine what the platform supports. | 
 |  | 
 | The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For | 
 | all of these, success is indicated by a return value from write() | 
 | equal to the number of bytes given in the call. Otherwise -1 is | 
 | returned and errno is set. | 
 |  | 
 | CCB_DEQUEUE | 
 | ----------- | 
 |  | 
 | Tells the driver to clean up resources associated with past | 
 | requests. Since no interrupt is generated upon the completion of a | 
 | request, the driver must be told when it may reclaim resources.  No | 
 | further status information is returned, so the user should not | 
 | subsequently call read(). | 
 |  | 
 | CCB_KILL | 
 | -------- | 
 |  | 
 | Kills a CCB during execution. The CCB is guaranteed to not continue | 
 | executing once this call returns successfully. On success, read() must | 
 | be called to retrieve the result of the action. | 
 |  | 
 | CCB_INFO | 
 | -------- | 
 |  | 
 | Retrieves information about a currently executing CCB. Note that some | 
 | Hypervisors might return 'notfound' when the CCB is in 'inprogress' | 
 | state. To ensure a CCB in the 'notfound' state will never be executed, | 
 | CCB_KILL must be invoked on that CCB. Upon success, read() must be | 
 | called to retrieve the details of the action. | 
 |  | 
 | Submission of an array of CCBs for execution | 
 | --------------------------------------------- | 
 |  | 
 | A write() whose length is a multiple of the CCB size is treated as a | 
 | submit operation. The file offset is treated as the index of the | 
 | completion area to use, and may be set via lseek() or using the | 
 | pwrite() system call. If -1 is returned then errno is set to indicate | 
 | the error. Otherwise, the return value is the length of the array that | 
 | was actually accepted by the coprocessor. If the accepted length is | 
 | equal to the requested length, then the submission was completely | 
 | successful and there is no further status needed; hence, the user | 
 | should not subsequently call read(). Partial acceptance of the CCB | 
 | array is indicated by a return value less than the requested length, | 
 | and read() must be called to retrieve further status information.  The | 
 | status will reflect the error caused by the first CCB that was not | 
 | accepted, and status_data will provide additional data in some cases. | 
 |  | 
 | MMAP | 
 | ---- | 
 |  | 
 | The mmap() function provides access to the completion area allocated | 
 | in the driver.  Note that the completion area is not writeable by the | 
 | user process, and the mmap call must not specify PROT_WRITE. | 
 |  | 
 |  | 
 | Completion of a Request | 
 | ======================= | 
 |  | 
 | The first byte in each completion area is the command status which is | 
 | updated by the coprocessor hardware. Software may take advantage of | 
 | new M7/M8 processor capabilities to efficiently poll this status byte. | 
 | First, a "monitored load" is achieved via a Load from Alternate Space | 
 | (ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY).  Second, a | 
 | "monitored wait" is achieved via the mwait instruction (a write to | 
 | %asr28). This instruction is like pause in that it suspends execution | 
 | of the virtual processor for the given number of nanoseconds, but in | 
 | addition will terminate early when one of several events occur. If the | 
 | block of data containing the monitored location is modified, then the | 
 | mwait terminates. This causes software to resume execution immediately | 
 | (without a context switch or kernel to user transition) after a | 
 | transaction completes. Thus the latency between transaction completion | 
 | and resumption of execution may be just a few nanoseconds. | 
 |  | 
 |  | 
 | Application Life Cycle of a DAX Submission | 
 | ========================================== | 
 |  | 
 |  - open dax device | 
 |  - call mmap() to get the completion area address | 
 |  - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc. | 
 |  - submit CCB via write() or pwrite() | 
 |  - go into a loop executing monitored load + monitored wait and | 
 |    terminate when the command status indicates the request is complete | 
 |    (CCB_KILL or CCB_INFO may be used any time as necessary) | 
 |  - perform a CCB_DEQUEUE | 
 |  - call munmap() for completion area | 
 |  - close the dax device | 
 |  | 
 |  | 
 | Memory Constraints | 
 | ================== | 
 |  | 
 | The DAX hardware operates only on physical addresses. Therefore, it is | 
 | not aware of virtual memory mappings and the discontiguities that may | 
 | exist in the physical memory that a virtual buffer maps to. There is | 
 | no I/O TLB or any scatter/gather mechanism. All buffers, whether input | 
 | or output, must reside in a physically contiguous region of memory. | 
 |  | 
 | The Hypervisor translates all addresses within a CCB to physical | 
 | before handing off the CCB to DAX. The Hypervisor determines the | 
 | virtual page size for each virtual address given, and uses this to | 
 | program a size limit for each address. This prevents the coprocessor | 
 | from reading or writing beyond the bound of the virtual page, even | 
 | though it is accessing physical memory directly. A simpler way of | 
 | saying this is that a DAX operation will never "cross" a virtual page | 
 | boundary. If an 8k virtual page is used, then the data is strictly | 
 | limited to 8k. If a user's buffer is larger than 8k, then a larger | 
 | page size must be used, or the transaction size will be truncated to | 
 | 8k. | 
 |  | 
 | Huge pages. A user may allocate huge pages using standard interfaces. | 
 | Memory buffers residing on huge pages may be used to achieve much | 
 | larger DAX transaction sizes, but the rules must still be followed, | 
 | and no transaction will cross a page boundary, even a huge page.  A | 
 | major caveat is that Linux on Sparc presents 8Mb as one of the huge | 
 | page sizes. Sparc does not actually provide a 8Mb hardware page size, | 
 | and this size is synthesized by pasting together two 4Mb pages. The | 
 | reasons for this are historical, and it creates an issue because only | 
 | half of this 8Mb page can actually be used for any given buffer in a | 
 | DAX request, and it must be either the first half or the second half; | 
 | it cannot be a 4Mb chunk in the middle, since that crosses a | 
 | (hardware) page boundary. Note that this entire issue may be hidden by | 
 | higher level libraries. | 
 |  | 
 |  | 
 | CCB Structure | 
 | ------------- | 
 | A CCB is an array of 8 64-bit words. Several of these words provide | 
 | command opcodes, parameters, flags, etc., and the rest are addresses | 
 | for the completion area, output buffer, and various inputs:: | 
 |  | 
 |    struct ccb { | 
 |        u64   control; | 
 |        u64   completion; | 
 |        u64   input0; | 
 |        u64   access; | 
 |        u64   input1; | 
 |        u64   op_data; | 
 |        u64   output; | 
 |        u64   table; | 
 |    }; | 
 |  | 
 | See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of | 
 | each of these fields, and see dax-hv-api.txt for a complete description | 
 | of the Hypervisor API available to the guest OS (ie, Linux kernel). | 
 |  | 
 | The first word (control) is examined by the driver for the following: | 
 |  - CCB version, which must be consistent with hardware version | 
 |  - Opcode, which must be one of the documented allowable commands | 
 |  - Address types, which must be set to "virtual" for all the addresses | 
 |    given by the user, thereby ensuring that the application can | 
 |    only access memory that it owns | 
 |  | 
 |  | 
 | Example Code | 
 | ============ | 
 |  | 
 | The DAX is accessible to both user and kernel code.  The kernel code | 
 | can make hypercalls directly while the user code must use wrappers | 
 | provided by the driver. The setup of the CCB is nearly identical for | 
 | both; the only difference is in preparation of the completion area. An | 
 | example of user code is given now, with kernel code afterwards. | 
 |  | 
 | In order to program using the driver API, the file | 
 | arch/sparc/include/uapi/asm/oradax.h must be included. | 
 |  | 
 | First, the proper device must be opened. For M7 it will be | 
 | /dev/oradax1 and for M8 it will be /dev/oradax2. The simplest | 
 | procedure is to attempt to open both, as only one will succeed:: | 
 |  | 
 | 	fd = open("/dev/oradax1", O_RDWR); | 
 | 	if (fd < 0) | 
 | 		fd = open("/dev/oradax2", O_RDWR); | 
 | 	if (fd < 0) | 
 | 	       /* No DAX found */ | 
 |  | 
 | Next, the completion area must be mapped:: | 
 |  | 
 |       completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0); | 
 |  | 
 | All input and output buffers must be fully contained in one hardware | 
 | page, since as explained above, the DAX is strictly constrained by | 
 | virtual page boundaries.  In addition, the output buffer must be | 
 | 64-byte aligned and its size must be a multiple of 64 bytes because | 
 | the coprocessor writes in units of cache lines. | 
 |  | 
 | This example demonstrates the DAX Scan command, which takes as input a | 
 | vector and a match value, and produces a bitmap as the output. For | 
 | each input element that matches the value, the corresponding bit is | 
 | set in the output. | 
 |  | 
 | In this example, the input vector consists of a series of single bits, | 
 | and the match value is 0. So each 0 bit in the input will produce a 1 | 
 | in the output, and vice versa, which produces an output bitmap which | 
 | is the input bitmap inverted. | 
 |  | 
 | For details of all the parameters and bits used in this CCB, please | 
 | refer to section 36.2.1.3 of the DAX Hypervisor API document, which | 
 | describes the Scan command in detail:: | 
 |  | 
 | 	ccb->control =       /* Table 36.1, CCB Header Format */ | 
 | 		  (2L << 48)     /* command = Scan Value */ | 
 | 		| (3L << 40)     /* output address type = primary virtual */ | 
 | 		| (3L << 34)     /* primary input address type = primary virtual */ | 
 | 		             /* Section 36.2.1, Query CCB Command Formats */ | 
 | 		| (1 << 28)     /* 36.2.1.1.1 primary input format = fixed width bit packed */ | 
 | 		| (0 << 23)     /* 36.2.1.1.2 primary input element size = 0 (1 bit) */ | 
 | 		| (8 << 10)     /* 36.2.1.1.6 output format = bit vector */ | 
 | 		| (0 <<  5)	/* 36.2.1.3 First scan criteria size = 0 (1 byte) */ | 
 | 		| (31 << 0);	/* 36.2.1.3 Disable second scan criteria */ | 
 |  | 
 | 	ccb->completion = 0;    /* Completion area address, to be filled in by driver */ | 
 |  | 
 | 	ccb->input0 = (unsigned long) input; /* primary input address */ | 
 |  | 
 | 	ccb->access =       /* Section 36.2.1.2, Data Access Control */ | 
 | 		  (2 << 24)    /* Primary input length format = bits */ | 
 | 		| (nbits - 1); /* number of bits in primary input stream, minus 1 */ | 
 |  | 
 | 	ccb->input1 = 0;       /* secondary input address, unused */ | 
 |  | 
 | 	ccb->op_data = 0;      /* scan criteria (value to be matched) */ | 
 |  | 
 | 	ccb->output = (unsigned long) output;	/* output address */ | 
 |  | 
 | 	ccb->table = 0;	       /* table address, unused */ | 
 |  | 
 | The CCB submission is a write() or pwrite() system call to the | 
 | driver. If the call fails, then a read() must be used to retrieve the | 
 | status:: | 
 |  | 
 | 	if (pwrite(fd, ccb, 64, 0) != 64) { | 
 | 		struct ccb_exec_result status; | 
 | 		read(fd, &status, sizeof(status)); | 
 | 		/* bail out */ | 
 | 	} | 
 |  | 
 | After a successful submission of the CCB, the completion area may be | 
 | polled to determine when the DAX is finished. Detailed information on | 
 | the contents of the completion area can be found in section 36.2.2 of | 
 | the DAX HV API document:: | 
 |  | 
 | 	while (1) { | 
 | 		/* Monitored Load */ | 
 | 		__asm__ __volatile__("lduba [%1] 0x84, %0\n" | 
 | 				     : "=r" (status) | 
 | 				     : "r"  (completion_area)); | 
 |  | 
 | 		if (status)	     /* 0 indicates command in progress */ | 
 | 			break; | 
 |  | 
 | 		/* MWAIT */ | 
 | 		__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */ | 
 | 	} | 
 |  | 
 | A completion area status of 1 indicates successful completion of the | 
 | CCB and validity of the output bitmap, which may be used immediately. | 
 | All other non-zero values indicate error conditions which are | 
 | described in section 36.2.2:: | 
 |  | 
 | 	if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */ | 
 | 		/* completion_area[0] contains the completion status */ | 
 | 		/* completion_area[1] contains an error code, see 36.2.2 */ | 
 | 	} | 
 |  | 
 | After the completion area has been processed, the driver must be | 
 | notified that it can release any resources associated with the | 
 | request. This is done via the dequeue operation:: | 
 |  | 
 | 	struct dax_command cmd; | 
 | 	cmd.command = CCB_DEQUEUE; | 
 | 	if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) { | 
 | 		/* bail out */ | 
 | 	} | 
 |  | 
 | Finally, normal program cleanup should be done, i.e., unmapping | 
 | completion area, closing the dax device, freeing memory etc. | 
 |  | 
 | Kernel example | 
 | -------------- | 
 |  | 
 | The only difference in using the DAX in kernel code is the treatment | 
 | of the completion area. Unlike user applications which mmap the | 
 | completion area allocated by the driver, kernel code must allocate its | 
 | own memory to use for the completion area, and this address and its | 
 | type must be given in the CCB:: | 
 |  | 
 | 	ccb->control |=      /* Table 36.1, CCB Header Format */ | 
 | 	        (3L << 32);     /* completion area address type = primary virtual */ | 
 |  | 
 | 	ccb->completion = (unsigned long) completion_area;   /* Completion area address */ | 
 |  | 
 | The dax submit hypercall is made directly. The flags used in the | 
 | ccb_submit call are documented in the DAX HV API in section 36.3.1/ | 
 |  | 
 | :: | 
 |  | 
 |   #include <asm/hypervisor.h> | 
 |  | 
 | 	hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64, | 
 | 				 HV_CCB_QUERY_CMD | | 
 | 				 HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY | | 
 | 				 HV_CCB_VA_PRIVILEGED, | 
 | 				 0, &bytes_accepted, &status_data); | 
 |  | 
 | 	if (hv_rv != HV_EOK) { | 
 | 		/* hv_rv is an error code, status_data contains */ | 
 | 		/* potential additional status, see 36.3.1.1 */ | 
 | 	} | 
 |  | 
 | After the submission, the completion area polling code is identical to | 
 | that in user land:: | 
 |  | 
 | 	while (1) { | 
 | 		/* Monitored Load */ | 
 | 		__asm__ __volatile__("lduba [%1] 0x84, %0\n" | 
 | 				     : "=r" (status) | 
 | 				     : "r"  (completion_area)); | 
 |  | 
 | 		if (status)	     /* 0 indicates command in progress */ | 
 | 			break; | 
 |  | 
 | 		/* MWAIT */ | 
 | 		__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */ | 
 | 	} | 
 |  | 
 | 	if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */ | 
 | 		/* completion_area[0] contains the completion status */ | 
 | 		/* completion_area[1] contains an error code, see 36.2.2 */ | 
 | 	} | 
 |  | 
 | The output bitmap is ready for consumption immediately after the | 
 | completion status indicates success. | 
 |  | 
 | Excer[t from UltraSPARC Virtual Machine Specification | 
 | ===================================================== | 
 |  | 
 |  .. include:: dax-hv-api.txt | 
 |     :literal: |