|  | ======================================= | 
|  | Oracle Data Analytics Accelerator (DAX) | 
|  | ======================================= | 
|  |  | 
|  | DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8 | 
|  | (DAX2) processor chips, and has direct access to the CPU's L3 caches | 
|  | as well as physical memory. It can perform several operations on data | 
|  | streams with various input and output formats.  A driver provides a | 
|  | transport mechanism and has limited knowledge of the various opcodes | 
|  | and data formats. A user space library provides high level services | 
|  | and translates these into low level commands which are then passed | 
|  | into the driver and subsequently the Hypervisor and the coprocessor. | 
|  | The library is the recommended way for applications to use the | 
|  | coprocessor, and the driver interface is not intended for general use. | 
|  | This document describes the general flow of the driver, its | 
|  | structures, and its programmatic interface. It also provides example | 
|  | code sufficient to write user or kernel applications that use DAX | 
|  | functionality. | 
|  |  | 
|  | The user library is open source and available at: | 
|  |  | 
|  | https://oss.oracle.com/git/gitweb.cgi?p=libdax.git | 
|  |  | 
|  | The Hypervisor interface to the coprocessor is described in detail in | 
|  | the accompanying document, dax-hv-api.txt, which is a plain text | 
|  | excerpt of the (Oracle internal) "UltraSPARC Virtual Machine | 
|  | Specification" version 3.0.20+15, dated 2017-09-25. | 
|  |  | 
|  |  | 
|  | High Level Overview | 
|  | =================== | 
|  |  | 
|  | A coprocessor request is described by a Command Control Block | 
|  | (CCB). The CCB contains an opcode and various parameters. The opcode | 
|  | specifies what operation is to be done, and the parameters specify | 
|  | options, flags, sizes, and addresses.  The CCB (or an array of CCBs) | 
|  | is passed to the Hypervisor, which handles queueing and scheduling of | 
|  | requests to the available coprocessor execution units. A status code | 
|  | returned indicates if the request was submitted successfully or if | 
|  | there was an error.  One of the addresses given in each CCB is a | 
|  | pointer to a "completion area", which is a 128 byte memory block that | 
|  | is written by the coprocessor to provide execution status. No | 
|  | interrupt is generated upon completion; the completion area must be | 
|  | polled by software to find out when a transaction has finished, but | 
|  | the M7 and later processors provide a mechanism to pause the virtual | 
|  | processor until the completion status has been updated by the | 
|  | coprocessor. This is done using the monitored load and mwait | 
|  | instructions, which are described in more detail later.  The DAX | 
|  | coprocessor was designed so that after a request is submitted, the | 
|  | kernel is no longer involved in the processing of it.  The polling is | 
|  | done at the user level, which results in almost zero latency between | 
|  | completion of a request and resumption of execution of the requesting | 
|  | thread. | 
|  |  | 
|  |  | 
|  | Addressing Memory | 
|  | ================= | 
|  |  | 
|  | The kernel does not have access to physical memory in the Sun4v | 
|  | architecture, as there is an additional level of memory virtualization | 
|  | present. This intermediate level is called "real" memory, and the | 
|  | kernel treats this as if it were physical.  The Hypervisor handles the | 
|  | translations between real memory and physical so that each logical | 
|  | domain (LDOM) can have a partition of physical memory that is isolated | 
|  | from that of other LDOMs.  When the kernel sets up a virtual mapping, | 
|  | it specifies a virtual address and the real address to which it should | 
|  | be mapped. | 
|  |  | 
|  | The DAX coprocessor can only operate on physical memory, so before a | 
|  | request can be fed to the coprocessor, all the addresses in a CCB must | 
|  | be converted into physical addresses. The kernel cannot do this since | 
|  | it has no visibility into physical addresses. So a CCB may contain | 
|  | either the virtual or real addresses of the buffers or a combination | 
|  | of them. An "address type" field is available for each address that | 
|  | may be given in the CCB. In all cases, the Hypervisor will translate | 
|  | all the addresses to physical before dispatching to hardware. Address | 
|  | translations are performed using the context of the process initiating | 
|  | the request. | 
|  |  | 
|  |  | 
|  | The Driver API | 
|  | ============== | 
|  |  | 
|  | An application makes requests to the driver via the write() system | 
|  | call, and gets results (if any) via read(). The completion areas are | 
|  | made accessible via mmap(), and are read-only for the application. | 
|  |  | 
|  | The request may either be an immediate command or an array of CCBs to | 
|  | be submitted to the hardware. | 
|  |  | 
|  | Each open instance of the device is exclusive to the thread that | 
|  | opened it, and must be used by that thread for all subsequent | 
|  | operations. The driver open function creates a new context for the | 
|  | thread and initializes it for use.  This context contains pointers and | 
|  | values used internally by the driver to keep track of submitted | 
|  | requests. The completion area buffer is also allocated, and this is | 
|  | large enough to contain the completion areas for many concurrent | 
|  | requests.  When the device is closed, any outstanding transactions are | 
|  | flushed and the context is cleaned up. | 
|  |  | 
|  | On a DAX1 system (M7), the device will be called "oradax1", while on a | 
|  | DAX2 system (M8) it will be "oradax2". If an application requires one | 
|  | or the other, it should simply attempt to open the appropriate | 
|  | device. Only one of the devices will exist on any given system, so the | 
|  | name can be used to determine what the platform supports. | 
|  |  | 
|  | The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For | 
|  | all of these, success is indicated by a return value from write() | 
|  | equal to the number of bytes given in the call. Otherwise -1 is | 
|  | returned and errno is set. | 
|  |  | 
|  | CCB_DEQUEUE | 
|  | ----------- | 
|  |  | 
|  | Tells the driver to clean up resources associated with past | 
|  | requests. Since no interrupt is generated upon the completion of a | 
|  | request, the driver must be told when it may reclaim resources.  No | 
|  | further status information is returned, so the user should not | 
|  | subsequently call read(). | 
|  |  | 
|  | CCB_KILL | 
|  | -------- | 
|  |  | 
|  | Kills a CCB during execution. The CCB is guaranteed to not continue | 
|  | executing once this call returns successfully. On success, read() must | 
|  | be called to retrieve the result of the action. | 
|  |  | 
|  | CCB_INFO | 
|  | -------- | 
|  |  | 
|  | Retrieves information about a currently executing CCB. Note that some | 
|  | Hypervisors might return 'notfound' when the CCB is in 'inprogress' | 
|  | state. To ensure a CCB in the 'notfound' state will never be executed, | 
|  | CCB_KILL must be invoked on that CCB. Upon success, read() must be | 
|  | called to retrieve the details of the action. | 
|  |  | 
|  | Submission of an array of CCBs for execution | 
|  | --------------------------------------------- | 
|  |  | 
|  | A write() whose length is a multiple of the CCB size is treated as a | 
|  | submit operation. The file offset is treated as the index of the | 
|  | completion area to use, and may be set via lseek() or using the | 
|  | pwrite() system call. If -1 is returned then errno is set to indicate | 
|  | the error. Otherwise, the return value is the length of the array that | 
|  | was actually accepted by the coprocessor. If the accepted length is | 
|  | equal to the requested length, then the submission was completely | 
|  | successful and there is no further status needed; hence, the user | 
|  | should not subsequently call read(). Partial acceptance of the CCB | 
|  | array is indicated by a return value less than the requested length, | 
|  | and read() must be called to retrieve further status information.  The | 
|  | status will reflect the error caused by the first CCB that was not | 
|  | accepted, and status_data will provide additional data in some cases. | 
|  |  | 
|  | MMAP | 
|  | ---- | 
|  |  | 
|  | The mmap() function provides access to the completion area allocated | 
|  | in the driver.  Note that the completion area is not writeable by the | 
|  | user process, and the mmap call must not specify PROT_WRITE. | 
|  |  | 
|  |  | 
|  | Completion of a Request | 
|  | ======================= | 
|  |  | 
|  | The first byte in each completion area is the command status which is | 
|  | updated by the coprocessor hardware. Software may take advantage of | 
|  | new M7/M8 processor capabilities to efficiently poll this status byte. | 
|  | First, a "monitored load" is achieved via a Load from Alternate Space | 
|  | (ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY).  Second, a | 
|  | "monitored wait" is achieved via the mwait instruction (a write to | 
|  | %asr28). This instruction is like pause in that it suspends execution | 
|  | of the virtual processor for the given number of nanoseconds, but in | 
|  | addition will terminate early when one of several events occur. If the | 
|  | block of data containing the monitored location is modified, then the | 
|  | mwait terminates. This causes software to resume execution immediately | 
|  | (without a context switch or kernel to user transition) after a | 
|  | transaction completes. Thus the latency between transaction completion | 
|  | and resumption of execution may be just a few nanoseconds. | 
|  |  | 
|  |  | 
|  | Application Life Cycle of a DAX Submission | 
|  | ========================================== | 
|  |  | 
|  | - open dax device | 
|  | - call mmap() to get the completion area address | 
|  | - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc. | 
|  | - submit CCB via write() or pwrite() | 
|  | - go into a loop executing monitored load + monitored wait and | 
|  | terminate when the command status indicates the request is complete | 
|  | (CCB_KILL or CCB_INFO may be used any time as necessary) | 
|  | - perform a CCB_DEQUEUE | 
|  | - call munmap() for completion area | 
|  | - close the dax device | 
|  |  | 
|  |  | 
|  | Memory Constraints | 
|  | ================== | 
|  |  | 
|  | The DAX hardware operates only on physical addresses. Therefore, it is | 
|  | not aware of virtual memory mappings and the discontiguities that may | 
|  | exist in the physical memory that a virtual buffer maps to. There is | 
|  | no I/O TLB or any scatter/gather mechanism. All buffers, whether input | 
|  | or output, must reside in a physically contiguous region of memory. | 
|  |  | 
|  | The Hypervisor translates all addresses within a CCB to physical | 
|  | before handing off the CCB to DAX. The Hypervisor determines the | 
|  | virtual page size for each virtual address given, and uses this to | 
|  | program a size limit for each address. This prevents the coprocessor | 
|  | from reading or writing beyond the bound of the virtual page, even | 
|  | though it is accessing physical memory directly. A simpler way of | 
|  | saying this is that a DAX operation will never "cross" a virtual page | 
|  | boundary. If an 8k virtual page is used, then the data is strictly | 
|  | limited to 8k. If a user's buffer is larger than 8k, then a larger | 
|  | page size must be used, or the transaction size will be truncated to | 
|  | 8k. | 
|  |  | 
|  | Huge pages. A user may allocate huge pages using standard interfaces. | 
|  | Memory buffers residing on huge pages may be used to achieve much | 
|  | larger DAX transaction sizes, but the rules must still be followed, | 
|  | and no transaction will cross a page boundary, even a huge page.  A | 
|  | major caveat is that Linux on Sparc presents 8Mb as one of the huge | 
|  | page sizes. Sparc does not actually provide a 8Mb hardware page size, | 
|  | and this size is synthesized by pasting together two 4Mb pages. The | 
|  | reasons for this are historical, and it creates an issue because only | 
|  | half of this 8Mb page can actually be used for any given buffer in a | 
|  | DAX request, and it must be either the first half or the second half; | 
|  | it cannot be a 4Mb chunk in the middle, since that crosses a | 
|  | (hardware) page boundary. Note that this entire issue may be hidden by | 
|  | higher level libraries. | 
|  |  | 
|  |  | 
|  | CCB Structure | 
|  | ------------- | 
|  | A CCB is an array of 8 64-bit words. Several of these words provide | 
|  | command opcodes, parameters, flags, etc., and the rest are addresses | 
|  | for the completion area, output buffer, and various inputs:: | 
|  |  | 
|  | struct ccb { | 
|  | u64   control; | 
|  | u64   completion; | 
|  | u64   input0; | 
|  | u64   access; | 
|  | u64   input1; | 
|  | u64   op_data; | 
|  | u64   output; | 
|  | u64   table; | 
|  | }; | 
|  |  | 
|  | See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of | 
|  | each of these fields, and see dax-hv-api.txt for a complete description | 
|  | of the Hypervisor API available to the guest OS (ie, Linux kernel). | 
|  |  | 
|  | The first word (control) is examined by the driver for the following: | 
|  | - CCB version, which must be consistent with hardware version | 
|  | - Opcode, which must be one of the documented allowable commands | 
|  | - Address types, which must be set to "virtual" for all the addresses | 
|  | given by the user, thereby ensuring that the application can | 
|  | only access memory that it owns | 
|  |  | 
|  |  | 
|  | Example Code | 
|  | ============ | 
|  |  | 
|  | The DAX is accessible to both user and kernel code.  The kernel code | 
|  | can make hypercalls directly while the user code must use wrappers | 
|  | provided by the driver. The setup of the CCB is nearly identical for | 
|  | both; the only difference is in preparation of the completion area. An | 
|  | example of user code is given now, with kernel code afterwards. | 
|  |  | 
|  | In order to program using the driver API, the file | 
|  | arch/sparc/include/uapi/asm/oradax.h must be included. | 
|  |  | 
|  | First, the proper device must be opened. For M7 it will be | 
|  | /dev/oradax1 and for M8 it will be /dev/oradax2. The simplest | 
|  | procedure is to attempt to open both, as only one will succeed:: | 
|  |  | 
|  | fd = open("/dev/oradax1", O_RDWR); | 
|  | if (fd < 0) | 
|  | fd = open("/dev/oradax2", O_RDWR); | 
|  | if (fd < 0) | 
|  | /* No DAX found */ | 
|  |  | 
|  | Next, the completion area must be mapped:: | 
|  |  | 
|  | completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0); | 
|  |  | 
|  | All input and output buffers must be fully contained in one hardware | 
|  | page, since as explained above, the DAX is strictly constrained by | 
|  | virtual page boundaries.  In addition, the output buffer must be | 
|  | 64-byte aligned and its size must be a multiple of 64 bytes because | 
|  | the coprocessor writes in units of cache lines. | 
|  |  | 
|  | This example demonstrates the DAX Scan command, which takes as input a | 
|  | vector and a match value, and produces a bitmap as the output. For | 
|  | each input element that matches the value, the corresponding bit is | 
|  | set in the output. | 
|  |  | 
|  | In this example, the input vector consists of a series of single bits, | 
|  | and the match value is 0. So each 0 bit in the input will produce a 1 | 
|  | in the output, and vice versa, which produces an output bitmap which | 
|  | is the input bitmap inverted. | 
|  |  | 
|  | For details of all the parameters and bits used in this CCB, please | 
|  | refer to section 36.2.1.3 of the DAX Hypervisor API document, which | 
|  | describes the Scan command in detail:: | 
|  |  | 
|  | ccb->control =       /* Table 36.1, CCB Header Format */ | 
|  | (2L << 48)     /* command = Scan Value */ | 
|  | | (3L << 40)     /* output address type = primary virtual */ | 
|  | | (3L << 34)     /* primary input address type = primary virtual */ | 
|  | /* Section 36.2.1, Query CCB Command Formats */ | 
|  | | (1 << 28)     /* 36.2.1.1.1 primary input format = fixed width bit packed */ | 
|  | | (0 << 23)     /* 36.2.1.1.2 primary input element size = 0 (1 bit) */ | 
|  | | (8 << 10)     /* 36.2.1.1.6 output format = bit vector */ | 
|  | | (0 <<  5)	/* 36.2.1.3 First scan criteria size = 0 (1 byte) */ | 
|  | | (31 << 0);	/* 36.2.1.3 Disable second scan criteria */ | 
|  |  | 
|  | ccb->completion = 0;    /* Completion area address, to be filled in by driver */ | 
|  |  | 
|  | ccb->input0 = (unsigned long) input; /* primary input address */ | 
|  |  | 
|  | ccb->access =       /* Section 36.2.1.2, Data Access Control */ | 
|  | (2 << 24)    /* Primary input length format = bits */ | 
|  | | (nbits - 1); /* number of bits in primary input stream, minus 1 */ | 
|  |  | 
|  | ccb->input1 = 0;       /* secondary input address, unused */ | 
|  |  | 
|  | ccb->op_data = 0;      /* scan criteria (value to be matched) */ | 
|  |  | 
|  | ccb->output = (unsigned long) output;	/* output address */ | 
|  |  | 
|  | ccb->table = 0;	       /* table address, unused */ | 
|  |  | 
|  | The CCB submission is a write() or pwrite() system call to the | 
|  | driver. If the call fails, then a read() must be used to retrieve the | 
|  | status:: | 
|  |  | 
|  | if (pwrite(fd, ccb, 64, 0) != 64) { | 
|  | struct ccb_exec_result status; | 
|  | read(fd, &status, sizeof(status)); | 
|  | /* bail out */ | 
|  | } | 
|  |  | 
|  | After a successful submission of the CCB, the completion area may be | 
|  | polled to determine when the DAX is finished. Detailed information on | 
|  | the contents of the completion area can be found in section 36.2.2 of | 
|  | the DAX HV API document:: | 
|  |  | 
|  | while (1) { | 
|  | /* Monitored Load */ | 
|  | __asm__ __volatile__("lduba [%1] 0x84, %0\n" | 
|  | : "=r" (status) | 
|  | : "r"  (completion_area)); | 
|  |  | 
|  | if (status)	     /* 0 indicates command in progress */ | 
|  | break; | 
|  |  | 
|  | /* MWAIT */ | 
|  | __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */ | 
|  | } | 
|  |  | 
|  | A completion area status of 1 indicates successful completion of the | 
|  | CCB and validity of the output bitmap, which may be used immediately. | 
|  | All other non-zero values indicate error conditions which are | 
|  | described in section 36.2.2:: | 
|  |  | 
|  | if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */ | 
|  | /* completion_area[0] contains the completion status */ | 
|  | /* completion_area[1] contains an error code, see 36.2.2 */ | 
|  | } | 
|  |  | 
|  | After the completion area has been processed, the driver must be | 
|  | notified that it can release any resources associated with the | 
|  | request. This is done via the dequeue operation:: | 
|  |  | 
|  | struct dax_command cmd; | 
|  | cmd.command = CCB_DEQUEUE; | 
|  | if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) { | 
|  | /* bail out */ | 
|  | } | 
|  |  | 
|  | Finally, normal program cleanup should be done, i.e., unmapping | 
|  | completion area, closing the dax device, freeing memory etc. | 
|  |  | 
|  | Kernel example | 
|  | -------------- | 
|  |  | 
|  | The only difference in using the DAX in kernel code is the treatment | 
|  | of the completion area. Unlike user applications which mmap the | 
|  | completion area allocated by the driver, kernel code must allocate its | 
|  | own memory to use for the completion area, and this address and its | 
|  | type must be given in the CCB:: | 
|  |  | 
|  | ccb->control |=      /* Table 36.1, CCB Header Format */ | 
|  | (3L << 32);     /* completion area address type = primary virtual */ | 
|  |  | 
|  | ccb->completion = (unsigned long) completion_area;   /* Completion area address */ | 
|  |  | 
|  | The dax submit hypercall is made directly. The flags used in the | 
|  | ccb_submit call are documented in the DAX HV API in section 36.3.1/ | 
|  |  | 
|  | :: | 
|  |  | 
|  | #include <asm/hypervisor.h> | 
|  |  | 
|  | hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64, | 
|  | HV_CCB_QUERY_CMD | | 
|  | HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY | | 
|  | HV_CCB_VA_PRIVILEGED, | 
|  | 0, &bytes_accepted, &status_data); | 
|  |  | 
|  | if (hv_rv != HV_EOK) { | 
|  | /* hv_rv is an error code, status_data contains */ | 
|  | /* potential additional status, see 36.3.1.1 */ | 
|  | } | 
|  |  | 
|  | After the submission, the completion area polling code is identical to | 
|  | that in user land:: | 
|  |  | 
|  | while (1) { | 
|  | /* Monitored Load */ | 
|  | __asm__ __volatile__("lduba [%1] 0x84, %0\n" | 
|  | : "=r" (status) | 
|  | : "r"  (completion_area)); | 
|  |  | 
|  | if (status)	     /* 0 indicates command in progress */ | 
|  | break; | 
|  |  | 
|  | /* MWAIT */ | 
|  | __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */ | 
|  | } | 
|  |  | 
|  | if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */ | 
|  | /* completion_area[0] contains the completion status */ | 
|  | /* completion_area[1] contains an error code, see 36.2.2 */ | 
|  | } | 
|  |  | 
|  | The output bitmap is ready for consumption immediately after the | 
|  | completion status indicates success. | 
|  |  | 
|  | Excer[t from UltraSPARC Virtual Machine Specification | 
|  | ===================================================== | 
|  |  | 
|  | .. include:: dax-hv-api.txt | 
|  | :literal: |