Documentation/sparc/oradax/oracle-dax.rst - linux - Git at Google

 =======================================
 Oracle Data Analytics Accelerator (DAX)
 =======================================

 DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
 (DAX2) processor chips, and has direct access to the CPU's L3 caches
 as well as physical memory. It can perform several operations on data
 streams with various input and output formats.  A driver provides a
 transport mechanism and has limited knowledge of the various opcodes
 and data formats. A user space library provides high level services
 and translates these into low level commands which are then passed
 into the driver and subsequently the Hypervisor and the coprocessor.
 The library is the recommended way for applications to use the
 coprocessor, and the driver interface is not intended for general use.
 This document describes the general flow of the driver, its
 structures, and its programmatic interface. It also provides example
 code sufficient to write user or kernel applications that use DAX
 functionality.

 The user library is open source and available at:

     https://oss.oracle.com/git/gitweb.cgi?p=libdax.git

 The Hypervisor interface to the coprocessor is described in detail in
 the accompanying document, dax-hv-api.txt, which is a plain text
 excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
 Specification" version 3.0.20+15, dated 2017-09-25.


 High Level Overview
 ===================

 A coprocessor request is described by a Command Control Block
 (CCB). The CCB contains an opcode and various parameters. The opcode
 specifies what operation is to be done, and the parameters specify
 options, flags, sizes, and addresses.  The CCB (or an array of CCBs)
 is passed to the Hypervisor, which handles queueing and scheduling of
 requests to the available coprocessor execution units. A status code
 returned indicates if the request was submitted successfully or if
 there was an error.  One of the addresses given in each CCB is a
 pointer to a "completion area", which is a 128 byte memory block that
 is written by the coprocessor to provide execution status. No
 interrupt is generated upon completion; the completion area must be
 polled by software to find out when a transaction has finished, but
 the M7 and later processors provide a mechanism to pause the virtual
 processor until the completion status has been updated by the
 coprocessor. This is done using the monitored load and mwait
 instructions, which are described in more detail later.  The DAX
 coprocessor was designed so that after a request is submitted, the
 kernel is no longer involved in the processing of it.  The polling is
 done at the user level, which results in almost zero latency between
 completion of a request and resumption of execution of the requesting
 thread.


 Addressing Memory
 =================

 The kernel does not have access to physical memory in the Sun4v
 architecture, as there is an additional level of memory virtualization
 present. This intermediate level is called "real" memory, and the
 kernel treats this as if it were physical.  The Hypervisor handles the
 translations between real memory and physical so that each logical
 domain (LDOM) can have a partition of physical memory that is isolated
 from that of other LDOMs.  When the kernel sets up a virtual mapping,
 it specifies a virtual address and the real address to which it should
 be mapped.

 The DAX coprocessor can only operate on physical memory, so before a
 request can be fed to the coprocessor, all the addresses in a CCB must
 be converted into physical addresses. The kernel cannot do this since
 it has no visibility into physical addresses. So a CCB may contain
 either the virtual or real addresses of the buffers or a combination
 of them. An "address type" field is available for each address that
 may be given in the CCB. In all cases, the Hypervisor will translate
 all the addresses to physical before dispatching to hardware. Address
 translations are performed using the context of the process initiating
 the request.


 The Driver API
 ==============

 An application makes requests to the driver via the write() system
 call, and gets results (if any) via read(). The completion areas are
 made accessible via mmap(), and are read-only for the application.

 The request may either be an immediate command or an array of CCBs to
 be submitted to the hardware.

 Each open instance of the device is exclusive to the thread that
 opened it, and must be used by that thread for all subsequent
 operations. The driver open function creates a new context for the
 thread and initializes it for use.  This context contains pointers and
 values used internally by the driver to keep track of submitted
 requests. The completion area buffer is also allocated, and this is
 large enough to contain the completion areas for many concurrent
 requests.  When the device is closed, any outstanding transactions are
 flushed and the context is cleaned up.

 On a DAX1 system (M7), the device will be called "oradax1", while on a
 DAX2 system (M8) it will be "oradax2". If an application requires one
 or the other, it should simply attempt to open the appropriate
 device. Only one of the devices will exist on any given system, so the
 name can be used to determine what the platform supports.

 The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
 all of these, success is indicated by a return value from write()
 equal to the number of bytes given in the call. Otherwise -1 is
 returned and errno is set.

 CCB_DEQUEUE
 -----------

 Tells the driver to clean up resources associated with past
 requests. Since no interrupt is generated upon the completion of a
 request, the driver must be told when it may reclaim resources.  No
 further status information is returned, so the user should not
 subsequently call read().

 CCB_KILL
 --------

 Kills a CCB during execution. The CCB is guaranteed to not continue
 executing once this call returns successfully. On success, read() must
 be called to retrieve the result of the action.

 CCB_INFO
 --------

 Retrieves information about a currently executing CCB. Note that some
 Hypervisors might return 'notfound' when the CCB is in 'inprogress'
 state. To ensure a CCB in the 'notfound' state will never be executed,
 CCB_KILL must be invoked on that CCB. Upon success, read() must be
 called to retrieve the details of the action.

 Submission of an array of CCBs for execution
 ---------------------------------------------

 A write() whose length is a multiple of the CCB size is treated as a
 submit operation. The file offset is treated as the index of the
 completion area to use, and may be set via lseek() or using the
 pwrite() system call. If -1 is returned then errno is set to indicate
 the error. Otherwise, the return value is the length of the array that
 was actually accepted by the coprocessor. If the accepted length is
 equal to the requested length, then the submission was completely
 successful and there is no further status needed; hence, the user
 should not subsequently call read(). Partial acceptance of the CCB
 array is indicated by a return value less than the requested length,
 and read() must be called to retrieve further status information.  The
 status will reflect the error caused by the first CCB that was not
 accepted, and status_data will provide additional data in some cases.

 MMAP
 ----

 The mmap() function provides access to the completion area allocated
 in the driver.  Note that the completion area is not writeable by the
 user process, and the mmap call must not specify PROT_WRITE.


 Completion of a Request
 =======================

 The first byte in each completion area is the command status which is
 updated by the coprocessor hardware. Software may take advantage of
 new M7/M8 processor capabilities to efficiently poll this status byte.
 First, a "monitored load" is achieved via a Load from Alternate Space
 (ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY).  Second, a
 "monitored wait" is achieved via the mwait instruction (a write to
 %asr28). This instruction is like pause in that it suspends execution
 of the virtual processor for the given number of nanoseconds, but in
 addition will terminate early when one of several events occur. If the
 block of data containing the monitored location is modified, then the
 mwait terminates. This causes software to resume execution immediately
 (without a context switch or kernel to user transition) after a
 transaction completes. Thus the latency between transaction completion
 and resumption of execution may be just a few nanoseconds.


 Application Life Cycle of a DAX Submission
 ==========================================

  - open dax device
  - call mmap() to get the completion area address
  - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
  - submit CCB via write() or pwrite()
  - go into a loop executing monitored load + monitored wait and
    terminate when the command status indicates the request is complete
    (CCB_KILL or CCB_INFO may be used any time as necessary)
  - perform a CCB_DEQUEUE
  - call munmap() for completion area
  - close the dax device


 Memory Constraints
 ==================

 The DAX hardware operates only on physical addresses. Therefore, it is
 not aware of virtual memory mappings and the discontiguities that may
 exist in the physical memory that a virtual buffer maps to. There is
 no I/O TLB or any scatter/gather mechanism. All buffers, whether input
 or output, must reside in a physically contiguous region of memory.

 The Hypervisor translates all addresses within a CCB to physical
 before handing off the CCB to DAX. The Hypervisor determines the
 virtual page size for each virtual address given, and uses this to
 program a size limit for each address. This prevents the coprocessor
 from reading or writing beyond the bound of the virtual page, even
 though it is accessing physical memory directly. A simpler way of
 saying this is that a DAX operation will never "cross" a virtual page
 boundary. If an 8k virtual page is used, then the data is strictly
 limited to 8k. If a user's buffer is larger than 8k, then a larger
 page size must be used, or the transaction size will be truncated to
 8k.

 Huge pages. A user may allocate huge pages using standard interfaces.
 Memory buffers residing on huge pages may be used to achieve much
 larger DAX transaction sizes, but the rules must still be followed,
 and no transaction will cross a page boundary, even a huge page.  A
 major caveat is that Linux on Sparc presents 8Mb as one of the huge
 page sizes. Sparc does not actually provide a 8Mb hardware page size,
 and this size is synthesized by pasting together two 4Mb pages. The
 reasons for this are historical, and it creates an issue because only
 half of this 8Mb page can actually be used for any given buffer in a
 DAX request, and it must be either the first half or the second half;
 it cannot be a 4Mb chunk in the middle, since that crosses a
 (hardware) page boundary. Note that this entire issue may be hidden by
 higher level libraries.


 CCB Structure
 -------------
 A CCB is an array of 8 64-bit words. Several of these words provide
 command opcodes, parameters, flags, etc., and the rest are addresses
 for the completion area, output buffer, and various inputs::

    struct ccb {
        u64   control;
        u64   completion;
        u64   input0;
        u64   access;
        u64   input1;
        u64   op_data;
        u64   output;
        u64   table;
    };

 See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
 each of these fields, and see dax-hv-api.txt for a complete description
 of the Hypervisor API available to the guest OS (ie, Linux kernel).

 The first word (control) is examined by the driver for the following:
  - CCB version, which must be consistent with hardware version
  - Opcode, which must be one of the documented allowable commands
  - Address types, which must be set to "virtual" for all the addresses
    given by the user, thereby ensuring that the application can
    only access memory that it owns


 Example Code
 ============

 The DAX is accessible to both user and kernel code.  The kernel code
 can make hypercalls directly while the user code must use wrappers
 provided by the driver. The setup of the CCB is nearly identical for
 both; the only difference is in preparation of the completion area. An
 example of user code is given now, with kernel code afterwards.

 In order to program using the driver API, the file
 arch/sparc/include/uapi/asm/oradax.h must be included.

 First, the proper device must be opened. For M7 it will be
 /dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
 procedure is to attempt to open both, as only one will succeed::

 	fd = open("/dev/oradax1", O_RDWR);
 	if (fd < 0)
 		fd = open("/dev/oradax2", O_RDWR);
 	if (fd < 0)
 	       /* No DAX found */

 Next, the completion area must be mapped::

       completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);

 All input and output buffers must be fully contained in one hardware
 page, since as explained above, the DAX is strictly constrained by
 virtual page boundaries.  In addition, the output buffer must be
 64-byte aligned and its size must be a multiple of 64 bytes because
 the coprocessor writes in units of cache lines.

 This example demonstrates the DAX Scan command, which takes as input a
 vector and a match value, and produces a bitmap as the output. For
 each input element that matches the value, the corresponding bit is
 set in the output.

 In this example, the input vector consists of a series of single bits,
 and the match value is 0. So each 0 bit in the input will produce a 1
 in the output, and vice versa, which produces an output bitmap which
 is the input bitmap inverted.

 For details of all the parameters and bits used in this CCB, please
 refer to section 36.2.1.3 of the DAX Hypervisor API document, which
 describes the Scan command in detail::

 	ccb->control =       /* Table 36.1, CCB Header Format */
 		  (2L << 48)     /* command = Scan Value */
 		| (3L << 40)     /* output address type = primary virtual */
 		| (3L << 34)     /* primary input address type = primary virtual */
 		             /* Section 36.2.1, Query CCB Command Formats */
 		| (1 << 28)     /* 36.2.1.1.1 primary input format = fixed width bit packed */
 		| (0 << 23)     /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
 		| (8 << 10)     /* 36.2.1.1.6 output format = bit vector */
 		| (0 <<  5)	/* 36.2.1.3 First scan criteria size = 0 (1 byte) */
 		| (31 << 0);	/* 36.2.1.3 Disable second scan criteria */

 	ccb->completion = 0;    /* Completion area address, to be filled in by driver */

 	ccb->input0 = (unsigned long) input; /* primary input address */

 	ccb->access =       /* Section 36.2.1.2, Data Access Control */
 		  (2 << 24)    /* Primary input length format = bits */
 		| (nbits - 1); /* number of bits in primary input stream, minus 1 */

 	ccb->input1 = 0;       /* secondary input address, unused */

 	ccb->op_data = 0;      /* scan criteria (value to be matched) */

 	ccb->output = (unsigned long) output;	/* output address */

 	ccb->table = 0;	       /* table address, unused */

 The CCB submission is a write() or pwrite() system call to the
 driver. If the call fails, then a read() must be used to retrieve the
 status::

 	if (pwrite(fd, ccb, 64, 0) != 64) {
 		struct ccb_exec_result status;
 		read(fd, &status, sizeof(status));
 		/* bail out */
 	}

 After a successful submission of the CCB, the completion area may be
 polled to determine when the DAX is finished. Detailed information on
 the contents of the completion area can be found in section 36.2.2 of
 the DAX HV API document::

 	while (1) {
 		/* Monitored Load */
 		__asm__ __volatile__("lduba [%1] 0x84, %0\n"
 				     : "=r" (status)
 				     : "r"  (completion_area));

 		if (status)	     /* 0 indicates command in progress */
 			break;

 		/* MWAIT */
 		__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
 	}

 A completion area status of 1 indicates successful completion of the
 CCB and validity of the output bitmap, which may be used immediately.
 All other non-zero values indicate error conditions which are
 described in section 36.2.2::

 	if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */
 		/* completion_area[0] contains the completion status */
 		/* completion_area[1] contains an error code, see 36.2.2 */
 	}

 After the completion area has been processed, the driver must be
 notified that it can release any resources associated with the
 request. This is done via the dequeue operation::

 	struct dax_command cmd;
 	cmd.command = CCB_DEQUEUE;
 	if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
 		/* bail out */
 	}

 Finally, normal program cleanup should be done, i.e., unmapping
 completion area, closing the dax device, freeing memory etc.

 Kernel example
 --------------

 The only difference in using the DAX in kernel code is the treatment
 of the completion area. Unlike user applications which mmap the
 completion area allocated by the driver, kernel code must allocate its
 own memory to use for the completion area, and this address and its
 type must be given in the CCB::

 	ccb->control |=      /* Table 36.1, CCB Header Format */
 	        (3L << 32);     /* completion area address type = primary virtual */

 	ccb->completion = (unsigned long) completion_area;   /* Completion area address */

 The dax submit hypercall is made directly. The flags used in the
 ccb_submit call are documented in the DAX HV API in section 36.3.1/

 ::

   #include <asm/hypervisor.h>

 	hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
 				 HV_CCB_QUERY_CMD |
 				 HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
 				 HV_CCB_VA_PRIVILEGED,
 				 0, &bytes_accepted, &status_data);

 	if (hv_rv != HV_EOK) {
 		/* hv_rv is an error code, status_data contains */
 		/* potential additional status, see 36.3.1.1 */
 	}

 After the submission, the completion area polling code is identical to
 that in user land::

 	while (1) {
 		/* Monitored Load */
 		__asm__ __volatile__("lduba [%1] 0x84, %0\n"
 				     : "=r" (status)
 				     : "r"  (completion_area));

 		if (status)	     /* 0 indicates command in progress */
 			break;

 		/* MWAIT */
 		__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::);    /* 1000 ns */
 	}

 	if (completion_area[0] != 1) {	/* section 36.2.2, 1 = command ran and succeeded */
 		/* completion_area[0] contains the completion status */
 		/* completion_area[1] contains an error code, see 36.2.2 */
 	}

 The output bitmap is ready for consumption immediately after the
 completion status indicates success.

 Excer[t from UltraSPARC Virtual Machine Specification
 =====================================================

  .. include:: dax-hv-api.txt
     :literal:
	=======================================
	Oracle Data Analytics Accelerator (DAX)
	=======================================

	DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
	(DAX2) processor chips, and has direct access to the CPU's L3 caches
	as well as physical memory. It can perform several operations on data
	streams with various input and output formats. A driver provides a
	transport mechanism and has limited knowledge of the various opcodes
	and data formats. A user space library provides high level services
	and translates these into low level commands which are then passed
	into the driver and subsequently the Hypervisor and the coprocessor.
	The library is the recommended way for applications to use the
	coprocessor, and the driver interface is not intended for general use.
	This document describes the general flow of the driver, its
	structures, and its programmatic interface. It also provides example
	code sufficient to write user or kernel applications that use DAX
	functionality.

	The user library is open source and available at:

	https://oss.oracle.com/git/gitweb.cgi?p=libdax.git

	The Hypervisor interface to the coprocessor is described in detail in
	the accompanying document, dax-hv-api.txt, which is a plain text
	excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
	Specification" version 3.0.20+15, dated 2017-09-25.


	High Level Overview
	===================

	A coprocessor request is described by a Command Control Block
	(CCB). The CCB contains an opcode and various parameters. The opcode
	specifies what operation is to be done, and the parameters specify
	options, flags, sizes, and addresses. The CCB (or an array of CCBs)
	is passed to the Hypervisor, which handles queueing and scheduling of
	requests to the available coprocessor execution units. A status code
	returned indicates if the request was submitted successfully or if
	there was an error. One of the addresses given in each CCB is a
	pointer to a "completion area", which is a 128 byte memory block that
	is written by the coprocessor to provide execution status. No
	interrupt is generated upon completion; the completion area must be
	polled by software to find out when a transaction has finished, but
	the M7 and later processors provide a mechanism to pause the virtual
	processor until the completion status has been updated by the
	coprocessor. This is done using the monitored load and mwait
	instructions, which are described in more detail later. The DAX
	coprocessor was designed so that after a request is submitted, the
	kernel is no longer involved in the processing of it. The polling is
	done at the user level, which results in almost zero latency between
	completion of a request and resumption of execution of the requesting
	thread.


	Addressing Memory
	=================

	The kernel does not have access to physical memory in the Sun4v
	architecture, as there is an additional level of memory virtualization
	present. This intermediate level is called "real" memory, and the
	kernel treats this as if it were physical. The Hypervisor handles the
	translations between real memory and physical so that each logical
	domain (LDOM) can have a partition of physical memory that is isolated
	from that of other LDOMs. When the kernel sets up a virtual mapping,
	it specifies a virtual address and the real address to which it should
	be mapped.

	The DAX coprocessor can only operate on physical memory, so before a
	request can be fed to the coprocessor, all the addresses in a CCB must
	be converted into physical addresses. The kernel cannot do this since
	it has no visibility into physical addresses. So a CCB may contain
	either the virtual or real addresses of the buffers or a combination
	of them. An "address type" field is available for each address that
	may be given in the CCB. In all cases, the Hypervisor will translate
	all the addresses to physical before dispatching to hardware. Address
	translations are performed using the context of the process initiating
	the request.


	The Driver API
	==============

	An application makes requests to the driver via the write() system
	call, and gets results (if any) via read(). The completion areas are
	made accessible via mmap(), and are read-only for the application.

	The request may either be an immediate command or an array of CCBs to
	be submitted to the hardware.

	Each open instance of the device is exclusive to the thread that
	opened it, and must be used by that thread for all subsequent
	operations. The driver open function creates a new context for the
	thread and initializes it for use. This context contains pointers and
	values used internally by the driver to keep track of submitted
	requests. The completion area buffer is also allocated, and this is
	large enough to contain the completion areas for many concurrent
	requests. When the device is closed, any outstanding transactions are
	flushed and the context is cleaned up.

	On a DAX1 system (M7), the device will be called "oradax1", while on a
	DAX2 system (M8) it will be "oradax2". If an application requires one
	or the other, it should simply attempt to open the appropriate
	device. Only one of the devices will exist on any given system, so the
	name can be used to determine what the platform supports.

	The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
	all of these, success is indicated by a return value from write()
	equal to the number of bytes given in the call. Otherwise -1 is
	returned and errno is set.

	CCB_DEQUEUE
	-----------

	Tells the driver to clean up resources associated with past
	requests. Since no interrupt is generated upon the completion of a
	request, the driver must be told when it may reclaim resources. No
	further status information is returned, so the user should not
	subsequently call read().

	CCB_KILL
	--------

	Kills a CCB during execution. The CCB is guaranteed to not continue
	executing once this call returns successfully. On success, read() must
	be called to retrieve the result of the action.

	CCB_INFO
	--------

	Retrieves information about a currently executing CCB. Note that some
	Hypervisors might return 'notfound' when the CCB is in 'inprogress'
	state. To ensure a CCB in the 'notfound' state will never be executed,
	CCB_KILL must be invoked on that CCB. Upon success, read() must be
	called to retrieve the details of the action.

	Submission of an array of CCBs for execution
	---------------------------------------------

	A write() whose length is a multiple of the CCB size is treated as a
	submit operation. The file offset is treated as the index of the
	completion area to use, and may be set via lseek() or using the
	pwrite() system call. If -1 is returned then errno is set to indicate
	the error. Otherwise, the return value is the length of the array that
	was actually accepted by the coprocessor. If the accepted length is
	equal to the requested length, then the submission was completely
	successful and there is no further status needed; hence, the user
	should not subsequently call read(). Partial acceptance of the CCB
	array is indicated by a return value less than the requested length,
	and read() must be called to retrieve further status information. The
	status will reflect the error caused by the first CCB that was not
	accepted, and status_data will provide additional data in some cases.

	MMAP
	----

	The mmap() function provides access to the completion area allocated
	in the driver. Note that the completion area is not writeable by the
	user process, and the mmap call must not specify PROT_WRITE.


	Completion of a Request
	=======================

	The first byte in each completion area is the command status which is
	updated by the coprocessor hardware. Software may take advantage of
	new M7/M8 processor capabilities to efficiently poll this status byte.
	First, a "monitored load" is achieved via a Load from Alternate Space
	(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a
	"monitored wait" is achieved via the mwait instruction (a write to
	%asr28). This instruction is like pause in that it suspends execution
	of the virtual processor for the given number of nanoseconds, but in
	addition will terminate early when one of several events occur. If the
	block of data containing the monitored location is modified, then the
	mwait terminates. This causes software to resume execution immediately
	(without a context switch or kernel to user transition) after a
	transaction completes. Thus the latency between transaction completion
	and resumption of execution may be just a few nanoseconds.


	Application Life Cycle of a DAX Submission
	==========================================

	- open dax device
	- call mmap() to get the completion area address
	- allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
	- submit CCB via write() or pwrite()
	- go into a loop executing monitored load + monitored wait and
	terminate when the command status indicates the request is complete
	(CCB_KILL or CCB_INFO may be used any time as necessary)
	- perform a CCB_DEQUEUE
	- call munmap() for completion area
	- close the dax device


	Memory Constraints
	==================

	The DAX hardware operates only on physical addresses. Therefore, it is
	not aware of virtual memory mappings and the discontiguities that may
	exist in the physical memory that a virtual buffer maps to. There is
	no I/O TLB or any scatter/gather mechanism. All buffers, whether input
	or output, must reside in a physically contiguous region of memory.

	The Hypervisor translates all addresses within a CCB to physical
	before handing off the CCB to DAX. The Hypervisor determines the
	virtual page size for each virtual address given, and uses this to
	program a size limit for each address. This prevents the coprocessor
	from reading or writing beyond the bound of the virtual page, even
	though it is accessing physical memory directly. A simpler way of
	saying this is that a DAX operation will never "cross" a virtual page
	boundary. If an 8k virtual page is used, then the data is strictly
	limited to 8k. If a user's buffer is larger than 8k, then a larger
	page size must be used, or the transaction size will be truncated to
	8k.

	Huge pages. A user may allocate huge pages using standard interfaces.
	Memory buffers residing on huge pages may be used to achieve much
	larger DAX transaction sizes, but the rules must still be followed,
	and no transaction will cross a page boundary, even a huge page. A
	major caveat is that Linux on Sparc presents 8Mb as one of the huge
	page sizes. Sparc does not actually provide a 8Mb hardware page size,
	and this size is synthesized by pasting together two 4Mb pages. The
	reasons for this are historical, and it creates an issue because only
	half of this 8Mb page can actually be used for any given buffer in a
	DAX request, and it must be either the first half or the second half;
	it cannot be a 4Mb chunk in the middle, since that crosses a
	(hardware) page boundary. Note that this entire issue may be hidden by
	higher level libraries.


	CCB Structure
	-------------
	A CCB is an array of 8 64-bit words. Several of these words provide
	command opcodes, parameters, flags, etc., and the rest are addresses
	for the completion area, output buffer, and various inputs::

	struct ccb {
	u64 control;
	u64 completion;
	u64 input0;
	u64 access;
	u64 input1;
	u64 op_data;
	u64 output;
	u64 table;
	};

	See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
	each of these fields, and see dax-hv-api.txt for a complete description
	of the Hypervisor API available to the guest OS (ie, Linux kernel).

	The first word (control) is examined by the driver for the following:
	- CCB version, which must be consistent with hardware version
	- Opcode, which must be one of the documented allowable commands
	- Address types, which must be set to "virtual" for all the addresses
	given by the user, thereby ensuring that the application can
	only access memory that it owns


	Example Code
	============

	The DAX is accessible to both user and kernel code. The kernel code
	can make hypercalls directly while the user code must use wrappers
	provided by the driver. The setup of the CCB is nearly identical for
	both; the only difference is in preparation of the completion area. An
	example of user code is given now, with kernel code afterwards.

	In order to program using the driver API, the file
	arch/sparc/include/uapi/asm/oradax.h must be included.

	First, the proper device must be opened. For M7 it will be
	/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
	procedure is to attempt to open both, as only one will succeed::

	fd = open("/dev/oradax1", O_RDWR);
	if (fd < 0)
	fd = open("/dev/oradax2", O_RDWR);
	if (fd < 0)
	/* No DAX found */

	Next, the completion area must be mapped::

	completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);

	All input and output buffers must be fully contained in one hardware
	page, since as explained above, the DAX is strictly constrained by
	virtual page boundaries. In addition, the output buffer must be
	64-byte aligned and its size must be a multiple of 64 bytes because
	the coprocessor writes in units of cache lines.

	This example demonstrates the DAX Scan command, which takes as input a
	vector and a match value, and produces a bitmap as the output. For
	each input element that matches the value, the corresponding bit is
	set in the output.

	In this example, the input vector consists of a series of single bits,
	and the match value is 0. So each 0 bit in the input will produce a 1
	in the output, and vice versa, which produces an output bitmap which
	is the input bitmap inverted.

	For details of all the parameters and bits used in this CCB, please
	refer to section 36.2.1.3 of the DAX Hypervisor API document, which
	describes the Scan command in detail::

	ccb->control = /* Table 36.1, CCB Header Format */
	(2L << 48) /* command = Scan Value */
	\| (3L << 40) /* output address type = primary virtual */
	\| (3L << 34) /* primary input address type = primary virtual */
	/* Section 36.2.1, Query CCB Command Formats */
	\| (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */
	\| (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
	\| (8 << 10) /* 36.2.1.1.6 output format = bit vector */
	\| (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */
	\| (31 << 0); /* 36.2.1.3 Disable second scan criteria */

	ccb->completion = 0; /* Completion area address, to be filled in by driver */

	ccb->input0 = (unsigned long) input; /* primary input address */

	ccb->access = /* Section 36.2.1.2, Data Access Control */
	(2 << 24) /* Primary input length format = bits */
	\| (nbits - 1); /* number of bits in primary input stream, minus 1 */

	ccb->input1 = 0; /* secondary input address, unused */

	ccb->op_data = 0; /* scan criteria (value to be matched) */

	ccb->output = (unsigned long) output; /* output address */

	ccb->table = 0; /* table address, unused */

	The CCB submission is a write() or pwrite() system call to the
	driver. If the call fails, then a read() must be used to retrieve the
	status::

	if (pwrite(fd, ccb, 64, 0) != 64) {
	struct ccb_exec_result status;
	read(fd, &status, sizeof(status));
	/* bail out */
	}

	After a successful submission of the CCB, the completion area may be
	polled to determine when the DAX is finished. Detailed information on
	the contents of the completion area can be found in section 36.2.2 of
	the DAX HV API document::

	while (1) {
	/* Monitored Load */
	__asm__ __volatile__("lduba [%1] 0x84, %0\n"
	: "=r" (status)
	: "r" (completion_area));

	if (status) /* 0 indicates command in progress */
	break;

	/* MWAIT */
	__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
	}

	A completion area status of 1 indicates successful completion of the
	CCB and validity of the output bitmap, which may be used immediately.
	All other non-zero values indicate error conditions which are
	described in section 36.2.2::

	if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
	/* completion_area[0] contains the completion status */
	/* completion_area[1] contains an error code, see 36.2.2 */
	}

	After the completion area has been processed, the driver must be
	notified that it can release any resources associated with the
	request. This is done via the dequeue operation::

	struct dax_command cmd;
	cmd.command = CCB_DEQUEUE;
	if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
	/* bail out */
	}

	Finally, normal program cleanup should be done, i.e., unmapping
	completion area, closing the dax device, freeing memory etc.

	Kernel example
	--------------

	The only difference in using the DAX in kernel code is the treatment
	of the completion area. Unlike user applications which mmap the
	completion area allocated by the driver, kernel code must allocate its
	own memory to use for the completion area, and this address and its
	type must be given in the CCB::

	ccb->control \|= /* Table 36.1, CCB Header Format */
	(3L << 32); /* completion area address type = primary virtual */

	ccb->completion = (unsigned long) completion_area; /* Completion area address */

	The dax submit hypercall is made directly. The flags used in the
	ccb_submit call are documented in the DAX HV API in section 36.3.1/

	::

	#include <asm/hypervisor.h>

	hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
	HV_CCB_QUERY_CMD \|
	HV_CCB_ARG0_PRIVILEGED \| HV_CCB_ARG0_TYPE_PRIMARY \|
	HV_CCB_VA_PRIVILEGED,
	0, &bytes_accepted, &status_data);

	if (hv_rv != HV_EOK) {
	/* hv_rv is an error code, status_data contains */
	/* potential additional status, see 36.3.1.1 */
	}

	After the submission, the completion area polling code is identical to
	that in user land::

	while (1) {
	/* Monitored Load */
	__asm__ __volatile__("lduba [%1] 0x84, %0\n"
	: "=r" (status)
	: "r" (completion_area));

	if (status) /* 0 indicates command in progress */
	break;

	/* MWAIT */
	__asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
	}

	if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
	/* completion_area[0] contains the completion status */
	/* completion_area[1] contains an error code, see 36.2.2 */
	}

	The output bitmap is ready for consumption immediately after the
	completion status indicates success.

	Excer[t from UltraSPARC Virtual Machine Specification
	=====================================================

	.. include:: dax-hv-api.txt
	:literal: