Diskless Oracle Virtual Machines with OVM 3.0.2

This article reviews methods to run Oracle’s XEN based Virtual Machine Server 3.0.2 offering (OVM) without local disks.  The method discussed in this article involves booting the  hyper-visor using the iSCSI protocol from a remote array.  Although it is also possible to boot and run disklessly using fiber channel storage, iSCSI is chosen because it eliminates the need for extra fiber channel gear.  Similarly, this article does not discuss using an iSCSI Host Bus Adapter (HBA)  as this requires extra hardware as well.

The setup I used to test these procedures included an Oracle SunBlade 6000 server with two X6270 blades and two Sun Blade 6000  Network Express Module 24p 10Gb/s switches.  For the management server I used an old Sun X4150 with 4 600 GB disks running Solaris 11.  The Oracle VM manager was run on the X4150 using the VirutalBox type 2 hypervisor.   Please note that running the OVM manager under VirtualBox is not a configuration that is tested and supported by Oracle.

Why diskless

The first question to answer, of course, is why would you want to run your servers diskless.  As always, the answer boils down to money.  In this case the money is in savings from:

  • Hardware Costs
  • Reduced disk failure costs
  • Ease of management

Lets look at each of these in turn.  Since the default OVM boot image is only 4.4 GB, placing all your boot images on a shared array means you do not have to waste space by using only 4.4 GB of a 300GB boot disk.   In addition, if your array has compression and Copy On Write (COW) cloning, you can reduce the storage needs to a few megabytes per image (not counting the first image). To put it simply, you can boot over 60 servers from just a pair of mirrored 300GB disks, and that is without compression or cloning. You can use the rest of the storage on that array to store your VMs, templates, and other repository images.

With fewer disks you will also have fewer disk failures to debug and locate.  Since most arrays these days do a better job than a simple server of reporting and displaying disk failures, you also save time fixing the failures that do happen.  Also, with a shared array, you can now use options like triple mirroring, multi-parity RAID, or remote replication without having to buy expensive RAID controllers for each server.

Managing all these boot images is simplified too. Since you have everything in one place, you only need to back up the image on the array, and not each server.  If your array offers snapshots, you can snap shot your images before a software upgrade and quickly roll back if when things go wrong. Also, since all the images are available over the network, you can easily mount any boot image an repair it by hand if need be.

Naturally, you would not save any money if you were to buy and expensive array to boot just a couple of servers.  Likewise, you would not realize much (if any) of the savings in ease of management or reduced hardware costs if you only have a few servers. However, for something like a small lab environment, you could scale back from an array and provide your iSCSI LUNs from a single server.  You could even use the same server you used to run the OVM manager as was done while researching this article

Boot method

Before we get into the nuts and bolts of how to implement this solution lets look at a high level overview.  Getting the boot image up and running will take six phases.  First the PXE firmware uses DHCP to find the server where we will TFTP load the open source Etherboot gPXE network boot loader.  Next the PXE firmware chain loads the gPXE boot loader using TFTP and transfers control to it.  Then gPXE uses DHCP to find the iSCSI  target parameters.  In the next phase gPXE loads the GRUB boot loader over iSCSI and GRUB loads the kernel and initrd image.  In the last two phases, the kernel executes the initrd image.  The initrd image loads the necessary kernel network drivers, initializes the network plumbing, and runs DHCP.  Finally, initrd mounts the needed file systems (including the root disk over iSCSI), and changes root to the boot disk and transfers control to the init process.

To simplify the diagram below, I show the same server providing DHCP, TFTP, and the iSCSI targets; this need not be the case in your network.  I also give only a high level view of the various protocols and do not show all the messages.  For example the DHCP request and ack are not show, nor are all the TFTP request/responses, nor any of the iSCSI protocol.

You may notice that DHCP is used three times.  In the first phase it is used to find the gPXE binary, and in the second and third phases it is used to find the iSCSI server parameters.  In these invocations different parameters are returned depending on whether the request was originated by PXE or gPXE.  To do this we use the capabilities of the ISC DHCP server as described below.

For the boot method talked about here, it is assumed the Network Interface Card (NIC) does not support iSCSI.  You can setup many NICs to use iSCSI using BIOS settings.  Some NICs that do not support iSCSI directly in the BIOS may still provide iSCSI support using the NIC vendor’s tools. In this description we will only use the NIC to load the gPXE boot loader.  This allows us to use any NIC and we do not have to modify BIOS settings on all the servers.  If you choose to use the native NIC iSCSI capabilities you would skip the first two phases of the boot and use the native PXE in the NIC instead of gPXE for phase three and four.

Installing the image

Before we can boot our image, we need to have it installed in an iSCSI LUN somewhere.  If the OVM server installer were capable of adding iSCSI disks to the list of install disks, we could install directly to the iSCSI LUN.  However, that is not currently available and modifying the install software to do so is beyond what is needed for this article.  Instead, what we will do is install an image to a disk and use it as a “golden” image that we will copy into our iSCSI LUNs.

In order for this image to boot over iSCSI we will need to modify it slightly.  First we will need to modify the initrd image to load the necessary network drivers and mount the boot disk.  If we are careful not to hard-code in IP addresses and iSCSI target IDs, we can get by with modifying initrd only once allowing us to re-use it for all the clients.  The second thing we must modify are any client specific configuration settings in the image.  Specifically, we will need to change the iSCSI initiator ID, ethernet address, and possibly host name and IP addresses, for each client.  We will have to change these client specific configuration settings for each client.

You should note that this image will only work for one type of hardware. A server with different hardware components will likely need different drivers.  You should plan to make a different golden image for each hardware configuration.

Network configurations

There is one more thing we need to consider before we start setting up a boot image, and that is our network configuration. If our boot disk is mounted over the network, making network changes on the fly could bring down the network we used to access the root disk.   For example, in the OVM manager under the network tab, if you add virtual machines to the same network used for management, a virtual bridge is added to the network device.  If our boot disk were mounted over this device at the time, the iSCSI session would be lost and the root disk would hang which would hang the whole server.

One way around this is to use a separate NIC (or pair of bonded NICs) for the boot network.  Alternatively, you could use VLANs and boot over the management VLAN while using separate VLANs for all the other traffic.   You might be tempted to use one VLAN for boot and other VLANs for everything else.  Unfortunately, this would require having two VLANs configured when the server is brought under control of the OVM manager (discovered), and that would confuse the OVM manager.

For this article, we will use one NIC (actually one pair of bonded NICs) with separate VLANs for management and other traffic.  The server will boot over the management VLAN.  This means we can not modify the management VLAN while the server is running.  Note that it is possible to not use VLANs and combine all traffic on one network by creating and booting over a virtual bridge device.  This will also be discussed below.

Because gPXE does not currently support VLANs, in our example we must boot over the default LAN and then switch over to the management VLAN when we mount the iSCSI drive.  If you use the native PXE firmware to iSCSI boot, it may be possible to boot directly over the VLAN.

Create the initial image

Before we can modify the image we have to create it.  In this case we install the OVM server to a real hard disk and then copy it to a file based image on an NFS server where it can be access remotely.  When we do the copy, the disk we are coping should not be mounted.  One of the easier ways to do this is to boot from a live CD image, such as KNOPPIX, and use dd to copy the image to our remote location.  For example, if OVM server was installed on disk sda then, after booting from KNOPPIX, we could run the commands:

# Mount the remote NFS share
mkdir /tmp/workingdir
mount nfsserver:/workingdir /tmp/workingdir

# Calculate image size using the last block
fdisk -lu /dev/sda
Disk /dev/sda: 4409 MB, 4409393152 bytes
...
  Device Boot    Start       End   Blocks   Id  System
/dev/sda1  *        63    208844   104391   83   Linux
/dev/sda2       208845   6506324  3148740   83   Linux
/dev/sda3      6506325   8610839 1052257+   82   Linux swap / Solaris
expr 8610839 / 2048
4204
# Copy the image, add an extra MB
dd if=/dev/sda of=/tmp/workingdir/ovm.img bs=1024k count=4205

One note about the image, since the system will be diskless, the iSCSI disk will be added as sda. Therefore, the image we build should be installed on sda so that all the mounts are correct.  Also, as we are booting over a VLAN (VLAN 3 in this example), so we must use the VLAN option during the network setup of the install.

Modify initrd

Now we want to modify our initrd image.  Since we need to copy the correct drivers, libraries, and commands into our modified initrd image, we want to copy all these components from our installed image. The easiest way to do this it to just boot the OVM server we just installed and construct our modified initrd image there.  The initrd image is just a gzip(ed) cpio file that gets unpacked into a RAM disk on boot, so on the now booted OVM server we run:

mkdir /tmp/root
cd /tmp/root
gunzip -c /boot/initrd*.img | cpio -icudmv

Now the entry point into the initrd image is a nash script, init, that sits in the top level directory.  While nash contains a DHCP client and other useful facilities, it is not a full scripting language which we will need to avoid hard coding iSCSI targets and addresses.  To get around this, we will use bash.  In addition, tools to bring up the VLANs will be needed.  Also, the device drivers for VLAN, iSCSI, and the NIC hardware are not present in the initrd image and will need to be installed.   Lets install the drivers first:

KVER=2.6.32.21-41xen # Kernel version from uname -r on Dom0
KRNDIR=/lib/modules/${KVER}/kernel # Location of kernel modules
DRVDIR=${KRNDIR}/drivers # Location of driver modules
ADDMODS="
 ${DRVDIR}/scsi/scsi_transport_iscsi.ko
 ${DRVDIR}/scsi/libiscsi.ko
 ${DRVDIR}/scsi/libiscsi_tcp.ko
 ${DRVDIR}/scsi/iscsi_tcp.ko
 ${DRVDIR}/firmware/iscsi_ibft.ko
 ${DRVDIR}/dca/dca.ko
 ${DRVDIR}/net/ixgbe/ixgbe.ko
 ${KRNDIR}/crypto/crc32c.ko
 ${KRNDIR}/lib/libcrc32c.ko
 ${KRNDIR}/net/ipv6/ipv6.ko
 ${DRVDIR}/net/bonding/bonding.ko
 ${KRNDIR}/net/llc/llc.ko
 ${KRNDIR}/net/802/stp.ko
 ${KRNDIR}/net/802/garp.ko
 ${KRNDIR}/net/8021q/8021q.k
"

# Copy in needed modules.
for mod in ${ADDMODS}; do
 FILE=$(basename $mod)
 cp ${mod} ./lib/${FILE}
 chmod 600 ./lib/${FILE}
done

If you are going to combine everything on one LAN, then you would also need the bridge driver “${KRNDIR}/net/bridge/bridge.ko”. If you need a different network driver than ixgbe, you should load that driver instead.  Do not forget to also load any dependencies.  You can find the driver’s dependencies using the modinfo command.

Next we load the tools and libraries needed:

ADDBINS="
 /sbin/iscsistart
 /bin/bash
 /sbin/ip
 /sbin/vconfig
"

# Find needed libraries.
for bfile in ${ADDBINS}; do
 if file ${bfile} | egrep -q '(statically linked)|(shell script)'; then
     continue
 fi
 LIBS=$(ldd ${bfile} | awk '
     /^[ ]*linux-vdso/{ next }
     /ld-linux-x86-64/{ print $1; next }
     {print $3}
 ')

 # Copy in the needed libraries
 for lib in ${LIBS}; do
     if [ -r ./${lib} ]; then
         continue
     fi
     mkdir -p ./$(dirname ${lib}) >/dev/null 2>&1
     cp ${lib} ./${lib}
 done

done

# Copy in needed binaries.
for bfile in ${ADDBINS}; do
 mkdir -p ./$(dirname ${bfile}) >/dev/null 2>&1
 cp ${bfile} ./${bfile}
done

Again, if you are just running everything on one network, you would need to add /usr/sbin/brctl too.

Now that we have the drivers in the image, we need to modify the init script to load them. Also, we will need to setup the network and attach the iSCSI target.  We will do this in a separate bash shell script called startnet.  First the changes to the init script are show. The additions are bold-ed and much of the unchanged script is elided:

...

echo "Loading usb-storage.ko module"
insmod /lib/usb-storage.ko
echo Waiting for driver initialization.
stabilized /proc/bus/usb/devices

# Load the modules needed for iSCSI
echo "Loading scsi_transport_iscsi.ko module"
insmod /lib/scsi_transport_iscsi.ko
echo "Loading libiscsi.ko module"
insmod /lib/libiscsi.ko
echo "Loading libiscsi_tcp.ko module"
insmod /lib/libiscsi_tcp.ko
echo "Loading iscsi_tcp.ko module"
insmod /lib/iscsi_tcp.ko
echo "Loading iscsi_ibft.ko module"
insmod /lib/iscsi_ibft.ko
echo "Loading dca.ko module"
insmod /lib/dca.ko
echo "Loading ixgbe.ko module"
insmod /lib/ixgbe.ko
echo "Loading crc32c.ko module"
insmod /lib/crc32c.ko
echo "Loading libcrc32c.ko module"
insmod /lib/libcrc32c.ko
echo "Loading ipv6.ko module"
insmod /lib/ipv6.ko
echo "Loading bonding.ko module"
insmod /lib/bonding.ko
echo "Loading llc.ko module"
insmod /lib/llc.ko
echo "Loading stp.ko module"
insmod /lib/stp.ko
echo "Loading garp.ko module"
insmod /lib/garp.ko
echo "Loading 8021q.ko module"
insmod /lib/8021q.ko 

# Bring up the network and attach the disk
mkdir /var
mkdir /var/lib
mkdir /var/lib/dhclient
/startnet

echo "Loading ide-core.ko module"
insmod /lib/ide-core.ko
echo "Loading ide-gd_mod.ko module"
insmod /lib/ide-gd_mod.ko
 ...

Notice that /var/lib/dhclient is created in the RAM disk.   This will be needed by the startenet script but we created in the nash script because nash has a built in mkdir command.

The startnet script must also be placed in the top level directory of the initrd image.  Remember that the gPXE boot loader has already run DHCP before it starts the kernel and initrd.  The gPXE boot loader (and PXE for that mater) will pass the iSCSI boot parameters through an iSCSI Boot Firmware Table (iBFT) that is retained by the kernel. We can read these parameters using “iscsistart -f” (or by reading the variables under /sys/firmware/ibft provided by the iscsi_ibft module).

#!/bin/bash
getiBFTparm() {
 typeset parm opt e val
 parm=$1
 iscsistart -f | while read opt e val; do
    if [ "${opt}" = "${parm}" ]; then
        echo "${val}"
        return
    fi
 done
 echo ""
}

getDHCPopt() {
 typeset parm so opt val
 parm=$1
 while read so opt val; do

     if [ "${so}" = option ] && [ "${opt}" = "${parm}" ]; then
         echo "${val/;/}"
         return
     fi

     if [ "${so}" = "${parm}" ]; then
         echo "${opt/;/}"
         return
     fi
 done < /var/lib/dhclient/dhclient.leases
 echo ""
}

shcp() {
 typeset in out ln
 in=$1
 out=$2
 while read ln; do
     echo $ln
 done < $in > $out
}

# Read the interface and initiator from the iBFT
iface=$(getiBFTparm iface.net_ifacename)
inm=$(getiBFTparm iface.initiatorname)
tgn=$(getiBFTparm node.name)
ip addr flush dev ${iface}
ip link set dev ${iface} down

# Bring up the bonding device
echo "Bringing up bonding device"
echo "+${iface}" > /sys/class/net/bond0/bonding/slaves
echo 1 > /sys/class/net/bond0/bonding/mode
echo 250 > /sys/class/net/bond0/bonding/miimon
echo 1 > /sys/class/net/bond0/bonding/use_carrier
echo 500 > /sys/class/net/bond0/bonding/updelay
echo 500 > /sys/class/net/bond0/bonding/downdelay
ip link set dev ${iface} up
ip link set dev bond0 up

# Bring up VLAN 3
vconfig set_name_type DEV_PLUS_VID_NO_PAD
vconfig add bond0 3

# Run dhcp on the VLAN and pass the lease to the boot scripts
echo 'network --device bond0.3 --bootproto dhcp' > /dhup
nash /dhup
shcp /var/lib/dhclient/dhclient.leases /dev/.dhclient-bond0.3.leases

# Read the target parameters from DHCP
# <servername>:[protocol]:[port]:[LUN]:<targetname> rpath=$(getDHCPopt filename) rlist=(${rpath//:/\" \"}) tip=${rlist[0]//\"/} prot=${rlist[1]//\"/} port=${rlist[2]//\"/} lun=${rlist[3]//\"/} # Attach the disk echo "attaching iSCSI device" iscsistart -i ${inm} -t ${tgn} -g ${lun} -a ${tip} ${port:+-p ${port}}

As you can see, this script will read the boot NIC device from the iBFT, setup a bonding device with only that NIC, and then setup VLAN 3 on top of the bonding device.   Note that any other NICs in the bonding device will be added later by the normal boot scripts. Also, the startnet script uses nash to run DHCP and copies the DHCP lease to a location where it will be picked up by the boot scripts.

In the last step, we wish to attach the iSCSI LUN. However, as discussed above, we booted using gPXE over the default VLAN.  This means the IP addresses in the iBFT are not the addresses we wish to use for the mount.  To get the correct addresses, we again pass the “filename” parameter in DHCP with the information we need.  To simplify parsing, though, the iSCSI target from the iBFT is used.

If you choose not to use VLANs and instead run everything over one network, the startnet script would look like this:

#!/bin/bash

getiBFTparm() {
 	typeset parm opt e val
 	parm=$1

 	iscsistart -f | while read opt e val; do
 		if [ "${opt}" = "${parm}" ]; then
 			echo "${val}"
 			return
 	fi
 	done
 	echo ""
}

getDHCPopt() {
 	typeset parm so opt val
 	parm=$1

 	while read so opt val; do
 		if [ "${so}" = option ] && [ "${opt}" = "${parm}" ]; then
 			echo "${val/;/}"
 			return
 		fi
 		if [ "${so}" = "${parm}" ]; then
 			echo "${opt/;/}"
 			return
 		fi
 	done < /var/lib/dhclient/dhclient.leases
 	echo ""
}

maskAddr() {
 	typeset mask addr i out val
 	mask=(${1//\./ })
 	addr=(${2//\./ })
 	unset out

 	for(( i = 0; i < ${#addr[@]}; i++ )); do
  		((val = ${addr[$i]} & ${mask[$i]}))
  		out="${out:+${out}.}$val"
  	done
  	echo ${out}
 }
 calcBCAST() {
  	typeset net i bcast
  	net=(${1//\./ })
  	for(( i = ${#net[@]} - 1; i >= 0; i-- )); do
 		if [ ${net[$i]} != 0 ]; then
 			break;
 		fi
 		net[$i]=255
 	done

 	bcast=${net[@]}
 	echo ${bcast// /\.}
}

mask2CIDR() {
 	typeset mask oct cidr tst
 	mask=${1//\./ }

 	cidr=0
 	for oct in ${mask}; do
 		tst=128
 		while [ $(($oct & $tst)) != 0 ]; do
 			((cidr++))
 			((tst >>= 1))
 		done
 	done

 	echo ${cidr}
}

shcp() {
 	typeset in out ln
 	in=$1
 	out=$2

 	while read ln; do
 		echo $ln
 	done < $in > $out
}

# Read the interface and DHCP options from the iBFT
iface=$(getiBFTparm iface.net_ifacename)
addr=$(getiBFTparm iface.ipaddress)
mask=$(getiBFTparm iface.subnet_mask)
route=$(getiBFTparm iface.gateway)

ip addr flush dev ${iface}
ip link set dev ${iface} down

# Bring up the bonding device
echo "Bringing up bonding device"
echo "+${iface}" > /sys/class/net/bond0/bonding/slaves
echo 1 > /sys/class/net/bond0/bonding/mode
echo 250 > /sys/class/net/bond0/bonding/miimon
echo 1 > /sys/class/net/bond0/bonding/use_carrier
echo 500 > /sys/class/net/bond0/bonding/updelay
echo 500 > /sys/class/net/bond0/bonding/downdelay
ip link set dev ${iface} up
ip link set dev bond0 up

# Setup virtual bridge
bdev=$(maskAddr ${mask} ${addr})
brctl addbr ${bdev}
/sbin/ip addr flush dev bond0
brctl addif ${bdev} bond0
brctl setfd ${bdev} 0
brctl stp ${bdev} off

cidr=$(mask2CIDR ${mask})
bcast=$(calcBCAST ${bdev})
ip link set dev ${bdev} up
ip addr add ${addr}/${cidr} broadcast ${bcast} dev ${bdev}
ip route add default via ${route} dev ${bdev}

# Attach the disk
echo "attacing iSCSI devices iscsistart -b"
iscsistart -b

Now that we have the initrd layout, we need to pack it up.

chmod 555 ./startnet
find . | cpio -oc --quiet | gzip -c -9 > /boot/initrd-${KVER}.img.iSCSI

Now we want to copy the initrd image we just created into our golden image.  To do this we must mount our golden image and so we need a machine that understands the ext3 file system, most likely a Linux system.  This could be the NFS server on which we stored our image, or some other machine. We could even do this on the OVM server if we setup NFS or copy the image over by hand. To mount the /boot partition from the image, we calculate the partition offset from the start block and perform a loopback mount:

mkdir /tmp/workingdir
mount nfsserver:/workingdir /tmp/workingdir
cd /tmp/workingdir
fdisk -lu ovm.img
 ...
  Device Boot    Start       End   Blocks   Id  System
ovm.img1 *          63    208844   104391   83  Linux
ovm.img2        208845   6506324  3148740   83  Linux
ovm.img3       6506325   8610839  1052257+  82  Linux swap / Solaris

expr 63 \* 512
32256

mount -o loop,offset=32256 ovm.img /mnt

Now copy over the /boot/initrd-${KERVR}.img.iSCSI file we just created to the machine where our golden image is mounted and install it in the /boot partition of the golden image:

mv /mnt/initrd-${KERVER}.img /mnt/Oinitrd-${KERVER}.img
cp initrd-${KERVER}.img.iSCSI /mnt/initrd-${KERVER}.img
umount /mnt
umount /tmp/workingdir # Might report busy until all changes flushed

iSCSI target setup

Because some of the partitions are mounted using disk labels (i.e. swap) we either have to modify disk labels, or restrict which disks each diskless client sees.  In our case we will restrict which disks the diskless client sees by using multiple iSCSI targets and one LUN per target.

The next step it to take our golden image, copy it into an iSCSI LUN, mount it, and fix the client specific info like addresses. Before we can do this, we need to setup the LUN. How you do this will depend on what you are using for storage. In this case I am running Solaris 11 on an old Sun x4150 with some storage, ZFS, and the OVM manager on the same server using VirtualBox (via VBoxHeadless).  You do not have to implement your back-end storage this way, this is just one example of what could be done.

To take advantage of compression and cloning, first create a ZFS volume, turn on compression, and copy in the image:

# Create volume to store golden image
zfs create -s -o volblocksize=128K -V 4306048K pool1/golden
zfs set compression=on pool1/golden
dd if=ovm.img of=/dev/zvol/dsk/pool1/golden bs=128k

The host name of the diskless client in this example is x6270a, so we will also use that name as the image name. To create the diskless client image for x6270a, we clone the image, create a LUN using the cloned image, create an iSCSI target, and then restrict the target to the one LUN.

# Clone the golden image
zfs snapshot pool1/golden@x6270a
zfs clone pool1/golden@x6270a pool1/x6270a

# Create the iSCSI LUN, must use rdsk device for performance
sbdadm create-lu /dev/zvol/rdsk/pool1/x6270a
Created the following LU:
 GUID DATA SIZE SOURCE
-------------------------------- ------------------- ----------------
600144f00800275417364ee79e2f0036 4409393152 /dev/zvol/rdsk/pool1/x6270a
LUN=600144f00800275417364ee79e2f0036

# Create the target
itadm create-target --alias x6270a
Target iqn.1986-03.com.sun:02:61eef508-70b9-ee18-9684-eac468458c8d successfully created
TGT=iqn.1986-03.com.sun:02:61eef508-70b9-ee18-9684-eac468458c8d
stmfadm offline-target ${TGT}
stmfadm create-tg x6270a # Create target group
stmfadm add-tg-member -g x6270a ${TGT}
stmfadm online-target ${TGT}

# Associate the restricted target group with the LUN
stmfadm add-view -t x6270a ${LUN}

Now that we have an iSCSI LUN with a boot-able image, we have to tailor the image to boot with the identity of the diskless client, x6270a.  To do this, we mount the new LUN on a Linux machine, add the client to the hosts table, give the client a unique iSCSI initiator ID, and make sure the MAC addresses in the client are correct.

# Attach the iSCSI LUN from the iSCSI server (x4150a), and mount it
TGT=iqn.1986-03.com.sun:02:61eef508-70b9-ee18-9684-eac468458c8d
iscsiadm -m node -o new -p x4150a -T ${TGT}
service iscsi restart
mount /dev/disk/by-path/*${TGT}*-part2 /mnt

# Fix the hosts table.
echo "192.168.3.3 x6270a x6270a.mydomain.com" >> /mnt/etc/hosts

Now we modify /mnt/etc/iscsi/initiatorname.iscsi and make sure the client has a unique ID and modify /mnt/etc/sysconfig/network/ifcfg-eth* and make sure we are using the correct MAC addresses.  We can then umount the image.

# Unmount the disk and remove all iSCSI connections.
umount /mnt
iscsiadm -m node --logoutall=all
iscsiadm -m node -o delete
service iscsi restart

TFTP and DHCP setup

Finally, TFTP and DHCP must be setup.   TFTP must be setup so the diskless client can download the gPXE binary.  If your NIC supports iSCSI directly and you do not wish to use gPXE, you do not need to setup TFTP.  If you are using gPXE, however, you can use the rom-o-matic web site to generate a gPXE image to place in the TFTP directory.  Select the latest version (this article was tested with 1.0.1), choose the PXE bootstrap loader format (.kpxe), the undionly driver, and click “Get Image”.  Put the binary in your TFTP root directory.  It is straight forward to setup TFTP and there are quite a few tutorials available through your favorite search engine, so it will not be covered further here.

For convenience, we can setup the DHCP server on the OVM manager machine, just be sure all the VLANs are visible from this server as well.  As mentioned, the ISC DHCP server was used for this article.  First there are a few gPXE specific variables that need to be setup in the global scope:

# DHCP Server Configuration file /etc/dhcpd.conf
ddns-update-style none;
option domain-name "mydomain.com";
option domain-name-servers 192.168.2.254;
get-lease-hostnames on;
default-lease-time 10800;
max-lease-time 86400;
authoritative;

# gPXE-specific encapsulated options
option space gpxe;
option gpxe-encap-opts code 175 = encapsulate gpxe;
option gpxe.priority code 1 = signed integer 8;
option gpxe.keep-san code 8 = unsigned integer 8;
option gpxe.no-pxedhcp code 176 = unsigned integer 8;
option gpxe.bus-id code 177 = string;
option gpxe.bios-drive code 189 = unsigned integer 8;
option gpxe.username code 190 = string;
option gpxe.password code 191 = string;
option gpxe.reverse-username code 192 = string;
option gpxe.reverse-password code 193 = string;
option gpxe.version code 235 = string;

# Other options that may be useful
option iscsi-initiator-iqn code 203 = string;
option vendor-encapsulated-options 3c:09:45:74:68:65:72:62:6f:6f:74:ff;

As described above, because we are using gPXE, we must perform the initial phases of the boot over the default VLAN, and then switch over to our management VLAN for the final phase of the boot. This means we need a DHCP setup for the default VLAN (192.168.2.0 in this case) that provides the TFTP and gPXE options, and a DHCP setup for the management VLAN (192.168.3.0) that provides the settings used in the final mount phase.  First the default VLAN settings:

subnet 192.168.2.0 netmask 255.255.255.0 {
 option routers 192.168.2.1;
 group {

   option gpxe.no-pxedhcp 1;
   if not exists gpxe.bus-id {
       filename "gpxe-1.0.1-undionly.kpxe";
       next-server x4150a;
   } else {
      filename "";
      option gpxe.keep-san 1;
   }

   host x6270a {

      hardware ethernet 00:1B:21:6F:D5:90;
      fixed-address x6270a;
      # iSCSI root path syntax:
      # iscsi:server:[protocol]:[port]:[LUN]:target
      if exists gpxe.bus-id {
          option root-path "iscsi:192.168.2.98:::0:iqn.1986-03 ...";
      }

   }

   host x6270b {

     hardware ethernet 00:1B:21:6F:D3:24;
     fixed-address x6270b;
     if exists gpxe.bus-id {
         option root-path "iscsi:192.168.2.98:::0:iqn.1986-03 ...";
     }

   }

 }

}

As you can see, we are using the gpxe.bus-id parameter to decide if the request is coming from native PXE or gPXE.  If the request is from native PXE, we supply the info needed to TFTP load the gPXE binary.  If the request comes from gPXE, we supply the info needed to do the iSCSI bootstrap loading in the root-path option variable.  Note that the full iSCSI target was elided for brevity and that we are using x4150a as our TFTP server.

Next we need to supply the DHCP options for the management VLAN:

subnet 192.168.3.0 netmask 255.255.255.0 {

 group {
    filename "192.168.3.1:::0";

    host x6270a-v3 {
        hardware ethernet 00:1B:21:6F:D5:90;
        fixed-address 192.168.3.3;
    }

    host x6270b-v3 {
        hardware ethernet 00:1B:21:6F:D3:24;
        fixed-address 192.168.3.4;
    }

 }

}

The parameters for the final mount are supplied in the filename option variable and are read in the startnet script above. Since we have complete control over the startnet script, we are free to use any format we wish for this variable.  In this case we only need to specify the iSCSI target server (192.168.3.1) and the LUN (which will always be 0).  Notice that our iSCSI target server, x4150a, is on the default VLAN as 192.168.2.98, and VLAN 3 as 192.168.3.1.  Now we just need to start up DHCP and the diskless client will be ready to boot:

server dhcpd start
chkconfig dhcpd on

One final note, it is often the case that on the first several tries things go wrong.  It will speed your debugging to run wireshark or some other network packet sniffer to make sure TFTP and DHCP are working as expected.  Also, in the initrd scripts, you can place a call to bash anywhere to start up an interactive shell.  At that point, you can call “iscsistart -f” to see what DHCP parameters were set, “ip addr list” or “ip link list” to see what interfaces have been setup, and “echo *” to do a primitive ls.  If you need more debugging tools, you can always add them to your initrd image.

Posted in Uncategorized | Tagged , , | 5 Comments