Karamba
16 September 2004

Summary

Karamba boots to both CPUs, the OS is on /hda1, and there's 250GB of software RAID1 with the XFS file system on /md0.

The raid drives are Maxtor 7Y250M0.

The raid is currently defined in /etc/mdadm/mdadm.conf and possibly mdadm.conf.init.

Hardware
Software RAID related commands
  • cfdisk
    • cfdisk /dev/sda
    • cfdisk /dev/sdb
  • md
    • cat /proc/mdstat
  • raidtools2
    • pico /etc/raidtab
    • mkraid /dev/md0
    • lsraid -R -p  (generate a raidtab)
    • just start raid2
    • just stop raid2 (/etc/init.d/raid2)
    • pico /etc/cron.daily/raidtools2
  • mdadm
    • mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
    • pico /etc/mdadm/mdadm.conf
    • mdadm -S /dev/md0   (to stop the array)
    • mdadm -Q /dev/md0   (to query the array to see if it's ok)
    • mdadm -D /dev/md0 | more  (to see details)
    • just restart mdadm
    • mdadm --assemble --run /dev/md0 /dev/sdb1 /dev/sda1 (to assembe an array that wasn't autodetected, or that stopped)
    • just reconfigure mdadm
    • mdadm --zero-superblock /dev/sda1
    • mdadm --examine /dev/sda1
      mdadm --examine /dev/sdb1
  • e2fs (XFS is preferred)
    • mke2fs  -b 4096 -R stride=8 /dev/md0
  • XFS
    • mkfs.xfs /dev/md0
Guides
  • md ("multiple devices" -- the kernel's software RAID)
  • raidtools2
    • man raidstart
    • man lsraid
    • just list-files raidtools2 | grep man
  • mdadm
    • man mdadm  |  info mdadm
  • e2fs (XFS is preferred)
    • mke2fs  -b 4096 -R stride=8 /dev/md0
  • XFS
  • kernel -- Frederik Schueler writes on 22 Sept 2004 (not sure what to make of it):
    • em64t-p4 - uniprocessor kernel image for intel P4 based systems with one processor and HT disabled
    • em64t-p4-smp - smp kernel image for Xeon based systems with 2 or more processors or uniprocessor P4 systems with HT enabled.
Software
  • mdadm
  • raidtools2
  • xfsprogs
  • xfsdump
Procedure in a nutshell
  • Use cfdisk to create partitions type FD on all individual drives and reboot
  • Build all components into the kernel, not as modules, install and reboot
    • sata_sil, libata, scsi, md, raid1, raid5, whatever it takes
  • Create /etc/raidtab and run mkraid /dev/md0
  • Add the XFS file system -- mkfs.xfs /dev/md0
  • Mount -- mount /md0
SATA

For information on SATA support in Linux, see http://www.linuxmafia.com/faq/Hardware/sata.html:
Silicon Image 3112 / 3114 (integrated), and 3512 (PCI) (CMD Technology, Inc.) — libata driver set provides beta-level support (as of 2004-07-08) via the sata_sil driver. Note that enabling libata support for this chipset requires enabling CONFIG_BROKEN (under "Code maturity level options") in your kernel configuration, for reasons Jeff Garzik has explained.

Note that as of 06/2004, Silicon Image chipsets have a bug in "lba48" addressing (of drives over 137GB), necessitating a patch that will, as a necessary consequence, limit performance.

...or probably the kernel 2.4.x siimage driver (originally developed for the pre-SATA CMD680 chip used in many ATA host adapters), or alternatively the 2.4.21 or later kernel's low-level silraid (Arjan van de Ven's) driver or its superior replacement, medley (by Thomas Horsten — see below). "medley" or "silraid" works with the 2.4.x-only "ataraid" mid-level driver, and results in your partitions being addressed using a /dev/ataraid/d0p1 (etc.) device-naming convention.

Proprietary drivers are available from the manufacturer.

These chipsets can do a type of software RAID called "Medley", for which Linux 2.4.26 and later kernels include a low-level "medley" driver, which (like the older silraid driver that it replaces) works with the 2.4.x-only mid-level "ataraid" driver, and results in your partitions being addressed using a /dev/ataraid/d0p1 (etc.) device-naming convention. The kernel "medley" driver originated in Thomas Horsten's open-source medley driver. Note: So far, the medley driver supports only Medley's RAID0 "striped" mode, and not its RAID1 "mirrored" or RAID0+1 (AKA "RAID10") modes. Alternatively, you can use Linux's "md" software-RAID driver.

It is not entirely clear from this whether sata_sil is a RAID driver, which I had assumed, or you need a RAID driver on top of sata_sil -- and in that case, which? Turns out we need md, or Linux software RAID.

RAID
We should get a third identical drive and define them as RAID5 at the level of the RAID card (SII 3112) by pressing F4 at boot and hopefully making sense of the menu choices. Then we'll need to define them as RAID5 in Linux and finally put a single file system on the whole thing -- the three 250GB drives will appear as a single 500GB drive in RAID5.

I've checked out RAID a bit and found this:
  •  RAID0 is striping, more speed, less security
  •  RAID1 is mirroring, more security, less speed (Karamba's current configuration)
  •  RAID5 is useful for three or more drives -- more security, neutral for speed.
    • A RAID-5 set of N drives with a capacity of C MB per drive provides the capacity of C * (N - 1) MB
  • It looks like it's not possible to change the size of an existing RAID 5 array without removing all data on it
This last point means that once we define a three-disk array, we're locked into that size and won't be able to expand the array (five drives would otherwise be fine). If you think we need more than 500GB of storage, buy two or three more Maxtor 250 drives and get 750GB to 1TB of storage in the array. The "lost" (redundant) drive remains constant at one, whether you have two or five drives.

By the same token, a set of RAID1 arrays, which just does mirroring, is safer, since it survives without data loss if half the drives fail.

Software RAID

It turns out that the Silicon Image 3112 is not a full hardware RAID card but a hybrid software/hardware RAID solution. Under the 2.4 kernel it can be run with the Medley driver, but people report it runs faster under 2.6 as a SCSI system using MD, or Linux software
RAID. This is what we should be using.

Without MD, the sata_sil driver just sees the individual drives, not the RAID array.

MD ('multiple devices') uses the admin program mdadm, which I've installed. The current setup is RAID1 (mirroring), so I've loaded the RAID1 module (using modconf), just so we get used to how MD works.

To configure mdadm, run "just reconfigure mdadm" -- it has some interesting parameters. For instance it will e-mail you if a disk fails.

Configuration file -- I started by issuing
cp /usr/share/doc/mdadm/examples/mdadm.conf-example /etc/mdadm.conf
Once the array is created, change the configuration file to reflect the details. You don't need it to create the array, it's just used for reassembling it later, if required.

I ended up using raidtools2 instead of mdadm -- as the latter gave me trouble.

For details, see /etc/raidtab.

Installation history

Update 22 September 2004

The kernel should from now on be compiled with gcc 3.4; I changed the symlink.

It turns out lowmem only handles 896MB of memory, so I enabled highmem -- and 2GB of RAM showed up! I also removed DRI, since the mach64 driver isn't included in the kernel (though we could patch it).

Comparing dmesg files, I noticed that swap is no longer being initialized, ever since devfs mount at boot was included in the kernel. In fact swapon -a doesn't work, as the /dev/hda2 partition isn't even seen! On the other hand, the system boots off /dev/hda1, so it's not as if there's a problem seeing the disk.
# grep swap *
dmesg-2004-09-09:Adding 2096472k swap on /dev/hda2.  Priority:-1 extents:1
dmesg-2004-09-15-md:Adding 2096472k swap on /dev/hda2.  Priority:-1 extents:1
But swap is enabled in all kernels -- and the Real Time Clock Driver also doesn't show. Now,
# fdisk -l
produces nothing, and cfdisk /dev/hda says "FATAL ERROR: Cannot open disk drive". Recall this is the drive the operating system is currently running on -- that is, /dev/hda1. The system in fact can't see the other partitions:
#swapon -a
  swapon: cannot stat /dev/hda2: No such file or directory
I rebooted with devfs=nomount to test this hypothesis. Indeed, that was the problem -- dmesg now shows,
Adding 2096472k swap on /dev/hda2.  Priority:-1 extents:1
Real Time Clock Driver v1.12
Whew! That's a relief. Make sure never to use devfs automount. fdisk now works fine.

The kernel should now be in good shape; it's tweaked and checked.

19 September 2004: ATI Rage XL AGP card

The ATI Rage XL is in the Mach64 family. In official XFree86 releases there is currently no hardware accelerated 3D support for Mach64. However, the mach64 branch in DRI CVS has an almost complete 3D driver. You can find
It may not be worth installing 3D capability, but it's there if we want it. For instructions, see mini-HOWTO: Compiling and Installing the mach64 Branch of DRI.

In XF86Config-4 use "ati" as driver name. It automatically selects the correct driver.

16 September 2004: XFS

Karamba is now running sata raid (md software raid) and the xfs file system; the raid array is now autodetected at boot. XFS is the file system developed by SGI for Hollywood and other IRIX users -- a powerful, fast, journaled file system. Ideally we'd put the metadata (the journal) in a small, separate RAID array, but that seems a bit too complicated at this point. This is the file system we should be using for the archives.

Hardware RAID is currently limited to around 12 drives. Software RAID has a limit somewhere, but it's something like a couple of hundred drives. For practical purposes, however, we'll probably want to create several arrays of five or six drives.

RAID systems are currently limited to the size you define when you establish them. To create file systems that can be shrunk and grown at will, we should use LVM, or logical volume manager. I don't think we need it, but there are cases where it may be useful.

On karamba, the current setup is RAID1 and is ready for stress-testing. The RAID config file is at /etc/raidtab and you can see the status in cat /proc/mdstat.

15 September 2004:

The software RAID array is not found at boot. To achieve this, you have to partition the individual drives to use the FD partition type (Linux autodetection). Then you have to build every component that's required for the array to get going into the kernel itself and not as modules.

To get this working I had to build sata_sil, libata, scsi, md, and raid1 into the kernel, and use cfdisk to set up the component drives (sda and b) with the partitition type FD (Linux automount).

I found mdadm a bit limited and installed raidtools2, which seems more robust -- though mdadm has some superior features. I created a raidtab:
raiddev /dev/md0
        raid-level              1
        nr-raid-disks           2
        nr-spare-disks          0
        persistent-superblock   1
        device                  /dev/sda1
        raid-disk               0
        device                  /dev/sdb1
        raid-disk               1
Using raidtools2, created the raid with this command:
mkraid /dev/md0
Nice and simple -- with good error messages. Use the -R switch to force overwriting an old array.

13 September 2004 update: creating a software RAID array

We now need to configure the drives to be managed by MD. Once that is done, we can set mdadm to identify the RAID1 system at boot (and the RAID5 array once we have the new drive).

I created the RAID1 array with this command:
mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
I got the receipt
mdadm: size set to 245111616K
mdadm: array /dev/md0 started
Then put the XFS file system on it:
mkfs.xfs /dev/md0
That took no time -- a lot faster than ext2 and ext3.

Possible remaining issues: I've not testeed that this setup survives a reboot -- it probably won't, as some piece of configuration is likely missing. We should run tests on this to make sure the configuration is robust.

9 September 2004 update

Andrey went straight for 2.6, and things looked good -- but the drives were seen as separate drives, not a raid array. The kernel was configured for a single 386, so I made a new kernel and booted to the dual Xeons. Note that you either need initrd or the drive modules compiled into the kernel, not as modules.

Karamba appears to have a so-called watchdog card, a built-in chip in this case. Its function appears to be to reboot the machine under certain conditions of failure. It's supported by the i8xx_tco module. I've loaded the module, which we might regret, but not looked at the configuration.


30 August update

Andrey booted into Knoppix 2.4 on 30 August 2004 and sent me a brief report, and gave me access. (Note that you can't ping the machines, but you can ssh to paco.)

The ataraid module was loaded, but not medley. I attempted to insert medley, but it didn't find the hardware. This may be because it only supports RAID0 and the machine was configured for RAID1. In this case, we have to use the libata and sata_sil drivers.

I adviced Andrey to install a new hard drive for the OS, and use the new Debian installer, with the 2.4 kernel with ataraid and medley for the RAID. Later, we should switch to the 2.6 kernel with libata and sata_sil.

Hardware inventory

IDE drives


There's a CDRW and two Maxtor 250GB drives:

hdparm -i /dev/hdc

/dev/hdc (CDRW):

 Model=FX54++M, FwRev=Y01G, SerialNo=
 Config={ Fixed Removeable DTR<=5Mbs DTR>10Mbs nonMagnetic }
 RawCHS=0/0/0, TrkSize=0, SectSize=0, ECCbytes=0
 BuffType=unknown, BuffSize=0kB, MaxMultSect=0
 (maybe): CurCHS=0/0/0, CurSects=0, LBA=yes, LBAsects=0
 IORDY=yes, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 *mdma2
 UDMA modes: udma0 udma1 udma2
 AdvancedPM=no

hdparm -i /dev/hde

/dev/hde:

 Model=Maxtor 7Y250M0, FwRev=YAR51EW0, SerialNo=Y62QTH3E
 Config={ Fixed }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=DualPortCache, BuffSize=7936kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2
 AdvancedPM=yes: disabled (255) WriteCache=enabled
 Drive conforms to: (null):


hdparm -i /dev/hdg

/dev/hdg:

 Model=Maxtor 7Y250M0, FwRev=YAR51EW0, SerialNo=Y62QT4XE
 Config={ Fixed }
 RawCHS=16383/16/63, TrkSize=0, SectSize=0, ECCbytes=4
 BuffType=DualPortCache, BuffSize=7936kB, MaxMultSect=16, MultSect=16
 CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=268435455
 IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
 PIO modes:  pio0 pio1 pio2 pio3 pio4
 DMA modes:  mdma0 mdma1 mdma2
 UDMA modes: udma0 udma1 udma2
 AdvancedPM=yes: disabled (255) WriteCache=enabled
 Drive conforms to: (null):

Knoppix 2.4 boot

lspci

0000:00:00.0 Host bridge: Intel Corp. E7505 Memory Controller Hub (rev 03)
0000:00:00.1 Class ff00: Intel Corp. E7000 Series RAS Controller (rev 03)
0000:00:01.0 PCI bridge: Intel Corp. E7000 Series Processor to AGP Controller (rev 03)
0000:00:02.0 PCI bridge: Intel Corp. E7000 Series Hub Interface B PCI-to-PCI Bridge (rev 03)
0000:00:02.1 Class ff00: Intel Corp. E7000 Series Hub Interface B PCI-to-PCI Bridge RAS Controller (rev 03)
0000:00:1d.0 USB Controller: Intel Corp. 82801DB (ICH4) USB UHCI #1 (rev 02)
0000:00:1d.1 USB Controller: Intel Corp. 82801DB (ICH4) USB UHCI #2 (rev 02)
0000:00:1d.2 USB Controller: Intel Corp. 82801DB (ICH4) USB UHCI #3 (rev 02)
0000:00:1d.7 USB Controller: Intel Corp. 82801DB (ICH4) USB2 EHCI Controller (rev 02)
0000:00:1e.0 PCI bridge: Intel Corp. 82801BA/CA/DB/EB/ER Hub interface to PCI Bridge (rev 82)
0000:00:1f.0 ISA bridge: Intel Corp. 82801DB (ICH4) LPC Bridge (rev 02)
0000:00:1f.1 IDE interface: Intel Corp. 82801DB (ICH4) Ultra ATA 100 Storage Controller (rev 02)
0000:00:1f.3 SMBus: Intel Corp. 82801DB/DBM (ICH4) SMBus Controller (rev 02)
0000:02:1c.0 PIC: Intel Corp. 82870P2 P64H2 I/OxAPIC (rev 04)
0000:02:1d.0 PCI bridge: Intel Corp. 82870P2 P64H2 Hub PCI Bridge (rev 04)
0000:02:1e.0 PIC: Intel Corp. 82870P2 P64H2 I/OxAPIC (rev 04)
0000:02:1f.0 PCI bridge: Intel Corp. 82870P2 P64H2 Hub PCI Bridge (rev 04)
0000:04:02.0 Ethernet controller: Intel Corp. 82540EM Gigabit Ethernet Controller (rev 02)
0000:05:02.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
0000:05:03.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0d)
0000:05:04.0 RAID bus controller: Silicon Image, Inc. (formerly CMD Technology Inc) Silicon Image Serial ATARaid Controller [ CMD/Sil 3112/3112A ] (rev 02)


 

 

CogWeb