Tuesday 17 July 2012

Linux Boot Process in a nutshell


The moment after a computer is powered on, the CPU (central processing Unit) gets power and does not know what it suppose to do. Mean while a special hardwire circuit raises the logical value of the RESET pin of the CPU (means a special HW circuit sets the RESET pin of the CPU). Once the CPU receives the RESET pin set, it assorted and sets some of its special registers to its default value and the specific code resides in the address location of 0xFFFF FFF0 gets executed known as system BOIS (Basic Input Output System). BIOS code is resides in the flash memory (ROM) present on the mother board. The BIOS then does the POST (Power On Self Test) operation to determine the HWs which are responsible for booting process. When a boot device is found it loads the boot-sector (1st stage boot-loader) from the boot device into RAM (Random Access Memory) and execute (The code that resides in the MBR (Master Boot Record) is known as boot-loader). The 1st stage boot loader is less than 512 bytes in size (a single sector), and its main job is to load the 2nd stage boot-loader into the RAM. When the 2nd stage boot loader is in RAM and executing, a splash screen is commonly displayed, and Linux and an optional RAM disc (temporary root file system) are loaded into memory (NB: The code segment in MBR is known as boot loader and is also responsible for selecting more than one OS at this point of time while 2nd boot-loader is in execution). When the kernel images are loaded into the RAM, the 2nd stage boot-loader passes control to the kernel image and the kernel is decompressed inside RAM and initialized. During this time the 2nd stage boot loader checks the system hardware, enumerates (numbers) the attached hardware devices, mounts the root device and then loads the necessary kernel modules. When all the aforesaid tasks get completed, the 1st User-space program (init) starts.



Brief description and Functionality

Basic Input Output System [BIOS] :

BIOS is a program resides in the memory location 0xFFFFFFF0 of ROM/Flash memory. It consists of some interrupt driven low-level procedures, used by the Operating Systems to handle the hardware devices. After the initialization process of Linux completes, it does not use the BIOS. BIOS has two parts, POST code and runtime service. POST part will flush from the memory once the POST operation has completed, but the runtime service will remain in the memory and available to the target OS.

·         BIOS execute a series of tests in the computer hardware, in order to establish which devices are present in the computer and whether they are working properly or not. This series of tests are often called POST (power on self test). During this phase user may see the BIOS version banners in the screen.

·         It initializes the hardware devices and ensures that all hardware devices are operate without conflict on the IRQ(interrupt) lines and IO ports. In case of PCI based architecture, it shows the table of all installed PCI devices.

·         BIOS runtime service searches for devices that are both active and bootable in the order of preference defined by the CMOS (complementary metal oxide semiconductor) settings. The boot device can be a floppy disc, a CD-ROM, a partition in a hard disc, a device in the network, or an USB flash memory stick.

·         Commonly Linux is booted from the Hard disc where the MBR (Master Boot Record) contains the primary boot-loader. The MBR is a 512-byte sector, located on the 1st sector on the disc (sector = 1; cylinder = 0; head = 0).  

·         As soon as it gets a valid device to boot, it copies the 1st sector of the device into RAM, starting from physical address 0x7C00 of RAM. Then it jumps to that address and execute the loaded instructions there (inside RAM, physical address 0x7C00).

Master Boot Record [MBR] :

Every hard disc must have a consistent starting point where all the key information (number of partitions the disc has, what kind of partitions they are etc.) about the disc is stored.  The place where these information are stored is known as MBR/master Boot sector/ Boot sector. MBR is always located in cylinder 0, head 0, and sector 1 (1st sector of the disc). BIOS always look into this sector to boot the OS. MBR contains the following structures:

·         Partition Table : this table contains the information regarding the partitions that are contained on the hard disc. The size of this table for information describing is 4, which mean the hard disc can have maximum 4 true partitions known as primary partitions; more than 4 are logical partitions and linked with one of the physical partition.

·         Master Boot Code : the MBR contains the small initial boot program that the BIOS loads and executes to start the boot process.

Stage-1 Boot Loader :

The primary boot loader that resides in the MBR is a 512-byte image containing both program code and a small partition table. The 1st 446 bytes are the primery boot loader, which contains a record for each of four partitions (16 bytes each). The last 2 bytes of MBR contains the magic number (0XAA55). The magic number serves as the validation check of the MBR. The job of the primary boot loader is to find and load the secondary boot loader. Primary boot loader 1st searches the partition table for an active partition. When it finds an active partition, it scans all other partitions to make sure that all others are in-active or not. When this is verified (only there exists only one active partition), the active partition’s boot record is read from the device into RAM and executed.

Stage-2 Boot Loader :

The secondary boot loader is also known as the kernel loader. The task of this stage is to load the Linux kernel into the RAM along with optional RAM disk. The 1st and 2nd stage boot loaders combined are called Linux loader (LILO) or Grand Unified Boot Loader (GRUB). The advantage of GRUB is that it includes the knowledge of Linux file systems. Instead of using raw sectors on the disk, as LILO does, GRUB can load the Linux kernel from an ext2 or ext3 file system. It (GRUB) does this by making the 2 stage boot loader into a three stage boot loader (after MBR (1st stage), it boots a 1.5 boot loader that understands the particular file system containing the Linux Kernel Image).

Kernel Boot Procedures:

With the kernel image in the memory and control given from the stage 2 boot loader, the kernel stage begins. The kernel image is a compressed kernel image. Typically it’s a zImage (compressed image less than 512 kilo bytes) or a bzimage (big compressed image more than 512 KB), that has been previously compressed with zlib. At the head of this image is a routine that does some minimal amount of H/W setup and then decompressed the kernel contented within the kernel image and places it into high memory. If an initial RAM disk image is present, then this routine will moves it to the memory and notes it for later use. Then the routine calls the Kernel and then the Kernel boot process begins.

Kernel Start-up Procedures

Setup.S [/arch/i386/boot/setup.S]

Setup.S is an assembly code and responsible for getting the system data from the BIOS, and putting them into the appropriate location of the system memory. This code asks the BIOS for memory/disc/other parameters and put them into a protected memory region (0x90000-0x901FF). It also re-initializes all H/Ws and moves from real mode to protected mode memory addressing. It then sets up a provisional GDT and IDT and also re-programmes the Programmable Interrupt Controller (PIC) and maps the 16 IRQ lines from 0 to 15.

Code Flow:

In setup.S file, the 1st instruction found is a jump.

1.    “start: jmp    trampoline” and followed by a set of initializes. At trampoline a procedure was called “start-of-setup (trampoline: call     start_of_setup)”  which starts the actual work.

start_of_setup:

# Bootlin depends on this being done early

            movw    $0x01500, %ax

            movb    $0x81, %dl

            int         $0x13

2.    Resets the disk controller. (# Reset the disk controller.)

#ifdef SAFE_RESET_DISK_CONTROLLER

# Reset the disk controller.

            movw    $0x0000, %ax

            movb    $0x80, %dl

            int         $0x13

#endif

3.    Setup the code and data segment registers to SETUPSEG(0x9020)

# Set %ds = %cs, we know that SETUPSEG = %cs at this point

            movw    %cs, %ax                      # aka SETUPSEG

            movw    %ax, %ds

4.    Then looks for the signature, at the end of the setup block (SIG1 0xAA55, SIG2 0x5A5A) to ensure that loader (LILO) loaded us right.

# Check signature at end of setup

            cmpw    $SIG1, setup_sig1

            jne        bad_sig



            cmpw    $SIG2, setup_sig2

            jne        bad_sig



            jmp       good_sig1

5.    If the signature is missing we have to find the rest of the setup code. If we are unable to get the code we will give up throwing a message “No setup signature found …” at put the processor in halt state.

6.    Change the data-segment register to INITSEG 0x9000.

good_sig:

            movw    %cs, %ax                                  # aka SETUPSEG

            subw     $DELTA_INITSEG, %ax               # aka INITSEG

            movw    %ax, %ds

7.    Check if the loader version is proper, just to ensure that the loader can deal with the high loaded kernel. Jump to ‘loader_ok’ if a proper loader version else Strike a message “Wrong loader, giving up…” .

# Check if an old loader tries to load a big-kernel

            testb     $LOADED_HIGH, %cs:loadflags   # Do we have a big kernel?

            jz          loader_ok                                  # No, no danger for old loaders.



            cmpb    $0, %cs:type_of_loader              # Do we have a loader that

                                                                        # can deal with us?

            jnz        loader_ok                                  # Yes, continue.



            pushw   %cs                                          # No, we have an old loader,

            popw     %ds                                          # die.

            lea        loader_panic_mess, %si

            call       prtstr



            jmp       no_sig_loop

8.    Get the extended memory size in Kb that can be found at the offset 0×1E0.If different memory detection scheme is used then try these three.First, try e820h, which lets us assemble a memory map, then try e801h, which returns a 32-bit memory size, and finally 88h, which returns 0-64m.

loader_ok:

# Get memory size (extended mem, kB)

................................................................

................................................................

mem88:



#endif

            movb    $0x88, %ah

            int         $0x15

            movw    %ax, (2)



# Set the keyboard repeat rate to the max

            movw    $0x0305, %ax

            xorw     %bx, %bx

            int         $0x16

9.    Set the keyboard repeat rate to the max.

# Set the keyboard repeat rate to the max

            movw    $0x0305, %ax

            xorw     %bx, %bx

            int         $0x16

10. Check for video adapter and its parameters and allow the user to browse video modes. This done by calling video which is there in video.S

# Check for video adapter and its parameters and allow the

# user to browse video modes.

            call       video                                        # NOTE: we need %ds pointing

                                                                        # to bootsector

11. Get hd0 data,check if hd1 is there ,scan for MCA bus.

# Get hd0 data...

..........................

..........................

# Get hd1 data...

............................

............................



12. After some more checking finally we move to protected mode. If there is a valid pointer to a real mode switch routine at offset “realmode_switch” then call that ,else leave it to the “default_switch”. The default_switch routine disables interrupts [cli] & NMI.

13. Now we move the system to its rightful place … but we check if we have a big-kernel. In that case we must not move it …we get the “code32_start” address & modify “code32 “ which is [0x1000 4K] for default for zImage or [0x100000 1Mb] for big kernel,as it can be changed by the loader.

14. Now we will set up the GDT and IDT.

15. Make sure any possible coprocessor is properly reset.

16. Now we mask all interrupts – the rest is done in init_IRQ() ,called from start_kernel() & mask all IRQs but IRQ2 which is cascaded.

17. This the time when we actually jump into the protected mode by setting the PE bit. [Movw $1, %ax & lmsw %ax]

18. The last line executed in this file is a jump to an assembly function called “startup_32”, which performs additional initialization [/arch/i386/boot/compressed/head.S].

Head.S [ /arch/i386/boot/compressed/head.S ]

It performs the following operations:

1.   Initializes the segmentation register.

startup_32:

            cld

            cli

            movl $(__BOOT_DS),%eax

            movl %eax,%ds

            movl %eax,%es

            movl %eax,%fs

            movl %eax,%gs



            lss stack_start,%esp

            xorl %eax,%eax

1:         incl %eax                      # check that A20 really IS enabled

            movl %eax,0x000000     # loop forever if it isn't

            cmpl %eax,0x100000

            je 1b

...............................

................................

2.   Sets up a provisional stack.

/*

 * Initialize eflags.  Some BIOS's leave bits like NT set.  This would

 * confuse the debugger if this code is traced.

 * XXX - best to initialize before switching to protected mode.

 */

            pushl $0

            popfl

3.   Decompresses the kernel Image [decompress_kernel() misc.c]

/*

 * Do the decompression, and jump to the new kernel..

 */

            subl $16,%esp   # place for structure on the stack

            movl %esp,%eax

            pushl %esi         # real mode pointer as second arg

            pushl %eax       # address of structure as first arg

            call decompress_kernel

            orl  %eax,%eax

            jnz  3f

            popl %esi          # discard address

            popl %esi          # real mode pointer

            xorl %ebx,%ebx

            ljmp $(__BOOT_CS), $0x100000

4.   “decompress_kernel” returns a value telling whether we were loaded high or not. If not we straight away jump to startup_32 function in the decompressed kernel, in /arch/i386/kernel/head.S [0x100000],else we move the “move_in_place” routine to address 0×1000 [4K].This will move the kernel to its final destination [0x100000].

................

..................

3:

            movl $move_routine_start,%esi

            movl $0x1000,%edi

            movl $move_routine_end,%ecx

            subl %esi,%ecx

            addl $3,%ecx

            shrl $2,%ecx

            cld

...................

...................



Head.S [ /arch/i386/kernel/head.S ]

The second startup_32() continues the initialization sequence. its main job is to set up an environment within which the first process can execute. This includes:

1. Initializes the segmentation registers with their final values.

/*

 * Set segments to known values.

 */

            cld

            lgdt boot_gdt_descr - __PAGE_OFFSET

            movl $(__BOOT_DS),%eax

            movl %eax,%ds

            movl %eax,%es

            movl %eax,%fs

            movl %eax,%gs

.................

.................

2. Sets up the Kernel Mode stack for Process 0.

xorl %eax,%eax

            movl $__bss_start - __PAGE_OFFSET,%edi

            movl $__bss_stop - __PAGE_OFFSET,%ecx

            subl %edi,%ecx

            shrl $2,%ecx

            rep ; stosl

3. Initializes the provisional kernel Page Tables & create a PDE

page_pde_offset = (__PAGE_OFFSET >> 20);



            movl $(pg0 - __PAGE_OFFSET), %edi

            movl $(swapper_pg_dir - __PAGE_OFFSET), %edx

            movl $0x007, %eax                               /* 0x007 = PRESENT+RW+USER */

10:

            leal 0x007(%edi),%ecx                           /* Create PDE entry */

            movl %ecx,(%edx)                                /* Store identity PDE entry */

            movl %ecx,page_pde_offset(%edx)                     /* Store kernel PDE entry */

            addl $4,%edx

            movl $1024, %ecx

.............

..............

..............

4. Stores the address of the Page Global Directory in the cr3 register, and enables paging by setting the PG bit in the cr0 register.

5. Fills the bss segment of the kernel with zeros.

6. Invokes setup_idtO to fill the IDT with null interrupt handlers.

7. The first page frame is loaded with the system parameters learned from the BIOS and the
parameters passed to the operating system from the boot loader.

8. Loads the gdtr and idtr registers with the addresses of the GDT and IDT tables.

9. The first CPU calls “start_kernel” which does the rest of initialization, all other CPUs call “initialize_secondary”

.....................

......................

#ifdef CONFIG_SMP

            movb ready, %cl           

            cmpb $1,%cl

            je 1f                              # the first CPU calls start_kernel

                                                # all other CPUs call initialize_secondary

            call initialize_secondary

            jmp L6

1:

#endif /* CONFIG_SMP */

            call start_kernel

L6:

            jmp L6                          # main should never return here, but

                                                # just in case, we know what happens.

Start_kernel()[ /init/main.c ]

The start_kernel is the first function written in C. It performs the following tasks.
1.Take a global kernel lock (it is needed so that only one CPU goes through initialisation).

/*__inint Module */

asmlinkage void __init start_kernel(void)

{

            ...................

            ...................

            lock_kernel();

            ...................

}


2.Perform arch-specific setup (memory layout analysis, copying boot command line again, etc.).

page_address_init();

..................

            setup_arch(&command_line);

            setup_per_cpu_areas();


3.Print Linux kernel “banner” containing the version, compiler used to build it etc. to the kernel ring buffer for messages. This is taken from the variable linux_banner defined in init/version.c and is the same string as displayed by cat /proc/version.


printk(linux_banner);

4.Initialise traps.


..............

................

trap_init();

................

5.Initialise irqs.


.................

.................

init_IRQ();

................

6.Initialise data required for scheduler.


.................

.................

sched_init();

................

7.Initialise time keeping data.

..................

time_init();
.................

8.Initialise softirq subsystem.

...............

softirq_init(); 

..................


9.Parse boot commandline options.


..................

...................

parse_args("Booting kernel", command_line, __start___param,

                           __stop___param - __start___param,

                           &unknown_bootoption);

..................

10.Initialise console.

............

console_init();

..............


11.If module support was compiled into the kernel, initialise dynamical module loading facility.


12.If “profile=” command line was supplied, initialise profiling buffers.

............

............

profile_init();

............


13.kmem_cache_init(), initialise most of slab allocator.

............

..............

kmem_cache_init();

................


14.Enable interrupts.

.........

............

if (panic_later)

                        panic(panic_later, panic_param);

            profile_init();

            local_irq_enable();

..................


15.Calculate BogoMips value for this CPU.

............

..............

calibrate_delay();

................


16.Call mem_init() which calculates max_mapnr, totalram_pages and high_memory and prints out the “Memory: …” line.

...................

.....................

mem_init();

.......................


17.kmem_cache_sizes_init(), finish slab allocator initialisation.

..............

..............

kmem_cache_init();

.................


18.Initialise data structures used by procfs.

.................

..................

proc_root_init();

....................


19.fork_init(), create uid_cache, initialise max_threads based on the amount of memory available and configure RLIMIT_NPROC for init_task to be max_threads/2.

.................

..................

fork_init(num_physpages);

...................


20.Create various slab caches needed for VFS, VM, buffer cache, etc.

............

anon_vma_init();

...............

vfs_caches_init_early();

..............

vfs_caches_init(num_physpages);

.............

buffer_init();

.............


21.If System V IPC support is compiled in, initialise the IPC subsystem. Note that for System V shm, this includes mounting an internal (in-kernel) instance of shmfs filesystem.


22.If quota support is compiled into the kernel, create and initialise a special slab cache for it.


23.Perform arch-specific “check for bugs” and, whenever possible, activate workaround for processor/bus/etc bugs. Comparing various architectures reveals that “ia64 has no bugs” and “ia32 has quite a few bugs”, good example is “f00f bug” which is only checked if kernel is compiled for less than 686 and worked around accordingly.


24.Set a flag to indicate that a schedule should be invoked at “next opportunity” and create a kernel thread init() which execs execute_command if supplied via “init=” boot parameter, or tries to exec /sbin/init, /etc/init, /bin/init, /bin/sh in this order; if all these fail, panic with “suggestion” to use “init=” parameter.


25.Go into the idle loop, this is an idle thread with pid=0.

static void noinline rest_init(void)

{

            kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND);

            numa_default_policy();

            unlock_kernel();

            cpu_idle();

}