Detecting URL Rewriting (part 2)

This post is a continuation of my documenting the process I go through to come up with some way a client of a web site can first: determine if URL rewriting is occurring on a given web server, and second: in cases where it is used, determine what the rewrite rules are.
I left off with Apache configured, and a simple rule established for mod_rewrite. I now need to decide whether to use mod_rewrite to handle the rewrite using a redirect (via an HTTP 302 response), or to process it internally. As I mentioned, the difference between these two methods is quite large.
For example, if I choose to send a redirect, (eg. by amending our rule to include an [R] flag), like so …

RewriteRule    /litterbox/(.*)  /sandbox/$1 [R]

… the rewrite rule will cause an incoming request to to be redirected to the location instead by using HTTP server headers.
Examining the relevant portion of the HTTP request and response headers associated with this process, the conversation looks like this:
Initial request:

GET /litterbox/bar1.php HTTP/1.1

Initial response:

HTTP/1.1 302 Found
Date: Wed, 06 Oct 2010 04:50:18 GMT
Server: Apache/2.2.9 (Debian) PHP/5.2.6-1+lenny9 with Suhosin-Patch

In the response above, notice that the server has returned an HTTP 302 status response, and included a Location: header which contains the URL to the content. The browser receives this, and sends a new request to that location:
Redirected request:

GET /sandbox/bar1.php HTTP/1.1

This request is met with the final response, which includes the content at /sandbox/bar1.php:

HTTP/1.1 200 OK
Date: Wed, 06 Oct 2010 04:50:29 GMT

This is how I’ve used mod_rewrite in the past. The rules I’ve set to enforce SSL have been very similar to the one given in the example. At first glance, it seems that it will be easy to tell when rewriting is occurring… all that’s required is to look for the 302 response!

Not so fast

There are a couple problems with this theory. The first is: there are other mechanisms which can be used to provide this same HTTP response code. For example, the following PHP code will cause an HTTP 302 response to be sent by the server:


When I put that code into a file located at, the response to a GET request for that file looks pretty much exactly like the one generated natively by Apache above:

HTTP/1.1 302 Found
Date: Wed, 06 Oct 2010 05:38:09 GMT
Server: Apache/2.2.9 (Debian) PHP/5.2.6-1+lenny9 with Suhosin-Patch
X-Powered-By: PHP/5.2.6-1+lenny9

From this, it would seem that there is no way to distinguish between a redirect coming from mod_rewrite, and one stemming from some other mechanism.
More importantly though, and a bigger blow to my high hopes for an easy answer, is that the [R] flag is optional. By default, a redirect header isn’t returned by Apache at all when mod_rewrite is used. Looking up how Apache handles rewriting, there’s a fair amount of documentation on the process specific to the 2.2 version of Apache I’m using:

The nutshell version is this: Requests which are rewritten and not using a 302 response to the client are processed completely within the Apache Kernel only. There’s no indication given to the client that a redirect has occurred.
In fact, it appears that the only way an application hosted on the server can know that it has been reached via a rewritten request is by checking for the presence of one or both of two server headers which only appear when Apache has processed a rewrite … they do not appear on a redirect, despite their name =)
(Recall that I can see these because the PHP script I wrote includes a printout of every server header. It seems that doing this was a good idea indeed!):

REDIRECT_URL = /litterbox/bar1.php

Note that these headers are different than the ones the Apache documentation says it adds. I’m not sure why that is, but since these headers are only available as server variables, they are completely outside the reach of a client accessing a given URL on the host.

That sucks.

At this point, I give up on the 302 response and Location: header theory: it’s both misleading (in that a 302 response may not be the result of a URL rewrite), and inconsistent in that rewritten URLs may not provide a 302 response at all.
I start thinking of other mechanisms I could use. One that comes immediately to mind is the Referer header. This is an HTTP header which is provided to a web server when, for example, a user clicks a link. The destination host the link resolves to receives the request for a URL, along with where the user came from. An example of this can be seen here:
Initial Request:

GET /litterbox/bar1.php HTTP/1.1

Initial Response:

HTTP/1.1 200 OK
Date: Fri, 08 Oct 2010 05:51:54 GMT
  <div><a href="bar2.php">bar2</div>

The content served in the response contains a link to bar2.php. When I click that link, the fact that I’m coming from the bar1.php page is sent in the request, as shown below:
Request to bar2.php:

GET /sandbox/bar2.php HTTP/1.1

That’s all well and good, but as you can see, the Referer still shows /litterbox as the URL I was coming from. That’s because the referer is specified by the user agent (a browser in this case). Since the browser didn’t receive any indication that the content it is being served has come from a different location than it requested, it thinks it’s still at /litterbox and so sends that location in the headers.
So much for using that as a detection of rewriting. What’s next…
So far, I’ve tried a couple of different ideas to try to determine if a client can tell whether URL rewriting is in use or not. I’ve ruled out using a 302 response and accompanying Location: header as being unfit for this purpose. I’ve also briefly played with the idea of using Referer, and quickly ruled that out as an option as well. I need to come up with some more creative way to try to tell.

How about timing?

Thinking about this problem a bit, it occurs to me that, since the Apache kernel has to map rewritten URLs internally to come up with a computed URL to serve content from, that I may be able to use how long a request takes to load as an indicator.
To test this theory out, I’m going to use ruby, because I’m familiar with it, and it allows me to quickly throw together some proof-of-concept code.
Since I have the advantage in this case of knowing for sure what is being rewritten and what is not, I can use the benchmark module in ruby to measure the time it takes to get a file where rewriting is occurring, and where it is not. I can then compare the two to see if this theory bears further investigation.
For the intital test, I decide to use the bmbm method of the benchmark module for two reasons: 1) it automatically gives me two iterations to compare. But more importantly it 2) initializes the environment and tries to minimize skewed results by going through a rehearsal process before benchmarking “for reals”. Once I decided that, I came up with the following script:

#!/usr/bin/env ruby
require 'net/http'
require 'uri'
require 'benchmark'
include Benchmark
bmbm do |test|"rewrite:") do
    Net::HTTP.get_response URI.parse('')
  end"non-rewrite:") do
    Net::HTTP.get_response URI.parse('')

I’ve created two labels in this benchmark: one for the known rewritten URL, and one for the known non-rewritten URL. When I run this script, I get the following results:

Rehearsal ------------------------------------------------
rewrite:       0.010000   0.000000   0.010000 (  0.001429)
non-rewrite:   0.000000   0.000000   0.000000 (  0.000876)
--------------------------------------- total: 0.010000sec
                   user     system      total        real
rewrite:       0.000000   0.000000   0.000000 (  0.001105)
non-rewrite:   0.000000   0.000000   0.000000 (  0.000907)

That’s pretty interesting! When I run this on the same host the web server is located at, I can definitely tell a difference between rewritten and non-rewritten content!
I need to look into this further. The first thing that needs to happen is, I need to perform these requests many more times and look at the timing. A single request is useful for a quick “is there merit to this”, but the fact that it appears this may work could just be a fluke in the given requests at that particular time. I need to increase the number of times I perform this test and prove whether, statistically, there is a difference in the time it takes to serve a rewritten URL vs a non-rewritten one.
I also need to look at what factors may affect the results. Some immediate considerations that come to mind are:

  1. is the Apache server cacheing content, causing it to be served faster the second time?
  2. Am I able to prevent that if so?
  3. On a local machine, this may work, but what happens across a LAN?
  4. What happens to the timing when requests go across the Internet?
  5. How much does “heavy” content (video, images, etc.) affect the timing?
  6. Can I time just getting the HTTP headers, to avoid loading content?

I need to answer some of these before testing, and some of these will be answered as the testing progresses.
[to be continued]

Detecting URL Rewriting (part 1)

[edit 2010-10-02]: i realized after replying to cdman’s comment that i had neglected to include the goals of this project in this post, but had included them in this one instead. I’ve edited the beginning here to include the first part of that post.
As I mentioned earlier: I’ve been pondering URL rewriting for the past couple of days – trying to come up with some way a client of a web site can first: determine if URL rewriting is occurring on a given web server, and second: in cases where it is used, determine what the rewrite rules are.
I started this process by doing some homework to learn more about how URL rewriting occurs. I’ve used Apache’s mod_rewrite in the past to accomplish some basic tasks like redirecting incoming http:// requests to their https:// counterpart to enforce SSL usage, but I had never done much beyond that.
I decided (as I often do) that the best way to learn was to play. To determine whether URL rewriting is in use, and to try to map the rules, means that I need to have a portion of a web site that is using URL rewriting, and one that is not (so I can compare the two). I further need to have some rewrite rules. Coming up with a random set of rules is difficult, so I gave myself what was, in my mind, a likely scenario:

The Bar, Inc. marketing dept. has realized that their ‘litterbox’ product line has a name which creates a negative impression. It’s decided that ‘sandbox’ is a much better brand for the products. Of course, with the rebranding, the web site has to be updated, it simply won’t do to have links going to now that the name has changed.
Begrudgingly, the developers of the Bar, Inc. website put in a ton of overtime to change all the links in the code. Then someone realizes that all the Bar, Inc. customers and business partners also have links that are going to break. The developers can’t do anything about that, it’s outside their control. It now falls to the sysadmin to make sure that no critical third party links get broken.
As the sysadmin, my task is simple: take any requests for /litterbox/whatever and have them go to /sandbox/whatever instead.

Excellent! I now have an interesting story to keep me from getting bored. (OK, fine… interesting is subjective 😉
More importantly, the fictitious set of requirements dictated in the scenario means that I have a framework established for how to approach setting up this research project.
That means it’s time to get to work.

Preparing The Environment

To get this set up in a way that meets the criteria of the scenario, I first need to have a website. I have a Linux box handy, so I decide to do my testing using Apache. The specific version and OS I’m using is Apache 2.2.9 on Debian Linux, with the Suhosin Patch. In other words, I’m using the default apache2 (mpm-prefork) package on Debian ‘lenny’.
I create a directory named sandbox in the Apache web root (which is /var/www on Debian). I then create 4 files in that directory: bar1.php, bar2.php, bar3.php, and bar4.php. Next I edit each of these files to contain some generic code similar to the following, (changing the title and h1 tags to correspond to the file name):

<div><a href="bar1.php">bar1</div>
<div><a href="bar2.php">bar2</div>
<div><a href="bar3.php">bar3</div>
<div><a href="bar4.php">bar4</div>
<hr />
foreach($_SERVER as $key_name => $key_value) {
print $key_name . " = " . $key_value . "<br>";

The PHP code in these files simply spits out the HTTP Server headers key/value pairs to the page. This may prove useful to review, so I’m including it in each page.
Now that I have the Bar, Inc. “website” in place it’s time to contemplate how to proceed – I have at least three four options:
Edit 2010-10-04: I’d neglected to consider the Apache Alias directive. I’ve added that to the list.

  1. I can enable the SymLinks option and create a link from litterbox to sandbox.
  2. I can use mod_rewrite to change requests for litterbox to sandbox.
  3. I can use mod_rewrite to send an HTTP 302 response redirecting requests to the new location.
  4. I can use the Apache Alias directive to redirect requests to litterbox to a specific path on the file system

After considering these for a bit, I decide that leaving a bunch of stale links lying around the directory tree is a BadThing. For similar reasons, I decide not to use the Alias directive, so that future sysadmins don’t become confused. Accordingly, I select mod_rewrite as the way to go. (Thankfully, since that’s the whole point of this project 😉

Setting up mod_rewrite

The first thing I need is for the mod_rewrite module to be loaded in the Apache configuration. How this occurs varies based on the installation of Apache. In Debian it’s extremely simple to accomplish this task, a single command (and later, a reload of the Apache server) will suffice:

# a2enmod rewrite

Now that the module is enabled, I need to define some rules. This can be done by editing the configuration file that defines the web site. In Debian, this means editing the file /etc/apache2/sites-available/<site-name>. Because I’m just using the default configuration, I place my changes in /etc/apache2/sites-available/default.
The syntax for mod_rewrite can be quite complex, and there are some very powerful features that it provides. However, the scenario I set for myself dictates what I need to establish as far as the rewrite rules… that is, I need to change “litterbox” to “sandbox”. Configuring this in Apache is easy enough, it looks like this:

RewriteEngine on
RewriteRule    /litterbox/(.*)  /sandbox/$1

The first line turns on the RewriteEngine. The second one establishes that I want to replace any instance of “/litterbox/” followed by one or more characters, with “/sandbox/” followed by whatever other characters were present when the request came in.
That single line should accomplish the goal of my scenario, however I still have one choice left to make: I need to decide whether I should use mod_rewrite to accomplish this task via an HTTP redirect, or to rewrite the requests.
The difference between these two is not trivial.
Before I go any further, I need to gain a better understanding of how URL rewriting works in Apache.
[to be continued]

on security research

I’ve been pondering URL rewriting for the past couple of days – trying to come up with some way a client of a web site can first: determine if URL rewriting is occurring on a given web server, and second: in cases where it is used, determine what the rewrite rules are.
As I have been thinking about this, it occurred to me that, despite the proliferation of security research whitepapers and blog posts, there is a scarcity of ‘this is the process I went through to do this research’ information out there.
There are mountains of articles and documents, with dizzying arrays of statistics and metrics (often intermingled with a fair amount of marketing fluff), and yet most of the whitepapers, and certainly the various conference presentations, simply don’t talk about the process – preferring instead to present the end results.
As security professionals, we gather together at a multitude of conferences where we do a wonderful job displaying all of this shiny data and showing off new marvelous tricks to each other with varying degrees of self-indulgence. Yet most of how we came to have such cool stuff is left out of the picture entirely.
I understand why that is, of course. Simply put, the process is boring! It’s full of failure, and repeatedly throwing things at a wall and observing what happens. Nobody wants to sit in a small room with a couple hundred hackers listening to someone drone on for an hour about how “this didn’t work…and neither did this”, I get that. Added to that is the fact that, in some cases, the research is being done for a corporate (or government) entity. In such a situation, the process may be withheld not from a lack of desire to share on the researcher’s part, but because they are not permitted to do so by the organization for which the work was done.
Despite these reasons, in my opinion it is a disservice to ourselves, to the profession, and to others whom may be interested in performing their own research, when we all we do is deliver an end product in glossy PDF or a shiny PowerPoint presentation. That is simply not research, it’s promotion. Research, in an academic sense, implies documenting the entire process: both success and failure. This is not what I find when I look at the typical infosec industry output.
Accordingly, I’ve decided that I will share how I go about this particular project, and not just release some PDF or tool as a result of it. I’ll post my process here, any notes and thoughts, as well as any code I come up with. (Well, links to code anyway. I’ll probably keep the code itself in github).
One of the reasons I’m doing this is that I expect to fail. =)
As I’ve considered how one can detect URL rewriting, and as I’ve started investigating the details of how it works, my initial thought is that detecting it simply won’t be possible.
If that’s correct, I think it’s important that I present what I tried, along with the fact that ultimately it didn’t work. That’s vital information, in that it prevents someone else from wasting cycles repeating a process that’s already been done.
As well, understanding why something failed may lead to discovering a way to succeed.
OK… this rant being done now, my next post will start the process of documenting my research into detecting URL rewriting.