Wireshark-users: Re: [Wireshark-users] filter for ONLY initial get request
From: Jeffs <jeffs@xxxxxxxxxxxxx>
Date: Thu, 12 Aug 2010 09:53:43 -0400
On 8/12/2010 6:06 AM, Sake Blok wrote:
On 12 aug 2010, at 11:32, Thierry Emmanuel wrote:
The best I have come up with so far is to look only at requested objects of type "text/html" and then look at the referer instead of the host header (and the host header if the referer is empy). But also this is far from perfect. It leaves in false positives and might have some false negatives too. But you can give it a shot to see how it compares to what you already have...
I don't know how you want to use the referrer header. It is filled whether the object were requested by the browser to complete the display of the page or by the user by clicking on a link. The only case it isn't given by the browser is when the user explicitly type an url in the address bar of his favorite browser.
The thought behind using the referer header is that it will filter out the objects that the user did not manually requested. Give it some thought, a user types in a URL in the browser, the referer is empty so we need to count this request. The page contains several objects. They are requested with the requested page as the referer. It is save to count these as this is what the user requested. Even though the requested objects are for advertisements (which the OP wants to skip). As long as the user clicks on links that link to pages on the same site, we are fine.

Then the user clicks on a link to another site. OK the referer still points to the original site (so we have a miscount of 1), but assuming the user clicks on at least one link on within the new site, the new site still gets listed, only with one less count.

Next to that I followed your handy tip to not count every object, but only count objects of type html by filtering on the Accept: header.

Cheers,


Sake
Here is a perfect example of what I don't want to happen:

dumpcap -f "port 80 or port 443" -w http.cap

start capturing and go ONLY to www.nytimes.com with your favorite browser.

click on a few links for stories (I clicked only on the story about "Huge Ice Island Splits From Greenland" :-(

Then run Sake's latest tshark script:

tshark -nlr http.cap -R 'http.request and http.accept contains "text/html"' -T fields -e http.host -e http.referer | awk '$2=="" {print $1;next} {print $2}' | sed -e 's#^http://\([^\/]*\).*$#\1#' | sed -e 's/^.*\.\([^\.]*\.[^\.]*\)$/\1/' | sort | uniq -c | sort -rn | head -100

You will see:

37 nytimes.com
6 brightcove.com
2 llnwd.net
1 tubemogul.com

I'm really only interested in nytimes.com

I can't understand how those other domains get in there with Sake's filter of "http.request and http.accept contains "text/html"?