Wireshark-dev: Re: [Wireshark-dev] Autodetection of file types
From: Guy Harris <guy@xxxxxxxxxxxx>
Date: Sat, 2 Jul 2011 11:32:05 -0700
On Jul 1, 2011, at 8:07 AM, Matt Godbolt wrote:

> I've just hit an issue where an Endace packet file (ERF) that I'm trying to load into wireshark is being incorrectly loaded as a "packetlogger" file type.
> 
> From looking at the source, the packetlogger_open() call doesn't to seem to be very restrictive - I can see how it could generate false positives.  I can also see from file_access.c that packetlogger files have sometimes been mis-identified as mpegs.

Part of the problem is that "magic numbers" are ultimately just a form of heuristic, as there's no guarantee that a file that has the magic number in the appropriate location is a file of the type corresponding to that magic number.

Some magic numbers are probably pretty strong - I suspect relatively few non-pcap files start with A1 B2 C3 D4 or D4 C3 B2 A1.

Some magic numbers, not so much - there are probably plenty of files beginning with 00 00 01 that aren't MPEG-2 packetized elementary streams.

> An obvious solution would be to move the erf_open routine above packetlogger_open, which would also appear require moving netscreen_open above too (false positives there too)...
> 
> Given how fragile this whole process is, would that be safe - and how might I go about testing that I haven't broken anything else if I were to do so?
> 
> Failing all that; there's quite a simple way to detect ERFs (in the case that I'm seeing...) - relying on the '.erf' at the end of the filename. Presumably that's a no-go for other reasons.

The file suffix is not an *absolute* guarantee of file type, for several reasons:

	1) some files are generated by UN*X command-line programs (e.g., tcpdump) and might not even attempt to enforce a file suffix on the files they write;

	2) some files were generated by classic Mac OS applications (e.g, EtherPeek) and didn't use suffixes (relying on type and creator code, probably, which is why the old *Peek format didn't have a magic number, either);

	3) some files are text files, so if they have suffixes at all, it's probably ".txt";

	4) some suffixes are used by multiple programs with their own different binary formats, such as ".cap".

However, it can be used as a *hint*, just as data in the file can be used as a hint.  For example, files whose "standard" creator gives them a suffix, such as PacketLogger, could perhaps be sorted later in the list, *but* have their open routine called *before* the open routine for weak-heuristic or no-heuristic file formats *if* the file suffix matches their specified file suffix.

I.e., for a file whose name ends in ".pklg", packetlogger_open() would be called before erf_open() or mpeg_open(), but, for a file whose name *doesn't* end in ".pklg", it would be called *after* erf_open() or mpeg_open().

(We might also want to split mpeg_open() into separate routines, so that the routine that checks for MPEG-2 packetized elementary streams comes later in the list, due to the relative weakness of its magic number.)

The files with reasonably strong magic numbers would still be checked first (especially if many of the files have no suffix or a non-standard suffix, such as pcap or pcap-ng files).