Wireshark-dev: Re: [Wireshark-dev] Proposal to improve filtration speed by caching fields that
From: Jaap Keuter <jaap.keuter@xxxxxxxxx>
Date: Mon, 15 Jun 2020 06:24:18 +0200
HI,

Not sure since when the filtering system has been worked on in this depth, but I suspect it has been a while. Finding someone completely up to speed about this may be a challenge.

Thanks,
Jaap


> On 15 Jun 2020, at 05:38, Sidhant Bansal <sidhbansal@xxxxxxxxx> wrote:
> 
> Hi all,
> 
> I want to propose an improvement to speed up the display filters by avoiding to re-dissect all the packets again and again when not required and instead maintaining a cache of the fields that have been queried recently.
> 
> Motivation: Benchmarking filtering on capture files > 100 MB shows that the re-dissection step, i.e the amount of time spent inside the dissector tends to be a lot, i.e > ~40-50% of the total time spent is consumed to re-dissect. I believe we can make huge savings here.
> 
> Example:
> 1st Filter applied: tcp.srcport >= 1200 && tcp.dstport <= 1500
> This filter runs normally as it does right now AND stores the tcp.srcport and tcp.dstport for all the packets on-memory in wireshark
> 2nd Filter applied: tcp.srcport == 80
> We don't need to re-dissect all the packets again and can simply refer to the information stored to apply the filter.
> 3rd Filter applied: tcp.srcport == 120 || udp.srcport == 80
> Since we haven't stored "udp.srcport" in our cache, therefore we need to re-dissect again AND we will store udp.srcport for all the packets also (to speed-up future filter queries)
> 4th Filter applied: tcp.srcport == 40 || udp.srcport >= 1000 || tcp.dstport <= 500
> Since all of these fields are in cache, so we can refer to them directly from the on-memory information stored and don't need to re-dissect any of the packets.
> 
> We can limit the number of fields we store on-memory at any given moment of time depending on how many packets we have and how much memory we can afford to allocate. And deleting the fields from the cache can be done according to a specific cache replacement policy (I haven't thought about which one will the most apt, input is welcome)
> 
> Most of the fields tend to be fixed-length in terms of bytes and are small, i.e <= 8bytes. For fields such as strings that are variable-length and can be arbitrarily large we can avoid doing this caching procedure and instead re-dissect all the packets if the filter expression consists of such a field.
> 
> From an implementation point of view: The cached fields information can be stored inside the frame_data since that remains persistent throughout wireshark's execution for a single capture file opened. Now whenever we encounter a new filter query we can check if all the fields are in the cache or not? If yes, then once we convert our abstract syntax tree of the filter query to DFVM and then query, we should lookup the cache instead of re-dissecting. If no, then we do what we do currently, i.e re-dissect but we also store this new field into our cache (according to the specific replacement policy)
> 
> Want to know about any feedback or objections to this optimization.
> 
> ___________________________________________________________________________
> Sent via:    Wireshark-dev mailing list <wireshark-dev@xxxxxxxxxxxxx>
> Archives:    https://www.wireshark.org/lists/wireshark-dev
> Unsubscribe: https://www.wireshark.org/mailman/options/wireshark-dev
>             mailto:wireshark-dev-request@xxxxxxxxxxxxx?subject=unsubscribe