Wireshark-dev: Re: [Wireshark-dev] [Wireshark-core] Hash Tables and Algorithmic Complexity Atta
From: Anders Broman <anders.broman@xxxxxxxxxxxx>
Date: Tue, 22 Apr 2014 12:14:28 +0000

-----Original Message-----
From: wireshark-dev-bounces@xxxxxxxxxxxxx [mailto:wireshark-dev-bounces@xxxxxxxxxxxxx] On Behalf Of Bálint Réczey
Sent: den 22 april 2014 13:53
To: Developer support list for Wireshark
Subject: Re: [Wireshark-dev] [Wireshark-core] Hash Tables and Algorithmic Complexity Attacks

[Bringing the discussion to -dev with Evan's permission]

2014-04-22 10:15 GMT+02:00 Anders Broman <anders.broman@xxxxxxxxxxxx>:
>
>
> -----Original Message-----
> From: wireshark-core-bounces@xxxxxxxxxxxxx 
> [mailto:wireshark-core-bounces@xxxxxxxxxxxxx] On Behalf Of Evan Huus
> Sent: den 22 april 2014 05:36
> To: Wireshark core developers
> Subject: Re: [Wireshark-core] Hash Tables and Algorithmic Complexity 
> Attacks
>
> On Mon, Apr 21, 2014 at 6:28 PM, Balint Reczey <rbalint@xxxxxxxxx> wrote:
>>
>> On Apr 22, 2014 12:11 AM, Evan Huus <eapache@xxxxxxxxx> wrote:
>>>
>>> On Mon, Apr 21, 2014 at 3:59 PM, Bálint Réczey <balint@xxxxxxxxxxxxxxx> wrote:
>>> > Hi Evan,
>>> >
>>> > 2014-04-21 18:21 GMT+02:00 Evan Huus <eapache@xxxxxxxxx>:
>>> >> (sending to -core because of potential security implications)
>>> >>
>>> >> I was recently investigating implementing a wmem-based hash 
>>> >> table, since many of the current users of the wmem-tree do not 
>>> >> actually need the predecessor-lookup feature which is the only 
>>> >> advantage it provides over a hash table (whereas a hash table is otherwise much faster).
>>> >>
>>> >> I ended up wandering down the rabbit-hole of hash collisions, 
>>> >> algorithmic complexity attacks, and universal hashing ([1], [2], 
>>> >> [3] and more).
>>> >>
>>> >> Then I read the Glib hash table documentation: [4]. A quick grep 
>>> >> through the Wireshark source reveals numerous locations where we 
>>> >> use packet data in hash tables with insecure hash functions. As 
>>> >> such, I believe that Wireshark is vulnerable to a whole class of 
>>> >> denial-of-service attacks in this area.
>>> >>
>>> >> Has this come up before? Am I overlooking something clever we're doing?
>>> > It is true that Wireshark is vulnerable to hash collision attacks, 
>>> > and I think it did not come up before because we had enough 
>>> > vulnerabilities of other simpler classes. ;-)
>>>
>>> Makes sense :)
>>>
>>> > I think we could create wrappers to provide random seed to 
>>> > standard glib hash functions which is the standard way of handling 
>>> > such vulnerabilities.
>>>
>>> Unfortunately (based on my reading) that will not be sufficient.
>>> There are two vectors for an attack via collisions: the mapping from 
>>> key to guint, and the mapping from guint to bucket index. The first
>>> (key->guint) is user-settable so we can provide a random seed. The 
>>> second (guint->bucket) is not user-controllable. From the glib
>>> documentation:
>>>
>>> "The modulus is taken with the hash table size (a prime number) to 
>>> find the 'bucket' to place each key into."
>>> and then a few paragraphs down
>>> "Even cryptographic hashes are very easy to find collisions for when 
>>> the remainder is taken modulo a somewhat predictable prime number."
>>>
>>> So basically, it doesn't matter how strong we make the initial 
>>> mapping because the secondary bucket mapping is predictable anyways.
>>> Fortunately there are easy and efficient ways to make the secondary 
>>> mapping unpredictable (it's actually simpler than the initial
>>> mapping) so I guess that a good secure wmem map implementation is 
>>> actually fairly important to have.
>> Luckily ghashtable resizes itself increasing the number of buckets when the collision rate is high, thus this attack can't be performed effectively.
>> The described problem is valid only for hash tables with fixed bucket count.
>>
>> Regarding the wmem hash tables I think C wrapping C++ STL hash tables with wmem based custom allocators would do the job.
>>
>>Unfortunately this doesn't work; the free-all operation which wmem provides doesn't play nice with C++ destructors (besides which we are still pure-C for libwireshark for the time being).
We lost being C only with the Qt UI. :-) It can be implemented several ways which work, one is registering each hash table to the proper wmem pool and calling their destructor when freeing the pool - this way we don't have to play with a custom allocator.
Other technique would be not calling destructors, just freeing all the allocated connected objects. This is not nice, but would work as well IMO, since we would not hold refs to the free()-d objects.

>>
>>I have whipped up a very basic wmem-map implementation at [1]. It uses universal multiply-shift hashing [2] for bucket placement, and provides a wmem_secure_hash function for replacing >g_str_hash and friends, based on work done by the Perl community [3] in hardening their hash tables. To the best of my knowledge the resulting hash map is secure against complexity attacks.
>>
>>It still needs cleanup (comments, tests, etc). I would also like to replace one tree somewhere and run benchmarks to make sure it's actually faster. Suggestions and concerns are welcome.
It would be nice to see how it compares to other implementations:
http://incise.org/hash-table-benchmarks.html

I prefer using existing building building blocks instead of inventing our own.


One advantage would be the use of wmem which would release the memory atutomagically. I would also like to see benchmarks if the speed gain is good then more + for our own functions. 

Cheers,
Balint

>>
>>[1] https://code.wireshark.org/review/1272
>>[2] 
>>https://en.wikipedia.org/wiki/Universal_hashing#Avoiding_modular_arith
>>metic [3] http://blog.booking.com/hardening-perls-hash-function.html
>
>
> Would it be good to have initial hastable size in the API? So a table could be pre allocated to a custom size for the cases where we know the size will be larger than standard default.
I think it would not be good. I would save a constant number resizes
(O(1) gain) but we would reserve more memory for each hash table initially.

Yes but that would be an informed design decision, if we "know" that we will get at least x resizes for a "normal" trace file why not pre allocate for x if the memory "cost" is low for the
Unlikely small trace file cases where overall memory usage is low anyway? In some cases we might even know the final size, say a hf hash table. 

Cheers,
Balint

> Regards
> Anders
>
>
>> Cheers,
>> Balint
>>
>>>
>>> > AFAIK ghashtable does not employ randomization of hash function seed:
>>> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=655044
>>> >
>>> > Cheers,
>>> > Balint
>>> >
>>> >>
>>> >> Evan
>>> >>
>>> >> P.S. I believe it's still perfectly OK to use hash tables with 
>>> >> non-packet data such as static value_strings or as in [5].
>>> >>
>>> >> [1] http://lwn.net/Articles/474912/ [2] 
>>> >> http://perldoc.perl.org/perlsec.html#Algorithmic-Complexity-Attac
>>> >> k s [3] https://en.wikipedia.org/wiki/Universal_hashing
>>> >> [4]
>>> >> https://developer.gnome.org/glib/unstable/glib-Hash-Tables.html#G
>>> >> H
>>> >> ashFunc [5]
>>> >> https://code.wireshark.org/review/gitweb?p=wireshark.git;a=commit
>>> >> ;
>>> >> h=5983cda769fa6615dda5fc7b8f87819d40f0a8d5
>>> >
___________________________________________________________________________
Sent via:    Wireshark-dev mailing list <wireshark-dev@xxxxxxxxxxxxx>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request@xxxxxxxxxxxxx?subject=unsubscribe