Wireshark-dev: Re: [Wireshark-dev] Regenerating packet-parlay.c
From: Jaap Keuter <jaap.keuter@xxxxxxxxx>
Date: Mon, 4 May 2020 14:46:58 +0200
On 5/4/20 11:16 AM, Luke Mewburn wrote:
> On 20-05-04 10:55, Jaap Keuter wrote:
>   | Hi Luke,
>   | 
>   | > Yes, I regenerated the code using that patch to
>   | > tools/wireshark_gen.py, and it builds fine across a couple of
>   | > platforms.
>   | 
>   | Yes, I could build it too, so the generated code itself was okay. My
>   | quest is to regenerate the _same_ source files.
> 
> I don't think that's going to be possible to regenerate the same as the
> currently committed file, because it was generated with wireshark_gen.py
> whose implementation that did not guarantee reproducible output.
> 
> I think that unless you have access to the exact system setup that
> was used to generate packet-parlay.c that you won't be able reproduce
> the files as-is with the existing tool anyway.
> 
Good point. Also this drives home the point that generated code needs to be
reproducible.


> 
>   | > I wrote a simple wrapper tool idl-regen (attached) to assist each
>   | > regen from the idl, used with something like:
>   | > 	$ ./idl-regen ../epan/dissectors/
>   | > 	mmccs.idl:120: Warning: Identifier 'EventType' clashes with CORBA 3 keyword 'eventtype'
>   | > 	policy_data.idl:119: Warning: Anonymous sequences for recursive unions are deprecated. Use a forward declaration instead.
>   | > 	policy_data.idl:123: Warning: Anonymous sequences for recursive unions are deprecated. Use a forward declaration instead.
>   | > 
>   | > I get the same .c files generated from either:
>   | > - CentOS 7,  python 2.7.5, omniidl 4.2.0
>   | > - Fedora 31, python 3.7.6, omniidl 4.2.3
>   | > 
>   | 
>   | So the code generation is identical between Python 2 and 3, that's a
>   | good sign.
> 
> Yes.
> 
> 
>   | > Three files get changes from git master: - packet-gias.c -
>   | > packet-parlay.c - packet-tango.c
>   | > 
>   | > The generated code for packet-gias.c and packet-tango.c differ to
>   | > git master because of the following fix to idl2wrs that doesn't
>   | > appear to have been rerun to regenerate those files: commit
>   | > 443df93896 Author: Yannik Enss <Yannik.Enss@xxxxxxxxxxxxxxxxx>
>   | > Date:   2019-05-29 14:20:42 +0200
>   | > 
>   | > 	idl2wrs: fix 'undeclared identifier' error
>   | > 
>   | > 	the 'x_octetx' variables were removed a few years back, replace
>   | > 	them with get_CDR_xxx()
>   | > 	I8cf3410d8a152c834e7019f7d1d80de3798530c3
>   | > 
>   | 
>   | Good find. I've applied the anti-patch for this and reran code
>   | generation. Now the same source files come out (with the exception
>   | below). I've also build with these files without issue, so can
>   | anyone tell me what this commit achieves? It references something 'a
>   | few years back' which doesn't help either. If nothing I'm inclined
>   | to apply this anti-patch.
>   | 
>   | Gerald, you seem to have 'rubber stamped' this patch series about a
>   | year ago. I know it's much to ask, but I can't find anything on
>   | this, no bug report, no wireshark-dev posts, can you tell what this
>   | was for and why the repository code is not the same as the code
>   | generated?
> 
> I actually think the change is probably fine to leave, but I'll defer
> to others.
> 
True, but than it becomes a solution searching for a problem. I'd rather
understand the problem, or find there's no problem at all, so the solution can
be dropped.


> 
>   | > packet-parlay.c gets that above change as well as many other
>   | > changes to generated code for /* User exception filters */,
>   | > mostly in the ordering of the declarations and the list of
>   | > array entries added to proto_register_giop_parlay() variable
>   | > 	static hf_register_info hf[] 
>   | > 
>   | 
>   | Yes, this was the trigger for my initial question. Why is that ordering
>   | different than what's committed in the repo. My position is that
>   | generated files must always be possible to be regenerated.
> 
> I completely agree that generated files must be able to be reliably
> regenerated from the same inputs and tools without change.
> 
> As we both know, the issue is that the existing tool did not guarantee
> reproducible output, due to the use of a python dict which did not
> guarantee a particular ordering before python 3.6.
> 
> 
> 
>   | > As far as I can tell, the new packet-parlay.c has the same
>   | > number of lines, just in different order.
>   | 
>   | That was my impression too, but there are just too many to check by hand.
>   | 
>   | 
>   | > and compared the regenerated file with that, they both still have same
>   | > size and number of lines, and if I sort the lines in both the files are
>   | > identical. (Of course, that doesn't compile... :)
>   | > 
>   | 
>   | Oh, that is indeed a nice trick to see if at least the same lines are in the
>   | generated code files. I'm going to use that.
> 
> It wasn't super robust, but it was just part of a broader group of
> tests I ran.
> 
Of course, it's partial, but significant.

> 
>   | > I tried doing some comparison of the generated object files but
>   | > they differed too much, probably because of the reordering of the
>   | > source lines, even though the hf[] array ends up with the same
>   | > content in different order.
>   | > 
>   | 
>   | Okay, so your statement is that we just have to accept that the
>   | ordering of the the currently generated source file is going to be
>   | different from the committed source file. Do you think we still
>   | should introduce the ordering statements?
> 
> I think we should use the OrderedDict change I proposed, which means
> that the ordering will be in order the data structure is populated.
> This matches all of the other data structures in wireshark_gen.py
> and wireshark_be.py (lists).
> 
> I don't think we need to change the code to sort all the different
> identifiers and generated code, however. That will result in a bigger
> change to all the source files generated from IDL.
> 
Agreed, so your improvements go in, as far as I'm concerned.


> 
>   | > I don't know how much more verification is required, however.  It
>   | > looks like in the past there have been changes to wireshark_gen.py
>   | > and then regen of packet-parlay.c that reordered a lot of the
>   | > content.  For example, see commit c99bee9b5d on 2019-06-05.
>   | > 
>   | 
>   | Well, for me now reproducibility is the main goal. With your 'line
>   | sorting trick' we can at least confidently state that the actual
>   | items are not changed, and have to assume that the code generation
>   | is solid enough to create equivalent code. I'm okay with that.
> 
> There may be some more testing that could be done with processing the
> output of objdump, but I'm not sure it's worth it.
> 
> As mentioned previously, I suspect that this level of verification
> of regenerated files has not been applied to the previously
> committed versions of these regenerated files based on reviewing
> their commit history.
> 
> 
> Another other test, of course, is to process some parlay captures with
> tshark with a full decode, and compare the output :)
> 
> 
Sure thing. I'll see what I can do.

thanks,
Jaap