Some experience from me playing around with both 2x8GB and 2x16GB the last few weeks:
It's hard to find IMC that does good 2x8GB on air (2000+). On LN2 should not be a problem for most cpus. But it's much harder to find IMC that does good with 2x16GB when cold! For example my ES, which has the best IMC so far, can loop PYP at 2x16GB C12-11 very tight at 2050MHz all on air. When full pot IMC tops out at just 2000. The same CPU can do 2x8GB 2200++ C12 on LN2 full pot.
You have to pay attention to timings. Lower is definitely not always faster. And then balance performance vs stability. For example, with my timings TRFC 180 is worse than 220 for both PYP and 32M. Same goes for some subtimings. Some timings give slightly faster times but ends in crash 9/10 times. For me that is not worth it when another setting passes 10/10 times.
For me A2 is better than A0/A1 for 2x8GB. But I haven't played so much with A0/A1 because my A2 sets are fantastic. I know A0/A1 likes slightly different timings, so maybe I didn't tweak that enough... For 2x16GB I only have A2. Best 2x16GB kit so far did around 1970MHz C12-11 32M, 2050MHz PYP. Best 2x8GB does 4300+ C12-12 32M and 4400+ PYP.
32M with benchmate is really fast on win7. I use the same OS as I do for 3D, so nothing special really. Don't have to do waza either. A good win 7 run can compete with XP for low clock challanges, but it varies more than XP. For full out XP will give you higher clocks, so XP is still better there.
On ASUS make sure Round Trip Latency is enabled on mem training settings. And use latest SPI bios and maximus tweak mode 2.
For 32M with 2x16GB on XP higher maxmem than 600 is better.
When benching cold I would recommend 2x8GB, because unless you have a killer IMC it will have trouble handling 2x16 when cold. At least when you bench full pot. You will most likely loose MHz compared to air, unlike 2x8GB where you can see massive gains.