@finnw: to help us determine whether profiling or language overhead is overshadowing the timing result, please try implement the bit-counting code in a lower-level language such as C, C++ or Assembly, perform the timing test with 1 billion elements on the same machine, and post the timing results of both the Java and the native implementation. (I understand that your final implementation still nee