I replaced my blocking queue sitting between pipeline actors (threaded consumer/producers) and speed of my word freq program dropped from 5.2s to 3.5s on 5M file simply by replacing the blocking queue with my new ping-pong buffered version. That is close to the 3.0s achieved by the nonthreaded version.
// threaded, pipeline version; 3.5s (from 5.2s) on 5M file f => Words() => { string w | ... }
// nonthreaded, nested map operation on big list; 3.0s f.lines() => a; a:{ string line | line.split():{ string w | ... }};