I'm cross-posting the following from the "flowlang" Google Group, with minor modification.
--
There are other optimizations that can be performed, e.g. the compiler can perform something similar to "hoisting" by realizing that you're just pumping the ones through a set collection and then into a sum function, and that can be converted to an increment operation.
UPDATE:
The "->" operator should be able to support arbitrary MapReduce-style mapping, scattering, shuffling, grouping by key and reducing. For example, let's say the "y in ys" actually represents an id for a Person record of some form, and you want to group everybody together who has the same first name, you should be able to do something like the following:
for y in ys
p = persons[y]
p -> firstNameGroups[p.firstName]
It might be more helpful in general to extend "->" to take (key,value) pairs, and group by keys:
for y in ys
p = persons[y]
(p.firstName, p) -> firstNameGroups
...or, for the histogram example:
for y in ys
(y, 1) -> ones
--
It might be helpful to think about the creation of a syntax for Flow that maps onto the semantics described in the Flow Manifesto as follows:
- Take a very simple purely functional programming language syntax -- maybe a subset of Haskell, or a functional subset of Python.
- Add one operator, a scatter/push operator, say "->", that gives the language more of an imperative feel.
- [You can also add timestamps, e.g. x(t) = x(t-1) * 2. These also give the language more of an imperative feel, but I won't talk about those here for brevity. They're just basically a convenient way of aliasing immutable variables that has semantic value to both the user and the compiler.]
Then assignment, "=", is a pull / gather operation: y = f(x) pulls a value from x and applies f then pulls the computed value into y as expected.
The way that "->" operates is that it pushes / scatters values to locations, which is typical of the imperative style of programming. For example, take the following Java code:
int[] ys = new int[] {1, 5, 2, 9, 1, 1, 4, 2};
int maxVal = 0;
for (int i = 0; i < ys.length; i++)
maxVal = Math.max(maxVal, ys[i]);
int[] hist = new int[maxVal + 1];
for (int y : ys)
hist[y]++;
You would do something like this in Flow (making up the syntax):
ys = {1, 5, 2, 9, 1, 1, 4, 2}
for y in ys
1 -> ones[y] // each entry ones[y] is a set of integers (with value 1)
hist[i] = sum ones[i] // implicit iteration: "for all i", i.e. hist = map sum ones
This histogram example is one I keep coming back to in my own mind because it's a common and minimal testcase for the "push" model of programming. Basically you're scattering counts into a histogram at indices corresponding to list values, which is not in general threadsafe and is therefore not parallelizable in imperative languages without extra work.
The way that "->" operates is that it pushes / scatters values to locations, which is typical of the imperative style of programming. For example, take the following Java code:
int[] ys = new int[] {1, 5, 2, 9, 1, 1, 4, 2};
int maxVal = 0;
for (int i = 0; i < ys.length; i++)
maxVal = Math.max(maxVal, ys[i]);
int[] hist = new int[maxVal + 1];
for (int y : ys)
hist[y]++;
You would do something like this in Flow (making up the syntax):
ys = {1, 5, 2, 9, 1, 1, 4, 2}
for y in ys
1 -> ones[y] // each entry ones[y] is a set of integers (with value 1)
hist[i] = sum ones[i] // implicit iteration: "for all i", i.e. hist = map sum ones
This histogram example is one I keep coming back to in my own mind because it's a common and minimal testcase for the "push" model of programming. Basically you're scattering counts into a histogram at indices corresponding to list values, which is not in general threadsafe and is therefore not parallelizable in imperative languages without extra work.
In the Java case I just incremented the histogram values directly; in the Flow case I output a stream of 1s and they were then collected by running a "map" operator over the collection ones[i]. The key thing to realize here is that ones[i] is constrained (by being the target of the "->" operator) to be an unordered collection, because otherwise you would get a race condition if you were pushing different values that had to stay in order relative to the input. It doesn't matter here whether it is ordered or not, of course, because everything that is getting pushed is a 1 -- but in more complicated cases it can matter. (If you force the target of a "->" operator to be ordered, then the compiler can still parallelize, but it will have to do some extra work to keep things in order.)
For smaller lists, i.e. ys.length < L, it can just run this as a single-threaded program. For larger lists, the compiler is free to parallelize this in a lot of different ways, for example:
For smaller lists, i.e. ys.length < L, it can just run this as a single-threaded program. For larger lists, the compiler is free to parallelize this in a lot of different ways, for example:
- It put a lock on each bin, ones[i], to prevent race conditions.
- By realizing that the "sum" operator is just folded addition, and by realizing that addition is associative and commutative, the compiler can build one copy of the "ones" array-of-sets in each thread's TLS (thread local storage), and then combine these separate copies at the end.
There are other optimizations that can be performed, e.g. the compiler can perform something similar to "hoisting" by realizing that you're just pumping the ones through a set collection and then into a sum function, and that can be converted to an increment operation.
UPDATE:
The "->" operator should be able to support arbitrary MapReduce-style mapping, scattering, shuffling, grouping by key and reducing. For example, let's say the "y in ys" actually represents an id for a Person record of some form, and you want to group everybody together who has the same first name, you should be able to do something like the following:
for y in ys
p = persons[y]
p -> firstNameGroups[p.firstName]
It might be more helpful in general to extend "->" to take (key,value) pairs, and group by keys:
for y in ys
p = persons[y]
(p.firstName, p) -> firstNameGroups
...or, for the histogram example:
for y in ys
(y, 1) -> ones