Quantcast
Channel: The Flow Programming Language
Viewing all articles
Browse latest Browse all 8

On ordering and properties of domains and maps

$
0
0
Here is some of the thinking behind the "push" operator "->".  It basically maps to a "parallel-add-to-collection", ConcurrentCollection.add() or similar. The question was raised on the flowlang mailing list recently as to what the type was of this operator. There's quite a lot going on with type inference in Flow so it's worth highlighting those specific concepts in one place.  Also perhaps by rewording this, it may come across clearer than in the Flow Manifesto.

To recap relevant background:

Morphisms, domains and co-domains

Algorithms can basically be broken down into morphisms (elemental function applications) and control flow (loops/conditionals: the logic that connects data flowing into and out of morphisms together in some sort of fixed or dynamic graph). If the data dependency graph is viewed as a pipeline with DAG topology, then data can be pushed through the pipeline continuously but data that is all somehow mutually associated needs to be pushed through in "waves", and this gives rise to the need for collections at the nodes, not just single values.

[A morphism is fundamentally a map from a domain to a range, or a domain to a co-domain if you are talking about only the values in the range that are actually mapped to from the domain.  A specific co-domain is basically a collection (set or list, not a HashMap or other map type), and a morphism is basically the same thing as a function if it maps from a domain to an unknown range, or a map (HashMap or similar) if it maps from a finite domain to a specific co-domain.]

Flow tries hard to unify functions and maps -- and this makes a lot of sense, because in many functional programming languages, functions are "memoized" which basically means you cache a function's values in a HashMap to avoid having to re-compute them where possible. (Eventually we'll blur the line between what's computed at runtime and what's computed at compiletime, and some functions may be partially or wholely evaluated at compiletime and be transparently turned into a precomputed map, but that's a subject for another post.)

Tracking properties of variables

Every typed language keeps track of some information about variables (whether a variable represents an integer or a float, etc.), but Flow will take this much further, e.g. if a variable x can only represent an integer in the range [0..49] inclusive because it is the index of an item in an ordered collection of fixed constant size 50, then everywhere x is used, other things that are calculated from x have their own domains calculated relative to this number range.  e.g. if you set y = x/2 (using integer division), then y's range will be between [0..24] inclusive.

Tracking domains in close detail allows for some really powerful language features: for example, the isqrt() integer square root function might require a parameter that is an integer in the range [0..], so if you pass in an integer with unknown range or with known range that includes negative numbers, you'll get a compiletime error. No more division by zero etc. -- and you can even be warned about possible overflow, underflow, NaN situations etc. at compiletime.  (It will take some experimentation to figure out how pedantic to make this without annoying the programmer in some cases with variables whose value isn't known at compiletime.)  Also this can be extended to other types to eliminate NullPointerExceptions: you will know if it's possible for a reference's value to ever possibly be null just by looking to see if the "exceptional value" #Null is in the domain.

Tracking properties of collections and maps

The really powerful things that Flow will track are the properties of collections (co-domains, i.e. the specific collections that are produced by morphisms) and maps (morphisms).  This is where a lot of the magic will happen in Flow.

Some of the properties that will be tracked for collections include: orderedness; countability/iterability; sparseness; finiteness; whether or not duplicate objects are allowed.

Some of the properties that will be tracked for maps include: surjectivity; injectivity; bijectivity.

A couple of examples of traditional data representations that map to these:

(1) An array is a map from an ordered, dense, finite, iterable integer domain (the array indices) to a set of unordered values (the set of values in the array). (It is possible to iterate through the keys and pull out the values in array index order, but only if needed; if you don't need the values in a specific order, you can just consider the values in unordered set form).

(2) A sparse named array is a map from an ordered or unordered, sparse, finite, possibly iterable domain to some set of values.

(3) A map or morphism type could be created to read in lines from a text file, and the domain would be line numbers and the range would be lines of text.

Orderedness of domains

In most languages, sorting a collection is a manually-initiated operation: you start with an ArrayList<String>, for example, and you call Collections.sort() if you want that collection sorted.  There are also usually special collections that keep their elements sorted: for example, priority queues and TreeMaps.

In flow, orderedness is a first-class attribute of a collection. Sorting is not something you "do" in almost any case, it's simply the property of a domain or collection. And wherever possible, if orderedness is not required, collections are left unordered.  Collections are treated as unordered by default.

The reason for this is that there are a few basic algorithmic patterns that you see arising again and again: maps, folds and filters to name a few. Map operations are elementwise so don't impose any ordering constraints.  Folds only impose ordering constraints if the function being applied is not commutative (and they only impose "divide-and-conquer"-type grouping constraints if it's a parallel fold and the function being applied is not associative).  And, interestingly, filter operations only require their input to be ordered if something that depends on their output requires that the output is ordered relative to the input -- so the need for orderedness propagates lazily back through filter operations.  (The need for orderedness also propagates lazily back through map operations, if the output of the map is (key,value) pairs and not just values.)

Orderedness of domains and implicit parallelization

In fact, most sorting incurs overhead of some sort, but, worse than that, saying that you want a sorted collection when you don't ever apply some sort of non-commutative operator to the collection just unduly constrains the compiler from being able to find good parallelizations of your code.  So Flow will find the places that really need sorted data, and back-propagate the requirement as far as necessary up the data dependency graph to find other things that need to be ordered to produce output in the right order.

For example, if you are writing data out to a file, the method for writing lines of data will require an ordered collection, because the act of serializing elements of a collection is a non-commutative operation.  If you try to pass in an unordered collection of lines to be written to a file, you will get a compiletime error. This is how the compiler statically guarantees that operations like this are threadsafe. (Note however that it is possible for an operation to be non-commutative but still associative, for example string concatenation -- in this case, lets say you have 80 strings to concatenate into one string, and you have 8 threads, the compiler knows it has to keep all the strings in order relative to each other, so it requires the input be ordered, but it doesn't have to run the concatenation in series, it can split the list of 80 strings into 8 pieces containing 10 strings each, one per thread, and combine the results at the end to produce a single string. The compiler can transparently generate skip-lists or similar data structures under the hood to allow this sort of work splitting to be performed quickly.)

There could be several ways to specify that a non-commutative operator should first order a given unordered collection in an appropriate way.  One way would be to specify a comparator or "order by" clause, and this would produce a "view" of the collection that is remembered by the compiler, and reused without re-generating if the same ordering is used elsewhere in the program.  Another method would be to specify that the natural ordering of the elements in the collection should be used (e.g. lexicographic order for strings, increasing size for numbers).  Another method would be to have a function that takes a map with naturally-orderable keys and unordered values, and produces a list of values in the corresponding key order (this would be useful for dumping out values in arrays in order, for example) -- but I want to avoid this where possible because it implies launching a sort operation manually. I think the compiler should almost always make the decision as to what should be sorted and when, to minimize sorting wherever possible, again because this not only incurs overhead but it reduces parallelizability.  *Everything* in Flow should be designed with maximum parallelizability as THE driver.

Back to the "->" push operator

Back to "->": The reason specified in the previous post that "->" would push values into an unordered collection by default is mostly that it allows for maximum degrees of freedom in parallelizing.

Also hopefully the above description begins to answer questions about type inference for push operations.  The type of the target of a push operation is simply a set of elements whose type is the same as the type of the values being pushed into the collection.

Note that you can push into the same collection from multiple places in your source, and the compiler will try to unify the types that are being pushed from different places. If they are not unifiable, you will get a compiletime error.

As mentioned above, any requirement for orderedness is back-propagated by the compiler from operations that require orderedness because of non-commutativity.  So it is possible that ultimately the push target will end up ordered.  In that case, a sort or "shuffle" phase (to use Google's MapReduce terminology) will automatically be inserted by the compiler, or the collection will be turned into an ordered collection like a TreeSet, depending on what is most efficient.

Viewing all articles
Browse latest Browse all 8

Trending Articles