Google Summer of Code 2009

Google will allocate funded student positions as outlined here:

http://socghop.appspot.com/document/show/program/google/gsoc2009/studentallocations

Meaning that it is in our interest to get as many student applications as possible. We don't know how many will be approved by google.

Please post your project ideas here, where interested students will be most likely to stumble on them.

Boost.Serialization

Here are a few small projects which would be really helpful.

Performance Testing and Profiling

I've managed to setup performance profiling using the following:

  • current (as I write this) Boost.Build tools.
  • the gcc compiler.
  • and a shell script - profile.sh
  • library_status program from the tools/regression/src directory

Invoking profile script produces a table which shows the results of each test and links to the actual profile.

The first thing I did was include some of the serialization library tests. It became immediately apparent that these tests were totally unsuitable for performance testing and that new tests needed to be written for this purpose. These tests would highlight the location of any performance bottlenecks in the serialization library. Whenever I've subjected my code in the past to this type of analysis, I've always been suprised to find bottlenecks in totally unanticipated places and fixing those has always lead to large improvements in performance. I expect that this project would have a huge impact on the utility of the serialization library.

Back Versioning

It has been suggested that a useful feature of the library would be the ability to create "older versions" of archives. Currently, the library permits one make programs that are guarenteed the ability to load archives with classes of a previous version. But there is not way to save classes in accordance with a previous version. At first I dismissed this a a huge project with small demand. A cursory examination of the code revealed that this would not be very difficult. It would require some small changes in code and some additional tests. Also it would require special treatment in the documentation - perhaps a case study.

Environments without RTTI

I note that some have commented that this library requires RTTI. This is not strictly true. The examples and almost all the tests presume the existence of RTTI. But it should be possible to use the library without it. The example used for testing is an extended_typeinfo implemenation which presumes that all classes names have been exported. So, to make this library compatible for platforms without RTTI, a set of tests, examples and new manual section would have to be created

Portable Archives (straszheim)

A method for testing portability was suggested (save and load to a portable binary archive, and verify that the checksum of the binary archive matches some checksum. This would involve development of the archive and tests.

Export name aliasing (straszheim)

If you change the name under which a class is BOOST_CLASS_EXPORT'ed you break backwards compatibility... you can't read older archives. In addition there is no way to have more than one class exported under the same name.

Boost.Python

  • Python 3.0 support
  • Ability to extend the fundamental PyTypeObject? used by boost.python
  • Thread safety
  • PyFinalize? support
  • Easier methods to write to_python/from_python converters

Boost.Proto

support for two-level (van Wijngaarden) grammars (niebler)

Proto is essentially a compiler construction toolkit for DSELs. It allows you to define the grammar for the DSEL, but currently has no native support for DSEL type systems. Support for two-level grammars would fill that hole.

The job would need a student who has experience with type theory and a solid grasp of C++ template metaprogramming. The can read more about the problem in this thread:

http://groups.google.com/group/boost-list/browse_frm/thread/df6ecfb0089b28fd

Serialization/Frames? (viboes)

Library based on an extension of the Archive concept making it bidirectional. From wikipedia "A frame is a data packet of fixed or variable length which has been encoded by a data link layer communications protocol for digital transmission over a node-to-node link. Each frame consists of a header frame synchronization and perhaps bit synchronization, payload (useful information, or a packet at higher protocol layer) and trailer. Examples are Ethernet frames and Point-to-point protocol (PPP) frames."

The Boost.serialization saving archive concept allows to save serializable data at the end of the archive, and the loading archive concept allows to read serializable data from the beginning of the archive. The saving frame concept will allows to save serializable data either at the end or the begin of the frame, and the loading frame concept allows to read serializable data from the beginning or the end of the archive. I'm not sure which syntax will be the more appropriated. The serialization library use the <<, >>, and & operators that are associative from left to right. We need the equivalent operators from right to left. <<=, >>=, and &= seams to be the more natural candidates but I don't know if it is correct in C++ to define this operators with this prototype

template <typename T> frame& operator<<=(const T&, frame&);
template <typename T> frame& operator>>=(const T&, frame&);
template <typename T> frame& operator&=(const T&, frame&);
h1 >>= h2 >>= sf << t2 << t1

is equivalent to

(h1 >>= (h2 >>= ((sf << t2) << t1)))

and should be equivalent to

sa & h1 & h2 & t2 & t1

if sf and sa were empty.

The main difference is that we can do it hierarchically. The top layer will create a frame, and serialize its own information elements.

frame sf;
h_n >>= sf  << p_n << t_n;

Then this top layer will use a primitive of the lower level having as parameter the frame as payload.

primitive_n-1(sf);

A primitive at the k level will add its own header and trailer information element to the payload of the upper level

void primitive_k(frame& sf) {
    // ...
    h_k >>= sf  << t_k;
    // ...
    another_primitive_k_1(sf);
    // ...
}

So the frame allows to serialize top-down. To get the same result with the archive concept the serialization must be done bottom-up, needing to chain each one of the information element in a list and only when we have all off them we can start the serialization. I think that the frame approach should be more efficient because it avoid to store in dynamic memory the information elements to be serialized, instead they are serialized directly.

Loading a frame works as loading an archive, except that we can load also at the end of the frame. This avoids to load the complete archive to load the trailer of the lower levels.

lf >>= h_1;
// ...
t_1 << lf;

In addition, it would be great to have saving/loading frames (I'm not sure but I think that an archive can not be saving and loading at the same time). The same frame can be used as loading, analyzing only some lower levels, and as saving in order to construct other lower levels. This will be very useful for gateways and routers.

| L4 |<------------------------------>| L4 |
| L3 |    |     ADAPTATION     |      | L3 |
| L2 |<-->| L2 |          | X2 | <--> | X2 |
| L1 |<-->| L1 |          | X1 | <--> | X1 |

It would be great also that the same data that can be serialized on archives, could be serializable on a frame using the same save function. But this works only when we save at the end of the frame. Let me see this using a little example: Class C has 3 information elements to serialize (a, b and c). So the save functions could be something like

template <typename ARCHIVE>
void C::save(ARCHIVE& sa) {
    sa & a & b & c;
}

This save function works well from left to right, but can not be used when saving at the beginning of a frame, because the expected result when saving at the beginning is

a b c sa

but the result will be

c b a sa

So unfortunately we need a different save function

template <typename FRAME>
void C::save(FRAME& sa, begin_tag&) {
    a >>=  b >>= c >>= sa;
    // a >>=  (b >>= (c >>= sa));
}

Boost.Accumulators

Dependents accumulators (viboes)

The accumulator library allows to determine dependency between accumulator, but not between accumulator_sets. I would like to define an accumulator_set c so when we cumulate in two others c1 and c2 accumulator_sets we cumulate also in c, some thing like:

typedef dependable_accumulator_set <double, ...> dependable_acc_type;

dependable_acc_type c1, c2;
dependable_acc_type c=c1+c2

dependable_acc_type c3;

c+=c3;

c1(1);
c2(1);
c3(2);
assert(count(c)==3);
assert(sum(c)==4);

How dependable_accumulator_set can be defined? Here follows the interfaces of such a class and the pseudo code, I've named the class dependable_accumulator_set, sorry but I have not found a shorter and better name (may be observed/listened?)

template <typename T, typename F, typename W>
class dependable_accumulator_set : public acumulator_set<T,F,W>
{
public:
    dependable_accumulator_set();
    void add_dependent(dependable_acumulator_set&);
    void remove_dependent(dependable_acumulator_set&);

    template<typename A1>
    void operator ()(A1 const &a1) {
        for (acc in dependents) {
            acc(a1);
        }
        this->accumulator_set_type::operator()(a1);
    }
    dependable_accumulator_set<T,F,W>& operator+=(dependable_accumulator_set<T,F,W>);
    dependable_accumulator_set<T,F,W>& operator-=(dependable_accumulator_set<T,F,W>);
};

template <typename T, typename F, typename W>
dependable_accumulator_set<T,F,W> 
operator+()(dependable_accumulator_set<T,F,W>,dependable_accumulator_set<T,F,W>);

Another variant could be also to have a temporary accumulator that cyclically push its current value on another accumulator.

It will also interesting to have a dependable and cyclic accumulator set. It would be great to integrate these features on a unique class (accumulator_set?) in a clean way.

template <typename T, typename F, typename W, 
          typename DependablePolicy, 
          typename CyclicPolicy>
class accumulator_set;

Dependent variables (viboes)

Manage with the underlying concepts of a Spreadsheet but without displaying them. Defines cells that can be organized in a classical grid consisting of rows and columns, or any other structure as for example a tree. Each cell could contains alphanumeric text or numeric values but can also contain any other value. A spreadsheet cell may alternatively contain a formula that defines how the contents of that cell is to be calculated from the contents of any other cell (or combination of cells) each time any cell is updated.

It seems natural to use Boost.Proto to define the DSL for the cell formulas.

space sh;
space::cell<int> c1(sh), c2(sh), c3(sh);
c1.expr=1;
c2.expr=2;
c3.expr=c1+c2;
sh.refresh();
std::cout<< c1.value() << ", " << c2.value() << ", " << c3.value() << std::endl;
sheet sh(2,3);
sheet::cell<int> c01(sh,0,1), c02(sh,0,2), c03(sh,0,3);
c01.expr=1;
c02.expr=2;
c03.expr=rel_cell(c03,0,1)+rel_cell(c03,0,2);
sh.refresh();
std::cout<< c01.value() << ", " << c02.value() << ", " << c03.value() << std::endl;
c11.expr=5;
c12.expr=8;
c13.expr=c03.copy(); 
sh.refresh();
std::cout<< c11 << ", " << c12 << ", " << c13 << std::endl;

Boost.Devector

The purpose of this project would be to write a super-efficient alternative to std::deque, named boost::devector. As the name suggest, this would be something of a mixture between std::deque and std::vector, bringing the best of them both together.

One of the prime uses of std::deque is with intrusive containers, whether are home-made or based on Boost.Intrusive (http://www.boost.org/doc/libs/1_38_0/doc/html/intrusive.html). The property that is so imporant here is reference stability (when the container grows, references are not invalidated).

The exsting std::deque has a number of short-comings that should be solved for this container:

  • slow iteration (see e.g. http://lafstern.org/matt/segmented.pdf)
  • inability to specify the size of the chunk and/or change them at runtime
  • inability to reserve chunks of memory in advance (at both ends)
  • overencapsulated for certain low-level operations.

Writing such a container is non-trivial, but you will learn to master the following important topics

Futhermore, you will learn to write canonical exception-safe code which is something that you can make use of no matter what language you will be using in the future. You will also learn to write rigorous tests. And you will learn how to write precise and clear documentation at a level suitable for an expert. All of these abilities should be very useful for you in your future career.

If there is more time, we can look at additional advanced data-structures.

-Thorsten

Boost Graph Library

In addition to the graph library that is currently in Boost, Indiana University has also created parallel extensions to BGL (the Parallel Boost Graph Library) for distributed-memory computers. Several of the projects below would also be worthwhile to integrate into the parallel framework.

Hierarchical graph layout algorithms

The Boost Graph Library (BGL) currently only implements a small number of graph layout algorithms, primarily spring-based layout algorithms. This project would involve implementing other layout algorithms, especially hierarchical algorithms such as that used in the dot program of Graphviz (http://www.graphviz.org/).

Bindings to Python or other languages

BGL has some bindings to Python and R, but they do not exist for all algorithms. This project would be to either extend the Python or R bindings, or write a set of BGL bindings to another language.

Semantic graphs

A semantic graph is a graph that can have multiple distinct types of vertices and edges, with restrictions on connections between vertices and edges based on their types. There may also be more efficient representations of the graph based on storing different vertex and edge types separately. This project would be to define appropriate concepts, algorithms, and data structures for semantic graphs in BGL.

Graph partitioning

This project would involve implementing graph partitioning algorithms, such as those used in METIS (http://www.cs.umn.edu/~metis), using BGL. Integrating this support into the Parallel BGL would also be useful.

Computer vision algorithms

This project would be to implement computer vision algorithms based on max-flow problems on graphs using BGL. Implementing a specialized graph type for grids would also be needed in order to avoid storing the graph topology explicitly.

Network Workbench integration

The Network Workbench (http://nwb.slis.indiana.edu/) is a GUI-based framework for analysis of large-scale networks. This project would involve implementing components for NWB based on BGL and/or Parallel BGL.

Incidence Matrices

An incidence matrix (http://en.wikipedia.org/wiki/Incidence_matrix) is a graph data structure that relates vertices to the set of edges that connect them. This project would entail the implementation of a generic incidence matrix data structure that fits into the BGL generic graph interface. This project can include the development and definition of concepts related to graphs represented by incidence matrices (e.g., signed and bidirected graphs). Incidence matrices can also be adapted to implement hypergraphs.

Hypergraphs

A hypergraph (http://en.wikipedia.org/wiki/Hypergraph) is a generalization of graphs where an edge can connect any number of vertices (instead of exactly two). Hypergraphs can also be directed or undirected, and have their own algorithms (e.g., traversal and partitioning). This project minimally involves the definition of one hypergraph data structure and a generic traversal algorithm. This project would also involve the identification and definition of generic concepts related to hypergraphs.

Note: This is a non-trivial project that requires substantial knowledge of graph theory, algorithms, and generic programming.

Implicit Graphs

Design and develop a framework for adapting functions and function objects into graphs. For example, a binary predicate function can be made to behave like an adjacency matrix, returning an "edge" if the predicate is true. This project will require a substantial amount of metaprogramming to adapt function concepts (e.g., domain, range, etc.), onto graph concepts (vertices, edges, etc.).

Closures and Reductions

Implement a suite of generic graph algorithms that computer closures and reductions on graphs. The BGL already contains a transitive_closure algorithm, but not (yet) a transitive_reduction. Other kinds of computable closures and reductions include reflexive, symmetric, and congruence. This project can also include predicates to test whether or not a graph exhibits these properties (e.g., is_symmetric).

Lattices

Define and implement a generic lattice abstraction over a graph or relation. A lattice represents relations between objects in a partially ordered set (poset). Lattices are frequently used (outside of abstract mathematics) to descibe and reason about the relations of discrete objects. One such example is ontologies. Implement algorithms over the lattice that return its minima and maxima, and compute the meet and join of elements in the lattice.

Earlier Boost SOC Project Proposals

You might want to browse the project proposals from earlier years. They can be found here:

Last modified 8 years ago Last modified on Mar 23, 2009, 7:53:15 PM