Monday, December 1, 2008

Update on the Kompimi C Data Structure Library: One of the most useful common collections is the associative array. Today I completed work on two of the three data structures planned to support this abstraction, the hash table and the "dynamic array set", which is simply a dynamic array that uses linear search for lookup, intended for very small collections and as a performance baseline.

Performance results are shown below. The first tests insertion into a hash table preallocated to be large enough to hold all elements, the second tests lookup in a full hash table, and the third tests inserting while growing the hash table from its smallest size. The results are relatively unsurprising; the naive dynamic array set is almost never worthwhile, even on very small lists, except for growing, since it does not need to redo insertions when growing. Direct chaining performs particularly poorly in the growing test, since it needs to free up the memory for the old chains and allocate new ones whenever the table is resized (coallesced hashing may do better here), and because resizing open addressing hash tables by a factor of 2 has good cache performance (each bucket maps to only 2 new buckets). Time per operation is constant for a while, but increases as the table becomes too large to fit in the L2 cache. Quadratic probing is probably the best choice across the board.

The slowdown for growing versus insertion into a preallocated array ranges from 1.5x to 3x, and is worst for small lists - this is why the dynamic array set dominates here. There might be an interesting way to speed up resizing.





All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.

Tuesday, November 25, 2008

Kompimi.org website and tool philosophy

In some of the older incarnations I had of the Kompimi idea, I aimed to make the entire development and publishing process public domain, as a matter of principle. This approach was overambitious and doomed to failure - I need to take advantage of mature, well-maintained services like Google Blogger, SourceForge, and Freshmeat to promote productivity and compete on equal footing with licensed solutions.

Today I launched the Kompimi.org website, which does not provide services but merely serves as a simple index, information source, and project list for Kompimi. It's a bit lacking in prettiness at the moment, but it is functional, and will help to act as a hub to tie all the projects and resources like this blog together. Over time it will be expanded as necessary, but it will remain an index, rather than a home for projects.

All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.

Monday, November 24, 2008

Orphaned works and unexploited copyright

While working on my next additions to the C Data Structure Library, today I'm going to talk about a more philosophical problem: resource underutilization due to wasted exclusive rights.

Although each country in the world has its own laws and rules, nearly every country in the world today adheres to the Berne convention, a uniform international policy governing copyright (with some exceptions in Africa, Southeast Asia, and the Middle-East). In the United States, the most notable change introduced by the Berne convention was that works need no longer be registered to receive copyright protection; as soon as they are written or stored in some fixed form, exclusive rights are instantly and automatically granted to the author. This is quite convenient for some works - for example, a thief couldn't legally steal the rough drafts written by the author of a book and publish them. But it can get pretty strange: even if the author is a 4-year-old child who knows nothing about copyright at all, they still get exclusive rights.

A serious practical problem arises if you find an abandoned work lying around, and want to create a derived work based on it; chances are, some author somewhere still has exclusive rights to it. If you don't contact the author to secure rights, and they later find out about your derived work (say because you make a lot of money off it), they can sue you. But if the author has abandoned it and vanished, it may be very difficult to determine who the author is, much less contact them. These works are called orphaned works.

Another serious practical problem arises with unexploited works: copyright is intended to protect the creative work of an author so that they can exploit it to make a living. But under the Berne convention, an author - perhaps one with no knowledge of copyright at all - can sit on top of a huge pile of work, most of which they have no intention of exploiting.

The cost here is to society at large: these abandoned and unexploited works aren't used and benefit no one at all, when they could be contributing value directly through use and indirectly through derived works. And this isn't just a problem with books and paintings; how many abandoned software projects languish in the vaults of corporations, never to see the light of day again? How many abandoned open-source projects are up on SourceForge under old licenses that are incompatible with modern ones? That could be code that you're rewriting right now.

The United States was always a nation that emphasized growing the public domain; all works of the federal government are automatically released into the public domain. When U.S. copyright was established in 1790, works gained copyright protection for 14 years when they were registered, and it could be renewed for another 14 years. Besides the much more limited term, this meant that authors only registered works that they considered to have potential value. This did mean that sometimes people who didn't know to register their work would have it stolen with no legal recourse, but overall it struck an effective balance between allowing authors to make money and making old and abandoned works available to society for reuse.

One obvious question to ask is, do people really create derived works? What is the value of derived works? In 1790 this may not have been obvious, but one look at the Internet, where copyright infringement is rampant and derived works are ubiquitous and, frequently, highly creative, proves their value. Indeed, these works could be said to be at the center of Internet culture. Usenet, mailing list, and forum posts are routinely copied, archived, and republished without acquiring permission from their participants, an exercise that would be prohibitively costly if it were enforced. Photoshopped images, YouTube montages and fansubs, YTMNDs, musical reinterpretations and parodies, much of the artwork on deviantART, and many popular Flash videos are derivatives of copyrighted works. It also includes code: although few software developers dare to blatantly infringe copyright, forked codebases by different developers like the X.Org fork of XFree86, the XEmacs fork of GNU Emacs, the OpenSSH fork of SSH, and forks of Linux and BSD distributions (including Mac OS X) are ubiquitous. Enabling derived works like this is part of the mission of open source software.

In 2003, a bill called the Public Domain Enhancement Act (PDEA) was proposed but never put to a vote. It would have, without changing the term of copyright, reinstituted copyright registration in the U.S. for domestic works at a dramatically reduced fee of $1 per work. This helps solve the problem of orphaned works: if the author registers it, they must provide contact information that can be used to contact them and license the work. If a work is abandoned or unexploited, the author will (most likely) not invest effort in registering it, and it will enter the public domain and become available for reuse.

A large part of getting an act like the PDEA right is in the implementation. I imagine a government website where you can go and fill out a web form to register a work. You create an account associated with your name and contact information. Each time you want to register a work, you go to a page where you supply the work's title and description, and an electronic copy of the work, if available. The payment would work similarly to an MP3 purchasing site: you enter several works, and the cost is summed up and billed to your credit card at the end (in order to mitigate point-of-sale credit card fees). In return you get a permanent URL that you can hand out to anyone proving that you own this work and supplying the next renewal date. If you think $1 per work is too much for authors with many works, this system could automatically offer a "volume discount," reducing the fee per work for authors with many works. For companies that routinely register many works, a web service interface could be used to automate the procedure; this could also be incorporated as a feature into popular software like blogs, wikis, photo galleries, and version control software. The fees would be more than sufficient to cover the bandwidth, hardware, and storage needed to drive this service.

What about the case where an author's unregistered drafts are stolen and registered by someone else? The key is to make the registering of unpublished works that you did not produce a punishable offense that also invalidates the claim of copyright. This gives authors the opportunity to develop a work privately and only register it immediately before publication.

What about authors who don't have access to the Internet? Most tax filing is electronic today, but the government retains paper forms for those who aren't able or choose not to use it; copyright registration would be no different.

In short, I believe this system is practical to implement and would be effective in growing the public domain with many works that have no value to their authors, but might be quite valuable to someone else. All we need is the legal framework to make it reality.

All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.

Friday, November 21, 2008

Preliminary release of the C Data Structure Library

Now that I've got two data structures down solid, I went ahead and worked out the release process. The 0.1-alpha release of the C Data Structure Library is now available on the SourceForge website. A separate package was uploaded with the Doxygen docs (for those who don't want to build them themselves). Please download it and try it out!

To prepare for release I got the code compiling as C89, C99, and C++, and got it building under Windows in Visual Studio 2005 and 2008. Interestingly, the performance comparison of list versus doubly-linked lists is quite one-sided on Windows: the doubly-linked list is much slower, for some reason (perhaps a more expensive allocator). It's a little tricky asserting that the .vcproj/.sln files are public domain, since they're partly boilerplate generated by the tools, but I stripped out any of the content that didn't serve a functional purpose.

What next? Lots to do. Probably the most painfully needed is some kind of string. Strings are pretty straightforward until I start worrying about encoding - at the very least UTF8 and UTF16 support would be handy. Beyond that is associative array data structures, like hash tables, some kind of cache-friendly self-balancing search tree, and a simple association list for small dictionaries. What about after that? Well, I look to other libraries for inspiration.
  • Bit arrays.
  • There's a deque data structure that uses a dynamic array like a circular buffer.
  • Priority queues (a buffer tree or funnel heap is a good candidate for cache efficiency).
  • Stacks and queues, simpler interfaces on top of existing structures.
  • Sets supporting union and intersection.
  • Ropes.
  • Disjoint-set data structures.
  • Maybe collections for parallel programming? This could be hard to do in a platform-independent way.
All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.

Thursday, November 20, 2008

C Data Structure Library: list is done

Today I finished my list implementation, based on unrolled doubly-linked linked lists with external storage on the C Data Structure Library. As a relatively complex data structure, it includes deep invariant checking in debug mode, and has a good set of unit tests to stress it. I built a release version for performance testing and compared it against a simple doubly-linked list doing similar operations. Preliminary results seem to indicate the complexity was worthwhile:
  • An insert at the end requires only 29 ns on average, as compared to 110 ns for a standard doubly-linked list.
  • An iteration over a million-element list requires only 3.7 ms, as compared to 9.5 ms per a standard doubly-linked list.
It's a good thing I did the perf test though - a few critical functions had to be converted into macros to get this performance (before, it was slower than doubly-linked lists with similarly inlined calls).

After this I varied ELEMENTS_PER_LIST_NODE a bit to examine how it affects the performance of these simple tests. The default value is 25. At the minimum, 2, it required about 75 ns per insert and 7 ms per iteration. Around 18, iteration time drops to 2.9 ms. For particularly large values, like 128, iteration drops to 2.2 ms and insert drop to 24.4 ns, but other benchmarks like insert_after would probably show a loss on these.

Update: Improvements to the perf test framework eliminated some of the overhead of the framework itself, which makes a difference for the very short insertion test. Currently inserts are down to 23.5 ns, compared to 98 ns for the standard doubly linked list (4.1x speedup).

All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.

Tuesday, November 18, 2008

Back and working on the C Data Structure Library

After a long hiatus I'm getting back in the game and looking to finish up the Kompimi C Data Structure Library, the first Kompimi public domain component. C is a language that has traditionally suffered from a reinvent-the-wheel problem due to its dearth of basic data structures, and this module aims to supply some modern, well thought-out, fully tested data structures for this purpose.

Recent changes include:
  • Modifying the build to deploy all derived files into a single directory. It's a good general policy to separate derived files from source files, so that they can be easily cleaned up and so new source files can be easily identified.
  • Documenting the interfaces for each module using Doxygen doc comments. Although Doxygen is not a public domain tool, it is optional and installed separately. These docs will be deployed at Kompimi's website as well, when kompimi.org goes up.
  • I'm working on a new list module, based on unrolled doubly-linked linked lists with external storage. This will provide a great generic solution for programs that do a lot of insertion and deletion.
More to come!

All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.

Wednesday, July 23, 2008

The gridlock economy and software licensing

Michael Heller wrote a book entitled The Gridlock Economy: How Too Much Ownership Wrecks Markets, Stops Innovation and Costs Lives. Gridlock is a general economic concept describing what happens when there are so many owners of units of private property, it becomes infeasible to negotiate with all the owners necessary to assemble them into a new work. In one example, a drug company was unable to bring a drug to market because a long list of patents had to be secured to begin required testing, and the cost of doing so was prohibitive. Gridlock leads to resource underutilization and decreased innovation.

The original BSD license is another example of gridlock: the license included a seemingly innocuous term requiring any advertisement for a product incorporating the code to credit the original authors. This sounds innocuous until you assemble a multitude of BSD license modules, and realize that there's no longer room on your advertisements for anything else. The result is that companies avoided using any BSD-licensed code. This was repaired in the modified BSD license used today.

Software today is in a similar gridlock: there is an immense body of code out in the world, but it is impossible to legally combine code released under incompatible licenses, and even where licenses are compatible, it rapidly becomes infeasible to keep up with an increasingly long and complex list of requirements. When you build a product out of two or three large open-source systems, this isn't a problem; when you build it out of 200 small open-source modules, each by a separate author, you end up deciding you really ought to have written it yourself. This pushes more and more code into the proprietary space that has long since been written before and should have only been written once.

Public domain open-source movements like Kompimi aim to reverse this trend by pushing more license-free code into the ecosystem. With no ownership, nothing limits free assembly and synergy of an unlimited number of modules by many authors, leading to more sophisticated software - and more reliable software, since the effort spent on generating many versions of a module can be concentrated on creating one high-quality module and getting it right. This leads in the end to more wealth for everyone.

All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.