Tuesday, November 25, 2008

Kompimi.org website and tool philosophy

In some of the older incarnations I had of the Kompimi idea, I aimed to make the entire development and publishing process public domain, as a matter of principle. This approach was overambitious and doomed to failure - I need to take advantage of mature, well-maintained services like Google Blogger, SourceForge, and Freshmeat to promote productivity and compete on equal footing with licensed solutions.

Today I launched the Kompimi.org website, which does not provide services but merely serves as a simple index, information source, and project list for Kompimi. It's a bit lacking in prettiness at the moment, but it is functional, and will help to act as a hub to tie all the projects and resources like this blog together. Over time it will be expanded as necessary, but it will remain an index, rather than a home for projects.

All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.

Monday, November 24, 2008

Orphaned works and unexploited copyright

While working on my next additions to the C Data Structure Library, today I'm going to talk about a more philosophical problem: resource underutilization due to wasted exclusive rights.

Although each country in the world has its own laws and rules, nearly every country in the world today adheres to the Berne convention, a uniform international policy governing copyright (with some exceptions in Africa, Southeast Asia, and the Middle-East). In the United States, the most notable change introduced by the Berne convention was that works need no longer be registered to receive copyright protection; as soon as they are written or stored in some fixed form, exclusive rights are instantly and automatically granted to the author. This is quite convenient for some works - for example, a thief couldn't legally steal the rough drafts written by the author of a book and publish them. But it can get pretty strange: even if the author is a 4-year-old child who knows nothing about copyright at all, they still get exclusive rights.

A serious practical problem arises if you find an abandoned work lying around, and want to create a derived work based on it; chances are, some author somewhere still has exclusive rights to it. If you don't contact the author to secure rights, and they later find out about your derived work (say because you make a lot of money off it), they can sue you. But if the author has abandoned it and vanished, it may be very difficult to determine who the author is, much less contact them. These works are called orphaned works.

Another serious practical problem arises with unexploited works: copyright is intended to protect the creative work of an author so that they can exploit it to make a living. But under the Berne convention, an author - perhaps one with no knowledge of copyright at all - can sit on top of a huge pile of work, most of which they have no intention of exploiting.

The cost here is to society at large: these abandoned and unexploited works aren't used and benefit no one at all, when they could be contributing value directly through use and indirectly through derived works. And this isn't just a problem with books and paintings; how many abandoned software projects languish in the vaults of corporations, never to see the light of day again? How many abandoned open-source projects are up on SourceForge under old licenses that are incompatible with modern ones? That could be code that you're rewriting right now.

The United States was always a nation that emphasized growing the public domain; all works of the federal government are automatically released into the public domain. When U.S. copyright was established in 1790, works gained copyright protection for 14 years when they were registered, and it could be renewed for another 14 years. Besides the much more limited term, this meant that authors only registered works that they considered to have potential value. This did mean that sometimes people who didn't know to register their work would have it stolen with no legal recourse, but overall it struck an effective balance between allowing authors to make money and making old and abandoned works available to society for reuse.

One obvious question to ask is, do people really create derived works? What is the value of derived works? In 1790 this may not have been obvious, but one look at the Internet, where copyright infringement is rampant and derived works are ubiquitous and, frequently, highly creative, proves their value. Indeed, these works could be said to be at the center of Internet culture. Usenet, mailing list, and forum posts are routinely copied, archived, and republished without acquiring permission from their participants, an exercise that would be prohibitively costly if it were enforced. Photoshopped images, YouTube montages and fansubs, YTMNDs, musical reinterpretations and parodies, much of the artwork on deviantART, and many popular Flash videos are derivatives of copyrighted works. It also includes code: although few software developers dare to blatantly infringe copyright, forked codebases by different developers like the X.Org fork of XFree86, the XEmacs fork of GNU Emacs, the OpenSSH fork of SSH, and forks of Linux and BSD distributions (including Mac OS X) are ubiquitous. Enabling derived works like this is part of the mission of open source software.

In 2003, a bill called the Public Domain Enhancement Act (PDEA) was proposed but never put to a vote. It would have, without changing the term of copyright, reinstituted copyright registration in the U.S. for domestic works at a dramatically reduced fee of $1 per work. This helps solve the problem of orphaned works: if the author registers it, they must provide contact information that can be used to contact them and license the work. If a work is abandoned or unexploited, the author will (most likely) not invest effort in registering it, and it will enter the public domain and become available for reuse.

A large part of getting an act like the PDEA right is in the implementation. I imagine a government website where you can go and fill out a web form to register a work. You create an account associated with your name and contact information. Each time you want to register a work, you go to a page where you supply the work's title and description, and an electronic copy of the work, if available. The payment would work similarly to an MP3 purchasing site: you enter several works, and the cost is summed up and billed to your credit card at the end (in order to mitigate point-of-sale credit card fees). In return you get a permanent URL that you can hand out to anyone proving that you own this work and supplying the next renewal date. If you think $1 per work is too much for authors with many works, this system could automatically offer a "volume discount," reducing the fee per work for authors with many works. For companies that routinely register many works, a web service interface could be used to automate the procedure; this could also be incorporated as a feature into popular software like blogs, wikis, photo galleries, and version control software. The fees would be more than sufficient to cover the bandwidth, hardware, and storage needed to drive this service.

What about the case where an author's unregistered drafts are stolen and registered by someone else? The key is to make the registering of unpublished works that you did not produce a punishable offense that also invalidates the claim of copyright. This gives authors the opportunity to develop a work privately and only register it immediately before publication.

What about authors who don't have access to the Internet? Most tax filing is electronic today, but the government retains paper forms for those who aren't able or choose not to use it; copyright registration would be no different.

In short, I believe this system is practical to implement and would be effective in growing the public domain with many works that have no value to their authors, but might be quite valuable to someone else. All we need is the legal framework to make it reality.

All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.

Friday, November 21, 2008

Preliminary release of the C Data Structure Library

Now that I've got two data structures down solid, I went ahead and worked out the release process. The 0.1-alpha release of the C Data Structure Library is now available on the SourceForge website. A separate package was uploaded with the Doxygen docs (for those who don't want to build them themselves). Please download it and try it out!

To prepare for release I got the code compiling as C89, C99, and C++, and got it building under Windows in Visual Studio 2005 and 2008. Interestingly, the performance comparison of list versus doubly-linked lists is quite one-sided on Windows: the doubly-linked list is much slower, for some reason (perhaps a more expensive allocator). It's a little tricky asserting that the .vcproj/.sln files are public domain, since they're partly boilerplate generated by the tools, but I stripped out any of the content that didn't serve a functional purpose.

What next? Lots to do. Probably the most painfully needed is some kind of string. Strings are pretty straightforward until I start worrying about encoding - at the very least UTF8 and UTF16 support would be handy. Beyond that is associative array data structures, like hash tables, some kind of cache-friendly self-balancing search tree, and a simple association list for small dictionaries. What about after that? Well, I look to other libraries for inspiration.
  • Bit arrays.
  • There's a deque data structure that uses a dynamic array like a circular buffer.
  • Priority queues (a buffer tree or funnel heap is a good candidate for cache efficiency).
  • Stacks and queues, simpler interfaces on top of existing structures.
  • Sets supporting union and intersection.
  • Ropes.
  • Disjoint-set data structures.
  • Maybe collections for parallel programming? This could be hard to do in a platform-independent way.
All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.

Thursday, November 20, 2008

C Data Structure Library: list is done

Today I finished my list implementation, based on unrolled doubly-linked linked lists with external storage on the C Data Structure Library. As a relatively complex data structure, it includes deep invariant checking in debug mode, and has a good set of unit tests to stress it. I built a release version for performance testing and compared it against a simple doubly-linked list doing similar operations. Preliminary results seem to indicate the complexity was worthwhile:
  • An insert at the end requires only 29 ns on average, as compared to 110 ns for a standard doubly-linked list.
  • An iteration over a million-element list requires only 3.7 ms, as compared to 9.5 ms per a standard doubly-linked list.
It's a good thing I did the perf test though - a few critical functions had to be converted into macros to get this performance (before, it was slower than doubly-linked lists with similarly inlined calls).

After this I varied ELEMENTS_PER_LIST_NODE a bit to examine how it affects the performance of these simple tests. The default value is 25. At the minimum, 2, it required about 75 ns per insert and 7 ms per iteration. Around 18, iteration time drops to 2.9 ms. For particularly large values, like 128, iteration drops to 2.2 ms and insert drop to 24.4 ns, but other benchmarks like insert_after would probably show a loss on these.

Update: Improvements to the perf test framework eliminated some of the overhead of the framework itself, which makes a difference for the very short insertion test. Currently inserts are down to 23.5 ns, compared to 98 ns for the standard doubly linked list (4.1x speedup).

All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.

Tuesday, November 18, 2008

Back and working on the C Data Structure Library

After a long hiatus I'm getting back in the game and looking to finish up the Kompimi C Data Structure Library, the first Kompimi public domain component. C is a language that has traditionally suffered from a reinvent-the-wheel problem due to its dearth of basic data structures, and this module aims to supply some modern, well thought-out, fully tested data structures for this purpose.

Recent changes include:
  • Modifying the build to deploy all derived files into a single directory. It's a good general policy to separate derived files from source files, so that they can be easily cleaned up and so new source files can be easily identified.
  • Documenting the interfaces for each module using Doxygen doc comments. Although Doxygen is not a public domain tool, it is optional and installed separately. These docs will be deployed at Kompimi's website as well, when kompimi.org goes up.
  • I'm working on a new list module, based on unrolled doubly-linked linked lists with external storage. This will provide a great generic solution for programs that do a lot of insertion and deletion.
More to come!

All content in this document is granted into the public domain. Where this is not legally possible, the copyright owner releases all rights. This notice may be modified or removed.