Programming Breaks Things

| No Comments

Computer scientist Edsger Dijkstra famously said "It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration."

I disagree, in principle and in practice. (I disagree so strongly that I work on a project to teach programming to children.)

I believe it's almost impossible to teach programming to someone who hasn't experienced what we USians call "Geometry". That's mathematics: not the specific behavior of triangles and angles and their relationships, but the hard work and creativity and even beauty of following a set of logical rules to a desirable conclusion. People who can do that can program effectively. People who can't do that will struggle.

Before you can solve a big problem, you have to break it.

One of my work projects is a document categorization system. I've written before that it uses a pipeline processing model, where a document moves through the pipeline in various named stages. One stage might be "NEW", while another might be "EXTRACT METADATA". As the system runs, documents make their way through the pipeline in various stages and eventually enter a search index and an archive intended for users.

Documents come from various places, and it's possible for identical (or near-dentical) documents to enter the system at various times. I've long had an exact title match filter as a first approach to remove duplicates, but it's never filtered out enough duplicates. (Some documents are essentially press releases barely edited and republished by multiple news organizations. These documents are almost never interesting and are frustrating in their sameness within the archive, but in the system they go regardless.)

We've talked about several approaches to finding duplicate and near-duplicate articles, with everything from heuristics to identify title similarity to maintaining multiple latent semantic indexes for each unique category of documents. I dragged my feet on the latter because documents expire after 90 days, and managing an n-dimensional corpus search space where one of those dimensions is also time was more work than I wanted.

Wednesday I realized that a naïve approach could give really good results while being easy to code and, more importantly, very quick to run. I coded and deployed it yesterday, and tuned it and deployed an improved version as I was writing this very paragraph. I added a new processing stage which makes a word histogram for every new document entering the system and compares those histograms to existing articles. If they're similar enough, the new article gets invalidated before it enters the search index or undergoes any further processing.

It's silly, but it works. It's 108 lines of code, per sloccount.

I realized something while writing it: programming is breaking things.

Long years of programming experience have taught me that most problems are too big. Most functions are too long. Most methods are too long. Most entities in the system do too much.

If you read much novice code, you see long functions (if you see functions at all) with deeply nested conditionals and mutable state mutating all over the place, because a variable at the top of the program gets used all throughout the entire program. You see a mess, and you see a maintenance burden, and you see someone flailing to control something that's grown way out of hand.

(You see this in part because people trying to learn how to program are also learning the syntax and semantics of a programming language, and until you know the vocabulary rules, you're going to have trouble understanding nuance of meaning and metaphor and idioms.)

I had no trouble writing this code in in the small because I know the tools Perl provides for me: hashes and arrays and methods.

I had little trouble writing this code, because I understand the pattern of fetching a document at a time from an iterator and processing it to get a histogram and putting that histogram in an array for later processing.

I had an easy time testing this code because I know how to write testable code: each of my methods has a well-defined input and a well-defined output and I can test only at those boundaries to see what happens.

Even though you don't know the details of this system, if you're a decent programmer, you can probably write an outline of how the code works just from how I've described it already:

  • Get a collection of all active extant documents
  • Iterate over them
  • Fetch a histogram of each
  • Get a collection of all new documents
  • Iterate over them
  • Fetch a histogram of each
  • Compare each to every document in the histogram array
  • Invalidate the document if it matches any histogram too closely
  • Add the document's histogram to the array

You can probably guess the names of my methods. If you're not exactly right, you're close.

This is the discipline and experience that sets a good programmer apart from a novice. Sure, a novice (or an undisciplined programmer) could write twice as much code to do the same thing and get it working. Maybe he or she could write four times as much code. (I don't pretend that my factoring of this code is the rightest way to do it, but I do know that it passes multiple tests.)

That's my writing in the small. My writing in the large is even more interesting.

Each stage in the pipeline is its own self-contained class. I call them app classes. Every app class conforms to an interface and gets run by a runner. Every app connects to a defined logger and performs its own registration and reporting.

Every app has a method to fetch its basic resultset (every app is part of a processing pipeline; obviously it's going to iterate over documents in a certain state). Every app has method hooks to fire before this iteration and after it. Every app has a process() method which performs the iteration.

I've extracted and formalized the thirteen app classes over the past several months. They started as a series of individual scripts. Then they had a common base class. Now they share code with roles, take configuration out of a common configuration file, and register themselves when loaded as plugins. They can run separately (great for testing) or all together (as is normal).

I knew from the start that I was writing suboptimal code I'd eventually have to change, but that's because I didn't know enough about the problem yet. I'd discover that as the project went on. I'd gain more insight as I saw what kinds of documents we'd have to handle (and how very strange some of them are compared to what we expected).

The original concept of refactoring always reminds me of math. We rearrange things to make them clearer, to prepare us to do other work, or harder work, or at least further work. It's not change for change's sake, and it's not change to add or remove or modify behavior. It's nothing more or less than changing the design of things without changing their behavior.

It's the same skill, from writing functions of the right name and size to putting modules in the right places with the right contents. It's about breaking big things into smaller things. It's about breaking things into the right things.

(Dijkstra is right that BASIC affords few abstraction possibilities to break programs into effective and distinct components, but for novice programmers the experience of turning what seems like a simple task into the steps required to accomplish it is an important experience. That's also one reason why Modern Perl: The Book uses small test programs to demonstrate language features: working in small steps is too important to ignore.)

Time Will Tell

| No Comments

The May 2012 Dr. Dobb's interview with Ward Cunningham has an interesting quote about Ward's notion of technical debt:

I was really devoted to finding great code, especially when objects were new. Objects gave us an extra dimension beyond functional decomposition. And the question was, "Are these the right objects or not?" And the answer was, "Time will tell."

I work off and on with a handful of great programmers in the Portland area. Several years ago, James Shore and Dave Woldrich created CardMeeting, an agile remote collaboration tool. Jim and Dave are both very good programmers. For this project, they decided to forgo their usual test-driven development and just write code so as to deliver a working prototype on a vry strict deadline.

Jim took to calling that experience "leveraged technical debt". My estimate (not having read the code, but having tested a lot of code written without testing in mind) is that it takes at least as long to write tests for untested code as it took to write the code and much longer the more time has passed between writing the code and writing the tests.

Jim, Dave, and I have all worked on small, software-driven businesses doing things we've never seen anyone else do before. We've all had to deal with the risk of building lots of code that may or may not solve the problems of real customers with real money. When I say write the wrong code first, I don't mean "deliberately do things you know won't work" or "paint yourself into a corner" or even "use the fact you don't know everything you're doing as an excuse to play with completely new technologies you don't know how to use". (Not that the latter is a bad thing, but if you decide to do that, do so only after you've considered the risks and the rewards.)

Last night, we had a short conversation with John Wilger, another PDXer. He works with a successful and relatively young startup with a huge software component. I don't want to put words in his mouth, but it sounds like their software is, colloquially, a mess. Their developer team is trying to get to the point of slapping hands whenever someone needs to make a change and starts by copying and pasting code.

Four years after founding (and two years after discovering its cash cow business), the company was worth at least $3 billion.

It's irresponsible to derive meaningful statistics from a single data point, but we can say this: the technical debt of their codebase didn't entirely prevent the company from achieving its current measure of success. (You can also say that the liberal application of candy-flavored magical unicorn shavings of Ruby and Rails didn't prevent people from making an unholy mess.)

Time will tell if changing the development culture and refactoring the code and paying down all of the technical debt will help the company adapt and take advantages of new opportunities.

Time will tell if the codebase collapses under its own weight.

Time will tell if a competitor (and several exist!) will prove more agile and nimble because it has much better flexibility thanks, in part, to better code.

The whole situation reminds me of Facebook's HipHop virtual machine, where it's apparently cheaper and easier and faster and less risky to hire lots of developers to create and maintain a compatibility layer for the existing code than to rewrite existing code in a better language, or in a better fashion, or to improve it meaningfully.

I'm not suggesting that the only way to build a big business from nothing is to write bad code. I'm not suggesting that scaling to billions in revenue is the goal of all software-driven businesses. I'm not suggesting that you have to choose between test-driven development and business success.

In an ideal world, I can write the right software the first time. I can have sufficient test coverage to have complete confidence in the behavior of the code. I can deliver a feature which gets me paying customers in an afternoon without having to rewrite other parts of the code or taking shortcuts I know that I'll have to clean up when I get a spare weekend afternoon.

For a profession where some of us call ourselves "engineers", we certainly spend a lot of time discussing practical concerns as if the risks and rewards and limitations of the real world did not apply. (I wonder if the academic/practical divide between computer science and software development has some relationship to this.)

In the real world, I have to remind myself every day when I'm working on proof of concept code that proving my concept workable is more important than solidifying my code into well-tested and well-designed software and when I'm working on code I intend to keep that doing things as right as possible now will help me modify it to get it more right in the future.

None of this guarantees success. All of this benefits from the hard-won experiences I have from doing things the wrong way—and occasionally getting it very right. (In the real world, I spent part of the day finding and deploying a shim to turn SVG into VML for Internet Explorer 8 and earlier.)

Maybe Jim and Dave could have thrown out a couple of features and spent more time writing tests for the most valuable parts of their application. Maybe I'm wasting my time optimizing SQL queries for a search feature no one will ever use. Maybe John's company waited too long to untangle the admin and the user sides of their application.

If we're honest with ourselves, the best answer we can give is that time will tell. May we pay attention when it does.

A couple of comments on Simple Attribute-Based Template Exporting have asked for an example. I'll show off more of this code in my YAPC::NA 2012 and Open Source Bridge 2012 talk about how to write the wrong code (along with a handful of other techniques).

(I assume some knowledge of Template Toolkit (besides far too many books about finance, accounting, and investing, the Template Toolkit book is always within reach these days); I've set up a wrapper template which provides the standard look and feel of my application and I include/process other templates liberally. If you understand that much, you'll be able to follow along.)

One of the interesting templates in the system displays a list of chapters of a book in progress. A cron job rebuilds a static page from this template once a day. The template looks something much like:

[% USE Bootstrap -%]
[%- canonical_url = 'http://sitename.example.com/book/' _ link -%]

[%- add_og_properties({
    'fb:admins'      => '436500086365356',
    'og:title'       => title _ ' | sitename.example.com',
    'og:type'        => 'article',
    'og:image'       => 'http://static.sitename.example.com/images/logo.png',
    'og:url'         => canonical_url,
    'og:description' => text.chunk(300).0,
    'og:site_name'   => 'Sitename: site tag line',
   })
-%]
[%- add_meta(
    'pagetitle'     => title _ ' | sitename.example.com',
    'feed_url'      => 'http://static.sitename.example.com/book/atom.xml'
    'canonical_url' => canonical_url
) -%]

[% article_text = BLOCK -%]
<article>
<h2>[% title | html %]</h2>
<p>Published: <time datetime="[% date %]">[% nice_date %]</time></p>
[% text %]
</article>

<ul class="pager">
[%- IF prev -%]
    <li><a href="[% prev.link %].html">← [% prev.title | html %]</a></li>
[%- END -%]
    <li><a href="/onehourinvestor">index</a></li>
[%- IF next -%]
    <li><a href="[% next.link %].html">[% next.title | html %] →</a></li>
[%- END -%]
</ul>

[% INCLUDE 'components/social_links.tt', title => title %]
[%- END -%]

[%- row(
    maincontent( article_text ),
    sidebar(
        sideblock( process( 'components/cached/book_latest_chapters.tt' ) ),
        sideblock( process( 'components/cached/book_drafts.tt'          ) )
    )
) -%]

The emboldened lines are most important; they put all of the content produced or assembled by this template in the HTML structure the site needs. That is to say, everything on the site needs to fit into something I call a row. A row can contain multiple elements, such as maincontent and a sidebar, or fullcontent by itself with no sidebar. A sidebar can contain multiple sideblocks.

(You can ignore the other functions; they put metadata in the right places to pass to wrapper templates.)

Within my template plugin (called Bootstrap), each of these elements is a simple Perl function which takes one or more arguments and interpolates it into some HTML:

sub row :Export
{
    return <<END_HTML;
<div class="row">
    @_
</div>
END_HTML
}

sub sidebar :Export
{
    return <<END_HTML;
<div class="span4">
    @_
</div>
END_HTML
}

(I initially tried to write these functions as templates within Template Toolkit itself, but there comes a point at which you want a real language. That point came very early for me.)

I lose no love over the varname = BLOCK pattern necessary to populate variables to pass to these plugin functions, but it works for now. In some of my templates—usually those with lots of text I might end up changing later—I extract that text into a separate template under components/content/ to make it easy to edit. (This idea came up during a client project where the client wanted to edit the legal clickthrough arrangement after users create accounts. I didn't want lawyers or anyone to have the ability to mess up the templating language, so I said "Edit this single file as plain HTML and you'll be fine." It worked great.)

While my programmer brain says "This is ugly, and you're a horrible person for committing this hack upon the world—you're calling Perl from your template system to generate HTML you're stuffing into a template and that puts your presentation elements in Perl code, you awful human being!", it keeps the presentation code in a single place where I can update it infrequently (being that I don't change the layout of the site dramatically) without having to change the divs and classes of multiple templates.

I'm not arguing that this technique as expressed here is right. It's probably not optimal; there may be easier approaches to achieve the same effects.

I am saying that this currently works very well for me. I'm not typing the same HTML over and over and over again, and I can tweak it much more easily than I did before when I was refining the look and feel. In fact, I've even forgotten the exact details of the layout, from the HTML/CSS point of view, and now think only in terms of rows, maincontent, and sidebars.

Working abstractions are very nice.

If you're like me and your design skills are sufficient to modify something decent to look nice but insufficient to create something from first principles, you can do a lot worse than to play with Twitter Bootstrap for your next web site.

I've used it successfully for a few projects and it's been great.

It's a lot better now that I've written my own silly little Template Toolkit plugin to reduce the need for writing lots of repetitive HTML in my templates. (It's like Haml but less ugly and more Perlish and easier to extend.)

Writing a TT2 plugin is relatively easy. Of course I do it the wrong way; when you initialize your plugin, you have the ability to manipulate TT2's stash. This is the data structure representing the variables in scope in your templates. Where a well-behaved template should use object methods to perform its operations, my code stuffs function references in the stash. Here's the relevant code:

sub new
{
    my ($class, $context, @params) = @_;

    $class->add_functions( $context );

    return $class->SUPER::new( $context, @params );
}

sub add_functions
{
    my ($class, $context) = @_;
    my $stash             = $context->stash;

    while (my ($name, $ref) = each %exports)
    {
        $stash->set( $name, $ref );
    }

    $stash->set( process => sub { $context->process( @_ ) } );
}

I'll fix this eventually, but the process of making this work was interesting.

In my first attempt (see Write the Wrong Code First for the justification), I'd write the function I needed, like row(), which creates a new Bootstrap row or maincontent() which creates the main content area of the page. Then I'd add that function to the %exports hash and everything would work.

After the sixth function, keeping that list up to date was tedious. Then I kept forgetting it. After all, any time you have to update the same data in two places, you're doing something wrong.

Now the code looks more like:

sub row :Export
{
    return <<END_HTML;
<div class="row">
    @_
</div>
END_HTML
}

... with a single code attribute marking those functions which I want to stuff into the template stash. I've used Attribute::Handlers before, but I always end up reading the manual and playing with things to get them to work correctly. (Something about the way you have to write another package and inherit from it to get your attributes to work correctly always confuses me.)

My second attempt lasted no longer than ten minutes. I switched to Attribute::Lexical. This is almost as trivial to use as to explain:

use Attribute::Lexical 'CODE:Export' => \&export_code;

Whenever any function has the :Export attribute, Perl wil lcall my export_code() function:

my %exports;

sub export_code
{
    my $referent = shift;
    my $name     = Sub::Identify::sub_name( $referent );

    return unless $name;
    $exports{$name} = $referent;
}

The first argument to this function is a reference to the exported function. I use Sub::Identify to get the name of the function reference. (That wouldn't work for anonymous functions, but I can control that here.) Then I store the name of the function and the function reference in a hash.

It took as long to write as it does to explain.

A lot of people dislike the use of attributes. Used poorly, they create weird couplings and plenty of action at a distance. Attribute::Handlers can be confusing.

I like to think that I'm using attributes well here (even if I'm abusing TT2 more than a little), and that they've simplified my code so that I can avoid repeating myself and performing manual busywork that I'm likely to forget. Even better, the code to use them isn't magical at all: it's all hidden behind the pleasant interfaces of Attribute::Lexical and Sub::Identify.

Write the Wrong Code First

| 5 Comments

I rewrite code often.

If I were a better programmer, designer, or businessman, I would rewrite my code much less frequently—but I get things wrong about as often as I get them right. Even with years of practical experience, software's still too difficult to predict with any degree of accuracy.

As a case in point, I've been revising some financial software in the past week. In reviewing the calculations, I found a way to simplify them dramatically. Even better, these simplifications allow me to simplify the interface and user experience.

That means rewriting a lot of code. That means throwing out code and revising the storage model and making a lot of changes.

I'm fortunate to have a good test suite that runs in 15 to 20 seconds and lets me know that everything I most need to work continues to work. That's a lot of confidence. People who like to talk about test-driven development and refactoring tout this as one of the benefits of well-tested software: you can refactor with confidence.

I'm not refactoring. I'm throwing away parts of this application and adding others. I'm changing how it behaves. Even though my test suite helps, that's not refactoring.

As part of this project, I've added an SVG graph to a class of web pages. I started by creating the SVG in Inkscape. Then I exported it as plain SVG. Then I made a template for that SVG to include from the page template.

That was still the example SVG with sample data, still the proof of concept.

I then extracted one piece of hard-coded data and made it a templated value. One. Everything still worked. Then I extracted the second piece of data and so on.

It's one step at a time. It's one change at a time. I'm using Git, so I could even commit after every single change, no matter that it's a few characters or even merely changing the color of a bar in the graph. I can work in steps as small and discrete as possible, and then squash them into one big commit or rewrite them into functional units, or do whatever I want with them.

That's the same principle behind test-driven development (or test-driven design or even behavior-driven development, if you need to hang a new name on the same idea). Do one thing at a time. Make your code do a little more of what it needs to do. Prove that it all hangs together, that it all works, that it does what you intended.

Then clean up a little bit. That's refactoring, in your code and in your tests. That's rebasing in Git.

Sure, I wish I could know exactly what I needed to write from the start. I wish sometimes that programming were mere transcription of the voice of an ephemeral muse (though I find it difficult to imagine a muse dictating Perl or JavaScript or Haskell or J aloud). I wish I were the Beethoven of programming (without the mercurial temperament and the hearing loss).

Usually I don't get things right from the start. Fortunately, a little discipline and the willingness to work in small steps, to erect and replace the scaffolding as I go, and I usually get a lot closer to the right code than if I guessed.

Maybe that means I've thrown out more code than I've written. (It's satisfying to delete unused code, after all.) Maybe any project which starts as a proof of concept, then has to pivot in other directions to do what it's always needed to do always becomes a Ship of Theseus.

I'm okay with that. It's more important to me to create something useful and then make it right than to wait on getting it right before other people can find value in it. I may never write the right code from the start, but I believe I can make almost-right code much, much more right, with discipline and care and feedback.

Find recent content on the main index or look in the archives to find all content.

Modern Perl: The Book


The best Perl Programmers read Modern Perl: The Book.

Read Modern Perl online for free!

Recent Comments

  • chromatic: Unfortunately I can no longer ignore IE 8. Fortunately, I read more
  • https://me.yahoo.com/a/evZh.8gAt5qa1xDbY_dE.iSYdbI-#2dbce: Hey, As one of the people asking for a code read more
  • barefootcoder.myopenid.com: Interesting. I have to say I still prefer the interface read more
  • http://openid.anonymity.com/2a3n8o: Template::Semantic gives a good separation of html from perl code, read more
  • asknet999.myopenid.com: I completely agree with your post. Most of the software read more
  • https://me.yahoo.com/a/evZh.8gAt5qa1xDbY_dE.iSYdbI-#2dbce: I'd like to see that as well. I've also been read more
  • robmueller.myopenid.com: I did something similar for our web application to mark read more
  • autarch.urth.org: It'd be interesting to see how a template looks uses read more
  • Aristotle Pagaltzis: Your time writing a reply was wasted, you fell for read more
  • chromatic: There's no real namespace distinction between keywords and user-defined functions. read more

Recent Assets

  • KO.png
  • butteraptor.png

Categories

Pages

OpenID accepted here Learn more about OpenID
Powered by Movable Type 4.23-en