Wednesday, May 13, 2015

Ruby, Rust, and Rake: A Small Tale

So in getting Ruby to call Rust functions I came across a small headache when using Rake to compile my Rust and run my Ruby in one fell swoop.

I got my Rust file to compile correctly by making a system call like so:

    rake :run do
      `rustc func.rs --crate-type=dylib`
      load 'script.rb'
    end

But of course, being new to Rust, I invariably started causing the compiler to error out.  I wanted my Ruby script to run only if my Rust file compiled correctly, so I tried something like this:

    rake :run do
      code = `rustc func.rs --crate-type=dylib`
      load 'script.rb' if code
    end

But it didn't work!  Come to find out, the compilation was always storing the empty string "" into error.

The solution was to use system('command') instead of the backticks:

    rake :run do
      code = system('rustc func.rs --crate-type=dylib')
      load 'script.rb' if code
    end

And now it works perfectly!  Now to clean it up:

    rake :run do
      load 'script.rb' if system('rustc func.rs --crate-type=dylib')
    end

I hope that helps someone else running into this (tiny) headache.

Tl:dr: use Kernel.system to compile Rust from Rake if you want to conditionally run more code.

Thursday, May 7, 2015

Git Commits are Not Transactions

Git commits are not a list of differences between the last commit and itself.  A branch is not a transactional database of file content changes.  Git does not store your files as differences between each other (until later during garbage collection).

Each Git commit contains a compressed snapshot of your entire working directory*.  To be more precise, it contains a reference to a list of compressed files.  Each of those compressed files are what you "git add"ed in the last iteration including the files that you did not change.

To make this more clear and concrete, run the following command.

    $ git cat-file -p HEAD
    tree b1da7229394351fa209533717de8a741d227aa1b
    parent b6b293e9bc2dacb114463b36286b33e83c79c0b7
    author Dominic Muller <nicklink483@gmail.com> 1431043752 -0700
    committer Dominic Muller <nicklink483@gmail.com> 1431043752 -0700

    removed new_file

This will output the last commit** in plain text as it is stored in git***.  What you're looking at is literally how git saves a commit.  You may notice that the top line says the word "tree" followed by a sha1 hash.  This sha1 hash is the tree object that is the complete list of files that your index knew about at the time of the commit.  To show this, use the following command.

  $ git cat-file -p HEAD^{tree}
  100644 blob 6b5ffcfb6de1421c1b7a90b0b98febbceb75e70e    .gitignore
  100644 blob f73693a16cdf594532ee4c423a46d32ce3430c4e    blah.txt
  040000 tree 86c2509f4c12c5d3bf9a486925ed051666ee2d97    new_dir
  100644 blob 0861b9114fba8c82892d89e53f2a34447bd4c9e7    new_file.txt

That is the list of files and directories that your index knew about at the moment of that commit.  For any that say "blob" in them, if you runthe following (I'm doing it on blah.txt here):

    $ git cat-file -p f73693a16
    "test"

You will see an entire file of yours. Mine here only contained the content "test" (quotes included).

Every commit in Git can reconstruct the working directory at the time of it's creation without the knowledge of any other commit.  The reason commits have parents is a) for convenience on your part to know a kind of history, and b) to help git know if there has been any data loss/change at some point.  Every time you add a changed file, git makes a complete new copy of that file, so each commit has its own file references without the help from any other commit.  This may seem like file bloat over time, but fear not, as git will garbage collect to produce diffs in the database, but the commits still reference their own reconstructable blobs.

TL;DR If anyone tells you that a commit contains the difference between the last commit and that one, they are wrong, and you can prove it.  I've heard this falsehood more than once even from well-seasoned git enthusiasts.

For more on this, please read chapter 10 of Pro Git, available online for free here: https://git-scm.com/book/tr/v2/Git-Internals-Plumbing-and-Porcelain

*well, only the files that your index knew about from your working directory, but if you git add ./ and don't have anything in your .gitignore file, these will almost certainly be the same.

**git cat-file -p HEAD might not show you the most recent commit if your HEAD isn't pointing to the most recent commit, but you get the point.  You can always see what HEAD is referencing with "cat .git/HEAD" (unless you're on Windows, I guess)

***except for the header.  so it would be "commit 241\0" + that plain text.  The number 241 is the length in bytes of that commit on my computer.

Wednesday, April 29, 2015

Refactoring By Any Other Name

I like philosophy.  Specifically, I like semantics and ontology.  I like thinking about naming things, and naming them distinctively.

"There are only two hard things in computer science: cache invalidation, and naming things." - Phil Karlton[1]

Normally, "naming things" in this quote refers to naming your variables and namespaces, but we can imagine it applying equally to programming concepts themselves.

Refactoring, as defined by Martin Fowler et. al.[2], is the process of improving a code base through incremental changes to the source code.

I don't like terms whose definitions require subjectivity, but I'll be a nice guy and let Fowler have the word 'refactoring'.

Today I'm going to talk about the more objective definition of refactoring: changing code without changing its behavior.  I'm even going to give it a new name, so that we have some common (and hopefully sane) vocabulary to discuss changes to code and the affects of those changes on the behavior of the code.

So what are we talking about anyway?  We're talking about changing code.  We're talking about a transformation of the source code of a file from one state to a different state.

Definition time:

    Def.
      Transformation:
        Some change or alteration to the source code.

As you can see, this is not limited to changes which result in working/compilable code.  Adding a comment is a transformation.  Fixing a bug (usually) requires one or more transformations.

Okay, so we know what 'change' means.  But there's another word in the objective definition of refactoring.  "Behavior".  Sure, yea, we all *know* what 'behavior' is.  It's what the code does!

Here is a more formal definition:

    Def.
      Behavior:
        The mapping of the state of the program at one point of computation to another point of computation.

A "point of computation" can usually be thought of as a particular line of code, if you're programming in a structured fashion. In fact, most times we will be assuming the beginning and ending lines of a function/method definition.  As a matter of convenience, I argue that this should be the default.  So in a colloquial sense, the behavior is the mapping of your program's state from before a function/method is run to after that function/method is run.

But do note that "behavior" must be discussed with respect to two points of computation.  So the definition "what the program does" is not enough information to convey 'behavior'.

So now that we have a common ground for transformations and behaviors, we can start to classify transformation by their affects on behavior.

The first will be the Behavior-Maintaining Transformation, or BMT.

    Def.
      Behavior-Maintaining Transformation (BMT):
        A change in the source code such that the behavior (mapping of program state between two points) is maintained for all execution paths between those two points.

An example of a BMT is adding a comment (unless there's something gravely wrong with your compiler/interpreter).

When people talk about refactoring, they're usually referring to BMTs.  However, this definition also applies to minification and linting.  I doubt many proponents of refactoring would consider minification an example of it.

The other classification of transformations are those that do change the behavior of the code between two points.  I'll call these Behavior-Changing Transformations, or BCTs.

    Def.
      Behavior-Changing Transformation (BCT):
        A change in the source code such that the behavior (mapping of program state between two points) is altered for any execution paths between those two points.

Note here I said any execution paths.  Even if your change changes one special case, it is a BCT.  Fixing a bug is a BCT.  A bug fix alters the current behavior, so it must be behavior-changing (otherwise you wouldn't fix the bug).

Since we live in the real world where real work is done, the situation is more hairy.  For example, you may introduce a BCT whose changes don't affect you because you never use the code in a manner in which the change is evident.  For example, take the case of a function which you change and becomes incorrect for integer values that cannot be expressed with the total matter of the universe.  This can safely be considered a BMT.  But of course, there are closer cases, such as ones where you aren't currently affected by the change, but you could write a test which exposes the new bug, but it may never be the case that you use the code that way.  Is this a BMT or a BCT? I'll leave the answer up to the reader to ponder.  But I maintain (no pun intended) that it is an important question and should be considered.



So who cares?  What can we do with these new-found definitions?  I'd say it's useful in the following sense.  If you know all of the transformations in your programming language which will not change code behavior, then you can freely alter your source code to a form which is more palatable.  We already know a few of these transformations.  Extract method, extract class, replace literal with variable, etc.  almost every refactoring is a BMT (one which is not a BMT is extract class, since you have a new class you can talk about elsewhere in your code).

Also, we can start to construct an Algebra of Programming (Transformations).  In trigonometry, we learned about identities, and how to show that one trigonometric expression was identical to another through a series of algebraic transformations, each of which were provably correct transformations (like multiplying by sin(x)/sin(x) or breaking tan(x) into sin(x)/cos(x)).

The only difference was that we were never allowed to make changes in math that resulted in incorrectness, but in programming we do it all the time, since things can work well enough without being "correct".

Here is an implementation of square root in Ruby through a series of easily digestible steps/transformations.  I've labelled each step as a BCT or a BMT.  It's surprising how much work you can get done with only BMTs.  Of course, BCTs have to be written eventually, otherwise your code won't actually do anything!
https://gist.github.com/nicklink483/aefec4a7fec814300ba4

[1] https://twitter.com/timbray/status/506146595650699264
[2] http://martinfowler.com/books/refactoring.html

Wednesday, January 28, 2015

What is a Master Programmer?


After seeing Bob Martin's talk on professionalism in computer programming, it got me thinking about programming as a craft.

We're all aware of the notion of a craft, and varying levels of proficiency within a craft.  The most intriguing level is that of the master.  The master craftsman has a level of knowledge worth revering, but what it means to be a master is different from craft to craft.


So, what does it mean to be a master programmer?  Here is an inexhaustive, slightly opinionated list of traits of a master in programming.  Feel free to add to this list, and share it around.  And if you aspire to be an expert in the field of programming, then set this list aside and reason about it and choose some of them to abide by.

Note that some of these bullet points are true of master in general, not about programming in particular.

A master programmer...

  • knows several languages and language paradigms and leverages each to inform the others and his/her overall product
    • i.e. imperative vs declarative, and functional, OO, procedural, etc., and various type structures
  • puts some effort into typing quickly
  • knows how long it will take to complete a project, or at least has some reasonable amount of insight
  • knows which of his/her beliefs are "opinions", and is versed in the tradeoffs of both sides
  • (optional) takes on an apprentice
  • is constantly learning and improving
  • knows the limits of his/her knowledge
  • can convert one problem into another
  • knows that tomorrow he/she may be proved wrong on any of his/her beliefs of the craft
  • knows where to get more information on almost any subject

In short, a master knows their tools, knows how to use them, and is aware of the right tool for the right job.  There are some things I'm hesitant to add to the list, but seem important, though I don't know how to word them or modify them to fit in:

  • is comfortable in a full stack
  • knows the O-times for various algorithms and their implementations

Do note that nowhere in either list is the master considered "professional".  This is not a mistake.  Professionalism is not a trait of the master craftsman.  Just ask this fella:


(Note that TDD isn't on the list either)  Finally, this list will not help you become good at programming.  For that list, check out this page.