Wednesday, May 13, 2015

Ruby, Rust, and Rake: A Small Tale

So in getting Ruby to call Rust functions I came across a small headache when using Rake to compile my Rust and run my Ruby in one fell swoop.

I got my Rust file to compile correctly by making a system call like so:

    rake :run do
      `rustc func.rs --crate-type=dylib`
      load 'script.rb'
    end

But of course, being new to Rust, I invariably started causing the compiler to error out.  I wanted my Ruby script to run only if my Rust file compiled correctly, so I tried something like this:

    rake :run do
      code = `rustc func.rs --crate-type=dylib`
      load 'script.rb' if code
    end

But it didn't work!  Come to find out, the compilation was always storing the empty string "" into error.

The solution was to use system('command') instead of the backticks:

    rake :run do
      code = system('rustc func.rs --crate-type=dylib')
      load 'script.rb' if code
    end

And now it works perfectly!  Now to clean it up:

    rake :run do
      load 'script.rb' if system('rustc func.rs --crate-type=dylib')
    end

I hope that helps someone else running into this (tiny) headache.

Tl:dr: use Kernel.system to compile Rust from Rake if you want to conditionally run more code.

Thursday, May 7, 2015

Git Commits are Not Transactions

Git commits are not a list of differences between the last commit and itself.  A branch is not a transactional database of file content changes.  Git does not store your files as differences between each other (until later during garbage collection).

Each Git commit contains a compressed snapshot of your entire working directory*.  To be more precise, it contains a reference to a list of compressed files.  Each of those compressed files are what you "git add"ed in the last iteration including the files that you did not change.

To make this more clear and concrete, run the following command.

    $ git cat-file -p HEAD
    tree b1da7229394351fa209533717de8a741d227aa1b
    parent b6b293e9bc2dacb114463b36286b33e83c79c0b7
    author Dominic Muller <nicklink483@gmail.com> 1431043752 -0700
    committer Dominic Muller <nicklink483@gmail.com> 1431043752 -0700

    removed new_file

This will output the last commit** in plain text as it is stored in git***.  What you're looking at is literally how git saves a commit.  You may notice that the top line says the word "tree" followed by a sha1 hash.  This sha1 hash is the tree object that is the complete list of files that your index knew about at the time of the commit.  To show this, use the following command.

  $ git cat-file -p HEAD^{tree}
  100644 blob 6b5ffcfb6de1421c1b7a90b0b98febbceb75e70e    .gitignore
  100644 blob f73693a16cdf594532ee4c423a46d32ce3430c4e    blah.txt
  040000 tree 86c2509f4c12c5d3bf9a486925ed051666ee2d97    new_dir
  100644 blob 0861b9114fba8c82892d89e53f2a34447bd4c9e7    new_file.txt

That is the list of files and directories that your index knew about at the moment of that commit.  For any that say "blob" in them, if you runthe following (I'm doing it on blah.txt here):

    $ git cat-file -p f73693a16
    "test"

You will see an entire file of yours. Mine here only contained the content "test" (quotes included).

Every commit in Git can reconstruct the working directory at the time of it's creation without the knowledge of any other commit.  The reason commits have parents is a) for convenience on your part to know a kind of history, and b) to help git know if there has been any data loss/change at some point.  Every time you add a changed file, git makes a complete new copy of that file, so each commit has its own file references without the help from any other commit.  This may seem like file bloat over time, but fear not, as git will garbage collect to produce diffs in the database, but the commits still reference their own reconstructable blobs.

TL;DR If anyone tells you that a commit contains the difference between the last commit and that one, they are wrong, and you can prove it.  I've heard this falsehood more than once even from well-seasoned git enthusiasts.

For more on this, please read chapter 10 of Pro Git, available online for free here: https://git-scm.com/book/tr/v2/Git-Internals-Plumbing-and-Porcelain

*well, only the files that your index knew about from your working directory, but if you git add ./ and don't have anything in your .gitignore file, these will almost certainly be the same.

**git cat-file -p HEAD might not show you the most recent commit if your HEAD isn't pointing to the most recent commit, but you get the point.  You can always see what HEAD is referencing with "cat .git/HEAD" (unless you're on Windows, I guess)

***except for the header.  so it would be "commit 241\0" + that plain text.  The number 241 is the length in bytes of that commit on my computer.

Wednesday, April 29, 2015

Refactoring By Any Other Name

I like philosophy.  Specifically, I like semantics and ontology.  I like thinking about naming things, and naming them distinctively.

"There are only two hard things in computer science: cache invalidation, and naming things." - Phil Karlton[1]

Normally, "naming things" in this quote refers to naming your variables and namespaces, but we can imagine it applying equally to programming concepts themselves.

Refactoring, as defined by Martin Fowler et. al.[2], is the process of improving a code base through incremental changes to the source code.

I don't like terms whose definitions require subjectivity, but I'll be a nice guy and let Fowler have the word 'refactoring'.

Today I'm going to talk about the more objective definition of refactoring: changing code without changing its behavior.  I'm even going to give it a new name, so that we have some common (and hopefully sane) vocabulary to discuss changes to code and the affects of those changes on the behavior of the code.

So what are we talking about anyway?  We're talking about changing code.  We're talking about a transformation of the source code of a file from one state to a different state.

Definition time:

    Def.
      Transformation:
        Some change or alteration to the source code.

As you can see, this is not limited to changes which result in working/compilable code.  Adding a comment is a transformation.  Fixing a bug (usually) requires one or more transformations.

Okay, so we know what 'change' means.  But there's another word in the objective definition of refactoring.  "Behavior".  Sure, yea, we all *know* what 'behavior' is.  It's what the code does!

Here is a more formal definition:

    Def.
      Behavior:
        The mapping of the state of the program at one point of computation to another point of computation.

A "point of computation" can usually be thought of as a particular line of code, if you're programming in a structured fashion. In fact, most times we will be assuming the beginning and ending lines of a function/method definition.  As a matter of convenience, I argue that this should be the default.  So in a colloquial sense, the behavior is the mapping of your program's state from before a function/method is run to after that function/method is run.

But do note that "behavior" must be discussed with respect to two points of computation.  So the definition "what the program does" is not enough information to convey 'behavior'.

So now that we have a common ground for transformations and behaviors, we can start to classify transformation by their affects on behavior.

The first will be the Behavior-Maintaining Transformation, or BMT.

    Def.
      Behavior-Maintaining Transformation (BMT):
        A change in the source code such that the behavior (mapping of program state between two points) is maintained for all execution paths between those two points.

An example of a BMT is adding a comment (unless there's something gravely wrong with your compiler/interpreter).

When people talk about refactoring, they're usually referring to BMTs.  However, this definition also applies to minification and linting.  I doubt many proponents of refactoring would consider minification an example of it.

The other classification of transformations are those that do change the behavior of the code between two points.  I'll call these Behavior-Changing Transformations, or BCTs.

    Def.
      Behavior-Changing Transformation (BCT):
        A change in the source code such that the behavior (mapping of program state between two points) is altered for any execution paths between those two points.

Note here I said any execution paths.  Even if your change changes one special case, it is a BCT.  Fixing a bug is a BCT.  A bug fix alters the current behavior, so it must be behavior-changing (otherwise you wouldn't fix the bug).

Since we live in the real world where real work is done, the situation is more hairy.  For example, you may introduce a BCT whose changes don't affect you because you never use the code in a manner in which the change is evident.  For example, take the case of a function which you change and becomes incorrect for integer values that cannot be expressed with the total matter of the universe.  This can safely be considered a BMT.  But of course, there are closer cases, such as ones where you aren't currently affected by the change, but you could write a test which exposes the new bug, but it may never be the case that you use the code that way.  Is this a BMT or a BCT? I'll leave the answer up to the reader to ponder.  But I maintain (no pun intended) that it is an important question and should be considered.



So who cares?  What can we do with these new-found definitions?  I'd say it's useful in the following sense.  If you know all of the transformations in your programming language which will not change code behavior, then you can freely alter your source code to a form which is more palatable.  We already know a few of these transformations.  Extract method, extract class, replace literal with variable, etc.  almost every refactoring is a BMT (one which is not a BMT is extract class, since you have a new class you can talk about elsewhere in your code).

Also, we can start to construct an Algebra of Programming (Transformations).  In trigonometry, we learned about identities, and how to show that one trigonometric expression was identical to another through a series of algebraic transformations, each of which were provably correct transformations (like multiplying by sin(x)/sin(x) or breaking tan(x) into sin(x)/cos(x)).

The only difference was that we were never allowed to make changes in math that resulted in incorrectness, but in programming we do it all the time, since things can work well enough without being "correct".

Here is an implementation of square root in Ruby through a series of easily digestible steps/transformations.  I've labelled each step as a BCT or a BMT.  It's surprising how much work you can get done with only BMTs.  Of course, BCTs have to be written eventually, otherwise your code won't actually do anything!
https://gist.github.com/nicklink483/aefec4a7fec814300ba4

[1] https://twitter.com/timbray/status/506146595650699264
[2] http://martinfowler.com/books/refactoring.html

Wednesday, January 28, 2015

What is a Master Programmer?


After seeing Bob Martin's talk on professionalism in computer programming, it got me thinking about programming as a craft.

We're all aware of the notion of a craft, and varying levels of proficiency within a craft.  The most intriguing level is that of the master.  The master craftsman has a level of knowledge worth revering, but what it means to be a master is different from craft to craft.


So, what does it mean to be a master programmer?  Here is an inexhaustive, slightly opinionated list of traits of a master in programming.  Feel free to add to this list, and share it around.  And if you aspire to be an expert in the field of programming, then set this list aside and reason about it and choose some of them to abide by.

Note that some of these bullet points are true of master in general, not about programming in particular.

A master programmer...

  • knows several languages and language paradigms and leverages each to inform the others and his/her overall product
    • i.e. imperative vs declarative, and functional, OO, procedural, etc., and various type structures
  • puts some effort into typing quickly
  • knows how long it will take to complete a project, or at least has some reasonable amount of insight
  • knows which of his/her beliefs are "opinions", and is versed in the tradeoffs of both sides
  • (optional) takes on an apprentice
  • is constantly learning and improving
  • knows the limits of his/her knowledge
  • can convert one problem into another
  • knows that tomorrow he/she may be proved wrong on any of his/her beliefs of the craft
  • knows where to get more information on almost any subject

In short, a master knows their tools, knows how to use them, and is aware of the right tool for the right job.  There are some things I'm hesitant to add to the list, but seem important, though I don't know how to word them or modify them to fit in:

  • is comfortable in a full stack
  • knows the O-times for various algorithms and their implementations

Do note that nowhere in either list is the master considered "professional".  This is not a mistake.  Professionalism is not a trait of the master craftsman.  Just ask this fella:


(Note that TDD isn't on the list either)  Finally, this list will not help you become good at programming.  For that list, check out this page.

Tuesday, December 16, 2014

Ruby is better than Assembly


Take two systems A and B.  A and B are identical excepting that B has a CPU that is twice as fast.

The benefits that a Ruby program running on B get vs running on A outweigh the benefits that an Assembly program running on B get vs running on A.  That is to say that Ruby receives a greater marginal utility in the change to a faster system than Assembly.

Since A and B are identical excepting their clock speed, writing a Ruby program for one takes the same amount off time and effort as it does for the other.  The same is true for Assembly.

Assembly takes more time and effort to code in than Ruby.

Computers get faster over time.  Therefore, Ruby will get better than Assembly will over time.

In addition, Assembly can't get much faster, but Ruby can and will.

Further, one can only do so much work, as there is only so much work to do.  Consider the following.  One spends a month writing a program that runs for 1 week.  Or they spend one week writing a program that runs for 6 months.  But the programmer's services aren't needed again for 1 year.  Surely they should opt to spend 1 week coding.

On the other hand, a certain amount of work needs to get done.  And Ruby is doing enough work in various domains.

Tuesday, November 11, 2014

A Better Typing Tutor

A better typing tutor

I've spent quite a bit of time doing typing tests to get better at touch typing.  2 years ago I could only type around 25-30 wpm without looking.  Today I can type around 70-80 wpm (on a good day, at least).  While a large portion of this increase is due to hard work and elbow grease, I also attribute a particular website the cause of my getting better.  This site is 10fastfingers.com.  It's a great interface for typing that is reasonably responsive, and the creator responds to feedback.  He even implemented my suggestion of being able to hide the timer (for us more anxious folk).

The main typing test gives you a random assortment of the 200 most common words in whatever language you choose.  It then gives you a minute to type out as many of that list as possible.  It's mostly lowercase and other than an occasional apostrophe, there is no punctuation.

The "advanced" test has a pool of 1000 words to randomly choose from, which contain many more occurrences of capital letters, and even some punctuation from abbreviations.

The other features of the tests are that you can only backspace in the current word, the typing area is separate from the word list, and every time you type a word, it disappears and the word in the list turn from black to either red or green depending on whether you typed it correctly.

Each of these features and their particular mixture can be different (for example, writing over the list instead of separately), but I'm not going to talk about those.

The features I am going to discuss are those of word choice/vocabulary, readability of word lists, and the computer's awareness of your skill level.

Arguably, these "features" are present in the 10fastfingers test.  For example, the word list has a readability of zero, the computer has zero awareness of your skill (in terms of using that knowledge to alter word choice), and the vocabulary is offered in 2 levels.  The vocab choice comes from there being an advanced test with 1000 word pool.

Other sites offer more readable word lists, but they suffer a different problem as a result.  Their method of offering readability comes in the form of the word list being a direct quite that a human has written somewhere.  The cost of this is maldistribution of letter frequency.  For example, in a Moby Dick quote, the letter 'w' might come up more often than on average, due to the word "whale" showing up so much.  Now, this isn't too big of a problem, but that solution forgoes another possible feature, skill awareness and word choice as a result.  If the computer knows you suck at typing "ae", it is more difficult to generate word lists if all you have to choose from are quotes.

And that brings us to the second feature I want to talk about.  The computer's awareness of your skill level.  That is, your skill at typing particular letters or letter groups.  For example, a lot of people are much more skilled in typing the letter group "the" than any other three-letter letter group because they type it so often.  Try it now!  I define "skilled" as being the amount of time in milliseconds it takes to type the letters in a letter group starting with the stroke of the first letter to the stroke of the keystroke following the last letter.  So if, say, you're typing the word "there", then the letter group "the" would be measured starting the timer at typing the "t" and stopping the timer at typing the "r".  Javascript is good at differentiating to the millisecond, and if someone were typing at 1 millisecond per character, their wpm would be around 12000, so milliseconds are good enough to measure by.

Finally there is the matter of vocabulary choice.  This is straightforward.  A vocabulary pool would grow to include words which occur less and less frequently in culture.  This is calculable several ways, from doing word occurrence analysis on books to magazines to blogs, etc.

Sentence readability is a harder problem, I grant you, but not anywhere near impossible, especially with naive solutions.  For example, one could arbitrarily generate a handful of syntaxes in a language, and, given a dictionary with sufficiently granular parts of speech, be able to generate sentences on the fly which maintain syntactic validity.  Only thing left is to remember to conjugate, though even that could be subsumed into the arbitrary syntax list.

Now, it's one thing for the computer to be aware of your skill level, but it's another thing to do anything about it.  Let's say it takes you 38 milliseconds to type the letter group "le" but it take everyone else 25 milliseconds (if we're comparing like wpm's).  In that case, the computer should opt to give you sentences in which "le" shows up more often, at least until you get better at it, or perhaps just for some temporary amount of time.  This will offer to strengthen your weakest muscle motions and even out your overall typing skill.



Monday, July 21, 2014

Mandlebrot rendering optimization method

Mandlebrot rendering optimization method

I thought of a new optimization for rendering the Mandlebrot set on a computer screen.  It involves prioritizing pixels based on whether or not their color has changed since the last iteration, and it is useful for real-time rendering.

Consider an unrendered 1,000x1,000 px canvas.  On it are 1,000,000 pixels to calculate.  The traditional naive method is to iterate over every pixel and stop only when the dwell is reached or the orbit escapes.  Such a method is computationally costly when the dwell is considerably large and there are large portions of pixels in the set (i.e. traditionally black pixels) within the current canvas.  One purpose of real-time rendering is to see just outside the set, not where the set is per se.  This new rendering method will algorithmically priorotize colored pixels to calculate in a computationally efficient manner to tend to render the bands first before spending time verifying that pixels in the set will cap out at the current dwell.

For the purposes of illustration, let us also assume that our 1000x1000 canvas is centered at the origin with horizontal and vertical ranges of [-2,2].

My optimization algorithm will start with the assumption that all pixels are in the same band, so we will only calculate one iteration of the top-left-most pixel.

This is easy to display, as the point (-2,2) escapes before the first iteration:





Next, we break down this "one" pixel into 4 sub-pixels and check to see what their color is.  As a result, you get a 2x2 Mandelbrot set render:

Next, of the pixels we just rendered, we only break them down into their subpixels if they are a different color from their parent.  So we will calculate the children of the top-right, bottom-left, and bottom-right, but ignore the top-left since it's the same color as last time.  Note that this is a naive implementation of the algorithm for illustrative purposes.  The next run gives a 4x4 render of the Mandelbrot set:



As you can see, we are already starting to see the shape of the set, and we've only calculated 13 pixels.  The next version will only select those smaller pixels whose color is different from its parent and break them down to get a 16x16 render:

As a result of selectively ignoring stagnant pixels, we can rush to render those regions which are interesting.  Now for 4 iterations this particular algorithm isn't any more impressive than if you simply rendered pixels in subchunks with no prioritizing, but this method starts to shine in areas where there are large regions of black, like a zoom-in to a mini-mandelbrot.

The details of the prioritizing process is left as an exercise to developers, but a few notes may aid the process.

1. It may be useful to only deprioritize regions for a few iterations, as seen in the example above, a naive deprioritization will lead to interesting areas being left unrendered until the end.
2. Be sure to calculate each pixel all the way to its dwell, as coupling the dwell with the current iteration in the render will result in mandelbrot set members to be focused on.

Got it working!






You can see that it very quickly finds a lot of the noisy parts of the image, and defers computation for areas which have neighboring colors that are the same.

Update:  Already got it quite a bit faster!





The big blue area is where the dwell is maxed out.