A chat about side projects from a Boston Python project night: choose your paths and forgive yourself.
Last night was a Boston Python project night where I
had a good conversation with a few people that was mostly guided by questions
from a nice guy named Mark.
How to write nice code in research
Mark works in research and made the classic observation that research code is
often messy, and asked about how to make it nicer.
I pointed out that for software engineers, the code is the product. For
research, the results are the product, so there’s a reason the code can be and
often is messier. It’s important to keep the goal in mind. I mentioned it might
not be worth it to add type annotations, detailed docstrings, or whatever else
would make the code “nice”.
But the more you can make “nice” a habit, the less work it will be to do it
as a matter of course. Even in a result-driven research environment, you’ll be
able to write code the way you want, or at least push back a little bit. Code
usually lives longer than people expect, so the nicer you can make it,
the better it will be.
Side projects
Side projects are a good opportunity to work differently. If work means messy
code, your side project could be pristine. If work is very strict, your side
project can be thrown together just for fun. You get to set the goals.
And different side projects can be different. I develop
coverage.py very differently
than fun math art
projects. Coverage.py has an extensive test suite run on many versions of
Python (including nightly builds of the tip of main). The math art projects
usually have no tests at all.
Side projects are a great place to decide how you want to code and to
practice that style. Later you can bring those skills and learnings back to a
work environment.
Forgive yourself
Mark said one of his difficulties with side projects is perfectionism. He’ll
come back to a project and find he wants to rewrite the whole thing.
My advice is: forgive yourself. It’s OK to rewrite the whole thing. It’s OK
to not rewrite the whole thing. It’s OK to ignore it for months at a time. It’s
OK to stop in the middle of a project and never come back to it. It’s OK to
obsess about “irrelevant” details.
The great thing about a side project is that you are the only person who
decides what and how it should be.
How to stay motivated
But how to stay motivated on side projects? For me, it’s very motivating that
many people use and get value from coverage.py. It’s a service to the community
that I find rewarding. Other side projects will have other motivations: a
chance to learn new things, flex different muscles, stretch myself in new
ways.
Find a reason that motivates you, and structure your side projects to lean
into that reason. Don’t forget to forgive yourself if it doesn’t work out the
way you planned or if you change your mind.
How to write something people will use
Sure, it’s great to have a project that many people use, but how do you find
a project that will end up like that? The best way is to write something that
you find useful. Then talk about it with people. You never know what will catch
on.
I mentioned my cog project,
which I first wrote in 2004 for one reason, but which is now being used by other
people (including me) for different purposes. It
took years to catch on.
Of course there’s no guarantee something like that will happen: it most
likely won’t. But I don’t know of a better way to make something people will
use than to start by making something that you will use.
Other topics
The discussion wasn’t as linear as this. We touched on other things along the
way: unit tests vs system tests, obligations to support old versions of
software, how to navigate huge code bases. There were probably other tangents
that I’ve forgotten.
Project nights are almost never just about projects: they are about
connecting with people in lots of different ways. This discussion felt like a
good connection. I hope the ideas of choosing your own paths and forgiving
yourself hit home.
This post continues where Hobby Hilbert Simplex left
off. If you haven’t read it yet, start there. It explains the basics of Hobby
curves, Hilbert sorting and Simplex noise that I’m using.
Animation
To animate one of our drawings, instead of considering 40 lines, we’ll think
about 140 lines. The first frame of the animation will draw lines 1 through 40,
the second draws lines 2 through 41, and so on until the 100th frame is lines
100 through 140:
I’ve used a single Hilbert sorter for all of the frames to remove some
jumping, but the Hobby curves still hop around. Also the animation doesn’t loop
smoothly, so there’s a giant jump from frame 100 back to frame 1.
Natural cubics
Hobby curves look nice, but have this unfortunate discontinuity where a small
change in a point can lead to a radical change in the curve. There’s another way
to compute curves through points automatically, called natural cubic curves.
These curves don’t jump around the way Hobby curves can.
Jake Low’s page about Hobby curves has interactive
examples of natural cubic curves which you should try. Natural cubics don’t
look as nice to our eyes as Hobby curves. Below is a comparison. Each row has
the same points, with Hobby curves on the left and natural cubic curves on the
right:
The “natural” cubics actually have a quite unnatural appearance. But in an
animation, those quirks could be a good trade-off for smooth transitions. Here’s
an animation with the same points as our first one, but with natural cubic
curves:
Now the motion is smooth except for the jump from frame 100 back to frame 1.
Let’s do something about that.
Circular Simplex
So far, we’ve been choosing points by sampling the simplex noise in small steps along
a horizontal line: use a fixed u value, then take tiny steps along the v axis.
That gave us our x coordinates, and a similar line with a different u value gave
us the y coordinates. The ending point will be completely unrelated to the
starting point. To make a seamlessly looping animation, we need our x,y values
to cycle seamlessly, returning to where they started.
We can make our x,y coordinates loop by choosing u,v values in a circle.
Because the u,v values return to their starting point in the continuous simplex
noise, the x,y coordinates will return as well. We use two circles: one for the
x coordinates and another for the y. The circles are far from each other to
keep x and y independent of each other. The size of the circle is determined by
the distance we want for each step and how many steps we want in the loop.
Here are three point paths created two ways, with linear sampling on the
right and circular sampling on the left. Because simplex provides values between
-1 and 1, the points wander within a square:
It can get a bit confusing at this point: these traces are not the curves we
are drawing. They are the paths of the control points for successive curves. We
draw curves through corresponding sets of points to get our animation. The first
curve connects the first red/green/blue points, the second curve connects the
second set, and so on.
Using circular sampling of the simplex noise, we can make animations that
loop perfectly:
Colophon
If you are interested, the code is available on GitHub at
nedbat/fluidity.
An exploration and explanation of how to generate interesting swoopy art.
I saw a generative art piece I liked and wanted to learn how it was made.
Starting with the artist’s Kotlin code, I dug into three new algorithms, hacked
together some Python code, experimented with alternatives, and learned a lot.
Now I can explain it to you.
I love how these lines separate and reunite. And the fact that I can express this idea in 3 or 4 lines of code.
For me they’re lives represented by closed paths that end where they started, spending part of the journey together, separating while we go in different directions and maybe reconnecting again in the future.
The drawing is made by choosing 10 random points, drawing a curve through
those points, then slightly scooching the points and drawing another curve.
There are 40 curves, each slightly different than the last. Occasionally
the next curve makes a jump, which is why they separate and reunite.
Eventually I made something similar:
Along the way I had to learn about three techniques I got from the Kotlin
code: Hobby curves, Hilbert sorting, and simplex noise.
Each of these algorithms tries to do something “natural” automatically, so
that we can generate art that looks nice without any manual steps.
Hobby curves
To draw swoopy curves through our random points, we use an algorithm
developed by John Hobby as part of Donald Knuth’s Metafont type design system.
Jake Low has a great interactive page for playing with Hobby
curves, you should try it.
Here are three examples of Hobby curves through ten random points:
The curves are nice, but kind of a scribble, because we’re joining points
together in the order we generated them (shown by the green lines). If you
asked a person to connect random points, they wouldn’t jump back and forth
across the canvas like this. They would find a nearby point to use next,
producing a more natural tour of the set.
We’re generating everything automatically, so we can’t manually intervene
to choose a natural order for the points. Instead we use Hilbert sorting.
Hilbert sorting
The Hilbert space-filling fractal visits every square in a 2D grid.
Hilbert sorting uses a Hilbert fractal traversing
the canvas, and sorts the points by when their square is visited by the fractal.
This gives a tour of the points that corresponds more closely to what people
expect. Points that are close together in space are likely (but not guaranteed)
to be close in the ordering.
If we sort the points using Hilbert sorting, we get much nicer curves. Here
are the same points as last time:
Here are pairs of the same points, unsorted and sorted side-by-side:
If you compare closely, the points in each pair are the same, but the sorted
points are connected in a better order, producing nicer curves.
Simplex noise
Choosing random points would be easy to do with a random number generator,
but we want the points to move in interesting graceful ways. To do that, we use
simplex noise. This is a 2D function (let’s call the inputs u and v) that
produces a value from -1 to 1. The important thing is the function is
continuous: if you sample it at two (u,v) coordinates that are close together,
the results will be close together. But it’s also random: the continuous curves
you get are wavy in unpredictable ways. Think of the simplex noise function as
a smooth hilly landscape.
To get an (x,y) point for our drawing, we choose a (u,v) coordinate to
produce an x value and a completely different (u,v) coordinate for the y. To
get the next (x,y) point, we keep the u values the same and change the v values by
just a tiny bit. That makes the (x,y) points move smoothly but interestingly.
Here are the trails of four points taking 50 steps using this scheme:
If we use seven points taking five steps, and draw curves through the seven
points at each step, we get examples like this:
I’ve left the points visible, and given them large steps so the lines are
very widely spaced to show the motion. Taking out the points and drawing more
lines with smaller steps gives us this:
With 40 lines drawn wider with some transparency, we start to see the smoky
fluidity:
Jumps
In his Mastodon post, aBe commented on the separating of the lines as one of
the things he liked about this. But why do they do that? If we are moving the
points in small increments, why do the curves sometimes make large jumps?
The first reason is because of Hobby curves. They do a great job drawing a
curve through a set of points as a person might. But a downside of the
algorithm is sometimes changing a point a small amount makes the entire curve
take a different route. If you play around with the interactive examples on
Jake Low’s page you will see the curve can unexpectedly
take a different shape.
As we inch our points along, sometimes the Hobby curve jumps.
The second reason is due to Hilbert sorting. Each of our lines is sorted
independently of how the previous line was sorted. If a point’s small motion
moves it into a different grid square, it can change the sorting order, which
changes the Hobby curve even more.
If we sort the first line, and then keep that order of points for all the
lines, the result has fewer jumps, but the Hobby curves still act
unpredictably:
Colophon
This was all done with Python, using other people’s implementations of the
hard parts:
hobby.py,
hilbertcurve, and
super-simplex. My code
is on GitHub
(nedbat/fluidity), but it’s a
mess. Think of it as a woodworking studio with half-finished pieces and wood
chips strewn everywhere.
A lot of the learning and experimentation was in
my Jupyter
notebook. Part of the process for work like this is playing around with
different values of tweakable parameters and seeds for the random numbers to get
the effect you want, either artistic or pedagogical. The notebook shows some of
the thumbnail galleries I used to pick the examples to show.
I went on to play with animations, which led to other learnings, but those
will have to wait for another blog post.
Update: I animated these in Natural cubics, circular Simplex.
People should spend less time learning DSA, more time learning testing.
I see new learners asking about “DSA” a lot. Data Structures and Algorithms
are of course important: considered broadly, they are the two ingredients that
make up all programs. But in my opinion, “DSA” as an abstract field of study
is over-emphasized.
I understand why people focus on DSA: it’s a concrete thing to learn about,
there are web sites devoted to testing you on it, and most importantly, because
job interviews often involve DSA coding questions.
Before I get to other opinions, let me make clear that anything you can do to
help you get a job is a good thing to do. If grinding
leetcode will land you a position, then do it.
But I hope companies hiring entry-level engineers aren’t asking them to
reverse linked lists or balance trees. Asking about techniques that can be
memorized ahead of time won’t tell them anything about how well you can work.
The stated purpose of those interviews is to see how well you can figure out
solutions, in which case memorization will defeat the point.
The thing new learners don’t understand about DSA is that actual software
engineering almost never involves implementing the kinds of algorithms that
“DSA” teaches you. Sure, it can be helpful to work through some of these
puzzles and see how they are solved, but writing real code just doesn’t involve
writing that kind of code.
Here is what I think in-the-trenches software engineers should know about
data structures and algorithms:
Data structures are ways to organize data. Learn some of the basics: linked
list, array, hash table, tree. By “learn” I mean understand what it does
and why you might want to use one.
Different data structures can be used to organize the same data in different
ways. Learn some of the trade-offs between structures that are similar.
Algorithms are ways of manipulating data. I don’t mean named algorithms
like Quicksort, but algorithms as any chunk of code that works on data and
does something with it.
How you organize data affects what algorithms you can use to work with the
data. Some data structures will be slow for some operations where another
structure will be fast.
Python has a number of built-in data structures. Learn how they work, and
the time complexity of their operations.
Learn how to think about your code to understand its time complexity.
Read a little about more esoteric things like Bloom
filters, so you can find them later in the unlikely case you need them.
Here are some things you don’t need to learn:
The details of a dozen different sorting algorithms. Look at two to see
different ways of approaching the same problem, then move on.
The names of “important” algorithms. Those have all been implemented for
you.
The answers to all N problems on some quiz web site. You won’t be asked
these exact questions, and they won’t come up in your real work. Again: try a
few to get a feel for how some algorithms work. The exact answers are not what
you need.
Of course some engineers need to implement hash tables, or sorting algorithms
or whatever. We love those engineers: they write libraries we can use off the
shelf so we don’t have to implement them ourselves.
There have been times when I implemented something that felt like An
Algorithm (for example, Finding fuzzy floats), but it was
more about considering another perspective on my data, looking at the time
complexity, and moving operations around to avoid quadratic behavior. It wasn’t
opening a textbook to find the famous algorithm that would solve my problem.
Again: if it will help you get a job, deep-study DSA. But don’t be
disappointed when you don’t use it on the job.
If you want to prepare yourself for a career, and also stand out in job
interviews, learn how to write tests:
This will be a skill you use constantly. Real-world software means writing
tests much more than school teaches you to.
In a job search, testing experience will stand out more than DSA depth. It
shows you’ve thought about what it takes to write high-quality software instead
of just academic exercises.
It’s not obvious how to test code well. It’s a puzzle and a problem to
solve. If you like figuring out solutions to tricky questions, focus on how to
write code so that it can be tested, and how to test it.
Testing not only gives you more confidence in your code, it helps you write
better code in the first place.
Testing applies everywhere, from tiny bits of code to entire architectures,
assisting you in design and implementation at all scales.
If pursued diligently, testing is an engineering discipline in its own
right, with a fascinating array of tools and techniques.
A proof-of-concept tool for finding unneeded coverage.py exclusion pragmas.
To answer a long-standing coverage.py feature request, I
threw together an experiment: a tool to identify lines that have been excluded
from coverage, but which were actually executed.
The program is a standalone file in the coverage.py repo. It is unsupported.
I’d like people to try it to see what they think of the idea. Later we can
decide what to do with it.
To try it: copy warn_executed.py from
GitHub. Create a .toml file that looks something like this:
# Regexes that identify excluded lines: warn-executed=[ "pragma: no cover", "raise AssertionError", "pragma: cant happen", "pragma: never called", ]
# Regexes that identify partial branch lines: warn-not-partial=[ "pragma: no branch", ]
These are exclusion regexes that you’ve used in your coverage runs. The
program will print out any line identified by a pattern and that ran during your
tests. It might be that you don’t need to exclude the line, because it ran.
In this file, none of your coverage settings or the default regexes are
assumed: you need to explicitly specify all the patterns you want flagged.
Run the program with Python 3.11 or higher, giving the name of the coverage
data file and the name of your new TOML configuration file. It will print the
lines that might not need excluding:
$python3.12warn_executed.py.coveragewarn.toml
The reason for a new list of patterns instead of just reading the existing
coverage settings is that some exclusions are “don’t care” rather than “this
will never happen.” For example, I exclude “def __repr__” because some
__repr__’s are just to make my debugging easier. I don’t care if the test suite
runs them or not. It might run them, so I don’t want it to be a warning that
they actually ran.
This tool is not perfect. For example, I exclude “if TYPE_CHECKING:” because
I want that entire clause excluded. But the if-line itself is actually run. If
I include that pattern in the warn-executed list, it will flag all of those
lines. Maybe I’m forgetting a way to do this: it would be good to have a way to
exclude the body of the if clause while understanding that the if-line itself is
executed.
Pytest’s parametrize feature is powerful but it looks scary. I hope this step-by-step explanation helps people use it more.
Writing tests can be difficult and repetitive. Pytest has a feature called
parametrize that can make it reduce duplication, but it can be hard to
understand if you are new to the testing world. It’s not as complicated as it
seems.
Let’s say you have a function called add_nums() that adds up a list of
numbers, and you want to write tests for it. Your tests might look like
this:
deftest_123(): assertadd_nums([1,2,3])==6
deftest_negatives(): assertadd_nums([1,2,-3])==0
deftest_empty(): assertadd_nums([])==0
This is great: you’ve tested some behaviors of your add_nums()
function. But it’s getting tedious to write out more test cases. The names of the
function have to be different from each other, and they don’t mean anything, so
it’s extra work for no benefit. The test functions all have the same structure,
so you’re repeating uninteresting details. You want to add more cases but it
feels like there’s friction that you want to avoid.
If we look at these functions, they are very similar. In any software, when
we have functions that are similar in structure, but differ in some details, we
can refactor them to be one function with parameters for the differences. We can
do the same for our test functions.
Here the functions all have the same structure: call add_nums() and
assert what the return value should be. The differences are the list we pass to
add_nums() and the value we expect it to return. So we can turn those
into two parameters in our refactored function:
Unfortunately, tests aren’t run like regular functions. We write the test
functions, but we don’t call them ourselves. That’s the reason the names of the
test functions don’t matter. The test runner (pytest) finds functions named
test_* and calls them for us. When they have no parameters, pytest can
call them directly. But now that our test function has two parameters, we have
to give pytest instructions about how to call it.
To do that, we use the @pytest.mark.parametrize decorator. Using it
looks like this:
There’s a lot going on here, so let’s take it step by step.
If you haven’t seen a decorator before, it starts with @ and is like a
prologue to a function definition. It can affect how the function is defined or
provide information about the function.
The parametrize decorator is itself a function call that takes two arguments.
The first is a string (“nums, expected_total”) that names the two arguments to
the test function. Here the decorator is instructing pytest, “when you call
test_add_nums, you will need to provide values for its nums andexpected_total parameters.”
The second argument to parametrize is a list of the values to supply
as the arguments. Each element of the list will become one call to our test
function. In this example, the list has three tuples, so pytest will call our
test function three times. Since we have two parameters to provide, each
element of the list is a tuple of two values.
The first tuple is ([1, 2, 3], 6), so the first time pytest calls
test_add_nums, it will call it as test_add_nums([1, 2, 3], 6). All together,
pytest will call us three times, like this:
This will all happen automatically. With our original test functions, when
we ran pytest, it showed the results as three passing tests because we had three
separate test functions. Now even though we only have one function, it still
shows as three passing tests! Each set of values is considered a separate test
that can pass or fail independently. This is the main advantage of using
parametrize instead of writing three separate assert lines in the body of a
simple test function.
What have we gained?
We don’t have to write three separate functions with different names.
We don’t have to repeat the same details in each function (assert,
add_nums(), ==).
The differences between the tests (the actual data) are written succinctly
all in one place.
Adding another test case is as simple as adding another line of data to the
decorator.