7 min read

Ghosts in the machine / 2023-10-19

So much news, so little time.
A fantastically carved jackolantern, next to an unnatural/clearly-AI-generated pumpkin vine. Via midjourney.

Welcome new subscribers!

Welcome new subscribers! You might start with my essays on why machine learning might be as big as the printing press (and how we underrate the impact of the printing press!) and on why I am not (yet) a stickler for precise usage of “open” in this space.

Recordings: past

  • The Open Source Initiative has published videos of its recent deep dive series, including me on the RAIL licenses (where, despite what I said above, I spent a lot of time being a stickler about open).

Events and recordings: future

Values

In this section: what values have helped define open? are we seeing them in ML?

WTF is open anyway?

I’m not the only one looking at the possible facets that get bundled into “what is open anyway”—because lots of things we assumed came together in the past may no longer be that way, unless we do it consciously. Here’s a few recent lists of those facets:

  • Here’s one spectrum, though not sure I agree with the conclusions (and not sure it’s even a spectrum).
  • Here's a wildly detailed chart that not only provides a list of facets of openness, but analyzes a pile of actually-existing models by that list.

Lowers barriers to technical participation

This giant list of open LLMs is a good starting place for looking at the actual shape of open(ish)-in-the-wild, with the changelog showing how the space is growing. But I can’t stress enough that this list is based on self-reported “open” and so may not actually give the standard-OSI-open rights.

Enables broad participation in governance

More about this below, but it’s interesting to call out this one slide from a Wikimania talk on ML at Wikipedia—about how they’re thinking about model cards as part of community governance of models.

Chris Albon of WMF has talked more about cards-as-governance as well; I’m super-curious to see how this evolves.

Improves public ability to understand (transparency-as-culture)

This section has often been empty/missing in these newsletters, perhaps indicative of the lack of focus on this area in the broader community?

But this paper is an interesting example of how ML’s different architecture from makes reverse-engineering—and so accountability/transparency—possible. It reverse-engineers CLIP (a key part of image training models) to help understand and improve it. I expect we’ll see more of this (though more next week on what kind of research gets prioritized, or not, and what that says about the prospects of open.)

Shifts power

Techniques

In this section: open software defined, and was defined by, new techniques in software development. What parallels are happening in ML?

Deep collaboration

  • Moderation is part of how communities grow, so here’s Hugging Face announcing their new community policy. It emphasizes consent—which is somewhat weird in a scrape-first, ask-questions-later industry. Consent for me but not for thee?
  • Nathan Lambert has been on fire lately; here he is from earlier in the summer on the path-dependency of different LLM talent pools, which I think is going to be very interesting to watch—just as open source software really appealed to hard-core systems guys (using guys on purpose here!) and failed to attract UX researchers, it seems certain that the shape of the humans who work on this will affect the shape out of the outputs. More on this next week, via a paper that hit while I was writing this…

Model improvement

Transparency-as-technique

Open access does not automatically create transparency—you need to be able to not just access, but also evaluate. Some links on how the space is maturing here:

Availability

  • Wikipedia has a good presentation on how they’re using Machine Learning. Interestingly for the availability point, for translation, they’re moving away from third-party hosted models and towards self-hosted models—something that would have been difficult to picture a few years ago. But they also note that their new LLM-based spam-prevention tooling is “heavy on computation resources”.
  • Allen Institute drops a truckload of training data under an open(ish) “medium-risk” license. Good news: the documentation is, as best as I can tell, best-in-class. Bad news: there’s a lot of it and I still haven’t fully digested it yet.
  • Meta steamrolls on with new translation model and new (very powerful?) code model.

Joys

In this section: open is at its best when it is joyful. ML has both joy, and darker counter-currents—let’s discuss them.

Humane

  • One of my favorite areas for open content is so-called “Galleries, Libraries, and Museums”—GLAM. Here’s a guide from Hugging Face on using ML for GLAM.

Radical

  • I’ve been interested in technology as a tool for democratic deliberation since the late 1990s, and here’s an attempt at using “LLMs for Scalable Deliberation”. I don’t think you can reason your way out of the deep bias problems here (that’s what torpedoed my writing about it in 2000-2001), but it’s still an interesting thought experiment.
  • Using LLMs to generate SQL is the kind of thing that might be very empowering for a lot of people who want to query data but don’t know SQL. Or (given that state of the art accuracy is only 80%) it might drive a lot of frustration. In either case, note that this is being done with open(ish) foundation models; I suspect Snowflake would otherwise not have the capacity to do this.
  • Today in impacting the “real world”: using models to reduce the climate impact of plane flights

Changes

In this section: ML is going to change open—not just how we understand it, but how we practice it.

Creating new things

I’ve enjoyed toying with this new, focused-on-a-fun-frontend-to-customization ML tool from a team of Mozilla vets. Recommend playing with it. The core insight: “view and edit prompt” now has a lot of vibes similar to “view source” in the early days of the web.

Ethically-focused practitioners

Changing regulatory landscape

Misc.