Data Science

Themes and Conferences per Pacoid, Episode 11

Paco Nathan2019-07-02 | 20 min read

Return to blog home

Paco Nathan's latest article covers program synthesis, AutoPandas, model-driven data queries, and more.

Introduction

Welcome back to our monthly burst of themespotting and conference summaries. BTW, videos for Rev2 are up: https://rev.dominodatalab.com/rev-2019/

On deck this time ’round the Moon: program synthesis. In other words, using metadata to generate code. In this case, code gets generated for data preparation, where so much of the “time and labor” in data science work is concentrated. SQL optimization provides helpful analogies, given how SQL queries get translated into query graphs internally, then the real smarts of a SQL engine work over that graph. One of the longer-term trends that we’re seeing with Airflow, and so on, is to externalize graph-based metadata and leverage it beyond the lifecycle of a single SQL query, making our workflows smarter and more robust. Let’s save the details for another time, but in brief this work leads into knowledge graphs about datasets and new features coming in Jupyter support this.

Program Synthesis

If you haven’t seen AutoPandas yet, go check it out. The gist is this: given a pandas.DataFrame, suppose you need to aggregate the data by performing a `groupby()` on one of the columns and then calculating the means of the grouped values. You can provide “before” and “after” versions of a DataFrame to AutoPandas as an input-output example pair. Then click the Synthesize button in AutoPandas to get back a snippet of code in Python. Done and done.

As background for program synthesis, it seems a bit strange that we program computers by coding detailed instructions. Armies of software engineers, heads down in their IDEs 24/7—does it really have to be this way? Don Knuth famously posed a related question in his “Literate Programming” article:

"Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do."

BTW, that Knuth article from 1983 was probably the first time that I ever saw the word “Web” used as a computer-related meaning. No idea whether Tim Berners-Lee riffed off that when he created the World Wide Web. In any case, Knuth’s description of literate programming has served to guide lots of subsequent work, including Project Jupyter. The end goal is to have people explain what they need to do, then have computers generate, or synthesize, the code to perform what’s needed. Doesn’t this seem like a worthy goal for machine learning—to make the machines learn to work more effectively? That’s the gist of program synthesis.

Much of the early work in program synthesis tended to focus on domain-specific languages (DSLs) since those are relatively constrained problem spaces. For example, you can write a BASIC programming language interpreter as a small DSL in Scala. In practice, Scala is popular for implementing a wide variety of DSLs.

However, AutoPandas has a bolder agenda. They set out to implement program synthesis for “wide” APIs, not the “narrow” constrained DSLs. This has implications for data science work, where so much of the heavy lifting of data preparation gets done in libraries like pandas, NumPy, etc., which seemingly fit the intentions of the AutoPandas research.

AutoPandas: Origins

AutoPandas was created at UC Berkeley RISElab and the general idea is described in the NeurIPS 2018 paper “Neural Inference of API Functions from Input–Output Examples” by Rohan Bavishi, Caroline Lemieux, Neel Kant, Roy Fox, Koushik Sen, and Ion Stoica. See also: Caroline Lemieux’s slides for that NeurIPS talk, and Rohan Bavishi’s video from the RISE Summer Retreat 2019.

The authors of AutoPandas observed that:

  • The APIs for popular data science packages tend to have relatively steep learning curves.
  • People look toward online resources such as StackOverflow to find out how to use APIs when the documentation doesn’t have an example that fits.
  • Those online resources can be less than ideal for answering questions: versioning issues, stale answers, rudeness/trolling, etc.

Instead, program synthesis can address these issues. In other words, people could describe their intent for how they want to use an API by showing examples and then receive back synthesized code. In the paper and video, the authors describe how they want to “Automate StackOverflow for APIs” with specific goals:

  • Develop a program synthesis engine for realistic, wide APIs (not narrow DSLs).
  • Scale the problem to handle complex data structures.
  • Scale the problem to synthesize for hundreds of functions, and thousands of potential arguments.

Their approach is based on using components of deep learning, specifically with graph embedding:

"At a high-level, our method works by (1) preprocessing I/O examples into a graph, (2) feeding these examples into a trainable neural network which learns a high-dimensional representation for each node of the graph, and (3) pooling to output of the neural network and applying softmax to select a pandas function. After this, we use exhaustive search to find the correct arguments; those details are beyond the scope of this paper."

For the training and evaluation data, they use examples from StackOverflow, Python for Data Analysis by Wes McKinney (who created pandas), and a Data School video series. In the first part of the problem, a graph gets used to represent the input-output examples of DataFrames. In the second part of the problem, some search is involved and the potential search space can be large and complex. Even for simple “depth 1” parse trees, the search space has a 10^17 branching factor. Early, brute-force attempts by the RISElab researchers were able to reduce that search space down to 10^4—reduced by 13 orders of magnitude. That’s impressive. Using ML models to search more effectively brought the search space down to 10^2—which can run on modest hardware. That’s even more impressive.

AutoPandas: Results

So far AutoPandas supports 119 pandas.DataFrame transformation functions and the results for its synthesized code are compelling. Going back to StackOverflow, the authors found they could solve for 65% of the pandas.DataFrame questions. That’s based on “top-1” evaluation of results, meaning that the best predicted output of synthesized code must match the selected answer on StackOverflow. Alternatively, in terms of “success rate” evaluation of results, AutoPandas predicted correct functional equivalents for 82% of the StackOverflow answers.

In terms of architecture, the front-end for AutoPandas runs on serverless computing, while its back-end is based on Ray. Part of the back-end processing needs deep learning (graph embedding) while other parts make use of reinforcement learning.

Looking ahead, the AutoPandas team intends to extend this system to synthesize longer programs, and may move this into open source. The approach they’ve used applies to other popular data science APIs such as NumPy, Tensorflow, and so on. This field is guaranteed to get interesting.

A Sampler of Program Synthesis

For more background about program synthesis, check out “Program Synthesis Explained” by James Bornholt from 2015, as well as the more recent “Program Synthesis in 2017-18” by Alex Polozov from 2018.

Here’s a sampler of related papers and articles if you’d like to dig in further:

All together, these links should provide some light summer reading fun.

BTW, to minimize possible confusion, note that there are other Python projects called “autopandas” out in the wild that are not related to the RISElab work:

Let’s talk about SQL

Speaking of DSLs, one of the most recognizable DSLs is SQL. The internals of SQL have similarities with program synthesis and help point toward an emerging theme for open source used for data science work.

SQL is a declarative language. SQL focuses on describing “what” must be performed—what a result set should look like, not the low-level specifics of “how” that work should be performed. To build a SQL query, one must describe the data sources involved and the high-level operations (SELECT, JOIN, WHERE, etc.) requested. Then a query engine gets busy, with much happening under the covers: query history and stats inform various optimization strategies; indexes get used automagically; intermediate results get caches; algorithm variants are substituted, and so on.

To make all those magical transformations happen, first the SQL engine parses a query, generating a query graph which is a directed-acyclic graph (DAG). Then the query engine identifies the tables to use (aka “relations” if you go back to the original paper by Edgar Codd). The query graph provides metadata that gets leveraged for optimizations at multiple layers of the relational database stack. Ultimately, some code gets generated to manipulate the data in the tables and index, with a result set as output if all goes well. That’s not entirely unlike program synthesis, but way more OG.

A happy, productive SQL developer usually doesn’t see that magic happening, focusing instead on how to declare what the result set should be, not so much about under-the-covers minutiae. That speaks to the remarkable learning curve aspects of SQL, how oh-so-much data munging can be performed without having to sweat the details. Clearly, SQL helps reduce cognitive load for those who are learning about data management and analytics. The long history and pervasiveness of SQL has helped make data-driven work much more accessible to a wider audience.

SQL and Spark

For a thorough view of how SQL optimizers work, take a look at the Catalyst optimizer in Spark SQL. For details, see their SIGMOD 2015 paper where Michael Armbrust & co. spectacularly knocked one out of the park:

"At the core of Spark SQL is the Catalyst optimizer, which leverages advanced programming language features (e.g. Scala’s pattern matching and quasi quotes) in a novel way to build an extensible query optimizer."

Spark programs are DAGs by definition, designed to use a wide variety of different kinds of data sources. Spark programs also have relatively interchangeable notions of “DataFrame” (not exactly pandas.DataFrame, but close) and SQL query results sets. As the Catalyst paper states, it is “tailored for the complex needs of modern data analysis” which comes in quite handy.
For example, suppose you have (1) a bunch of customer profile records in JSON loaded through some ETL process and (2) you need to join those with log files stored in Parquet. Odds are high that those log files have more columns than you’ll need for most queries. Catalyst is smart enough to infer across that DAG graph that it can use pushdown predicates with the Parquet data, only deserializing the exact columns needed for the query. Less data gets decompressed, deserialized, loaded into memory, run through the processing, etc. Key takeaway: Smart work with graphs and metadata about specific datasets can lead to optimizations at multiple levels. That’s good stuff.

Model-Driven Data Queries

As Tim Kraska, et al., pointed out in “The Case for Learned Index Structures” (see video) the internal smarts (B-trees, etc.) of relational databases represent early forms of machine learning. Moreover, they can be replaced with machine learning models to improve performance dramatically:

"We have demonstrated that machine learned models have the potential to provide significant benefits over state-of-the-art indexes, and we believe this is a fruitful direction for future research."

The research in Learned Indexes highlights another interesting point. One of the trade-offs about SQL and its declarative nature—in other words, its ease of use for such a wide variety of applications—is that much of the “smarts” in the system happens under the covers. As the theory goes, if you centralize all of your data processing within the same relational database management system, then that system can collect metadata (e.g., query stats) about your use cases and leverage that as feedback for the SQL optimizer.

Of course, if you use several different data management frameworks within your data science workflows—as just about everybody does these days—much of that RDBMS magic vanishes in a puff of smoke. Some may ask: “Can’t we all just go back to the glory days of business intelligence, OLAP, and enterprise data warehouses?” Nope, that genie is out of the bottle. One clear lesson of the early 21st century: strategies at scale that rely on centralization are generally risks (John Robb explores that in detail in Brave New War which I’ve just been reading – good stuff). One-size-fits-all (OSFA) is a harmful myth in data science work, and no one data framework will fit all of your use cases and needs.

Apache Spark innovated on integrating a wide range of different data sources and sinks, especially for unstructured data, and structuring the “applications code” as SQL statements, with their result sets becoming DataFrames. Even so, there is a problem: Spark, much like any other SQL engine, recreates a query graph every time it runs a program. That represents runtime overhead. The query graph gets used for optimizing the generated application code in terms of parallelization, compute time costs, memory resources, etc. Afterwards, the graph vanishes. The benefits of this approach (lower cognitive load, etc.) get offset against increased runtime overhead and vanishing metadata. The latter could be especially valuable if it were externalized more formally.

Loopback

Here’s where we come full-circle. Check out the “Scythe” demo referenced above and the related paper by Chenglong Wang, Alvin Cheung, and Ras Bodik from U Washington. They created a program synthesis system for SQL queries, using input-output examples similar to AutoPandas, and also leveraging StackOverflow questions. Although not specifically cited by the AutoPandas project (apologies if I missed a reference?) several aspects of that earlier U Washington project seem remarkably similar, including the experimental design, train/test data source, and even the slides. The U Washington team developed a SQL programming-by-example system, which is an interesting way to bring even more machine learning into SQL queries. It addresses the point of externalizing some of the metadata about the use of SQL queries over time—reusing that metadata for machine learning. In contrast, the RISElab team focused on pandas.DataFrame and leveraging Ray. They address a more challenging problem (“wide” APIs, not “narrow” DSLs such as SQL), plus they take advantage of more SOTA tooling. I look forward to watching the RISElab research unfold and make lots of progress.

By this point you’re probably starting to connect the dots between Catalyst, Scythe, Learned Indexes, AutoPandas, etc. Machine learning plays a big role in this movie. Graphs also play a big role, except that SQL uses an internalized graph (read: mostly invisible to mere mortals); however, there’s a growing case for leveraging an externalized graph instead.

Externalizing Graphs about Datasets

One of the longer-term trends that we’re seeing with Airflow, etc., is to externalize graph-based metadata. Are there ways to leverage metadata about datasets to make our workflows smarter and more robust? In other words, are there uses beyond the lifecycle of a single SQL query? Given the flurry of open source projects which are currently in that intersection of “data catalog and lineage” meets “knowledge graph about datasets” the answer is a resounding “YAAASSS!”

Here’s where I’d like to share some related news about Project Jupyter. There’s a new set of “Rich Context” features in development to support projects as top-level entities, real-time collaboration (à la GDocs) and commenting, data registry, metadata handling, annotations, and usage tracking.

Hypothetically speaking, suppose you have a bunch of data scientists working in Jupyter and your organization is getting serious about data governance. Suppose you wanted to track metadata about dataset usage, e.g., so that you could subsequently build a knowledge graph about datasets used in your organization, then leverage machine learning for data governance applications. With me so far? It makes sense to work through this amazingly popular open source data infrastructure as a way to infuse knowledge graph practices into mainstream IT reality.

Better yet, Jupyter is quintessentially about open standards, so feel free to jump into the dialogue:

Plus, here’s an issue/thread on GitHub which provides more backstory.

Overall, your use of pandas.DataFrame and other popular open source APIs for data science work may become smarter and automagic soon-ish. Workflows are evolving to capture metadata about datasets over a longer lifecycle so that data science workflows themselves can become more model-driven.

Subscribe to the Domino Newsletter

Receive data science tips and tutorials from leading Data Science leaders, right to your inbox.

*

By submitting this form you agree to receive communications from Domino related to products and services in accordance with Domino's privacy policy and may opt-out at anytime.