Data Geek

Unlocking the Power of Data

Welcome! I'm a software architect with formal training in both computer science and physics. I have a passion for applying elegant algorithms to huge data sets. I consider myself a data geek.

I also enjoy backpacking.

This website shares highlights from my professional past and present.

I am currently the CTO at Lore IO, an early-stage data startup, where I lead product and engineering. At Lore, we believe that collaboration is the key to unlocking the power of enterprise data. Learn more at www.getlore.io.

My contributions to Lore include:

Hired and built product, engineering, and QA teams.
Led engineering in the development of a novel data preparation and analytics platform.
Invented and developed early version of Lore's Schema Virtualization to dynamically generate complex transformations of tens of terabytes of semi-structured data from thousands of sources originating from dozens of customers. This technology drove Lore's pivot to ETL/data-prep space.
Developed product vision and roadmap. Worked to align marketing, sales, and engineering on GTM strategy.
Led the development of core IP including Schema Virtualization, Semantic Layer, AQL (an SQL-extension
utilizing our Semantic Layer), SmartSearch (NLP-based data discovery), as well as Lore's Storyteller service
framework (a websocket-based pub/sub API architecture servicing Lore's entire multi-tenant Cloud).

And, we’re hiring! Contact me at bill@getlore.io

Join us and help build the data tools of tomorrow!

HeadSpin

I led data science and developed the initial analytics engine at a Google Ventures backed startup called HeadSpin. Their mission is to allow developers to instantly test mobile apps on real cell networks around the world. My contributions include designing and implementing the following:

A protocol-extensible data structure called a Message ARchive (MAR) along with function libraries used to capture, represent, and efficiently analyze traffic on remote cell networks.
A novel online query engine that exposes an SQL-like interface and employs a variety of indexes to easily and efficiently answer questions about network traffic.
A command-oriented endpoint that allows client apps to subscribe to a variety of metrics, time series, and query results for live sessions on remote cell networks.
Numerous algorithms for efficiently identifying and extracting potentially problematic patterns on the network.

I studied computer science at Stanford specializing in Information Management and Analytics

MSCS 2014

Coursework

I've completed graduate courses in Machine Learning, Data Mining, Natural Language Processing, Information Retrieval, and Artificial Intelligence. My cumulative GPA was 3.80. While attending Stanford I wrote copious amounts of code in Python, C, Java, MATLAB, and SQL working with a wide variety of algorithms and techniques including MapReduce, linear and logistic regression, SVM, Naive Bayes, decision trees, PageRank, SVD/PCA and Neural Networks.

Projects

I received world-class mentorship on research that spanned multiple project courses. This work explored the problem of improving estimates of item-item similarity in the context of search engines and recommender systems. In CS 341, Mining Massive Data Sets, under the brilliant guidance of Professor Jeffrey Ullman, partners Charles Celerier and Jamie Irvine and I prototyped a system called Session Re-Rank (SRR). SRR leveraged historic data (>100M query results) and real-time user-session information to improve the quality of Walmart.com's search results. SRR adds a re-ranking layer to a search engine that compares results to items a user has previously expressed interest in. In this way, we were able to improve the overall CTR on Walmart.com search results by an estimated 4.5%.

Session Re-Rank paper

session Re-Rank code

Continuing on this theme, in the first of two quarters of independent study (CS 399), partner Jamie Irvine and I developed techniques for improving estimates of item-item similarity within collaborative filtering when challenged by data sparsity. Our advisor during this project was adjunct professor and entrepreneur Anand Rajaraman, whose insight and advice proved invaluable. We introduced two novel approaches to this problem. The first employs multiple linear regressions against a model of noisy similarity. The second utilizes latent similarity distributions and the notion of user-predictivity. Both techniques are shown to significantly improve estimates of future item-item similarity.

Item-Item Similarity Estimator paper

Item-Item Similarity Estimator code

In our subsequent quarter of independent study, Jamie and I set out to develop algorithms that could make good item recommendations from the catalog of retailer A based on a user's expressed preferences for items in the catalog of retailer B. Importantly, these algorithms would be limited to public data--that is, they can only leverage data shown on the websites of retailer A and retailer B. Such algorithms could provide valuable business intelligence to retailers or power a third-party online shopping service. We succeeded in creating a novel latent feature recommender that leverages publicly available intra-retailer recommendations to make inter-retailer recommendations that outperform a nontrivial content-based approach.

Inter-Retailer Recommender paper

Inter-Retailer Recommender CODE

Other Projects

CS 229 Machine Learning: Using PCA to detect Dark Matter

CS 221 Artificial Intelligence: Pattern Recognition with Neural Networks

While at Stanford I worked at the SLAC National Accelerator Lab. Working with a team of software engineers, I contributed to Qserv, a distributed query engine for the future Large Synoptic Survey Telescope (LSST).

The LSST is an 8.4 meter diameter optical telescope that will sit atop Chile's Cerro Pachón, which is expected to be fully functional by 2022. Wielding the largest digital camera ever constructed (3200 megapixels!), the LSST will be the largest and fastest (in terms of imaging) telescope ever built. It will scan and catalog billions of distant galaxies across the entire night sky. Toward the end of its run, the LSST's database is expected to include tens of billions of identified stars and galaxies together with tens of trillions of detections.

Astronomers require the ability to efficiently perform spatial self-joins as well as cross-match different observations of the same patch of sky. Achieving these requirements in the context of tens of petabytes of data presents a novel challenge. Toward this end, a team of engineers and computer scientists at SLAC are implementing Qserv, a distributed shared-nothing SQL database querying system (written primarily in Python, C++, and MySQL).

Download Qserv design document

Check out Qserv code on GitHub

My main contributions include implementing Qserv's core error handling and fault tolerance capabilities. I also developed a logging solution to be used by the entire LSST community. It is built atop of Apache log4cxx and is optimized for performance as well as flexibility of use. Read about the logging solution here.

Check out LSST logging code on GitHub

I studied experimental physics at Caltech. My research explores thermopower as a probe of strongly-correlated quantum electron systems.

PhD 2016

Life in a lab

From 2006 to 2012 most of my days were spent in a Caltech sub-basement working in Prof. Jim Eisenstein's lab fabricating semiconductor devices, cooling them to sub-Kelvin temperatures within cryostats, constructing circuits to measure the most delicate signals, writing software to automate data acquisition and regulate experimental parameters, writing more software to analyze experimental data, solving physics problems on white boards, and machining custom parts for future experiments.

My Research

When a metal or any other electron system is subjected to a temperature gradient, a voltage will appear. This is known as the thermoelectric effect and the ratio of the voltage to temperature difference is known as thermopower. It so happens that this thermopower is often proportional to a system's entropy, a very informative thermodynamic quantity. My research exploits this fact by using thermopower to probe some of the most exotic many-body quantum states that are known.

One of these states, known as the fractional quantum Hall state at filling factor ν=5/2, is believed to exhibit quantum exchange statistics that could be harnessed for the creation of a particularly robust type of quantum computer. Consequently, there is a great deal of interest within the scientific community in establishing whether this state indeed obeys these quantum statistics. It was recently predicted that due to these quantum statistics the ν=5/2 state should have an enhanced entropy compared to otherwise similar states. One goal of my work is to measure this entropy enhancement via thermopower.

Download my dissertation

My Publications

Thermoelectric response of fractional quantized Hall and reentrant insulating states in the N = 1 Landau level
W.E. Chickering, J.P. Eisenstein, L.N. Pfeiffer, and K.W. West, Phys. Rev. B 87, 075302 (2013).

Thermopower of two-dimensional electrons at filling factors ν = 3/2 and 5/2
W.E. Chickering, J.P. Eisenstein, L.N. Pfeiffer, and K.W. West, Phys. Rev. B 81, 245319 (2010).

Hot-electron thermocouple and the diffusion thermopower of two-dimensional electrons in GaAs
W.E. Chickering, J.P. Eisenstein, and J.L. Reno, Phys. Rev. Lett. 103, 046807 (2009).

While attending Caltech, I provided consulting services to the IT solutions firm HyperTrends.

HyperTrends develops web and other IT solutions for corporate and medium-sized business customers. A key project of mine involved solving a problem related to the data access layers of web applications written using Microsoft's .NET framework. At the time (e.g. 2009) a common software stack for .NET web apps included ASP.NET and LINQ to SQL. These technologies offer a great deal of flexibility. However, we found that otherwise straightforward CRUD pages could require data access logic that would take half a developer-day or more to deliver. Compounding the problem, different developers would make slightly different design decisions regarding how data was manipulated or persisted resulting in a lack of uniformity across an application. These issues directly impacted the company's bottom line. Working with co-founder Sean Green, we solved this problem by developing a dynamic data access layer that typically reduces development from half a day to about 30 minutes or less. This was achieved via design decisions that traded some flexibility for uniformity thereby allowing us to solve many of their data access problems once and for all.

This inspired a personal project to develop a more generalized and comprehensive solution for data access using the .NET stack. The goal was to create an intuitive utility that wrapped LINQ to SQL and would virtually liberate an application developer from SQL and the RDBMS. This was achieved by embracing certain design patterns thereby foregoing some of the flexibility offered by .NET in exchange for rapid application development, uniformity, and more robust software. I called it the VirtualContext (a reference to LINQ to SQL's DataContext object). The VirtualContext simplifies data persistence by providing change tracking within the disconnected regime of web applications. In addition, it allows developers to easily articulate the shape of fetched object graphs (e.g. whether to include FKs, dependencies, etc.) without the use of a query.

Check out the VirtualContext on GitHub

I studied physics at UC Berkeley

BA Physics 2005

Attending Cal was one of the most challenging and rewarding experiences of my life. As part of the physics curriculum I completed courses in classical, quantum, and statistical mechanics, electromagnetism, optics, solid state physics, particle physics, and electronics. I also took the opportunity to take upper-division courses in applied mathematics, computer science, and history. My cumulative GPA was 3.82.

While attending Berkeley I worked as a software engineer at Lawrence Berkeley National Lab.

Following a summer internship, I was hired to work with the Nearby Supernova Factory (SNF), a scientific collaboration dedicated to exploring the nature of the universe's dark energy through observations of type Ia supernovae. I supported their scientific mission by improving and integrating multiple camera and telescope control systems (written primarily in C/C++). A particularly noteworthy project involved enhancing their telescope controller to enable automatic tracking of moving objects. This allowed SNF to train their unique spectrographic cameras and data processing pipeline on the debris of comet Tempel 1 that resulted from NASA's Deep Impact space probe. This project resulted in a co-authored article in Icarus.

Visible and near-infrared spectrophotometry of the Deep Impact ejecta of Comet 9P/Tempel 1
Klaus W. Hodapp, Greg Aldering, Karen J. Meech, Anita L. Cochran, Pierre Antilogus, Emmanual Pécontal, William Chickering, Nathalie Blanc, Yannick Copin, David K. Lynch, Richard J. Rudy, S. Mazuk, Catherine C. Venturini, Richard C. Puetter, Raleigh B. Perry, Icarus 187, 185 (2007).