Prompt-Based Feature Engineering (Part 1) - Generative AI Generates Data
Adding Large Language Models to your data pipeline can feel like magic.
Getting data sucks. Getting good data sucks even more. In 2024, this is still a fundamental truth about working in the data industry. It applies to data analysts, data scientists, and data engineers more or less equally and is the reason for memes such as the classic "data science is 90% data cleaning". Thanks to generative AI, this is no longer the case.
Here is a pretty basic thought: generative AI can generate data. This might sound trivial at first. Many people have experienced AI models generating texts, images, and videos. If you put your data hat on, however, this can have big implications for your workflow and pipelines.
Up until now, if you wanted to add new information to your data set, you would typically search on google (good luck!), visit official statistics websites with UIs straight from the 90s, etc. Thankfully, the good people at OpenAI, Mistral, and others did the hard work for us and encoded this information in their models. We only need to reconstruct it via the right prompts.
From Berlin to the Rhine and then Adidas
To appreciate the convenience and efficiency of this new workflow, let's build a data set from scratch. We start from a column of three major European cities with great data scenes: Berlin, London, and Paris.
df = pd.DataFrame(['Berlin', 'London', 'Paris'], columns=['city'])
For the analysis, we use gpt-3.5-turbo in combination with dpq. dpq is a python package for prompt-based feature engineering. We initialise the agent as follows.
dpq_agent = dpq.Agent(
url="https://api.openai.com/v1/chat/completions",
api_key="YOUR_API_KEY",
model="gpt-3.5-turbo"
)
We only need three inputs for this: an endpoint url, the API key and a model name. Now, we are ready to add new columns.
One-to-One Information
Let’s add the country for each city. The simplest prompt is:
"You return the country of a city."
To implement this functionality, we create a new function. We specify the message and pass it to dpq as follows.
messages = [
{
"role": "system",
"content": "You return the country of a city."
},
{
"role": "user",
"content": "Rome"
},
{
"role": "assistant",
"content": "Italy"
},
]
# Add new function
dpq_agent.return_country = dpq_agent.generate_function(messages)
Here, dpq creates a new function called “return country”. When calling this function with a column as input, it appends the row content to the message and sends the API requests in parallel. When we run this, we get the following output.
Well, that was pretty easy. We can think about this type of information as a one-to-one mapping with one bit of information matching one row in the data frame. Other examples of this would be adding the population size, time zone, etc.
System: Do More Stuff
In this first example, we saved the step of trying to find information in another data sets and merging in the relevant column. Next, we have the LLM do more work for us. Consider the following prompt.
“You return the longest river in a country.”
Creating a dpq function for this prompt in the same way as above, we get the following results.
Think about what this would typically require: finding the data, then cleaning, sorting, and merging it. Now, we literally write a single sentence. In contrast to the previous case, this is not directly one-to-one since there are many rivers in most countries and we let the LLM select the longest.
Kick It Up a Notch
As a final, slightly contrived example, let’s try this:
“For a given country, you return the company in the main stock exchange that is among the top 5 in terms of market capitalisation and comes first in the alphabet.”
Implementing this prompt, we get these three companies.
Thinking about how much work this saves us, it genuinely feels like magic.
Magic but not Perfect
In this first part, we saw that adding LLMs to your data pipeline is a potential game-changer. Using prompts to populate columns with information is incredibly convenient. Of course, like anything in life, this approach comes with trade-offs.
The most obvious factor: hallucinations. Since LLMs are probabilistic in nature, we currently do not have guarantees on the outputs. Therefore, until we have new tools for evaluation and data validation, we will have to check the output and make sure it is what we expect. In general, however, data is never perfect, just good enough. Hence, there are many use cases where the added value of using an LLM will be substantial. On a related note, we also haven’t touched on the topic of prompt engineering. It can take a couple iterations to get this right if we, for example, want to obtain outputs in a certain format.
We will look into prompt-based data processing including formatting and other topics in the upcoming parts of this series! In the meantime, feel free to add pull requests to the dpq repo on Github with prompts you have found useful!