Iterative, the company behind the command line tool DVC (Data Version Control), announced another open source project: Datachain is a Python library specifically designed for processing and evaluating unstructured data. It is designed to help ML and data professionals optimize their workflows.
Advertisement
Curate data with LLM and local ML models
According to a recent survey by McKinsey, while more and more companies are using generative AI, only 15 percent of the companies surveyed could confirm that it had any significant positive impact on their business. Reasons for this included the challenges of processing unstructured data. This is exactly where Iterative wants to start with DataChain. The open source tool aims to help convert data into more easily usable formats and improve quality – for example through AI-assisted data curation. Developers can use both local ML models and API calls to larger language models (MLMs such as GPT or in the cloud) to enrich their data.
Other typical areas of application for Datachain include LLM analysis and validating multimodal AI applications. When validating data, Datachain allows you to use strictly typed Pydantic objects instead of JSON. In the project’s GitHub repository, the development team shows an example of how chatbot dialogues from different data sources can be analyzed and evaluated using different approaches – for example using local ML models or using LLM that works as a universal classifier.
To store data sets, Datachain uses an embedded SQLite database that is automatically versioned. If needed, developers can use tools to serialize the entire LLM response from the response (class) to the internal DB instead of directly into the data structure ChatCompletionResponse
) to retrieve. The following code example shows how to retrieve the stored records and iterate over the object.
chain = DataChain.from_dataset("response")
# Iterating one-by-one: support out-of-memory workflow
for file, response in chain.limit(5).collect("file", "response"):
# verify the collected Python objects
assert isinstance(response, ChatCompletionResponse)
status = response.choices(0).message.content(:7)
tokens = response.usage.total_tokens
print(f"{file.get_uri()}: {status}, file size: {file.size}, tokens: {tokens}")
Here’s more information about Datachain Official announcement of Iterative’s open source project. If you want to get a solid impression of Datachain, you should take a look at Throw away the GitHub repo. Besides detailed information about the Python library, there is also a link to An introductory tutorial,
(Map)