Using ChatGPT to document my datasets
Over the last few weeks I’ve had fun playing around with the OpenAI API. This time I got interested in what the till could do if I just fed it a dataset.
To my surprise (though I’m less and less surprised the more I try things out), the response was impressive. It could describe the dataset just by being fed the data. It also could do some light analysis — for a pothole request dataset, I asked it to detail the record that had been open the longest, and it was able to identify it accurately.
Then, I wondered since it could detail the dataset, could it also provide documentation of different columns? Of course the answer is yes.
I asked for the output to give a description of the dataset, to document each column, tell me the data type, provide a couple of tests I’d want to do on each column, and format it all in a yaml file. I also asked it to make an observation about the dataset and report that as well.
It is not always perfect, but it is pretty close, and a really good option for a first pass at documenting a dataset. I also didn’t spend much time building this, so with more time, you could likely customize the format and output even further.
I’m normally a bit skeptical about tech tools when they have this much hype just after being released, but this feels transformative.
All of the code lives here: https://github.com/samedelstein/documentation_gpt