![]() For example, it turned “2222 Colony Road, Moorcroft” (note the “r”) into “2222 Colony Road, Mooncroft.” The word “Mooncroft” (with an “n”) doesn’t appear anywhere in the text. ChatGPT hallucinated data, meaning it made things up. Often in subtle and hard-to-detect ways.Overall, these are the types of errors ChatGPT introduced: For the police data, I was basically looking for a summary to identify certain incidents and the individuals involved. I wanted to find big players in the breach database, so I didn’t care if some of the names were wrong or if some numeric values were off by a zero. Averages, histograms, mins and maxes were out.īut for my projects, the mistakes were tolerable. ![]() The errors, although subtle and relatively infrequent, were enough to prevent me from doing the basic analyses that most data journalists want to do. The totality of these problems make most uses of ChatGPT editorially impractical, especially at scale. But once I went through the pages and compared values, I started to notice errors. At first glance, I even thought I had a perfectly extracted dataset. Impressively, ChatGPT built a mostly usable dataset. If it continued to fail, I’d make a note of it and skip the record. I retried if the validation check failed, and usually I’d get valid JSON back on the second or third attempts. Two checks were particularly important: 1) making sure the JSON was complete, not truncated or broken, and 2) making sure the keys and values matched the schema. I tried to extract a JSON object from every response and run some validation checks against it. (If you don’t know, you can always ask: “Explain how you’d _ using _.”)īecause ChatGPT understands code, I designed my prompt around asking for JSON that conforms to a given JSON schema. One tip: Figure out what wording ChatGPT uses when referring to a task and mimic that. Prompt design is the most important factor in getting consistent results, and your language choices make a huge difference. Addresses, for example, will sometimes end up as a string and sometimes as a JSON object or an array, with the constituent parts of an address split up. It will also decide on its own way to parse values. But doing this for multiple records is a bad idea because ChatGPT will invent its own schema, using randomly chosen field names from the text. You can paste in a record and say “return a JSON representation of this” and it will do it. Once it’s done, getting ChatGPT to convert a piece of text into JSON is really easy. I spent about a week getting familiarized with both datasets and doing all this preprocessing. Ask ChatGPT to turn each record into JSON.Break the documents into individual records.Clean the data as well as I could, maintaining physical layout and removing garbage characters and boilerplate text.This was critically important because ChatGPT refused to work with poorly OCR’d text. Redo the OCR, using the highest quality tools possible.These were completely unstructured and contained emails and document scans. 1,400 memos from internal police investigations.There were five different forms, bad OCR, and some freeform letters mixed in. A 7,000-page PDF of New York data breach notification forms.To test how well ChatGPT could extract structured data from PDFs, I wrote a Python script (which I’ll share at the end!) to convert two document sets to spreadsheets: The results were lackluster, but ChatGPT, OpenAI’s newest model, has several improvements that make it better suited to extraction: It’s 10 times larger than GPT-3 and is generally more coherent as a result, it’s been trained to explicitly follow instructions, and it understands programming languages. The errors, although subtle and relatively infrequent, were enough to prevent me from doing the basic analyses that most data journalists want to do.īack when OpenAI’s GPT-3 was the hot new thing, I saw Montreal journalist Roberto Rocha attempt a similar test. After throwing a couple programming problems at OpenAI’s ChatGPT and getting a viable result, I wondered if we were finally there. So every time a new iteration of AI technology arrives, I wonder if it’s capable of doing what so many people ask for: to hand off a PDF, ask for a spreadsheet, and get one back. I convert a ton of text documents like PDFs to spreadsheets.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |