Escaping Characters

PaulH · October 24, 2019, 8:37am

I am working on importing a large food database.

So many problems seem to come up with branded food names.

Some have + signs
There is T.G.I.F. Friday’s with a single quote in the name
Then some food names have double quotes "

I am pulling the names using PHP/SQL. Wrapping it up in JSON, then parsing that JSON.
I guess I will try to escape characters on the PHP/SQL query part and see if that fixes things.

This post is half ranting, half asking: what is your experience with escaping these trouble characters? (As it were)

Josh · October 24, 2019, 4:35pm

You could probably use regular expressions to remove the extra characters.

How many items are in the list?

PaulH · October 24, 2019, 6:00pm

Interesting, thanks for the input.

I often hear of people using Regex, but I really have no idea. How would I use it?

Looks like ~130,000 names. Some seemed to be duplicates, but I think it is because the strings were getting broken.

Josh · October 24, 2019, 6:27pm

If you want, I could show you how to use regex to clean the list. It could probably be done interactively in Vim, including removing the duplicates.

Here are a couple of tutorials on how to use regular expressions in JavaScript:

(It’s similar in PHP.)

This tool is useful for testing the patterns:

PaulH · October 24, 2019, 6:28pm

Cool, thanks, I will go through these materials and see. Yeah, I will have to test duplicates…

Josh · October 24, 2019, 6:32pm

If there is one food per line, then you can pipe the content through the sort and uniq commands:

$ cat original_file.txt | sort | uniq > sorted_unique_copy.txt

(That won’t modify the original_file.txt but will save the results to sorted_unique_copy.txt.)

PaulH · October 24, 2019, 6:35pm

Sweet, it is in a .sql format right now, but I think that bash would be way faster… I guess I could even do find replace like that.

I am going to manually create a bunch of code for edge cases like

Mcdonalds > McDonalds

and so on.

I am reading more about regex. I do like this quote:

Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems.

Jamie Zawinski

Getting there. I will try the sort unique as well.

PaulH · October 24, 2019, 6:44pm

Look at this!

PaulH · October 24, 2019, 6:44pm

Probably just going to delete all those entries…

OutsourcedGuru · October 25, 2019, 7:42pm

Wow. I do not love data-grooming-before-import. bleh

I think I usually went through a process, though:

define the stringifying character like apostrophe and replace this first with, say, an escaped version of that
it’s probably then best to also replace the other one (double-quote) with its equivalent
review how the database was installed from a collation-indexing standpoint because things may sort in a strange way if this is wrong versus the encoding of the incoming data
decide if you can store unicode or not. if not then “Black Bean Olé Seasoning” could be a problem (I sometimes see DBAs do a replacement of é => e’ which is probably more like e’ or e0x39 possibly but I honestly think that looks just wrong to me.)
decide if you want to store the “&” equivalent of an ampersand (the HTML-friendly display code itself) for things like this. this isn’t a good solution if you’re also doing EDI with other businesses, creating sales orders and such with them since that may or may not look good in those reports.

In a case like this I usually ask the client to begin with the end: what kinds of reports do you expect to see from this and how are they produced? Okay, it’s just a web page… then you could in theory just store HTML-friendly equivalents. These days, though, it seems like unicode is the way to go.