Search for \" As Rogue Delimiter in csv file

carlarogers · Jan 17, 2022

I have a few csv files containing about 2 million records. Between the commas delimiting fields, strings can be enclosed in double quotes. Unfortunately, a few hundred records use \" to delimit end of a string. At the command line, I can find the \" sequence with

grep -r '\\''\"' .

How can I back that up with something to replace \" with "

In other words, how can those slashes ahead of the double quote mark be shown the chute to the dust bin?

Any help would be much appreciated.

KGIII · Jan 17, 2022

I think grep isn't really the right tool for this. I want to say it's something like sed 's/\\\/g' but don't quote me on that, I'm going to ping @JasKinasis for this one. I'm pretty sure sed is the correct tool for this one, but I'm not that good with the actual mechanics. You can probably find a sample that will remove a single character and modify that example.

You could also try opening it up with your favorite text editor and using find/replace, make sure the first one or two works and then just click the replace all. Search for \", and replace with ", or just search for \ and replace with nothing. Find and replace is awesome like that.

With two million results, it's gonna take a minute but it shouldn't be too bad.

JasKinasis · Jan 17, 2022

@carlarogers - From your description, I’m assuming the text fields look something like this:

Code:

2,5,"bananas\",15.5,"I don’t know\",17

So you have numeric types (or perhaps boolean types) and then string values in certain places, surrounded by double quotes, but with \" for the closing quotes?
And you want to replace all instances of \" at the end of the strings with "?
If so, you need to run the following command:

Bash:

sed -i.bak 's/\\\"/\"/g' /path/to/*.csv

Where /path/to/ is the path to the directory containing the .csv files and *.csv is a globbing pattern, which tells sed to edit ALL of the .csv files in the specified directory.
Again, I'm assuming that you want sed to edit ALL of the files in one go. If that is not the case and you only want to edit one at a time, or to only edit specific files, you could list the files individually after the sed command, instead of using a globbing pattern as I have.
In the above sed command we use the -i.bak option, which tells sed to edit and overwrite the original file (edit in place). We also specify .bak as a parameter to the -i switch. NOTE: There is no space between -i and .bak. It's -i.bak. The addition of .bak tells sed to create a backup file of each file it edits (JIC any of my assumptions about your files are incorrect.),which will be called originalfile.ext.bak.
So if your file was called mycsvfile.csv, then sed will create a backup of the original called mycsvfile.csv.bak before overwriting the original file.

If you use sed -i without specifying a backup extension to use, it simply overwrites the original file without creating a backup. Normally before using sed - I take some time to make sure that my search/replace patterns are correct and are not going to bork things. And if I'm ever slightly unsure, I use sed's -i.bak option to edit in place AND create a backup.
And I accept no responsibility for borked files, so at least this way - if any of my assumptions about your files are wrong, you at least end up with an unmolested backup of the original files!

The search/replace pattern 's/\\\"/\"/g' can be broken down like this:
' is the opening single quote, specifying the start of the actions sed should take
s tells sed that we're doing a search and replace operation
/ is the field separator between the command (search+replace) and the search pattern
\\\" is the search pattern - i.e. The pattern of characters we're searching for
Where:
\\ is the escape sequence for a literal backslash characer \
\" is the escape sequence for a literal double quote character "
/ is the field separator between the search pattern and the replace pattern
\" is the replace pattern, which is the escape sequence for a literal double quote character "
/ is another field separator
g tells sed to perform the search/replace pattern globally***
' is the closing single quote, marking the end of the operations sed should perform

*** Without the g at the end, sed would only search/replace the first occurrence of the pattern on each line. So if there are multiple strings on a single line, then you'll NEED to specify g at the end.

So essentially it says 'find the pattern \" and replace it globally with "'

So after running the sed command I've posted, my initial example-string would end up looking like this:

Code:

2,5,"bananas",15.5,"I don’t know",17

I've just mocked up a small .csv file and checked it. And the sed command I've posted works properly for me. So it should work for you. At least as long as all of my assumptions about your .csv file are correct!

If anything is not quite right, please give me a more concrete example of the data you're working with and I can help you to work out an appropriate sed command to use.

KGIII · Jan 18, 2022

Thanks, Jas! Trying to wrap my head around how to use sed with a slash was just more than my brain can handle. Brilliant write up, as always.

carlarogers · Jan 20, 2022

JasKinasis said:
So you have numeric types (or perhaps boolean types) and then string values in certain places, surrounded by double quotes, but with \" for the closing quotes?

The closing delimiter is supposed to be just "

In a dataset with about 2 million records, with 22 fields per record, approximately 100 records contain strings delimited like this:

2,5,"bananas",15.5,"I don’t know\",17

Only one of the strings will have the rogue 2nd delimiter.

JasKinasis said:
And you want to replace all instances of \" at the end of the strings with "?

Yes! Exactly. Thank you for taking time to work this through with me.

JasKinasis said:
Bash:
sed -i.bak 's/\\\"/\"/g' /path/to/*.csv

Thank you, this is what I was looking for. I have used sed and regex for purposes similar to this many times. I don't understand regex, but I usually figure out a regex that gets it done. This particular search target stumped me.I have used sed close to this way many times. For some reason, I could not figure out the regex to use for this particular search target

Looking at the solution, I don't get how

\\\"

searches for

\"

I expected the solution to be

\\"
update: now I get it. Thank you for spelling it out.

You know, whenever I have tried looking up regex so I coud figure out this stuff on my own, I have no found a document that explains in a way that doesn't give me a headache. However, the way

JasKinasis said:
Code:
2,5,"bananas",15.5,"I don’t know",17

That looks like the right result. I am going to go run it on my data right now.
you have spelled it out is great. You have a gift for this kind of writing.

Just one extra \, to escape the slash I want to escape away from. Question: is "escape" the right word?

JasKinasis said:
Normally before using sed - I take some time to make sure that my search/replace patterns are correct and are not going to bork things.

doi.

JasKinasis said:
\\ is the escape sequence for a literal backslash characer \
\" is the escape sequence for a literal double quote character "

JasKinasis said:
tells sed to edit ALL of the .csv files in the specified directory.

absolutely what I need, since I don't know which files or records where this problem is going to appear.

JasKinasis said:
The addition of .bak tells sed to create a backup file of each file it edits (JIC any of my assumptions about your files are incorrect.),which will be called originalfile.ext.bak.

It is very generous of you to have spelled all this out. In the case I am working now, there is no need to save backups of the target files, because the operation is being done on a set of copies of the original file. When this sed runs, the original files are stashed in another directory. After the data is uploaded to mySql from the files, the files are deleted.

Did you test \\\" ?

I just tried that expression with grep. Does it work differently with sed?

whoops, I saw you tested it, so I am going to run it now.
***********************
I did not find success with that test.

I have attached a file containing 2 records that have this problem. The example you spelled out looks perfect. I read somewhere that regex varies by all kinds of things, so it is not always portable.

carlarogers · Jan 20, 2022

KGIII said:
Thanks, Jas! Trying to wrap my head around how to use sed with a slash was just more than my brain can handle. Brilliant write up, as always.

This topic is bazzar. Hard to believe it is the easiest way to have a general search tool.

KGIII · Jan 20, 2022

carlarogers said:
This topic is bazzar. Hard to believe it is the easiest way to have a general search tool.

LOL There's a reason why I pinged him to the thread. He's very, very good at this sort of thing.

f33dm3bits · Jan 20, 2022

carlarogers said:
This topic is bazzar. Hard to believe it is the easiest way to have a general search tool.

Sed is the best tool for searching and replacing things like this in a file with a lot of lines. You can probably do the same with Notepad++ but you will endup having to write a search and replace regex just as you have to do with sed.

JasKinasis · Jan 21, 2022

carlarogers said:
The closing delimiter is supposed to be just "

In a dataset with about 2 million records, with 22 fields per record, approximately 100 records contain strings delimited like this:

2,5,"bananas",15.5,"I don’t know\",17

Only one of the strings will have the rogue 2nd delimiter.

Yes! Exactly. Thank you for taking time to work this through with me.

Thank you, this is what I was looking for. I have used sed and regex for purposes similar to this many times. I don't understand regex, but I usually figure out a regex that gets it done. This particular search target stumped me.I have used sed close to this way many times. For some reason, I could not figure out the regex to use for this particular search target

Looking at the solution, I don't get how

\\\"

searches for

\"

I expected the solution to be

\\"
update: now I get it. Thank you for spelling it out.

You know, whenever I have tried looking up regex so I coud figure out this stuff on my own, I have no found a document that explains in a way that doesn't give me a headache. However, the way

That looks like the right result. I am going to go run it on my data right now.
you have spelled it out is great. You have a gift for this kind of writing.

Just one extra \, to escape the slash I want to escape away from. Question: is "escape" the right word?

doi.

absolutely what I need, since I don't know which files or records where this problem is going to appear.

It is very generous of you to have spelled all this out. In the case I am working now, there is no need to save backups of the target files, because the operation is being done on a set of copies of the original file. When this sed runs, the original files are stashed in another directory. After the data is uploaded to mySql from the files, the files are deleted.

Did you test \\\" ?

I just tried that expression with grep. Does it work differently with sed?

whoops, I saw you tested it, so I am going to run it now.
***********************
I did not find success with that test.

I have attached a file containing 2 records that have this problem. The example you spelled out looks perfect. I read somewhere that regex varies by all kinds of things, so it is not always portable.

From looking at that .zip file you've attached - the file does contain two records that have strings with the \" endings.
On my Linux laptop and in Cygwin on my Windows work PC, the sed command I posted in my original reply works and replaces the \" with ".

But when I extracted the zip, I did notice that your zip contains a __MACOS directory.
If you did the sed replacement on a MAC - it's worth noting that MAC's version of sed is one of the BSD versions, not the GNU version that we use in Linux. And there are some subtle differences between the behaviours of the GNU and BSD versions. Also the GNU version has a bunch of handy extensions too.

But I've never used a MAC, so I couldn't tell you exactly what those differences are.
So if you're on a MAC and it's not working, it could be that the MAC version of sed doesn't expand the escape sequences for \ and ". Or it could be a difference in the way that the -i option works. I don't know.
But if you run the sed command I posted in Linux, or in Cygwin on Windows, using GNU sed - it definitely works!

JasKinasis · Jan 21, 2022

f33dm3bits said:
Sed is the best tool for searching and replacing things like this in a file with a lot of lines. You can probably do the same with Notepad++ but you will endup having to write a search and replace regex just as you have to do with sed.

The advantage of using sed is, the process can be completely automated.
Tell it what to look for and what to do when it finds it - and it just does it! Much quicker and easier than opening the files in an editor and manually searching/replacing instances.

WRT editing by hand - it depends on the editor and the size and number of files you're working with.
Doing it with a really basic editor like Nano would be a PITA.
But in something like vim or emacs, or one of the more powerful, modern, GUI editors, you could probably do it in a single operation for ALL files at once.

For example, in vim you could do it in a single operation. By loading all of the files into separate buffers at the same time.
e.g.

Bash:

vim *.csv

That will load ALL of the .csv files in the current directory into vim in separate buffers.
Then you could do a global search/replace in ALL files using something like bufdo (if the files are all loaded into buffers), or tabdo (if the files are loaded into tabs).

In this case, we've just loaded all of the files into separate buffers, we don't have any tabs.
So in Vim's normal mode, we could use:

Code:

:bufdo %s/\\\"/\"/g

The above vim command performs the search/replace that Carla specified, on ALL buffers that are currently in memory. Then you'd simply use :xa to save all of the buffers and exit.

I haven't used Emacs for a while. There's a similar method for searching and replacing in multiple files simultaneously, but I can't remember it offhand.

And regardless of whether you use sed or an editor to do it in a single operation - you still have to build an appropriate regex for the things you're searching for.

Generally speaking though, if I need to edit a lot of large files quickly - I'll use sed.

KGIII · Jan 21, 2022

Like I mentioned way up near the top, lacking the sed skills to do this without asking for help, I'd just use a text editor. Some of them will let you do find/replace on all the files in a directory. I'd just replace the / with a space or with nothing. It might take a minute, but I bet it'd be done sooner than this thread will be - even on two million rows.

Search for \" As Rogue Delimiter in csv file

carlarogers

Member

KGIII

Super Moderator

JasKinasis

Super Moderator

KGIII

Super Moderator

carlarogers

Member

Attachments

carlarogers

Member

KGIII

Super Moderator

f33dm3bits

Gold Member

JasKinasis

Super Moderator

JasKinasis

Super Moderator

KGIII

Super Moderator

Staff online

Members online

Latest posts