How to find all patterns between two characters?

I'm trying to find all patterns between a pair of double quotes. Let say I have a file with contents look like as following:

first matched is "One". the second is here"Two "
and here are in second line" Three ""Four".

I want to below words as output:

One
Two
Three
Four

As you can see all strings in output are between a pair of quotes.

What I tried, is this command:

grep -Po ' "\K[^"]*' file

Above command works fine if I have a space before first pair of " marks. For example it works if my input file contains the following:

first matched is "One". the second is here "Two "
and here are in second line " Three " "Four".

I know I can do this with multiple commands combination. But I'm looking for one command and without using that for multiple time. e.g: below command

grep -oP '"[^"]*"' file | grep -oP '[^"]*'

How can I achieve/print all of my patterns just using one command?

Reply to comments: It's not important for me to removing whitespace around matched pattern inside a pair of quotes, but it would be better if the command support it too. and also my files contain nested quotes like "foo "bar" zoo". And all of the quoted words are in separate lines and they are not expanded to multi lines.

Thanks in advance.

5 Answers

First of all, your grep -Po '"\K[^"]*' file idea fails because grep sees both "One" and ". the second is here" as being inside quotes. Personally, I'd probably just do

$ grep -oP '"[^"]+"' file | tr -d '"'
One
Two Three
Four

But that is two commands. To do it with a single command, you could use one of:

Perl
```
$ perl -lne '@F=/"\s*([^"]+)\s*"/g; print for @F' file
One
Two
Three
Four
```
Here, the @F array holds all matches of the regex (a quote, followed by as many non-" as possible until the next "). The print for @F just means "print each element of @F.
Perl
```
$ perl -F'"' -lne 'for($i=1;$i<=$#F;$i+=2){print $F[$i]}' file
One
Two Three
Four
```
To remove leading/trailing spaces from each match, use this:
```
perl -F'"' -lne 'for($i=1;$i<=$#F;$i+=2){$F[$i]=~s/^\s*|\s$//; print $F[$i]}' file 
```
Here, Perl is behaving like awk. The -a switch causes it to automatically split input lines into fields on the character given by -F. Since I have given it ", the fields are:
```
$ perl -F'"' -lne 'for($i=0;$i<=$#F;$i++){print "Field $i: $F[$i]"}' file
Field 0: first matched is
Field 1: One
Field 2: . the second is here
Field 3: Two
Field 0: and here are in second line
Field 1: Three
Field 2:
Field 3: Four
Field 4: .
```
Because we are looking for text between two consecutive field separators, we know we want every second field. So, for($i=1;$i<=$#F;$i+=2){print $F[$i]} will print the ones we care about.

The same idea but in awk:

$ awk -F'"' '{for(i=2;i<=NF;i+=2){print $(i)}}' file
One
Two Three
Four

The key is to consume the quotes in your expression. Hard to do that with a single grep command. Here's a perl one-liner:

perl -0777 -nE 'say for /"(.*?)"/sg' file

That slurps the whole input and prints out the captured part of the match. It will work even if there's a newline inside the quotes, although it then becomes difficult to separate elements with and without newlines. To help with that, use a different character as the output record separator, the null character for instance

perl -0777 -lne 'print for /"(.*?)"/sg} BEGIN {$\="\0"' <<DATA | od -c
blah "first" blah "second
quote with newline" blah "third"
DATA

0000000 f i r s t \0 s e c o n d \n q u o
0000020 t e w i t h n e w l i n e \0
0000040 t h i r d \0
0000046

This could be possible with the below grep one liner and i assumed that you have balanced quotation marks.

grep -oP '"\s*\K[^"]+?(?=\s*"(?:[^"]*"[^"]*")*[^"]*$)' file

Example:

$ cat file
first matched is "One". the second is here"Two "
and here are in second line" Three ""Four".
$ grep -oP '"\s*\K[^"]+?(?=\s*"(?:[^"]*"[^"]*")*[^"]*$)' file
One
Two
Three
Four

Another hair pulling solution through PCRE verb (*SKIP)(*F),

$ grep -oP '[^"]+(?=(?:"[^"]*"[^"]*)*[^"]*$)(*SKIP)(*F)|\s*\K[^"]+(?=\b\s*)' file
One
Two
Three
Four

Using sed:

sed 's/[^"]*"\([^"]\+\)"[^"]*/\1\n/g' file

[^"]*

The ^ at the beginning of [^"]* ... means that the characters listed in the character class should not match(only match single "). The * means " can occur zero or more times.

"\([^"]\+\)"

Everything inside $...$ is a matching group. The first character outside of the matching group is the start match. A character class [^"] is following(It matches every character except of the "). The quantifier \+ means there must be at least one character between the quotes("...") in your input file. Then \), the end of the matching group. This matching group can be access by its index via \1.

The last part [^"]* is the same as the first part that matches everything until the next ".

Alternative approach with Python that doesn't require regular expressions (although not exactly robust), is to process each line in your textfile character by character.

Basic idea of how this works: if we see double quote and no flag raised - raise the flag, and if we see it again and flag is raised - lower the flag. When the flag is raised - that's how we know we're within double quotes, so we can store the subsequent characters. Once the flag is lowered, print what we have read.

#!/usr/bin/env python
from __future__ import print_function
import sys
flag=False
quoted_string=[]
for line in sys.stdin: for char in line.strip(): if char == '"': if flag: flag=False if quoted_string: print("".join(quoted_string)) quoted_string=[] else: flag=True continue if flag: quoted_string.append(char)

And test run:

$ cat input.txt
first matched is "One". the second is here"Two "
and here are in second line" Three ""Four".
$ ./get_quoted_words.py < input.txt
One
Two Three
Four

How to find all patterns between two characters?

5 Answers

Your Answer

Sign up or log in

Post as a guest

Related Archive

Test Website on Safari Browser Version 8 Online

How does monster health scaling work?

Thessaly La Force | Substack

FIRED Up Wealth | Substack