Extract Nth Words in Google Sheets
2022-08-31 • 4 min read
In this problem, we have two columns, one containing a list of phrases, and the other containing a list of comma-separated numbers specifying which words to extract.
In this article I will explain how to achieve the above Desired result using a single ArrayFormula.
Generic Formula
=ARRAYFORMULA(
TRANSPOSE(
QUERY(
TRANSPOSE(
IFERROR(
REGEXREPLACE(
SUBSTITUTE(" "&PHRASES," ","❄️",1/SPLIT(EXTRACT,",")^-1),
".*❄️(\w+).*",
"$1")))
,,9^9)))
- PHRASES: Array containing the list of phrases.
- EXTRACT: Array containing the list of comma-separated numbers specifying which words to extract.
Warning
PHRASES and EXTRACT must have the same size otherwise the formula will return an error.
Note
This formula was designed as an ArrayFormula but it returns the desired result even if PHRASES and EXTRACT are not arrays.
Explanation
Step 1 - Split all the numbers in EXTRACT into their own cell
The first step is to split every comma-separated number contained in the EXTRACT array into its own cell. This is easily done with the SPLIT() function.
=ARRAYFORMULA(SPLIT(B2:B10,","))
Step 2 - Turn the blank cells into something that returns an error when entered as the 4th parameter of SUBSTITUTE()
The numbers in Step 1 will later on be used as the 4th parameter of SUBSTITUTE(). However, we want SUBSTITUTE() to ignore the blank cells since they are by default treated as 0 and interpreted as "replace all occurrences".
To solve this problem, we can turn those blank cells into something that when entered as the 4th parameter of SUBSTITUTE() returns an error, which could be a text value, a negative number or an error directly generated with NA() or N/0 where N is any number. I will be using the latter.
There are two ways to tackle this down:
First way
If the cell is blank, evaluate 1/0, otherwise, return the content of the cell.
=ARRAYFORMULA(IF(SPLIT(B2:B10,",")="",1/0,SPLIT(B2:B10,",")))
Second way
Divide 1 by the cell value raised to the power of negative one.
=ARRAYFORMULA(1/SPLIT(B2:B10,",")^-1)
This works because Sheets interprets 1/N^-1 = N if N ≠ 0, whereas if N = 0, 1/N^-1 returns the following error:
* It's important to keep in mind that when performing mathematical operations, blank cells are treated as 0.
Step 3 - Insert a space before the first character in the phrases list and substitute the nth occurrences of the spaces with a placeholder character
For clarity, the nth occurrences I am referring to are simply the numerical values from the previous step(s). The reason we insert a space at the beginning of each string it's because in case the nth occurrence is a 1, we want the placeholder character to be placed before the 1st word. (We will see why in the following step)
I will be using a snowflake ❄️ as the placeholder character.
* We can also decide to insert the word separator at the end of each string but this will require a slightly different approach in the next step. I will assume that it was entered at the beginning
=ARRAYFORMULA(SUBSTITUTE(" "&A2:A10," ","❄️",1/SPLIT(B2:B10,",")^-1))
Step 4 - Remove all the words not preceded by the placeholder character
* In case the space was inserted after the last character, remove all the words not succeeded by the placeholder character
This is done by applying the following regular expressions to the result of the previous step:
Find: .*❄️(\w+).*
Replace with: $1
=ARRAYFORMULA(REGEXREPLACE(SUBSTITUTE(" "&A2:A10," ","❄️",1/SPLIT(B2:B10,",")^-1),".*❄️(\w+).*","$1"))
Tip
To learn more about regular expressions give a look at this as well as the additional resources mentioned there.
Step 5 - Get rid of all the errors
Wrap the previous formula inside an IFERROR() to remove all the errors.
=ARRAYFORMULA(IFERROR(REGEXREPLACE(SUBSTITUTE(" "&A2:A10," ","❄️",1/SPLIT(B2:B10,",")^-1),".*❄️(\w+).*","$1")))
Step 6 - Perform an array formula row-by-row TEXTJOIN()
The sixth and final step is to merge all the words into a single column while preserving their relative positioning. This is done using a technique known as query smush.
=ARRAYFORMULA(TRANSPOSE(QUERY(TRANSPOSE(IFERROR(REGEXREPLACE(SUBSTITUTE(" "&A2:A10," ","❄️",1/SPLIT(B2:B10,",")^-1),".*❄️(\w+).*","$1"))),,9^9)))
That's all. Thanks for reading.