# Advent of Code 2022 Day 7 in Google Sheets

2022-12-22 • 9 min read

In this article I will explain how I solved both parts of Advent of Code 2022 Day 7 in Google Sheets using a single formula.

Advent of Code is an Advent calendar of small programming puzzles for a variety of skill sets and skill levels that can be solved in any programming language you like. People use them as a speed contest, interview prep, company training, university coursework, practice problems, or to challenge each other. Read more...

In this puzzle we are given an input in the following form:

```
$ cd /
$ ls
dir a
14848514 b.txt
8504156 c.dat
dir d
$ cd a
$ ls
dir e
29116 f
2557 g
62596 h.lst
$ cd e
$ ls
584 i
$ cd ..
$ cd ..
$ cd d
$ ls
4060174 j
8033020 d.log
5626152 d.ext
7214296 k
```

Which represents a shell session of a user trying to figure out which files to delete to free up space for a system update.

The numbers before the files represent their size and dir xyz indicates that the current directory contains a directory named xyz.

Our goal is to (1) find the sum of the total sizes of the directories with a total size of at most 1e+5 and (2) given that the total disk space available to the filesystem is 7e+7 and that the space required to run the update is 3e+7, find the total size of the smallest directory that, if deleted, would free up enough space to run the update.

* The full problem can be found here.

# Final Formula

Assuming that the input is in cell A1, the formula that solves both parts is:

`=ARRAYFORMULA(LAMBDA(full_sizes,INDEX(QUERY(IF({1,-1}*full_sizes>{1e+5,4e+7-SUM(--REGEXEXTRACT(A1,SUBSTITUTE(REGEXREPLACE(A1,"(\d+)","($1)"),"$","\$")))},,full_sizes),"SELECT SUM(Col1), MIN(Col2)"),2))(LAMBDA(full_dirs,BYROW(--REGEXEXTRACT(full_dirs,REGEXREPLACE(full_dirs,"(\d+)","($1)")),LAMBDA(sizes,SUM(sizes))))(LAMBDA(paths,contents,MAP(REGEXREPLACE(contents,"dir (.*)","dir "&paths&"/$1"),LAMBDA(content,REDUCE(content,paths,LAMBDA(acc,null,IFNA(SUBSTITUTE(acc,"dir "®EXEXTRACT(acc,"dir (.*)"),VLOOKUP(REGEXEXTRACT(acc,"dir (.*)"),{paths,REGEXREPLACE(contents,"dir (.*)","dir "&paths&"/$1")},2,0)),acc))))))(QUERY(UNIQUE(SCAN(,FLATTEN(REGEXEXTRACT(A1,SUBSTITUTE(REGEXREPLACE(A1,"\$ cd ([\w.]+)","$ cd ($1)"),"$","\$"))),LAMBDA(abs,cur,IF(cur<>"..",abs&"/"&cur,REGEXEXTRACT(abs,"(.*)/"))))),"WHERE Col1 IS NOT NULL"),QUERY(FLATTEN(REGEXEXTRACT(A1&CHAR(10)&"$",SUBSTITUTE(REGEXREPLACE(A1&CHAR(10)&"$","(?ms)(\$ ls\n)(.*?)(\n\$)","$1($2)$3"),"$","\$"))),"OFFSET 1")))))`

# Explanation

## Step 1 - Extract the visited directories

The first step is to extract all the visited directories in the order they were visited. We will be doing this using regular expressions.

In order to extract all the occurrences of a match from a string, we first have to capture them, which means to place them inside a set of parentheses. The function we will use for this is REGEXREPLACE.

`=REGEXREPLACE(A1, "\$ cd ([\w.]+)","$ cd ($1)")`

The regular expression \$ cd ([\w.]+) is telling our function to capture all the word characters or a literal dot one or more times ([\w.]+) but only if they are preceded by $ cd .

In the second argument $ cd ($1) we are simply telling our function to enclose the previously captured substring $1 in a set of round brackets while keeping everything else the same $ cd .

Now that we have captured all the visited directories, we can extract them with REGEXEXTRACT.

`=REGEXEXTRACT(A1,SUBSTITUTE(REGEXREPLACE(A1,"\$ cd ([\w.]+)","$ cd ($1)"),"$","\$"))`

The reason we also need SUBSTITUTE is because $ is a special character in regex so we have to escape it \$.

### Alternative approach without REGEX

Another way to do this is without resorting to regex is to use a combination of SPLIT, FLATTEN and QUERY.

`=ARRAYFORMULA(QUERY(SPLIT(FLATTEN(SPLIT(A1,"$ cd ",0)),CHAR(10)),"SELECT Col1 OFFSET 1"))`

The formula is splitting by $ cd , flattening the output, splitting again by the newline, selecting the first column of the new output and ditching the root directory with OFFSET 1.

## Step 2 - Transform the visited directories into their full path

We can now create an absolute path of all the visited directories, this is done with a simple IF statement: if the current directory is not .., append it to the previous directory, otherwise extract everything from the previous directory until the last occurrence of /.

`=IF(B1<>"..",A2&"/"&B1,REGEXEXTRACT(A2,"(.*)/"))`

The formula above is dragged across the range B2:F2. In order to do it with a single formula we have to use SCAN.

`=SCAN(,B1:F1,LAMBDA(abs,cur,IF(cur<>"..",abs&"/"&cur,REGEXEXTRACT(abs,"(.*)/"))))`

Where abs indicates the absolute path and cur indicates the current path.

## Step 3 - Remove duplicate paths and extract the content of the listed directories

The third step consists of removing the duplicate paths and extracting the content of the listed directories in the order they were listed.

To remove the duplicates, we will be using the UNIQUE function

`=QUERY(UNIQUE(SCAN(,FLATTEN(B1:F1),LAMBDA(abs,cur,IF(cur<>"..",abs&"/"&cur,REGEXEXTRACT(abs,"(.*)/"))))),"WHERE Col1 IS NOT NULL")`

The QUERY function is there to remove the blank cell (which indicates the root directory).

Now we extract the content of the listed directories in the order they were listed. We will be doing this in a very similar way to how we extracted the visited directories in Step 1.

`=FLATTEN(REGEXEXTRACT(A1&CHAR(10)&"$",SUBSTITUTE(REGEXREPLACE(A1&CHAR(10)&"$","(?ms)(\$ ls\n)(.*?)(\n\$)","$1($2)$3"),"$","\$")))`

The regular expression \$ ls\n(.*?)\n\$ is telling our function to capture everything .*?starting from the character after the newline \n succeeding $ ls until the character preceding the next occurrence of newline \n followed by $. This regular expression is essentially capturing the content of all the listed directories which we then extract using REGEXEXTRACT.

## Note

The reason we also captured $ ls\n and \n$ in our original formula is because in Google Sheets there isn't a nice way to insert newlines in formulas, so we take advantage of the newlines captured using the regex syntax to reuse them in the replacement field of REGEXREPLACE.

The goal here is to have the visited directories aligned with their content. This is easily doable because every $ cd xyz is followed by $ ls and no directory is visited or listed more than once, the problem would have been more complex if that wasn't the case. Additionally, for the purpose of this puzzle, we can completely ignore the content of the root directory, so our new formula becomes:

`=QUERY(FLATTEN(REGEXEXTRACT(A1&CHAR(10)&"$",SUBSTITUTE(REGEXREPLACE(A1&CHAR(10)&"$","(?ms)(\$ ls\n)(.*?)(\n\$)","$1($2)$3"),"$","\$"))),"OFFSET 1")`

We can now see that the content of each directory is aligned with its full path, which sets us up for the next step.

### Alternative approach without REGEX

Just like in Step 1, we can solve this part without relying on regex.

`=ARRAYFORMULA(TRIM(QUERY(SPLIT(FLATTEN(SPLIT(A1,"$ ls",0)),"$ cd",0),"SELECT Col1 OFFSET 2")))`

This formula is conceptually the same as the formula used in Step 1.

## Step 4 - Replace all the dir xyz with the content of xyz

The goal of this step to remove all the occurrences of dir xyz and replace them with the actual content of xyz which will allow us to know the full size of each path.

This step would have been trivial if each directory didn't contain more than one directory. However, in the final input, this is not the case, so we have to keep replacing dir xyz with the content of xyz until there are no directories left.

The function designed to solve these type of problems is REDUCE which I consider to be one of the most powerful functions in Sheets.

We will be doing this in two steps.

### Step 4.1 - Replace xyz with its full path

First, we have to determine what's the full path of xyz so we can use that information to look it up in our list of absolute paths B2:B4 and return its content.

The function we will use for this is once again REGEXREPLACE.

`=ARRAYFORMULA(REGEXREPLACE(C2:C4,"dir (.*)","dir "&B2:B4&"/$1"))`

The above regular expressions translates to: capture whatever .* comes after dir and replace it with dir followed by the full path up until that point B2:B4 followed by / followed by what was previously captured $1.

### Step 4.2

Now that we know the full path of each nested directory and the content of that full path, we can proceed with the replacement.

We will use REGEXEXTRACT to extract the substring to be replaced (the full path of the nested directory).

`=REGEXEXTRACT(D2,"dir (.*)")`

Followed by VLOOKUP to return its content.

`=VLOOKUP(REGEXEXTRACT(D2,"dir (.*)"),{B2:B4,D2:D4},2,0)`

Now we can proceed with the actual replacement.

`=SUBSTITUTE(D2,"dir "®EXEXTRACT(D2,"dir (.*)"),E2)`

As previously mentioned, if we only had to worry about replacing one directory like in the case of the test input, the above formula would be fine; however, in the final input, that's not the case and here's where REDUCE comes into play.

We need to repeatedly apply the above formula until there are no dir left. The maximum amount of times we will ever have to apply the formula corresponds to the total amount of unique paths B2:B4, so we will use that information to tell REDUCE how many iterations to run.

`=REDUCE(D2,B2:B4,LAMBDA(acc,null,IFNA(SUBSTITUTE(acc,"dir "®EXEXTRACT(acc,"dir (.*)"),VLOOKUP(REGEXEXTRACT(acc,"dir (.*)"),{B2:B4,D2:D4},2,0)),acc)))`

I added an extra directory dir /d inside
/a (cell D2)
to show the difference between using only
`SUBSTITUTE`

(cell
F2) and
SUBSTITUTE + REDUCE (cell
G2).

You may have also noticed that there is an IFNA wrapper on SUBSTITUTE, this is needed because the majority of the times, REDUCE will run more iterations than necessary and when REGEXEXTRACT tries to extract the next path, it returns a #N/A error because there would be nothing to extract as every path would have already been replaced with its content. When that occurs, we just tell REDUCE to keep returning the accumulated value acc until all the iterations have been completed.

Finally, to apply this formula to the entire array we can either use BYROW or MAP.

`=MAP(D2:D4,LAMBDA(content,REDUCE(content,B2:B4,LAMBDA(acc,null,IFNA(SUBSTITUTE(acc,"dir "®EXEXTRACT(acc,"dir (.*)"),VLOOKUP(REGEXEXTRACT(acc,"dir (.*)"),{B2:B4,D2:D4},2,0)),acc)))))`

## Step 5 - Determine the total size of each directory

We now have the required information to determine the total size of each directory, which is simply given by the sum of all the numeric values inside that directory.

So the first step we take is to extract all the numeric values using REGEXEXTRACT in combination with REGEXREPLACE.

`=ARRAYFORMULA(REGEXEXTRACT(E2:E4,REGEXREPLACE(E2:E4,"(\d+)","($1)")))`

Then we can use BYROW to sum them.

```
=ARRAYFORMULA(BYROW(--REGEXEXTRACT(E2:E4,REGEXREPLACE(E2:E4,"(\d+)",
"($1)")),LAMBDA(sizes,SUM(sizes))))
```

* By default REGEXEXTRACT returns a string so we are first converting the "numbers" to real numbers with -- which is the equivalent of multiplying by -1 twice.

## Step 6 - Solve

Now that we know the total size of each directory we can solve both parts of the puzzle.

### Part 1

For the first part, we have to sum the total sizes of the directories with a total size of at most 1e+5. We will do this by filtering out the directories whose size exceed 1e+5 and then taking the sum.

`=ARRAYFORMULA(SUM(IF(F2:F4>1e+5,,F2:F4)))`

### Part 2

For the second part, given that the total disk space available to the filesystem is 7e+7 and that the space required to run the update is 3e+7, we have to find the total size of the smallest directory that, if deleted, would free up enough space to run the update.

We can find the amount needed to be freed up with required_space - (total_space - used_space) = 3e+7 - (7e+7 - used_space) = -4e+7 + used_space where used_space is the sum of all the numeric values in our input.

So to find the space needed to be freed we can use the following formula:

`=ARRAYFORMULA(-4e+7+SUM(--REGEXEXTRACT(A1,SUBSTITUTE(REGEXREPLACE(A1,"(\d+)","($1)"),"$","\$"))))`

Now all we have to do is to filter out the directories whose size does not exceed the value we've just calculated and then take the minimum.

`=ARRAYFORMULA(MIN(IF(F2:F4<H2,,F2:F4)))`

And that is our answer to the second part.

## Step 7 - Condense everything into one formula

We can solve both parts with a single formula by taking advantage of how Google Sheets deals with array formulas. First, we flip the comparison operator of either part 1 or part 2 so they both match. I will flip the operator of part 2:

`=ARRAYFORMULA(MIN(IF(H2>F2:F4,,F2:F4)))`

Then we remove both the SUM and MIN functions, so our new formulas in cells G2 and I2 respectively become:

`=ARRAYFORMULA(IF(F2:F4>1e+5,,F2:F4))`

`=ARRAYFORMULA(IF(H2>F2:F4,,F2:F4))`

Now we choose one of the two formulas and swap the sides of the inequality, I will again choose the second one so our formulas become:

`=ARRAYFORMULA(IF(F2:F4>1e+5,,F2:F4))`

`=ARRAYFORMULA(IF(-F2:F4>−H2,,F2:F4))`

Which we can now condense into a single formula:

`=ARRAYFORMULA(IF({1,-1}*F2:F4>{1e+5,-I2},,F2:F4))`

And finally we can use QUERY to take the SUM of the directories with size <= 1e+5 and the MIN of the directories with size >= I2.

`=INDEX(QUERY(IF({1,-1}*F2:F4>{1e+5,-I2},,F2:F4),"SELECT SUM(Col1), MIN(Col2)"),2)`

From now on it's quite trivial as it's just a matter of substituting the ranges/cells with the formulas that are generating them.

After doing all the substitutions the final formula becomes:

`=ARRAYFORMULA(LAMBDA(full_sizes,INDEX(QUERY(IF({1,-1}*full_sizes>{1e+5,4e+7-SUM(--REGEXEXTRACT(A1,SUBSTITUTE(REGEXREPLACE(A1,"(\d+)","($1)"),"$","\$")))},,full_sizes),"SELECT SUM(Col1), MIN(Col2)"),2))(LAMBDA(full_dirs,BYROW(--REGEXEXTRACT(full_dirs,REGEXREPLACE(full_dirs,"(\d+)","($1)")),LAMBDA(sizes,SUM(sizes))))(LAMBDA(paths,contents,MAP(REGEXREPLACE(contents,"dir (.*)","dir "&paths&"/$1"),LAMBDA(content,REDUCE(content,paths,LAMBDA(acc,null,IFNA(SUBSTITUTE(acc,"dir "®EXEXTRACT(acc,"dir (.*)"),VLOOKUP(REGEXEXTRACT(acc,"dir (.*)"),{paths,REGEXREPLACE(contents,"dir (.*)","dir "&paths&"/$1")},2,0)),acc))))))(QUERY(UNIQUE(SCAN(,FLATTEN(REGEXEXTRACT(A1,SUBSTITUTE(REGEXREPLACE(A1,"\$ cd ([\w.]+)","$ cd ($1)"),"$","\$"))),LAMBDA(abs,cur,IF(cur<>"..",abs&"/"&cur,REGEXEXTRACT(abs,"(.*)/"))))),"WHERE Col1 IS NOT NULL"),QUERY(FLATTEN(REGEXEXTRACT(A1&CHAR(10)&"$",SUBSTITUTE(REGEXREPLACE(A1&CHAR(10)&"$","(?ms)(\$ ls\n)(.*?)(\n\$)","$1($2)$3"),"$","\$"))),"OFFSET 1")))))`

And that's it. Thanks for reading.