Beating my Friends in FruitBox
Beating my Friends in FruitBox, using OCR and AI.

In Fruitbox, there's a grid of apples, each with assigned values from 1-9. Your goal as a player is to draw rectangles that contain a cumulative sum of exactly 10. As you make these groupings, the total grid of apples gets gradually more sparse. Your final score is based on how many apples you can collect within the time limit. As a human player, there's some strategy, for example there's a roughly equal number of each number value present in the grid, which means you might try to make pairs of 2 when possible. If you're playing more slowly, you might accept that you'll reach a point ceiling, and instead optimize for larger clusters of apples (1,1,1,7 instead of 3,7).

Some of my friends got to playing, and for my taste they got to be too good too fast. Since I was late to the game and had some ground to make up in my high-scores, I decided my best bet at catching up was to train my computer to catch up for me. I went through a number of different approaches, which I'll outline as briefly as possible below. Although I started with a greedy algorithm that just plays hundreds of games and picks the first valid cluster it finds over and over, I quickly found that I needed to use more sophisticated algorithms to compete with my friends, which led me to train a neural network to play the game...
But first, here's how we did in the end, topping out at 143/170 points:

Heuristic solvers
Area-priority greedy: Each pass, enumerate all valid rectangles (non-zero sum = 10, at least one non-zero). Sort by area descending; take the largest valid one, zero it out, repeat. No backtracking; linear number of passes; subsecond. Quality good but not optimal.
Digit-priority greedy: Same enumeration, but score by number of non-zero digits cleared (with small random tie-breaking). Runs multiple attempts (default 3) with different random seeds and returns the best digit count. Typically better than area-priority; still no optimality guarantee.
Backtracking: Recursive search over valid rectangles. At each state, generate all valid rectangles, sort by non-zero count descending, then recurse on each choice. Pruning: if digits_cleared + remaining_nonzero ≤ best_count, skip. Optional time limit; returns best solution found so far. Can find optimal solution but complexity is exponential; used for analysis, not live play.
Brute-force solver
BruteForceSolver enumerates all valid rectangles (non-zero sum = 10) once for the initial grid. Two search modes:
Optimized: Rectangles sorted by efficiency (non-zero count / area). Greedy solution used as lower bound for pruning. Iterates over combinations with early termination when remaining rectangles cannot beat best; best used for small or medium grids when optimality is needed.
Full brute force: Unpruned enumeration of combinations; guaranteed optimal when run to completion but expensive.
solve_greedy is a simple greedy baseline used internally (e.g. as lower bound).
Before we execute a solution, we generate all the rectangles we are going to draw with the mouse and auto-clicker. Here's a preview of the rectangles being drawn on a grid:

PPO
Environment: Custom Gymnasium RectangleEnv. State: 10×17 grid (shape (10, 17, 1)), values 0–9. Actions: discrete index into all possible rectangles (r1,c1,r2,c2) in row-major order; invalid actions (sum ≠ 10 or all zeros) are masked. Reward: number of non-zero digits cleared per step; optional large bonus for full clear. Episodes run on fixed grids from disk (train/validation); env cycles through a list of grids.
Training: Stable-Baselines3 PPO. Policy: MLP (e.g. pi/vf [256,256] or [128,128] in fast mode). Action masking applied so the agent only selects valid rectangles. Training loads grids from data/train, validates on data/validation; supports MPS (Apple Silicon), CUDA, or CPU.
Inference: solve_rectangles_model builds a minimal env with the same action space and rectangle map, loads the saved PPO model, and runs the policy with masking until no valid action exists. Used by main.py for live solving.
GNN/CNN RL
Separate from the PPO pipeline. State: raw grid. Action: continuous (r1, c1, r2, c2) from a BoxActionHead (MLP outputting 4 coordinates, clamped to grid and r2 ≥ r1, c2 ≥ c1). Feature extractor: SpatialAttentionCNN — conv layers with spatial attention (1×1 conv + sigmoid), layer norm, then global pooling to a single vector. Training: experience replay; steps in a FruitBoxEnvironment that validates each box (sum must be 10, at least one non-zero) and rewards by digits cleared. Loss combines policy (e.g. continuous action loss) and value head; no discrete action space. BruteForceSolver (or mock) is used only to list valid rectangles for masking or termination, not for action space.
CNN solver
FruitBoxCNN: Encoder (conv stack + spatial attention) maps grid to latent vector; decoder reconstructs grid (e.g. for unsupervised pretraining). Strategy head: latent → 4-D (rectangle coordinates). Value head: latent → scalar. Can be trained with autoencoder loss on grids and/or supervised/RL objectives on moves. StrategyCNN (in same file) is a patch-based design with move history (LSTM/Transformer) for sequence-aware play. KMeans/TSNE used in scripts for latent analysis. Invoked via Makefile (train-cnn, test-cnn, etc.); not wired into main.py's default solve path.
With several trained models, I let them play several hundred games each on the same grids, and would use statistics to predict a theoretical max score, which would determine the better model. Here's an example of such a comparison:

