TheoreticallyMedia

June 11, 2026

The Cinematic Mega Prompt: From 2×2 Stills to a Moving Shot

The full prompt stack from the Cinematic Prompt Technique video — the Nano Banana Pro 2×2 template, the optional upscale pass, and the handheld video prompt. Copy, paste, direct.

Related tool Nanobanana Pro/2

Wall of text incoming — and that’s the point. This is the full prompt stack from the Cinematic Prompt Technique video: a three-stage pipeline that takes you from reference images to a 2×2 grid of cinematic stills, through an optional upscale pass, and out the other side with a video prompt built for handheld tension. It originally ran in Theoretically News, the newsletter — this is its permanent home.

The workflow targets Nano Banana Pro for the image stages, but the front end runs through any LLM you like. The smarter, the better.

Stage 1 — The 2×2 grid

Paste the block below into your LLM along with your reference images of characters, locations, or props. Two things matter here: respect the token limit — two or three reference images, not thirty — and your direction goes in the top line. One or two sentences of what’s happening in the moment. Action and intent only.

USER SCENARIO INPUT (user writes only this)
 [ONE OR TWO SENTENCES: what is happening in the moment. Action/intent only.]
SYSTEM INSTRUCTIONS (model must obey)
HARD OUTPUT
 Output ONLY the final image-generation prompt. No headings, no analysis, no brackets, no "template" text. Do not mention reference images.

REFERENCES (FAIL CLOSED)
 If you cannot access the attached images, output exactly: MISSING REFERENCES

LENGTH (CRITICAL)
 Final prompt must be 2,200–3,000 characters total (including spaces). If too long, compress by removing redundancy. Never remove the 4 frame lines or the Grounding Block.

SOURCE OF TRUTH
 All appearance + environment details must come exclusively from the attached images. Do not invent wardrobe, props, architecture, or lighting sources.
 Scenario elements not visible may be described only generically (no design specifics).

CONTINUITY
 All four frames show the same frozen instant, different angles/distances, unless specified in Scenario input section.

FINAL OUTPUT FORMAT (follow exactly; keep it tight)
A) HEADER (must start exactly like this)
 Generate a photorealistic cinematic 2x2 grid of still frames from a live-action film depicting [scenario summary from user input]. The imagery must feel grounded, restrained, and emotionally heavy. Realistic lighting, real materials, deliberate camera placement. No animation style, no painterly rendering, no exaggerated fantasy glow. Real weight, real consequence.
 Also specify: 2x2 contact sheet grid, four equal panels, thin borders, no text, no captions, no watermark.
B) FRAME LIST (must appear immediately after header; 1–2 sentences each)
 Frame 1 (WIDE): [establish space + subject placement from the location image]
 Frame 2 (MED/OTS or PROFILE): [maintain screen direction + scale; show relationship between subject(s) and threat/action]
 Frame 3 (CLOSE): [reaction/detail; micro-expression + tension; same instant]
 Frame 4 (REVERSE WIDE / ALT ANGLE): [reinforce continuity; read environment]
C) ENVIRONMENT (1 short paragraph, 3–5 sentences)
 Describe only what's visible in the location image: layout, materials, wear, moisture/haze, practical light sources, atmosphere.
D) CHARACTERS (1 short paragraph total unless multiple distinct subjects, then 1 sentence each)
 Confirm only visually supported traits: silhouette, key clothing layers, materials, wet/dirt/wear, hair shape, defining facial features. Keep minimal.
E) ACTION (1 short paragraph, 2–4 sentences)
 Translate the user scenario into physical posture/tension/gaze in the same frozen instant. No new events.
F) TECHNICAL (1–3 sentences)
 Only if supported by the reference look or metadata: lens type/DoF behavior + lighting direction/falloff. Do not guess film stock/camera brand unless provided.
G) GROUNDING BLOCK (must be verbatim, at the end)
 Photorealistic live-action realism.
 Real materials, real lighting, real physics.
 No fantasy glow, no stylized rendering, no illustrative techniques.
 Everything must feel physically present, heavy, and believable.

The output lands somewhere around 3,000 characters, which works well in Nano Banana Pro.

Stage 2 — The upscale pass (optional)

Once you’ve made your selects and cropped your images, run them back through Nano Banana Pro with PJ Ace’s upscale prompt:

Preserve the exact composition, framing, camera angle, color grade and subject placement — do not alter or add new elements

Increase resolution to true high-end cinematic clarity, with natural film-grade sharpness (no AI oversharpening).

Apply blockbuster–style cinematography:

– grounded realism

– soft, motivated lighting

– natural contrast

– restrained highlights

– deep but clean shadows

– subtle atmospheric depth

-Keep the same color temperature and color tone

Texture pass should feel physically real: skin pores, fabric weave, dust, stone, metal, wood — all enhanced without plastic smoothing.

Maintain cinematic depth of field consistent with the original image (natural lens falloff, no artificial blur).

Do not redraw. Do not stylize. Do not beautify. Only enhance realism and resolution.

Stage 3 — The video prompt

Take your upscaled frame and run it back into the LLM to generate the video prompt. Fair warning: I’m calling for a lot of handheld movement and camera shake here. You may not like that — look it over and tweak it to your taste. And once again, user input at the top. You’re the director.

FINAL VIDEO PROMPT (OUTPUT ONLY) Analyze the attached reference image and extend this exact moment into motion.
Create a single continuous video shot that feels raw, handheld, and unstable, as if filmed under pressure.
CAMERA (CRITICAL — PRIORITY)
 The camera is handheld and shaky at all times.
 Constant micro-jitter, uneven framing, slight rotational wobble, and irregular motion blur caused by real human movement.
 No stabilization. No smooth pans. No dolly or crane motion.
 Occasional small grip adjustments cause brief, imperfect micro-whips.
 Camera shake must be visible throughout the entire shot and drive the tension.
SHOT
 Close-to-medium handheld shot of the subject from the reference image, in the same environment.
 The camera reacts to the subject rather than controlling them.
 If dialogue is provided, the subject delivers "[INSERT DIALOGUE HERE]" with urgency and strain.
PERFORMANCE
 Expression shifts subtly during the shot: jaw tightening, breath visible, eyes flicking, tension rising.
 The action remains one continuous moment — no cuts, no time jump.
MOTION BLUR
 Motion blur is present due to camera shake and movement, not post effects.
 Blur increases slightly during emotional emphasis or camera instability.
LIGHTING & SPACE
 Lighting remains consistent with the reference image.
 Practical light sources only.
 Minor exposure fluctuation occurs naturally due to camera movement, not lighting changes.
STYLE CONSTRAINTS
 Photorealistic live-action.
 No cinematic smoothness.
 No slow motion.
 No stylized effects.
 No animation-like motion.
The shot should feel like a single raw take, captured in the middle of chaos.

That’s the stack

Three stages, each one optional after the first. The full walkthrough — including why the frame list and grounding block earn their place — is in the video. If you want prompts like this in your inbox when they’re fresh, that’s what the newsletter is for.