Programming

More from Less

Too much has been written already about AI Art and how it is killing puppies.

But even people like me who enjoy playing with generative AI tools still don't like finding AI-art-based images mixed in with images made by humans the old-fashioned ways—by hand (analog art or digital painting, or pixel art), by camera, or by CGI. When I'm looking at photos, I don't want to see some random fake non-photo mixed in there. When I'm looking at art made by humans, I want to see art made by humans. Perhaps just share AI art in places where AI art is shared, with a prominent label. If we don't cross the streams, perhaps fewer people will get cranky.

Given that starting place, this article documents the tail-end of a long multi-year process of human-machine hybrid image creation. (I won't call it art because it makes the anti-AI knee-jerk brigade come out of the woodwork.) But assuming the reader is not some sort of Luddite, he may follow along.

And this article is not about the images themselves, per se, or if this is good or bad for artists or for society as a whole, or even about AI itself, it's about what one programmer figured out how to do with certain tools available today. This is actually a programming journey.

What Are We Even Talking About?

My goal has gradually crystalized into this: overcoming the resolution limits of image generation tools and create appealing super-high-resolution poster-sized images, up to 8k or in the 30 Megapixel to 60 Megapixel range (example 1, 2), with increasing level of detail in places that make sense.

An example image

For the purposes of this article, assume the subject is portraiture/figurative and the max level of detail on the image is at the face. The background would naturally have less detail, as in an oil painting or a photograph with a depth of field effect. This artfully tells the viewer where the gaze should return to. This requirement of actually having detail when zoomed way in (for at least some part of the image) precludes simply depending on Topaz Gigapixel AI to turn an image of 1.5 Megapixels (or as low as 0.25 Megapixels) into anything nearly big enough. Gigapixel upscaling can achieve a lot, but without the newer generative features, it can't achieve miracles. But my process can.

Original, Painfully Slow, Too-Manual Process

My initial process involved using Wombo AI art online (from an iPad) to generate an image, then cropping the image repeatedly and doing Image-to-Image to create successively more detailed areas of the image (two nested crops of the original image, the third image generated from a crop of the second). The sad part was trying to take these assets, all the same size, say three images 1080 wide by 1920 tall, and combining them into a single image without upscaling artifacts. Naively upscaling using Bilinear, Bicubic, or even Lanczos will result in "screen-door" artifacts on the "zoomed out" parts (less detailed parts) of the image. (This also create problems for the final "free" 2x upscale in Gigapixel AI—again no miracles are possible once these big jaggies are introduced.)

Necessary Evil: Manual Scaling & Alignment on my Mac

So I began using Topaz Gigapixel AI and a lot of manual calculation and nudging and scaling to get the three images (for this one very simple example) into a shared pixel space without artifacts. It can be easily done but requires this very manual, very error-prone, very boring, very mechanical step before the images can even be masked to make a single image. But Gigapixel charges $2,000 for API access (on my own machine, offline!) so I cannot automate this painful step. (... or can I? ...)

The Fun Part: Masking and Manual Painting on iPad Affinity Photo

For me, the fun part is to take the PSD file (or in my case, Affinity Photo layers document) from my Mac (the only place where Gigapixel runs) back to my iPad (where I generated the images using an online image service such as Wombo AI Art) so I can manually mask the layers together, fix problems with actual manual digital painting, and apply final adjustments such as levels, white balance, manual local dodging and burning, etc.

Enter Mochi Diffusion

In 2024, less than a year ago, I got my hands on an Apple Silicon Mac—in this case the oldest 2020 M1 MacBook Air, the cheapest Mac laptop you can buy used, a device with no fan and 20-hour battery life. Apple has shipped Apple Silicon devices with Neural Engine cores on mobile devices such as iPhone and iPad for many years now, and getting this technology on a laptop opened up some really amazing avenues. Apple ported Stable Diffusion to run on CoreML, meaning it can use GPU cores or lower-power Neural Engine cores to run SD at an almost respectable speed on commodity mobile hardware. What I mean is that you don't need a $2,000 Nvidia GPU inside a massive box with huge loud fans and a 1,000 W power supply, you can play with this technology on a device with no fan, running on battery power.

Some amazing enthusiasts have pushed this forward, in terms of native Mac UI, in the form of an open-source project called Mochi Diffusion. Getting the code for this and compiling it allows a tinkerer like the author to add many crazy new features, to begin glimpsing what people have been doing for years on dedicated kilowatt hardware using A1111.

The first feature I added was the ability to favorite images as Mac-native Finder color tags. This got accepted into Mochi Diffusion and is shipping right now.

The Latest Process

My latest process is quite convoluted, but produces the highest quality and quantity of amazing results I can muster. Think controlled chaos, with a certain amount of manual control, leaving room for wonderful surprises.

Batching Prompts & Madlib-style Canned Faces

The first feature required to get piles of useful starting places is the ability to batch run dozens or hundreds of prompts from a text file. (This was simple enough to add to my experimental branch of Mochi Diffusion and is obvious. A111 has had this for a while now too, obviously.)

In order to get good results for portraiture/figurative images, it helps to have a pile of "canned faces" filtered from an even bigger pile of "blended faces." I won't go into detail but imagine a simple madlib Python script with a long list of pieces of text prompts which when combined generate stable, deterministic results that are unique and repeatable. These result faces vary widely but a good madlib generator might produce perhaps 25% keepers and 75% ugly faces. Generate a giant text file, run this overnight using batching, then comb through the results and favorite the best ones using the feature I added to Mochi Diffusion. Since the color tag is stored in the file system resource fork, you can just copy or move PNG files around in various folders (on Apple OSes). So your folder of copies of best PNGs is actually also a list of the best "canned face" prompts, for the next step. (You also need a command-line prompt puller Shell/Python script that can use the open source Exiv2 library to turn a folder of PNGs back into a text file).

Rows and Columns

The next trick is to use the text file of canned high quality faces prompts as rows in an even more complicated x-y batching mode (yet another feature I added to my experimental development branch). The columns are more complicated, but they involve a folder full of collected-up images with an index text file of prompt templates and strength values for those images (with a wildcard to leave a spot for the row prompt “face” in there). X-Y batching begins by shuffling all rows and runs over N random columns per row, where N is set in the UI as the Number of Images field. Columns can be thought of as poses or compositions or outfits, and the inner prompt "canned face" gets spliced into the larger column's prompt template. This batch can be run for hours or overnight to get hundreds of low-res starter images which can be curated down to dozens of great starting point images, which can be upscaled with my complicated process, below.

Columns & Cross-painting

Creating the images in the columns folder and related prompts is its own art form, and requires a lot of patience and experimentation. But this is also a huge part of how to get top notch results, building on years of collecting up awesome intermediate images, usually improved by cross-painting (manual inpainting of several nearly identical low-res images into a single optimized image) and correcting all of this messy output. Then the process can create endless variations with different faces using these poses and compositions as starting places.

Project Mode

I added an easy-to-use (but complicated-under-the-hood) Project pane to Mochi Diffusion that can help automate my fiddly upscaling process without relying on Gigapixel AI. Instead it uses the wonderful open source RealESRGAN upscaler, which is "good enough" to glue the whole process together, especially since it can all be automated.

Project Mode begins with a single low-res 512x512 PNG generated on my limited machine using just Neural Engine cores, using SD1.5 models. A single keyboard shortcut copies the prompt from the image to the UI, then the remaining parameters can be set by the user (notably Strength, which means Text Strength). The magic values that seem to work for this part of the process range from 0.25 but 0.3 but other values may work, depending on the SD CoreML model file you are using. (0.25 means 75% like the image, so details are added but mostly this is like a generative upscale at every step).


The user selects an image in Mochi Diffusion and enters Project Mode, which opens the starter PNG file (captioned "1.", a full, zoomed out composition) and creates a new unique folder where a pile of 512x512 images will be written to disk. First my code uses Apple's built-in Computer Vision tools for face and pose detection, along with tons of heuristics, (captioned "2.") to create a tree of complicated nested rectangles with explicit coordinates in the starter/original image space, all in a split second. Before doing the automated complicated process (below), the user gets to see how small the smallest rectangle is (face usually), relative to the starter image, with a scale number. For a typical figure this could be anywhere from 6x to 12x, but generally around 8x or 9x. This multiplier will set the scale for the final layered set of images and is very important in making the whole thing work without artifacts.

Project Mode has an Enqueue button so batching can be used yet again (for the third time?) after perhaps dozens of great starter images have been found. Enlarging a low-res image from 0.25 Megapixels into perhaps 20 Megapixels (so 80x for just the upscaled input image, and hundreds of times the number of total pixels, counting all the layers) takes about five minutes per project, so batching allows everything to run in succession in a fully automated fashion, perhaps overnight.

PSD for the Win

What is the output of each Project? Each project folder fills up with 512x512 images, which the Project collects location information about so it can create a giant layered PSD file at the end. Each image is first Crop/scaled from a 512x512 image before running SD Img2Img Generate on it. ("Crop/scale" is just "crop some part but scale back to 512x512" before Img2Img). In this manner, final images will have decent quality for their size, but cannot be combined into a coherent whole without first being intelligently uspcaled into the same pixel space (using RealESRGAN and not bilinear or bicubic or whatever). The final step just combines everything in the following manner: first, scale the widest image to the max scale (say, 8.5x), then next scale subsequent images in the stack of images less and less, until the final tiny face images which are like 1.25x and 1x. These images are added to a PSD document and written out to disk as a giant file, perhaps 100 MB to 500 MB depending on the full resolution. (Affinity Photo documents losslessly compress this to about 3x or 4x smaller, in practice, and PSD files themselves can be compressed more than my PSDWriter library supports.)

An example image

Here is an example of the automated PSD output of the process (captioned "3.") for one original low-res starter image, 512x512. Again this process is dependent on the fact that the "canned face" is known to look good as we zoom in. The prompt already has this information and can control the face. It's like cheating and knowing the future. The whole process is very dependent on this part of the prompt, to get controlled results, instead of requiring luck for those details. It's kind of like rendering a fractal at more and more zoomed in scales, where the formula is deterministic and simple but more detail is revealed with further computation. Similarly, we have mined the SD model for each fullly structured 3-D face model "formula" that can be represented with a few hundred bytes of text, and given enough computation, can be revealed at any scale, lighting, pose, etc.


The PSD file for this is available as a ZIP here, 150 MB unzipping to 250 MB, if you want to see what the layers look like, and try turning them on and off, and imagine masking all this mess together into one beautiful image. Another reason this process works is that there is so much redundant information that the human manual masker (see below) has enough valid data to reject some of the often bizarre results that will inevitably creep into the PSD file. Those layers can simply be deleted.

I am not opposed to uspcaling a finished, adjusted image one final time, manually, using Topaz Gigapixel AI, to get the final 8k image (7680x7680), available here, 59 Mpix @ 8 MB. I think these kinds of results are stunning, and I can create several fully realized images in a day once I have the PSD files in hand, which again, is pretty easy, given all this batching and curating that goes into getting a handful of great starting point images.

Major Caveat: Manual Masking Required

This whole process is heavily dependent on manual human masking of the PSD layers at the end, as many as one or two dozen layers. The results of masking for this example are captioned "4." above, before final adjustments. I've tried to automate this using Inpainting ControlNets but the version in MochiDiffusion is unusably slow and produces very poor quality results. But I find this masking a mostly pleasant process, like knitting or whatever, which is very low-key and amendable to music or podcast listening. Take the highly portable iPad anywhere with decent seating, sit down, lean back with the iPad in one hand and Apple Pencil in the other, and mask away. Also, some manual digital painting may be required, and some adjustments at the end can help put things over the finish line (captioned "5." above). Another example of running through this entire proces is show here:

An example image

Another caveat: this requires a SD CoreML model file that matches your style / prompt requirements (oil painting style or pastel painting style is particularly difficult to achieve). This adds another layer of complication with token count limits, in my case 75 positive and 75 negative tokens! In fact this example uses three different SD CoreML model files, which are automatically switched at different crop levels, developed heuristically for my needs.

Recap & Conclusion


I can create very-high-resolution results on very-low-end hardware. This even runs on battery, with no fan. I can overcome signiciant technical and creative barriers using human ingenuity and patience, instead of relying on luck or beefy hardware. It's an interesting challenge to create a bespoke process for one narrow type of image generation.

The code for this is in my Mochi Diffusion more-customizations branch but is very unsupported, and very experimental. An exercise for the reader, if ever there was one. You're on your own. Here Be Dragons.


Humor

True Wisdom

He may live without books,
what is knowledge but grieving?
he may live without hope,
what is hope but deceiving?
he may live without love,
what is passion but pining?
but where is the man that can
live without dining?

Robert Bulwer, Earl of Lytton


Reference

When Anything Can Happen, Nothing Matters

Film theory: an explanation for why so many overproduced movies are emotionally unsatisfying

If you've ever watched a movie where the climax was approaching and the story just started getting too big for its britches, and the filmmakers kept adding ridiculous twists and turns, and upping the ante, and instead of feeling like the movie was getting more interesting, instead it became harder to suspend disbelief and the story started to feel disconnecting, or you began wondering how much longer the movie will take, how much more climatic this already hyped-up scene will get... congratulations, you have experienced what I term,

"When Anything Can Happen, Nothing Matters."

When done right, the action in a story feels personal to individual characters, and when we know the stakes (and the stakes are not just "lots of empty emotionless buildings will get smashed, the whole city is under threat!") then we can feel empathy for specific characters, and we are drawn into the narrative. When the film-makers keep "pulling back" and doing a lot of what Brad Bird called "God shots" with the camera where we are looking down on the action from above, because the spectacle is so vast we have to take the mile-high view, literally, when the focus shifts to spectacle over character, then we are at risk of losing track of why we should care.

A list of movies that suffer somewhat from this problem:

  • Incredibles 2 (literally trying to save the city)
  • Despicable Me 3 (really fun movie with a trumped up, silly, even boring ending)
  • Minions (story gets too big, literally)
  • Penguins of Madagascar (save the city, save everyone)
  • Home (2015 DreamWorks film) - save the planet
  • The Hobbit (fun story, fun story, war campaign, fun story, fun story)
  • The Avengers (2012 film - save the city, punch a building, nuclear threats, endless outsized horrors, and expanding good-guy powers to counter the horrors)

There are many other examples.

An example of a franchise that makes clear rules, in order to avoid this problem, is the Harry Potter books and films. The rules are there to make sure that the writer and reader know that the rules cannot just change at the drop of a hat. Something is at stake. We will not feel emotionally manipulated: "Oh no! Oh no! It's about to get terrible! ... just kidding, the good guys had this, the whole time!" No one wants that.

Note that when this rule is knowingly broken, and when new rules or weird backstories are introduced, but the writers dig in, and explore the ramifications, it can create tension and fun stories, for example the series Adventure Time, and Rick and Morty, which are truly bananas at times, but the stories explore that craziness instead of pouring it on thick and then magically erasing the craziness a few minutes later.

Note also that superhero stories can feel connecting when individual characters are vulnerable, and we can relate to them and imagine ourselves in their shoes. Spider-man and Batman are two examples. Spider-man allows us to wonder what we would do with new powers. And he is always getting banged up and hurt. He struggles even to understand his powers at times. That aspect of a superhero story is relatable and very human. Batman is even better, because he is the ultimate self-made superhero. He created his own superpowers with his intelligence, his gadgets, his study of combat. And he is still very vulnerable. We worry about his fate because he is so human (he is literally not superhuman like Thor or Superman). The opposite is very disconnecting, the struggles of titanic forces against each other: evil gods and good gods fighting at an inhumanely large scale.


Apple

Just a VR Headset?

Or, No True AR Headset Fallacy

Apple has sent review units to a handful of reviewers, who have had a few days to create videos and write reviews of the Apple Vision Pro, before the general public gets their hands on the device in a few days.

Irksome comments by otherwise intelligent people

Sam Kohl and John Prosser react to these reviews and have a healthy back and forth about the product, based on the earliest public info that hasn't been filtered through Apple:

These two close Apple watchers offer insightful push-back, which is healthy in the sense that when we try to prognosticate about the future of a product, we need to understand things from every angle and as they really are.

However, John repeats something Nilay Patel says in The Verge review, which is that Apple's headset "is just a VR headset." What Nilay is saying is that he has historical context for Apple's headset because he has been using and reviewing such VR headsets for years now, and following the technology closely. He obviously knows what he is talking about, from a hardware point of view. He speaks fairly about the limitations of the device, "magic, until it is not." Everything he explains is based on a close examination of the device itself. He is careful about the details.

However, I wonder if Nilay’s comment that Vision Pro is "just a VR headset" will go down about as well as CmdrTaco regarding the iPod: "No wireless. Less space than a nomad. Lame."

John Prosser repeats Nilay's line that Apple Vision Pro is just a VR headset, and he mentions that Facebook could have made the Apple Vision Pro if they were willing to charge customers $3,500, then they too could have put better outward facing cameras and better displays in front of the wearer's eyes, and then that imaginary headset would be very similar to Apple's shipping Vision Pro, perhaps better?

The Ship(s) of Theseus

The mistake these reviewers are making is just a twist on the old philosophical problem of identity, famously explained as the thought experiment of the Ship of Theseus. We are asked to consider Theseus on his seafaring journeys, and his crew needing to replace pieces of their ship. Gradually they repair and fix the ship over many years, until every piece of the ship has been replaced. Is this ship the same ship as the original ship? When did it cease being that “same” ship? What if we take all of the pieces that were replaced and collect them up, and rebuild a run-down version of the original ship? Now we have two ships. Which ship is the real Ship of Theseus?

One solution to the problem is to use a different, functional definition of identity, instead of some nominal concept. (In other words, nominally, the only "real" ship of Theseus was the first one, then it stopped being the Ship of Theseus once a single change was made, in fact once it hit the water and started to weather. This is one solution to the thought experiment, but it doesn't match our intuition.) A functional definition might be: that the ship of Theseus is whichever ship takes Theseus and his crew on their adventures. The parts of the ship relate to one another functionally, and as long enough parts of the ship are functioning together as a ship, then it can be considered the ship of Theseus. We hold identity lightly. Any actual sailing ship that Theseus and his crew sail on can be considered the ship of Theseus.

So, applied to headsets, a functional definition (that is less reductive regarding just examining the hardware) might say that any headset that can do augmented reality things and mixed reality things is not “just a VR headset.” (Virtual Reality becomes an immersive, surrounding software experience, not a category of hardware.)

The Book of Face

The fallacy is that Facebook or other manufacturers could have replaced each component of their headsets with better spec'ed components, one by one, until they had arrived at the Apple Vision Pro. Then those vendors could have spent 2024 getting developers to port their apps to Vision Pro, built a headset platform that goes beyond just Virtual Reality games experiences and entertainment, then eaten Apple's lunch.

This may be factual (debatable) and superficially convincing, but still deeply wrong in some sense, because it ignores reality: the fact is Facebook

  • "could have" pushed the state of the art harder;
  • could have produced their own silicon;
  • could have hired better industrial designers instead of letting the skunkworks Vision Pro group at Apple hire them for a more exciting project;
  • could have treated pass-through (reproducing the world around you) as more of a core requirement to create AR / mixed reality in a single headset, instead of an afterthought;
  • could have watched Oblong and John Underkoffler pioneer spatial computing between his 2010 TED talk and when the Apple headset project really got going (Oblong hired a handful of ex-Apple individuals who returned back to Apple after a few years at Oblong);
  • and Facebook could have tried to ship a phone (oh wait, ten years ago they tried) and tablet platform so there would be thousands of tablet apps and phone apps to bring over to their headset—

Facebook could have done all these counter-to-their-culture things, but the fact is: they didn't. That timeline is not our timeline. And the same goes for Samsung, Sony, Microsoft and Amazon.

I call this lying with facts. You throw out narrow little facts which can all be verified to be true, but you ignore other big obvious things that show that the facts don't come together to imply the conclusion you claim they do. Otherwise very intelligent people fall for this all the time because they don't step back and look at the big picture. This kind of misunderstanding is a categorical mis-attribution:

  • The automobile is just a horseless carriage, sans having to feed the horse and scoop up mountains of horse manure, and the horse dies one day, and you can't replace the horse's leg if it breaks, the way you can replace a tire, you have to shoot the horse.
  • Human beings are just wimpy greats apes with long legs and no fur and a bigger cranium; if you added a bigger brain and reason and art and mathematics, and culture and imagination, and religion and architecture and music and taxes and martial law and graveyards, and writing systems and ten thousand years of hard-won knowledge, and the scientific method, which was fought against by religion for centuries, and the Enlightenment, and capitalism—if you just add this to gorillas and bonobos, you would get humans.

These statements are tantalizingly, superficially true, but they miss the heart of identity: when A + B = C, the differences listed in B are what make A and C so different. B is so vast, it doesn't bring A and C together, it pushes them apart.

Apple's headset is clearly very different from the existing headsets on the market in so many already remarked-upon ways, especially price. These pundits undermine their own point by saying that Apple's screens and outward facing cameras are light-years ahead of their competitors. That's the starting point you need to create an entirely different first-hand experience for people who are not full-time VR headset reviewers. There is some line crossed that Apple sees, that normal people see, that VR headset reviewers and manufactures cannot see. To justify purchasing a headset, it needs to do more than just VR, it needs to be more than just games and entertainment. Then it becomes much more useful than just a game console. Then it justifies a higher price too. (Or this is all Apple's hope.)

What makes this all the more infuriating is that Apple has done this so many times before, especially with Macintosh, iPod, and iPhone. (And essentially zero competitors have yet created a viable truly portable tablet platform to compete with iPad—a platform that feels mobile-first, with ten-hour battery life and every app for the platform feeling native. Windows tablets do not satisfy this criterion for battery and legacy (touch target) reasons). Again, this doesn't make the new visionOS platform a shoo-in, but it does mean that Apple gets to lead the technology industry for a lot of reasons, not just their widely feared faithful fan-base; the main reason being vision (pun intended). It just seems that the industry keeps waiting for Apple to come in and show the way, do something initially counter-intuitive but groundbreaking, and then all the vendors will race to catch up and iterate. But somehow only Apple seems capable of this step-function way of thinking.

And Nilay Patel, and other smart pundits keep highlighting this type of narrow thinking with their lack of imagination, trying shove a square new peg into a round old hole. I'm not the one putting words in his mouth, Nilay is really trying to do this! These pundits (Nilay and John) are supposed to be the ones with the perspective to understand what is really going on, but they seem to not see the forest, just trees, very close up, in great detail, perhaps just bark (as I said, they are deep into the details), based on these sort of farcical comments relegating a very different headset to "just VR" when it is clearly designed with such a broader vision in mind, with significantly more capability, in both hardware and software. It's just so uncanny how predictable some pundits are, when they try to pretend to be contrarian for the sake of coming across as objective. It makes them seem anything but! Again, the iPhone shipped in 2007 and tons of smart people publicly missed what was right in front of them, for several years, until it was obvious to everyone. I still wonder, why didn't everyone get rich from investing in Apple after the iPhone was publicly announced? The stock price stayed very very inexpensive for several years, until at least 2008 or 2009, and then it went up like 50 times or 60 times since then!

Is Software King or Is Hardware King?

I think these tech pundits and reviewers also suffer from an annoying form of reductionism or dualism: they see Apple's products more as pieces of nice hardware where the software is a necessary and limiting evil, often the type to lament that they cannot hack the hardware to run their own software, the way commodity PC hardware can run Windows or Linux.

Apple does not and has never seen it this way. Apple starts with the software experience they want to ship, then works backwards from the software to the hardware, then works for like seven years to create the hardware that can enable the experience they want, then they ship the complete package when it's ready. Apple knows their users see the hardware as a necessary evil. Think about it: every piece of hardware that Apple ships has always carried all or most of the downsides or costs (to their users) for their products. (And the software limits and obvious shortcomings are lifted gradually through annual updates, until OS updates become kind of boring fifteen years later).

Travel back in time, and show a cave man your iPhone. Tell him, "Look, iPhone software is amazing! Look at all the many things the software can do! Sadly you must carry around this expensive heavy device in your pocket, with limited battery life, but it is a price worth paying to get access to those amazing software experiences!" Same with the Vision Pro. Apple knows their users kind of don't care about the hardware (inside the case), especially not the specs, which reviewers so unduly focused on. Users only use software, not hardware. Hardware is the price we all pay to use software.

In the 1980s and 1990s, Apple used to have posters on the walls of the product teams' offices that said "Software Sells Systems." But I feel strongly that Apple is at least as software-first, if not more so, these days. From the user's point of view, the experience is so software-centric and the hardware so deliberately minimal, it's hard to argue otherwise. (Yes this is reductive; Apple is not dualistic about this; they design and build products as complete experiences, with their development structured with functionally cross-cutting teams, and not divisional as other companies might do, with Microsoftian fiefdoms.)

It's all about the software, stupid

Any third-party software (that is not Facebook-owned and thus exclusive to Oculus) can be ported to Apple Vision Pro (seated experiences only, based on Apple's documentation and VR experience safety issues). However, probably more than half of the software that developers ship on Apple Vision Pro (especially the rectangle window software, coming over from the iPad, something there is not a lot of on other headset platforms) will not be something that can be ported to Oculus, because it assumes a world of many apps running, many windows, or new experiences that assume a background of the real world with a baseline elaborate (and expensive) pass-through. If new AR experiences created for Apple's headset cannot be ported to any existing VR headset—in principle, in spite of developers wanting to do so (because for example they assume you can cook in your kitchen with the headset on, for example)—how does that leave any doubt that Apple Vision Pro is not just a VR headset?

It's so weird to call out intelligent pundits for saying something irksome and thoughtless, to have to even explain the basics of how words work. If a headset is used for augmented reality and mixed reality activities, and sometimes virtual reality (fully immersive activities) then how is this "just a VR headset?" Yes Nilay is smart and understands all the specs of the hardware, this is just a high-specced VR headset! But my five year-old would say, wouldn't headsets that are used for mostly VR experiences be "just VR headsets?" But headsets that are mostly used for other things be other-thing headsets? The mind boggles when words cease to work as expected.

This means that this is not just a VR headset. It's a platform. Yes, Apple is achieving mixed reality and even a few augmented reality features by cheating and using pass-through ("no true Augmented Reality fallacy"), but they are shipping something real that developers can create apps for, now, not some future glasses that don't exist. Once all this 2024 software exists, the platform will have much more momentum than any other headset. By the time other vendors catch up to the 2024 Vision Pro, four to six years from now, Apple will be shipping the newest 2028 or 2030 Vision Pro or Vision Air product.

I get the feeling that Nilay Patel and other reviewers and pundits are kind of just annoyed that (1) they didn't see it coming (see my article where I elaborated on this, months before any Apple announcement in 2023, and about a year before the product shipped) and (2) Apple pulled off a trick that no one has done: using VR to "fake" AR and mixed reality. It's like these pundits want to grab people and yell in their face, "This is just a trick, this is just VR warmed over; don't fall for it! It's not true AR! The windows aren't really floating in your room!" Like OK, you are trying so hard to be right, but who cares? Again, it just feels like some sort of "no true AR" fallacy. Here is a gold star, you are technically right, but your words are meaningless; congratulations, you broke the English language. Pat on the back.

Keep it up pundits; keep misunderstanding leaps in technology even after Apple publicly explains it thoroughly for the sake of trying to seem objective, so by the time it is too late to go back and understand it in real time, Apple's stock will be up too high, and people will think, oops, we missed that train. Without pundits confusing everyone and causing misunderstanding, perhaps competitors would see things more clearly too, and actually give Apple a run for their money, instead leaving Apple a clear runway, every time.

On Releasing Flawed Products

One last comment: I don’t mean to imply that the commentators are not pointing out legitimate flaws with the product. It is clear that the hardware needs improvement, and that Apple oversold certain features.

There are only two types of products: those that are released too soon, and those that never ship at all.

However I think that industry watchers keep forgetting that this keeps happening every time a product is released: Apple Watch, which iterated in public because they needed to learn how people use it; iPhone, because a 1.0 product, however flawed, was better than waiting another five years to ship a “perfect” version of the device; iPad, with the 1.0 version seeming like an aberration in terms of thickness compared to even iPad 2, which feels much more like the iPad we are used to. A perfect device is a boring device, and that only occurs when the platform is more mature, which by definition only occurs when developers can ship apps to users on hardware that can be purchased by the public.

On missing competitors' apps

The three missing third-party apps that keep being repeatedly mentioned are: YouTube, Spotify, and Netflix.

I think Apple may be genuinely annoyed that there is not a native YouTube app for Vision Pro that works better with the right-sized controls. Because Apple has no competing user-uploaded video social media service. However I think Apple is not at all annoyed that Netflix and Spotify are not on the platform yet. I think they know that their customers who can shell out nearly $4,000 for a 1.0 product can afford $20 a month to test music and video out using Apple's competing services: Apple TV+ and Apple Music. You might even say that TV+ as a service was probably started long ago with visionOS in mind, to make sure Apple could control the availability of high-quality original content for this (then upcoming) hardware play. And the Disney+ app shipping not just natively—but with the device—is no accident.

Again, I think Apple is happy to see Spotify treat themselves as irrelevant on this platform, at least at this point. Apple can control everything about the experience with their native Apple Music app. I'm sure they see allowing Spotify on their platforms, at all, as a cost of doing business, as redundant and lesser. I'm not saying they don't see customers wanting it there, but I think from Apple's point of view, they think, just cancel your Spotify subscription and use that money for Apple Music, and you are set. It sounds cold, but can you imagine a Steve Jobs or Eddie Cue email where they act annoyed that they have to let Spotify on the platform at all to appease regulators? I can! The iPhone had no native apps for a year. And even then they rejected apps for reproducing existing functionality (Calendar, Email), until at some point they let up on this early rule. Apple can be annoyingly opinionated and controlling, surprise!

Conclusion

This does not mean I think this category is a shoo-in, and I don't think this means that any success of Vision / future hardware / visionOS platform will be as large of a success as iPhone, probably not even close. But I think visionOS could create a business that complements the Mac and iPad and even subsumes and grows that core productivity computing market. I think Mac, iPad, and Vision will be the nucleus of their productivity and creativity platform, which also does include content consumption. And I think that by the time the platform has gotten its legs, and the price has dropped (some?) and battery life is better, and weight has been reduced—by that point it will be hard for competitors to catch up. By the time it becomes obvious that Apple has a clever approach, or a "now it's so obvious" approach, it will be too late to benefit from that knowledge, and Vision Pro will be to other headsets what iPhone is to other smartphones: the main attraction, with the best software. (What new software product launches Android-only before it launches iPhone-only? I don't mean emulators and utilities and stuff like that, I mean mainstream, large, successful businesses. What Android-only high-quality tablet software is there? iPad has tons of this stuff.)

Archive