PlaySDK User Manual

Welcome to the PlayMotion SDK User Manual! In these pages, you'll learn about the steps to producing interactive media using PlayMotion as part of the input mechanism.

Lets begin with a basic overview.

Overview

The PlayMotion SDK is a real-time computer vision library for devloping human-scale entertainment applications. It enables programmers and designers to use cameras as a replacement for or augmentation to 'traditional' input mechanisms in games.

But what is computer vision, really? Computer vision is a relatively new field of research and application in computer science. It's a synthesis of artificial intelligence, image processing and computer graphics whose goal is to create systems that can gather meaningful, semantic information from images. If the images being analyzed are a stream from a camera watching a person or group of people, then the semantic information can be interpreted as the signal from a gamepad, keyboard or joystick, then the camera becomes a new kind of control device that can be used to get players involved in the digital experience in a more visceral way.

And Human Scale? Glad you asked. Simply, it is the design ideal that virtual world objects should bear a 1:1 scale relationship to their real-world counterparts. There's a lot more to it; you may want to check out our Guide to Human-Centric Experience Design.

Finally, then, is entertainment? For the purposes of PlaySDK development, we like to think of entertainment as 'video games', but there are lost of other ways that computers entertain people: reading stories, playing music or movies, creating ambient mood lighting, laser shows, or drawing the first-down line on the field of a football game. Controlling any of these types of applications using a camera is the essence of what PlayMotion is all about.

So, you've got a rough idea of what you want to accomplish. Great! One critical difference between a PlayMotion experience and a classical interactive / videogame experience is that vision-based experiences are deeply tied to the physical environment, or PlaySpace into which they are installed. Vision based experiences rely on human movement as input; therefor, one of the first design constraints to consider is:

We'll begin by exploring design considerations for the physical environment.


Designing an Environment

A unique element of the PlayMotion experience design process is taking into consideration the physcial environment into which it will be installed. Since the vision input signal is generally dependent upon the physical movement of humans and/or objects within a PlaySpace, careful consideration of the intended physical space and its effects on user movement and/or interaction is in order.

This section presents a few considerations that should be part of the experience design process, as they have significant influence on how the camera signal will be processed and used. Specifically, the interrelationships between display type, camera location, and local lighting need to be well thought out while designing your experience.

Choosing a Display

The quality and size of your display, next to system latency, is perhaps the single most critical component of your physical rig when designing a PlayMotion based experience. Two main factors are physical size and orientation. For instance, consider a projector aimed down 90 degrees to evenly illuminate the surface of a tabletop cut to precisely the right size. Or, consider an array of 12 networked plasma displays positioned along a long corridor. Our purpose here is to open your mind to the design possibilities available to you when you look beyond your desktop or laptop monitor.

    Definition: A display is a device which radiates energy on the visible spectrum wavelength in order to transmit information from a machine to the human brain via the optical cortex. Displays can range in complexity from a simple lightbulb (think: one pixel, one bit - on & off) to as complex as a holographic video projector (multidimensional, multicolor).

Ramesh Raskar is credited with the concept that displays in aggregate, seen as a unified structure, can be considered as windows that enable us to penetrate through the present timespace / physical paradigm and to peer into the extra-dimensional world of machine, code and spirit.

The capabilities of a display can be measured and compared across several major axes:

type:
Major classifications include vector vs. raster, and emissive vs. reflective. Examples include CRTs, LCD panels, OLEDs, plasma panels, rear projection, front projection, video matrices (many individual displays managed by a video control system to function as a single cohesive display), and LED wall / billboards.

orientation:
Classically, displays are oriented in landscape fashion, with the image plane situated vertically so that the vector drawn from the eyes of a standing or sitting person will be perpendicular to the display surface. When creating your PlayMotion experience, we invite you to consider other orientations. For instance, projection based PlayMotion experiences are generally best viewed with the projection coming all the way to the floor, in contrast to the standard home theatre setup of 3-8 feet up. Another very compelling orientation from a design standpoint is the table top display, where the display is placed horizontally on a flat surface. Finally, we invite you to consider the effects of projecting onto three-dimensional or sculptural surfaces, which can produce quite surprising results when explicitly factored into the software design.

physical size:
For small raster screens, consider the color video display on your cellphone. Large LED screens have been installed as large as 240 feet wide, and high definition projections have illuminated entire facades of city buildings. Much of the power of a PlayMotion style experience derives from a 1:1 ratio of virtual world to real world, which we call a human scale experience. If you wish to design in this fashion, you need to strongly consider a display capable of fitting one or more full sized humans within its addressable area.

resolution:
number of addressable pixels used to synthesize an image. Resolutions are commonly defined to match media content standards. XGA is a fairly common 4:3 aspect ratio standard for computers which is 1024 pixels wide by 768 pixels tall. 1080 HD is the new standard of excellence, which is 1920 pixels wide by 1080 pixels tall running at 60Hz. It takes a decent GPU to render complex geometry in real-time at HD resolutions.

aspect ratio:
for rectilinear displays, the ratio of width to height in terms of physical dimension. Traditional aspect ratios include 4:3 (classic television) and 16:9 (cinema, and HD). Many large public displays are created in non-traditional aspect ratios. A simple and effective means of jolting participants out of the "this is video, I'll sit and watch" syndrome is to rotate the display 90 degrees. Seeing a video wall 16 feet tall by 9 feet wide is a good call to attention. :)

bit depth:
The theoretical range of color values that the display will accept. The simplest display is one bit (on or off, white or black). Typical computer displays are 32-bit (8 bits per channel of red, green, and blue, plus an "alpha" or transparency channel; each channel has 8 addressable bits, which translates to 256 available levels of intensity. When all three channels are combined (red 0-255, green 0-255, and blue 0-255), all possible combinations result in 16.7 million possible colors.

color gamut:
while bit depth determines what signals may be recevied by the display device, color gamut defines what range of colors is practically reproducible by the device in the real world. Displays with broader color gamuts are said to be of higher fidelity, though brightness, resolution, and contrast also play significant factors.

brightness:
how much light energy can be pumped out of the display. This is typically measured in lumens, and is measured at a fixed distance from the display surface while driving a maximum brightness signal (for instance, 100% white on a 32-bit display)

contrast:
the difference between the brightest bright reproducible, and the darkest black. Higher contrast translates to more realism. Low contrast can result in the "glowing screen" phenomenon.

latency:
how long it takes the display to update from time of signal reception. Also, how long it takes any given pixel to update from one color to another. Displays said to be "video capable" need minimum refresh rates of 24 frames per second, or latency of less than 0.04 seconds (1/24th of a second).

cost:
Cost is a major factor in any installation. A generally good way to think about display cost is either cost per square foot of coverage, or cost per lumen of brightness. Generally the hands down winner in these comparisons are projectors, though these devices do have their limitations, most notably their need for a relatively dark environment. On the other end of the spectrum, some of the most expensive displays in the world are massive LED walls manufactured by the likes of Barco and Daktronics; while these displays can run into tens of millions of dollars, they also have the highest brightness of anything on the market, making them ideal for daylight and outdoor applications.

Once you've selected your display, its time to consider the location of your camera.

Selecting a Camera Location

There are three distinct physical "components" which comprise a PlayMotion environment: the camera, the display and the player(s). The arrangement of these three things relative to one another alters which portion and angle of the player(s) is seen by the camera. This information will impact which are the best computer vision methods to retrieve consistent data about the player(s) actions and intentions.

Following are a few typical configurations. In each case it is assumed that the player are facing the display and the camera is facing the player.

The Traditional camera -- player -- display surface
In this arrangement the camera view is of the back of the player. In a rear projection setting, co-locating the camera and the projector gives strong optical alignment of the camera's view and the projector's display. If background subtraction is used, there is a clear mapping from the player's projected shadow to the signal generated by the vision pipeline, causing the shadow to be the player's avatar. This configuration and its best practices are embodied in the PlayMotion Reference Environmental Specification.

The Laptop player -- camera/display surface
Here the camera and display surface are close to one another. The name comes from the fact that most laptops sold today come with a camera mounted just above or below the LCD display. In this configuration the camera sees the front of the player. If the camera's view or the segmentation image is displayed as a layer of the graphical output, this orientation emulates a "magic mirror", where the real world is augmented with game elements.

The Pass-Thru camera -- display surface -- player
In this configuration, the camera sees the display and the (front of) the player superimposed together. This setup requires a display that is not opaque in order for the camera to be able to detect the player on the opposite side of the display. While similar to The Laptop, this configuration provides more accurate registration near the display, allowing tracking of fingers, hands or eyes relative to the display.

This list is not intended to be comprehensive, but rather a taste of the options available in physical configuration. Consider, for example, the case when the camera is not fixed in place. With 1 or more degrees of freedom, one could switch between configurations mentioned above, or even use the camera's motion as an input vector.

Now that you've decided on a display type and camera location for your application, lets explore the effects of environmental lighting.

Accounting for Local Lighting

Since cameras provide the input signal for PlayMotion, it's important for their image quality to be optimal. Some aspects of the image quality, such as focus, are straightforward to adjust. The impact of light in the game play space, however, is a much more complex issue.

Varying the light where your game is played is likely to change the way you use PlayMotion to generate control signals. Poor, non-uniform or non-constant lighting can have unpleasant effects on segmentation algorithms. Some of the techniques available in PlayMotion are stronger against adverse lighting than others. But, as always, there is no free lunch. These more robust techniques have their own drawbacks -- they may be more computationally expensive or could provide poorer quality (or just plain different) segmentation than other methods.

    Natural light from the sun, candles, fire, etc is notoriously inconsistent. This is obviously true for flames, but remember that the movement of clouds and the time of day have a huge impact on the way that the sun lights an environment. Given the intensity of natural sunlight, amazingly small amounts of even indirect sunlight (as in, an open shade 20 feet away) can cause vast irregularities in the input signal. Optimal environments will be completely sealed from natural sunlight (or, optionally, set up for outdoor use during nighttime)

    Some electric lights, such as incandescent, neon and fluorescent bulbs provide a stable, even light that is almost always compatible with all of PlayMotion's vision algorithms. On the other hand, mercury-vapor and halogen lamps create erratic light that may produce visible artifacts in your camera image. Another thing to observe is that many modern lights fluctuate on a given frequency. While this is generally imperceptible to the human eye, it may appear as a pulsing brightness on your incoming camera signal.

    Finally, there's the light from your display surface, aka visual feedback. Reflection of the light emitted by a computer monitor probably will not have a negative impact on the camera signal. If the camera is facing the display, however, the light in a (presumably large) part of the camera's view will be changing constantly. Similarly, high-brightness displays such as projectors or LED arrays can create significant dynamic in the overall illumination of your environment. Traditionally, PlayMotion uses an open-loop feedback scheme where the camera influences the display, but the display has no impact on what the camera detects. If the camera image is changed by the displayed image, we close the feedback loop. One possible way to avoid this scenario is to use a camera that filters out visible light, meaning that the displayed image will not appear in the camera view. This idea is implemented in the PMSP, an industrial class infra-red sensor array available for purchase from PlayMotion.

Now that you've considered your display type, chosen your camera location, and accounted for local lighting conditions, its time to get into the nitty gritty: configuring the filter graph.

Optionally, you can use the default filter graph and jump straight into development by running the Calibrate Utility and delving into the included Sample Applications.


Calibrating the System

After determining the physical setup for your game and building and arranging the filter graph settings in the siteXML, you can use the Calibrate Utilty to adjust filter settings for optimal performance in specific physical environments and lighting conditions. Calibrate also lets you dynamically tune the vision service settings and visualize the data returned by the services. Since these parameters are adjusted in real-time, the impact of the changes can be seen immediately.

Choose an option to learn more about Calibrate:


Configuring the Filter Graph

After adjusting your homography, you can call up the menu and enter the Filter Graph mode of the Calibrate Utility, where where you can configure each of the filters defined in the active input filter graph. Simultaneously, Calibrate shows the real time effects of the decisions you make on the overall clarity of the camera input signal.

By changing parameters for the image segmentors and/or image processing filters, you can ensure that the raw camera image is transformed into an optimal segmented output image to be used by the vision services in your game.

Typically this involves configuring the image source(s), image segmentor(s), and any image processing filters.

    Image Sources
    For Lumenera cameras, webcams, PMSPs or other cameras defined as the source in the filter graph, you can mirror the output about the X- and/or Y-axes to correct for camera orientation. Additionally, you can adjust the Lumenera cameras' exposure to achieve the desired mean intensity level in the video stream.

    Image Segmentors
    For background subtraction filters, such as bgSubtract, you can adjust their thresholds to make the segmentation more accurate. A lower threshold makes the background subtraction more lenient in classifying a given pixel as non-zero ("true"). Sometimes in the segmented image, people's shapes can appear to be incomplete, or disconnected. Lowering the threshold can correct this problem. Too low of a threshold, however, will result in many false positives, or pixels classified as non-zero which should be zero.

    Image Processing Filters
    Usually the output data of the segmentor filters needs to be adjusted before it is fit for use in computer vision algorithms. Several image processing filters can be used to accomplish this and can be configured in Calibrate.

    One common image processing filter is the quadWarper. Typically when the camera sees beyond the extents of the projected PlaySpace, a quadWarper can be used to warp the region of the camera's sensed image that falls within the projector's display frustum to match the extents of the display. When a quadWarper is defined for the source, you can move the corners or sides of a quadrilateral to set the area of interest within the camera's raw image, which will be warped into to the size of the segmented output image.

    Another common image processing filter is the medianFilter. Many times there is noise in the output of the image segmentors, which can be removed by a median filter. You can set the size of the filter as well as the number of passes the filter makes to smooth out the segmentation.

In addition to configuring the image source, image segmentors, and image processing filters, the filter graph mode of Calibrate lets you adjust the screen real world width and height dimensions. These have no direct affect on the output segementation image, but are used in the vision library and can be useful for scaling objects within your games to match real world scale. These values should be set to the dimension of your display surface in real-world units.

NOTE: to add or arrange filters in the vision pipeline, you must manually edit the .corners file.


Configuring the Vision Services

In addition to configuring the parameters of the filter graph, the Calibrate Utility is your command center for configuring and visualizing most all of PlayMotion's vision services. From within Calibrate, easily switch between service configuration modes by pressing the ctrl + left / right arrow keys. (helpful hint: print the Calibrate Quick Reference Card)

Each service has its own real-time visualization to help convey what data the service provides to your application. Most services also have tunable parameters accessible via an on-screen menu (within Calibrate, TAB toggles menu on/off; arrow up/down navigates between parameters; and arrow left/right adjusts individual values).

Choose a service below to begin your journey, or visit our Services Matrix for some wild-style browsing:

Humans Service

Visualization

The Humans Service visualization displays the segmentation image on the screen. Each region of connected white pixels (a blob) whose area is greater than or equal to Minimum Human Area is outlined by a green rectangle. Markers are shown for the estimated positions of the head (green triangle) and hands (red and blue circles) for each such blob, which are also data objects (part of the Human Model) passed by the service to your application.

An image of an example human of the specified pixel volume is shown for reference. As Minimum Human Area is changed, this reference human image is scaled appropriately.

For calibration to real world environments, a silde-rule scaled to the WorldSize set in the Filter Graph is positioned at right. Please note the following reference guidelines for Humans in the PlaySpace:

    • a 5'10" Human standing full frontal
    • in a PlaySpace measuring 9' high by 12' wide
    • using an ideal orthographic projection
    • with a camera with resolution of 320x240 pixels

will equate to a Human Area of 3,750 pixels (5.27 sq. ft) on screen
or approximately 5% of the actively warped sensor area of 76,800 pixels (108 sq. ft).

Finally, realise that these reference areas (in both pixels and square feet) are based on a person standing flush with the wall, in otherwords, orthographically projected onto the 2D Playspace display plane. In realistic play situations, play generally takes place between 1 and 3 meters away from the wall. Depending on the lensing of your projector and relative position of the PMSP, players moving closer to the projector will increase the relative size of the shadow and the detected segmentation object. In otherwords, someone holding their hand 1 meter in front of the PMSP can cause the triggering of a Human detection with only one finger. These factors can be taken into consideration in the design of your physical playspace.

Configuration

Minimum Human Area - Minimum number of contiguous segmented pixels required for a blob to be considered a "human," and to receive labels.

Code Reference

Sample Applications

  • Heads -- indicate the head of the first tracked blob
  • Heads-Advanced -- indicate an inferred head of the tracked blob
  • Hands -- a simple method for hand identification and tracking


Distance Transform Service

Visualization

The Distance Transform mode illustrates the distance of any given pixel from the segmentation boundary. The values in the Calibrate visualization image are not precise -- the distances have been quantized to enhance contrast. Points on the edge of the segmentation (the intersection of the bright red and green bands) have a value of zero. Red pixels represent "positive" distance from the edge (exterior), and green pixels are "negative" (interior). Darker values indicate greater distance from the segmentation's edge.

Configuration

no configurable parameters

Code Reference

Sample Applications


Edge Normals Service

Visualization

In Edge Normals mode, we see a grid of colored arrows on screen. The screen is sampled at regular intervals in the x- and y-directions and at each point an arrow is drawn. This arrow's orientation is the same as the edge normal at that point. Further, the color of the arrow is derived from the HSV color model, with the angle of the normal determining the hue of the fully saturated arrow -- all arrows (normals) pointing in the same direction will have the same color.

Configuration

no configurable parameters

Code Reference

Sample Applications


Feature Detection Service

Visualization

The Feature Detection mode displays a red dot for each pixel from the camera image that passes the feature detection test. These features may be displayed against a plain black background, or overlaid on the "raw" input camera image.

Configuration

Feature Threshold - the detected corner's minimum contrast value. Increasing this value finds fewer, higher quality features.

Code Reference

The sole purpose of the Feature Detection service is to serve as a data feed for the Feature Tracking service. Therefor, there are no Method Calls available to directly access the output of the Feature Detection service.

See Also:


Feature Tracking Service

Visualization

When a feature is tracked from frame to frame, an arrow is drawn on top of the warped camera image. The endpoints of each arrow represent the feature's location in the current and previous frame. With each successive frame, previous tracks fade out. The colors of the tracks are derived from the HSV color model and represent the track length compared to the Min Track Length and Max Track Length defined by the service, with blue being the shortest and red the longest.

Configuration

Min Track Length
The smallest distance in pixels that a feature can move from frame to frame for the movement to be considered a "track."

Max Track Length
The largest distance in pixels that a feature can move from frame to frame for the movement to be considered a "track."

Flow Window Size
Determines the size of a search window used iteratively to find features in the current frame. Higher values can more easily find matching features across larger distances, which helps track quickly moving objects and slow camera rates. However, too large a window size can significantly degrade performance.

Attachment Window Size
When using the Feature Tracking service, you can find the expected position in the next frame of any point in the image using getUpdatedPositionUsingFlow(). Adjust Attachment Window Size to change the extents of the window in which tracks are averaged to determine the point's expected position in the next frame. A bigger value can be useful to get a general sense of the optical flow in that area. However, too big of a value could identify flow from tracks that are too far from the point, or it could take into account many conflicting tracks, providing little information.

Track Initialization Interval
It is common to lose features in the sequence of images over time. The Track Initialization Interval determines the number of frames that detected features are tracked before they are discarded, and a new set of features are detected for tracking.

Code Reference


Difference Service

Visualization

Difference mode presents the segmentation image on screen with an overlaid marching graph. Each difference measurement is added to the right side of the graph, shifting all other values left. This allows you to see some historical measures of the difference value. A low frequency sampling of the scalar difference value is also displayed numerically on the screen.

Configuration

no configurable parameters

See also:


Moment Service

Visualization

In Moment mode, a gauge is displayed in the lower lefthand corner of the screen, superimposed on the segmentation image. The gague's needle points in the direction of the principle axis of the segmentation image. The image's skew values are displayed numerically below the gauge.

Configuration

no configurable parameters

Code Reference

Sample Applications


Motion Recognition Service

Visualization

The Motion Recognition service allows up to ten dynamic motions or held poses to be captured and subsequently recognized using a collection of motion history images for each motion. Motions are mapped to an integer index of 0 thru 9.

To capture a motion, first make the motion and ensure that its entirety is part of the motion history image (the segmentation and the fading out segmentation from the n previous frames, where n = the Motion Duration service parameter). If the motion duration is acceptable, make the motion and press (or have someone else press) a number 0-9 to capture the motion history image. This image is saved on the file system and mapped to number pressed. It is not necessary to go sequentially from 0 to 9.

In practice, it makes great sense to view all captured motion templates within the file system and to edit the groups using common sense heuristics, removing outliers. It also often makes sense to store multiple template sets for different applications you are developing. All current motion sets are accessible as BMP files within the Windows file system, much like Poppet. You can find and manipulate them here:

      C:\playmotion\product\gizmo\config\motions\n

Where C:\playmotion\ is the directory you selected during installation, and n is slot 0-9.

For each motion, repeat the capture process several times as the recognition algorithm works better with a larger data set. At any point, if the Motion Recognition service recognizes a captured motion, an icon containing the motion number will appear on the screen where the motion was detected. If a motion is not easily detected, try adjusting the feature threshold for that motion to a higher value.

Configuration

Motion Duration
The number of absolute frames which comprise the motion history image. For quick motions, fewer frames are needed. For motions that take longer to complete, more frames might be necessary to capture the motion.

Motion Area Threshold
Size in sensor pixels that a connected component of the motion history image must exceed to be analyzed as for motion recognition.

Cell Occupancy Threshold
The number of detected gestures in a cell in the occupancy grid needed to output a motion location in getMotionLocations(). Increase to make less sensitive to spurious detections. Decrease to make more sensitive.

Cell Occupancy Timeout
Time in milliseconds before resetting a cell in the occupancy grid. Once a cell has been reset, it starts aggregating detections again. Once a cell has reported a detection, it no longer aggregates detections.

Occupancy Grid Width
Width of occupancy grid. The occupancy grid is overlaid onto the camera image. So a 2x2 occupancy grid in a 320x240 sensor image aggregates recognized motions into each quadrant of the image.

Occupancy Grid Height
Height of occupancy grid. The occupancy grid is overlaid onto the camera image. So a 2x2 occupancy grid in a 320x240 sensor image aggregates recognized motions into each of four 160x120 quadrants of the image.

Motion Feature Thresholds
Parameter for each motion governing the sensitivity of identifying the motion based on features in the captured templates. Increase to avoid false positives. Decrease to make the recognition more lenient for this motion.

Code Reference


Using the PMVE Runtime

The PlayMotion runtime is the library of functions that provide real-time analysis of a camera signal, processed by the filter graph. When called from within your code, these functions produce and report the latest batch of input values to be used to update the game.

In this section, various styles of input processing will be presented. For each style, we will discuss some of its merits as well as how to use PlayMotion's services to recognize and use it. Additionally, the necessary code snippets needed to access the service data are presented.

While perusing this section, it might be a good idea to have the source code of some of the included Sample Applications open in your code editor of choice. If you don't have a code editor, we highly recommend that you download and install PyPE immediately.


Initialize VisionInput

The VisionInput class is the center of the PlayMotion runtime. It marshalls the vision pipeline, and generates the input signals used in each update cycle of a game's logic. Accessors for all of the parts of the pipeline belong to the VisionInput class.

So in order to do anything with PlayMotion, you must first initialize the system:

#python
playmotion.init()

or

//c++
VisionInput* input = new VisionInput();

The PlayMotion runtime can be run in its own thread, providing asynchronous access to the data generated by the vision pipeline and decoupling the input and update units of the game. To do this, call

//c++
input->start();

Note the start() call works in Python as well, but the wrappers currently do not relinquish the Global Interpreter Lock, so the performance gain is minimal. Watch the changelog for updates!

Once you've initialized VisionInput, its time to subscribe to services.


Subscribe to Services

Some computer vision algorithms are quite computationally intensive. To ensure that the games you develop are as responsive as possible, PlayMotion's computer vision services are provided on an as-needed basis. When a game requires a specific type of processing, it can request the vision pipeline to compute the given data and gain access to it.

Controlling which services should and should not be run is the job of the VisionSubscription. Like a magazine subscription, you ask for an item and it begins "arriving" in the next time the vision pipeline is updated. Similarly, you can cancel subscription to one or more of the services. Their processing will be stopped and the data they provide will no longer be available.

Say we want to know how far it is from the upper lefthand corner of the segmentation image to the nearest segmented pixel (i.e. non-zero value). First, create a VisionSubscription, and ask for the distance transform service.

#python
mySubscription = playmotion.VisionSubscription()
mySubscription.subscribe(playmotion.SERVICE_DISTTRANSFORM)

After the next update of the vision pipeline, the functions associated with the distance transform service will provide useful data:

#python
playmotion.update()
d = playmotion.getDistanceFromEdge(0,0)

When we're done with the service, we can manually remove it from the subscription to save processing cycles.

#python
mySubscription.unsubscribe(playmotion.SERVICE_DISTTRANSFORM)

The subscription is automatically cleared on release of the VisionSubscription object.

This is how you call it once. But the PlayMotion input signal is a real-time video stream. Next, we'll find out how to request new vision data on a per-frame basis.


Request New Vision Data

When you want the latest data from the vision pipeline, you must ask for an update:

#python
playmotion.update()

Generally, this call is made exactly once per update cycle within your game. In a single-threaded environment, this actually invokes the vision pipeline's update routine, producing new data.

If VisionInput::start() has been called, though, the vision thread is a tight loop effectively containing only the pipeline update. In this case, the update() call in the application thread acts as a "buffer-flip" in the vision pipeline, fetching the last processed frame, and it's accompanying service data. We use the same function call for these two purposes to make the transition between single- and multi-threaded environments seamless.

Once you've updated the data, you're ready to perform some signal analysis, and get the desired data from the subscribed services.


Signal Analysis with Services

The ultimate purpose of the filter graph is to transform the input video signals into one or more image data streams that can be analyzed to produce numerical input values to a game each frame. Many analysis techniques are provided by PlayMotion's vision services.

Generally, these techniques fall into one of two categories: static analysis, where only data from a single frame is analyzed, and dynamic analysis, where subsequent frames are compared in some meaningful way in order to generate time-based data abstractions.

Static Analysis

Static analysis is processing that can be performed to gather information from a single frame of video without knowledge of any surrounding frames. During static analysis, characteristics and metrics intrinsic to the image are computed and made available to subscribers.

Blob Labeling / GetHumans

Connected regions of non-zero pixels of sufficient (user-defined within Calibrate: Humans) size are called blobs. Blobs may be full human figures, highly reflective or actively lit tokens, or more abstract representations of objects or movement, depending on the configuration of the filter graph. Blobs, simply, are regions of contiguous pixels which are deemed of interest to a computer vision system.

Each blob typically represents a distinct object in the real world, and blobs' characteristics can tell you a lot about what's actually happening in front of the camera.

Blobs 1.0 are identified and labeled statically, so there is no expressed correlation between blobs from one frame to the next frame. Blobs are identified in raster-scan order (left to right across a row beginning at the top of the sensor image and moving downward one row at a time) and placed into a list that developers can access randomly.

Subscribing to SERVICE_HUMANS within your code grants access to the following information for each enumerated blob, within a structure called HumanModel:

    • bounding box
      the smallest rectangle that fully contains the blob (represented as an IPL Region of Interest within a CvRect)
    • area
      float -- the number of (non-zero) pixels comprising the blob
    • centroid
      the weighted center (X, Y in vision space) of the blob (not the bounding box)
    • head
      the center of the top edge of the bounding box
    • hands
      the uppermost non-zero pixel in each of the left- and rightmost columns of the bounding box

Access to these labels comes in two flavors, as seen here:

    #python
    #approach 1
    h = playmotion.getHumans()[0]
    
    area = h.ROI.area
    centerOfMass = h.centroid
    head = h.head
    lHand = h.leftHand
    rHand = h.rightHand
    bounds = h.ROI.rect
    # individual parts of this CvRect can now be accessed via variables
    # bounds.x, bounds.y, bounds.width, bounds.height
    
    # further, proper transform of the ROI is executed as follows:
    # bounding box of the identified Human
    boundsXY = playmotion.xformPointToWorld(h.ROI.rect.x,
                                            h.ROI.rect.y,
                                            base.getAspectRatio()*2, 2)
    boundsWH = playmotion.xformVectorToWorld(h.ROI.rect.width,
                                            h.ROI.rect.height,
                                            base.getAspectRatio()*2, 2)
    boundsX1 = boundsXY.x
    boundsY1 = boundsXY.y
    boundsW = boundsWH.x
    boundsH = boundsWH.y
    boundsX2 = boundsX1 + boundsW #rightFinger.x
    boundsY2 = boundsY1 + boundsH 
    
    #approach 2
    centerOfMass = playmotion.getCentroids()
    head = playmotion.getTopOfBlobs()[0]
    lHand = playmotion.getHands()[0]
    rHand = playmotion.getHands()[1]
    

Three things are important to note in the above code. First, only centroids, heads, and hands are available via direct accessors. Also getCentroids() returns the centroids of the blobs in the scene, and should not be confused with getCentroid() which we'll discuss next. Finally, all coordinates and dimensions are returned in vision space. Read more below to see how to convert between coordinate spaces.

See also

Position and Orientation of Segmented Pixels

The other two controls come from the centroid of the segmented pixels. The centroid is the "balance point" of the image -- any line passing through the centroid has equally many pixels on either side.

    #python
    playmotion.getCentroid()
    

generates an (x,y) coordinate pair, which can be used just like the location of a mouse or a joystick's angle measures. Moving left, right, up or down in the camera's view moves the centroid accordingly. Obviously, 'up' and 'down' motions are limited in a horizontal camera configuration, but with the camera pointed vertically, both of these axes can be controlled completely.

When the filter graph is configured to identify silhouettes of people and many players are in the camera's view at once, direction(s) can be chosen by majority rule. Simple crowd-based interactions can be controlled in this way by simply "following the mob". As with a single player, the centroid of the segmentation image will report the center of mass of the image. But, since each player is represented by (approximately) the same number of pixels, each of their positions counts as (roughly) one 'vote' for the location of the centroid. Hence, the highest concentration of people will govern the location of the centroid.

Additionally, the orientation of the segmented pixels en masse can be determined by calling the Moment Service.


Distance Measurements

For any point in the segmentation image, we can determine the distance to the nearest boundary between non-zero pixels and zeros. If SERVICE_DISTTRANSFORM is subscribed to,

#python
dist = playmotion.getDistanceFromEdge(0,0)

returns the distance from the upper lefthand corner of the image to the nearest segmentation boundary in pixels (using vision space units) . Since the distance is the straight-line Euclidean distance between a pair of pixels, the return value is a floating point (real) number.

The "orientation" of the pixel is also encoded in the distance value. Pixels that are outside the segmented regions (zero pixels) have a positive distance to edge (i.e. dist >0) and non-zero pixels have negative distance to edge (dist <0). This way, it's not necessary to test the distance and also sample the segmentation image to see if a pixel is part of a segmented region or not.

Segmentation Edge-Normal Computation

The boundary of a group of segmented pixels is a curve in the plane of the image. Each point on that curve has a normal vector -- the vector that points directly (that is, perpendicularly) away from the segmented region at that point. Normal vectors are valuable in a number of scenarios, such as computation of collision and reflection angles, or judging orientations.

PlayMotion provides normal vector information at all points in image space rather than only on the boundary of segmented pixels. If we consider the normal to be the vector that points "furthest away" from segmented pixels, then this generalization makes sense: the normal vector at a pixel p points in the direction that takes you further from segmented pixels than any other direction.

When subscribed to SERVICE_EDGENORMALS, the normal at a point in image space can be fetched simply:

#python
#x1, y1 indicates the point we wish to query, in world space 

v = cv.cvPoint2D32f(0,0) #out parameter (pointer)
w = 16 #worldWidth
h = 9  #worldHeight

testpoint = playmotion.xformPointToSegmentation(x1,y1,w,h)
dist = playmotion.getDistanceFromEdge(testpoint,v)
result = Vec3(v.x, 0, -v.y)

sets norm to be the x- and y-components of the segmentation normal vector at the center of a 320x240 segmentation image.

Pose Identification

Being able to identify poses of the human body provides an opportunity for rich, custom interfaces that are suited to a particular player. Just like configuring hot keys for a game, players could have the opportunity to decide what posture will elicit each of the actions available in the game.

Pose identification is available in PlayMotion when subscribed to SERVICE_SNAPSHOT. After capturing a collection of snapshots

#python
playmotion.captureSnapshot("PointLeft")

and later

#python
playmotion.captureSnapshot("StandOnOneFoot")

the current camera frame is compared against each. For a any chosen pose, we can determine a measure of similarity between this instant in time and the time of capture.

#python
playmotion.getScalarDifference("PointLeft")

The higher the return value, the more similar the current pose is to the pose in question.

Since the pose analysis is a comparison of images, the snapshot service is fairly delicate. A series of poses captured when a player is on the lefthand side of the camera's view will return poor (unusable) results if that player moves to the righthand side. Even with a fixed location (such as an unmoving chair), a different player's poses will generate poor results when compared to the original snapshots. Furthermore, differences between stored poses should be large, as comparison of a given frame to two references that are very similar to one another will produce results that are difficult to interpret.

Because of these limitations, games that use the snapshot service are typically short (<5 minutes) to minimize the chance of players displacing themselves during play. Also, it is common for snapshot service games to be calibrated on the fly at the beginning of each run of the game, as opposed to other services, which may be configured once per environment and have their parameters stored in the SiteXML file.

Dynamic Analysis

Some measurements require knowledge of the scene over time. A computation may rely on a number of previously processed frames or may need to know how long it's been between frames in the vision pipeline, etc. Typically, these dynamic analysis methods convey information about motion in the camera's view.

Energy Estimation

The simplest dynamic analysis method is to simply identify the gross amount of motion in the scene. PlayMotion measures this global motion by computing the percentage of pixels in the segmentation image which have changed between the previous frame and the current.

To get this data, subscribe to SERVICE_DIFFERENCE and call

#python
diff = playmotion.getScalarDifference()

Averaging this value over several frames (15, say) gives a good indication of the amount of energy in the scene.

Gesture Detection

Gesture detection is the dynamic analysis analog of pose detection. It allows players to use movements, rather than postures, to generate input signals to the game. This provides greater subtlety in the available library of actions a player can perform.

Consider, for instance, the difference between reaching for an object and striking that same object. The pose at the end of each of these actions is very similar, but the movements (in particular, the speeds) of the two actions are very different. Gesture detection provides the facility for differentiating between the two.

Using PlayMotion's SERVICE_MOTIONRECOGNITION allows users to identify up to 10 distinct (user-defined) gestures. We can see if the gesture has been detected by calling

#python
gList = playmotion.getMotionLocations(0)
detected = not gList

If detected is true, each element in gList is a point in image space where the first (i.e. index 0) gesture was identified this frame.

Feature Tracking

...

Do Your Own Analysis

You may find that the information you want is not available through PlayMotion's services. We welcome developers to perform their own computer vision processing, using the direct matrix output stream from the filter graph as your input. If you're particularly proud of your technique or would like to see it optimizied by inclusion in the runtime, please share it on the forums.

The images used by the PlayMotion vision pipeline are stored in the popular IplImage matrix structure, which makes them perfectly suited for third party analysis using OpenCV. Below you'll find a simple example of how to do analysis beyond what's available from PlayMotion.

#python
from opencv.cv import cvCountNonZero

sImg = playmotion.getSegmentationImage1()
w = playmotion.getSegmentationWidth()
h = playmotion.getSegmentationHeight()

occupiedArea = float(cvCountNonZero(sImg)) / (w * h)

What's generated here is the percentage of the segmentation image that is occupied by "good" (i.e. non-zero) pixels, as a number between 0 and 1. There are plenty of much more interesting things that can be done with PlayMotion's input signal and OpenCV.

Another excellent (and elegant) example of inline vision processing is provided in the Sample Application: Heads. In this case additional processing is computed within Python to find a more accurate Head object than the PlaySDK natively provides.

Of course, you don't have to use OpenCV to do your processing. Other computer vision primitives libraries are avialable, and you can directly manipulate the data if you choose.

See also:

All About Coordinate Spaces

The canonical coordinate spaces for raster images and the 3D virtual spaces in which most video games exist differ significantly.

The coordinate description for images, which will be referred to as vision space is an integer lattice (no values like 0.5) with origin in the upper lefthand corner. The positive x direction is rightward across the screen, and the positive y direction points down.

The 3D graphics for the game live in what is called 3D world space. World space coordinates are floating point triples (x,y,z). Axes are typically arranged to obey the "right hand rule". In Panda, they are ordered X,Z,Y where Z is depth normal to the display surface.

Often, PlayMotion games employ a fixed virtual camera in the 3D world that faces the origin. In these cases, world space can be considered a plane orthogonal to the camera's view with its origin in the center. For this discussion, let's assume the positive x direction is to the right and positive y is up, as in the Cartesian plane. For simplicity, let's also assume that the virtual camera's extent is [-1,1] in both axes.

Points in these two spaces have a slightly complicated relationship to one another because of the differences in the coordinate representations. Interpreting values from the vision pipeline in the virtual world requires coordinate space transformation. Some shortcuts for such transformations are available through the runtime.

Here's a simple example using the distance transform service. Let x and y be the 2D coordinates of a point in the world space plane. The function getDistanceFromEdge(), operates in vision space, so we must transform the point to its world space equivalent.

#python
visionSpacePoint = playmotion.xformPointToSegmentation(x,y)

All that we are doing here is making sure that the world point corresponds to the correct point in vision space. If the world point is in the bottom right corner of the screen, then the point should be in the bottom right corner of the segmentation image too. If the world space point is in the middle, then the vision space point should be in the middle as well.

Now that we have this information, we can simply make the call

#python
visionSpaceDistance = playmotion.getDistanceFromEdge(visionSpacePoint)

The distance returned is relative to the vision space, but we need it in a world space context. To accomplish this, we can just use another xform...() function

#python
worldDistance = playmotion.xformScalarToWorld(visionSpaceDistance)

Now we have an accurate distance value in units that are meaningful in our world.

Note that the xform...() functions do not require a subscription to any vision service.

Technical Reference:

Exploring the Sample Apps

The PlaySDK is packaged with a number of sample applications which you can open to learn about how to work with and visualize the output from PlayMotion services within Python. Please note: these Sample Applications all require Panda3D in order to run. Download Panda3D here.

Basics include:
* Framework -- the barebones: show segmentation on screen
* Heads -- indicate the head of the first tracked blob
* Heads-Advanced -- indicate an inferred head of the tracked blob
* Hands -- a simple method for hand identification and tracking
* DTbasic -- a Panda-independent demo of distance tranceform

Demos include:
* 8ball - a simple physics simulation of a billiards table
* Chakra - a viz connecting spring-loaded chakras to humans

COMING SOON:
* Islands - a visualization of Distance Transform into 3d objects
* Tank92 - a PlayMotionized tribute to Scorched Earth (1991)
* Maxwell - a visualization of electromagnetic forces using hands


Framework

This is the simplest of all the Sample Apps. In a nutshell, it:

    • sets up the PlayMotion services
    • creates a card in Panda3D on which to place the segmentation as a texture
    • maps the texture to the card
    • sets up a task in Panda3D to update the texture with each frame refresh

segmentation of PMSP signal on screen.

location:

filename: PMFramework.py
directory: C:\PlayMotion\PlaySDK\demos\Framework

related links:


Heads

Heads is a very basic demo which:

    • shows how one can call the Humans Service to find a head, and
    • maps the results to an on-screen text indicator beacon.



a red "H" indicates the headpoint as returned by the getHumans service. This segmentation image is delivered via a PMSP, a high performance machine vision input device available for purchase from PlayMotion.

location:

filename: pm-heads.py
directory: C:\playmotion\PlaySDK\demos\Heads

related links:

Heads-Advanced

Heads-Advanced takes the Heads demo and upgrades it with some internal logic:

    • shows how one can call the Humans Service to find heads
    • directly accesses the IplImage structure from within Python, then
    • utilizes an interpolation function to find a more accurate head reading, and
    • maps both results to on screen text indicator beacons.



a yellow "V" indicates the interpolated head, and a red "H" indicates the raw head as returned by the getHumans service. This segmentation image is delivered via a PMSP, a high performance machine vision input device available for purchase from PlayMotion.

location:

filename: PMHeadsAdvanced.py
directory: C:\playmotion\PlaySDK\demos\Heads

related links:

Hands

Hands shows how one can call the Humans service to see hands, and map those to an on screen render. It uses the same basic logic as the Heads demo, and adds support for multiple Human detections on-screen simultaneously.

location:

filename: PMhands.py
directory: C:\PlayMotion\PlaySDK\demos\Hands

related links:


8ball

8ball demonstrates how a very basic physics engine can have the player's virtual shadow inserted into the scene and uses the Edge Normals service to calculate collisions between virtual billiard balls and real world silouhettes.


8ball calculates rigid body physics collisions between segmentation edges of physical bodies, and the surfaces of the virtual billiard balls

any number of players can interact simultaneously with the virtual objects

location:

filename: PM8ball.py
directory: C:\PlayMotion\PlaySDK\demos\8ball\

related links:


Chakra

Chakra is an art piece which connects visualizations of the body's energetic centers with detected individuals in the PlaySpace using the Humans service. It is optimized for a rear projection environment, where the rendered chakra symbols act as a sort of 1:1 spiritual mirror for the humans playing in front of it. Two sets of images are available, depending on the directory set in the config.py file.

location:

filename: PMchakra.py
directory: C:\PlayMotion\PlaySDK\demos\Chakra

related links:


Tank92

VERSION 1.1 PREVIEW

Tank92 is a prototype videogame-style experience which takes advantage of the Moment Service as its primary form of player input.

filename: scorchmain.py
directory: C:\playmotion\PlaySDK\demos\Scorch\src\

related links:


Maxwell

VERSION 1.1 PREVIEW

Maxwell is a dynamic real-time visualization of electromagnetic forces, using variable charges attached to player's hands.

Features include:

    • a real-time visualization of Maxwell's Equation
    • interpolation of hand movement
    • building / decaying charge magnitude based upon how long an individual charge is attached to / detached from an actively detected hand
    • keyboard configurable variables to adjust performance / resolution.

location:

filename: max.py
directory: C:\playmotion\PlaySDK\demos\Maxwell

related links:


Big

VERSION 1.1 PREVIEW

Big is a sample application demonstrating a floor firing environmental setup. It is inspired by the humongous toy keyboard installed in the floor of the toy store in the movie Big (1988).

External Links:

Islands

VERSION 1.1 PREVIEW

Islands is an elegant demonstration which accesses the Distance Transform Service to morph the 2D camera input signal into a 3D model output in real-time.


the colorized distance transform image of a human in the scene

the Panda3D camera begins rotation to gain perspective on the scene

the final 3D island as derived from the original distance transform data


location:

filename: islands.py
directory: C:\playmotion\PlaySDK\demos\DistTran\src
author: Jordan Killpack

related links:


Building a Custom Filter Graph

Just as different games for Nintendo Wii or PC make different use of their respective controllers (Wii remote, classic controller, keyboard, mouse, joystick, etc), cameras and computer vision can be used in many different ways to generate input for your games. PlayMotion's SDK facilitates using cameras as input to video games.

Rarely is the stream of images generated by a camera actually of much use to a computer in their raw form. This section discusses how to build a custom filter graph pipeline in order to generate input signals that are actually useful to your specific application.

Filter graph basics

The purpose of PlayMotion's filter graph is to transform the source images into a signal that is meaningful to your program and/or PlayMotion's computer vision services. If you've used Photoshop, then you probably know a bit about image filtering. Filters are software processes that are applied to each pixel of an image to create a new image. Often, the resultant image bears strong resemblance to the original, merely extracting features or slicing the image into discrete segments. Other filters perform complex mathematical operations on the image, providing output that bears little or no resemblance to the original from a human's perspective.

The image processing filters can be thought of as little factories. Each factory takes a number of images (possibly zero) and some parameters as input and produce a number of images as output. Consider this cartoon:

A conveyor belt brings images (black fields with the letter P in white) in from the left. The filter inverts the color palette, producing black Ps on white fields rolling out on a conveyor belt to the right. These factories can be linked together, with the product one factory feeding directly into another as materials. To accomplish this, the first filter in the chain is declared to be the "source" of the next filter. In the next cartoon

a factory creates white Ps on black fields out of thin air (notice that there is no input image stream). These images roll off the assembly line and move straight to the input of the "invert" filter described above. It is easy to imagine very long lines of factories moving images around to produce a final result.

Some filters take multiple images as input. This is analogous to factories needing more than one kind of good to build a product. Similarly, some filters have multiple images as output. We can extend the metaphor here to to factories that create multiple products simultaneously.

Segmentation Methods

Image segmentation is any process whereby an image is divided into parts (or segments). These parts need not be polygonal (arbitrary shapes are possible) or connected (two or more different parts of an image can be part of a single segment). Computer vision makes heavy use of image segmentation to add extra information to images, such as identifying body parts, grouping pixels of similar color, etc.

The segment classes identified by PlayMotion's segmentation methods are only two. Hence these algorithms effectively transforms an image from any arbitrary color depth into 1-bit color depth. Typically, we consider the zero ("false") values to be "non-input" or regions not explicitly influencing the system, while the non-zero ("true") values are the "interesting", user-generated pixels which we'll mostly focus on.

Identifying Full Figures

To identify and report still objects in the camera's view, we must be able to differentiate between these objects of interest and "uninteresting" background elements. PlayMotion's methods for such differentiation are part of a category of techniques called background subtraction. The result of background subtraction is a segmentation image where the identified figures' silhouettes are "1" and other pixels are 0.

Background subtraction is similar to a technique called chroma key, which is what Hollywood studios use when they film actors on greenscreen and digitally insert a background in the footage. The difference between the two ideas is that background subtraction does not require a monochromatic background of some exotic color.

The bgSubtract filter provides a background subtraction mechanism. This background subtraction method assumes that the background elements and the light on these elements are static and unmoving. A reference image is taken using the captureBackground() method (this function is bound to the ctrl-space key combination in the provided frameworks). This reference image is the "background" that bgSubtract uses to compute differences between the default scene and the current camera frame. Anything in the camera's view at the time of capture (people, furniture, visible light) becomes part of this background, so be sure that anything that will change is not in view during this calibration step. If the background changes, a new image can be saved as the background at any time.

Due to the rear-projection nature of PlayMotion's first party installations, background subtraction is our most common means of segmentation. As such, the following filter graph

<inputFilterGraph>
   <medianFilter>
      <source>
         <bgSubtract>
            <source ref="myCamera"/>
         </bgSubtract>
      </source>
   </medianFilter>
</inputFilterGraph>

is the default pipeline distributed with the SDK. It performs background subtraction and smooths the output to improve the quality of the segmentation image under adverse lighting conditions.

Frame Differencing (Identifying Motion)

Where background subtraction compares images from a video stream against a fixed image to identify changes, frame differencing compares successive frames of video against one another. The result is an image that is zero everywhere the two frames match and non-zero in the places where they differ. Here's a simple SiteXML example:

<absDiffFilter>
   <source ref="myCamera"/>
</absDiffFilter>

This image generated by the above filter graph does not bear strong resemblance to either of the original images in its non-zero parts. To segment this image you may apply a binary threshold, counting differences below some value as 0 and those equal to or above as 1. This can be done without an extra filter like so:

<absDiffFilter isThresholded="true" threshold="13">
   <source ref="myCamera"/>
</absDiffFilter>

The threshold is set to 5% of the dynamic range (13/255 = 0.05, assuming an 8-bit stream) of the image to reject signal noise in the video.

One strong advantage of frame differencing is it's robustness against changes in lighting. Since each frame is compared against the frame immediately preceding it, turning a light on or off will affect just one frame in the result.

Intensity Threshold

The simplest segmentation method is a binary intensity threshold. Any pixels whose values exceed a given limit are the 1s and all others are 0. This method requires careful selection of the threshold level _and_ complete knowledge (and possibly control) of the lighting in the camera's view. Here's an example of the setup:

<thresholdFilter threshold="191" name="cutoff">
   <source ref="myCamera"/>
</thresholdFilter>

This scenario is best suited to dimly lit environments and players with some auxiliary device to add light to the scene (e.g. flashlights, LEDs, etc). Players shine their lights on objects or spaces, which places these things into the system's input. Also, this is the segmentation method used in the multitouch displays based on frustrated total internal reflection (FTIR).

When the threshold is chosen to be particularly low, noise in the camera signal can have significant detrimental effects on quality of the segmentation image. In this case it might be wise to add a medianFilter to remove some of this noise.

<medianFilter kernelSize="7"/>
   <source ref="cutoff"/>
</medianFilter>