053 iPhreaks Show - High Performance Computing with Colin Cornaby

00:00
Download MP3

The panelists talk about high performance computing with Colin Cornaby.

Transcript

CHUCK: Hey everybody and welcome to episode 53 of the iPhreaks Show. This week on our panel we have Andrew Madsen. ANDREW: Hi, from Salt Lake City. CHUCK: Jaim Zuber. JAIM: Hello, from Minneapolis. CHUCK: I'm Charles Max Wood from DevChat.tv, and we have a special guest this week, and that’s Colin Cornaby. COLIN: Hi, from Portland. CHUCK: Do you wanna introduce yourself really quickly since you haven't been on the show before? COLIN: Sure, yeah. I'm an iOS software developer, currently. My background – I did share when I was younger, I was on the Mac Lab for a little bit but my academic background is I actually went to school and worked on a lot of multi-core GPU-based programming, and then I actually ended up working –. I work at a company called Digimarc right now. We do augmented reality sorts of stuff in iOS, so it’s a nice dovetail of that stuff back into mobile because I get to work on really computationally-intensive stuff on mobile devices and really optimizing that out. CHUCK: And so if you wanna optimize something on mobile, you just write it in C, right? COLIN: Yes, that is totally –. No, that’s not –. I mean, yes, it will get you a little bit, but no, that’s not necessarily where you go. CHUCK: So you wanna give us a brief introduction on how to think about this stuff? COLIN: Sure. Mobile – you're a little bit more limited than your choices on the desktop. I’ll start with where you typically start on the desktop is, on the desktop, the two paths that you kinda have right now are CPU-based optimizations and GPU-based optimizations. GPU-based optimizations are still kind of developing. Part of the problem when you're looking at GPU vs CPU is GPU is typically on the other end of PC/iCard and you’ve got to transfer all this data over the card, and then your computations transfer data back and that can take some time. So, a lot of people these days will kind of – the lay person will say, there's a lot of [inaudible] surrounding GPU-based optimizations and the path there isn’t so clear. On CPU-based optimizations, the path is really clear; you’ve got multi-core, you’ve got something called SIMD, which I can talk about, which stands for Single Instruction Multiple Data, which kinda lets you work on big chunks of data. On iOS, the path is a little more clear because you don’t have tools like OpenCL available to you, so you're likely looking at CPU-based optimizations. Also, iOS is on openGL ES 2.0 right now, but the 3.0 spec has some stuff in there that you kind of relates back to OpenCL as well. But OpenCL is a framework that lets you take a GPU and run general computations on it to crunch your data instead. Again, you don’t have that on iOS. I'm hoping maybe this year at WWCD they’ll announce OpenCL for iOS, but you'll never know. JAIM: So maybe we can back up a little bit. Can you tell us a little bit about differences between GPU and CPU-optimization? Like a high level of what the difference is and like when you might wanna use one versus the other? COLIN: Sure. Well, it really is to how they were designed, is that CPUs, traditionally, they started as single-core processors, then they moved up to dual-core and quad-core, but they were really designed for working on – they're designed for the way most of us program is they run a program in linear order and they go through step by step and they run on one thread of execution. Even with a multi-core machine, we've talked about that, but you might have several threads of execution going. GPUs were designed quite differently where a GPU is built for an entirely different purpose; it’s built for getting pixels out to the screen, and you're going to have probably at least a few hundred thousand pixels, probably more, on your screen. So when they built the GPU, they built it to compute all these pixels – as many pixels as they could at the same time. So you’ve ended up with these GPUs that actually have a few thousand cores on them; they're very small cores – they're not very computationally complex, so you can’t make programs on each core. But if you can split your data up into thousands of parts and you can actually run little programs on each bit of data, the GPU might be a better option. I'm trying to give an example here. Usually, say you have an image [inaudible] GPUs good at, and you want to adjust the color in the image. You might take the image, send it into the GPU and then each little core on that GPU will take one of the pixels in the image, do the adjustment, and then send the image back. So those sorts of things, a GPU is good at. A CPU is typically better if you have larger chunks of data that can’t be broken up into smaller chunks. You need to run very intensive operations on that data. I'm trying to give another example here – something like unzipping a file. That’s going to be larger chunks of data that’s going to take a little bit more computation, so that’s much better for the CPU to do. JAIM: So the CPU is more of a general-purpose, can handle larger body of work, with more complex instruction sets; the GPU is going to have a bunch of different processors that do a little more focused – is that about right? COLIN: Yeah. I mean, it really comes back to if you can’t break your data down into small enough chunks for the GPU to operate on into full tiny operations, really the CPU right now is still your best bet. There's this other cool technology – it’s called SIMD. It’s Single Instruction Multiple Data, it’s what it stands for; it’s a complicated acronym. Basically, if you think about the case we talked about with GPUs where maybe you have a lot of little, tiny bits of data like an image with a bunch of different pixels or something, and you wanna do something like color correct the image and you gotta take each one of those little pixel values and change it a little bit, there's an instruction set on CPUs which appears once you're in place that can actually help you with that as well called SIMD. Each processor vendor has their own version of SIMD. Back in the day, we we’re on RVCs. They usually called this AltiVec or Velocity Engine, if [inaudible] it was. Intel calls it SSD; they have a newer version that has a newer name – I don’t remember – but on ARM where I spend most of my time, it’s called NEON. It’s a special instruction set where normally when you work with a processor you're thinking about your code linearly, maybe you're going through an array and you're multiplying every value by two in that array. You're probably going to have a for loop where you loop on through and you multiply every value by two. SIMD is this really interesting instruction set where you can actually take entire chunks of your array, load them into the processor, and the processor is going to – with a single instruction – do whatever the math you need to do on those: multiply them all by two, or add a number, something like that. Where you commonly see this used is, you know, if you're working on audio. Maybe you need to adjust the volume of an audio clip; maybe you need to double the volume. When you double the volume of an audio clip, you're going to through the audio clip and multiply values in the audio clip by two. So you might use SIMD for something like this. Unfortunately, SIMDs, traditionally, has only been available as assembly instructions. And worse than that, there are different assembly instructions for each processor type because ARM, like I said, has their own NEON flavor and RVC had AltiVec, and Intel’s got SSD. So typically if you're a Mac developer, you’ve got to write out the assembly for each platform, but Apple has done us all a favor and they have this framework called Accelerate, which everybody should go check out if you haven't checked it out. It’s a great framework; it’s written in C, so it’s a little hairy. But it’s this great framework that basically Apple’s written the assembly code for each platform for you and they got a bunch of common functions you might need to do with this technology and to you they're C, but they’ve actually been written with the proper, well-optimized back ends; and if you use Accelerate to do this thing it has ESP functionality, matrix map functionality – all that fun stuff you might need if you know you're doing really technical things. You can use this framework over all your processor architectures, over the Mac, iOS, everything and you'll be able to take advantage of all these significantly faster functions. One plus action I'm using this for the other day is for image scaling – they’ve got a really good image scaling function there. So if you're writing an app that does a lot of image scaling, you might find that the UI Kit, UI image methods are actually a little slow. They have a really good scale routine – Accelerate is a C-based API so it doesn’t understand UI image and data, so you need to do a little bit of translation work there, but once you’ve done that, it’s got a really fast scaling routing and a really fast rotation routine that you can use to do a lot of work really fast, and then bring your image back into UI image and push it into your UI wherever you need it. That’s an example of one thing that I used Accelerate for the other day. JAIM: Okay. So you got image processing – I've also heard people use it for digital signal processing, when people are trying to do the Fast Fourier Transform, a lot of math stuff. What other things that maybe aren’t a special interest, maybe a regular developer might also find useful? COLIN: Yeah, Fast Fourier Transform is a good example – and it’s a little bit more technical, but it’s also there. I've seen a lot of people write that by hand and I'm like, “Well it’s Accelerate; you don’t have to write it by hand.” But I think the most common example is if you have arrays of data and you need to just do a mathematical operation on that data – you need to multiply it all by two or something – I'm trying to think about a general example where you can do that. But we’ve all ran into that sort of situation where you get an array of numbers and you need to add one to them or subtract one, or maybe you need to correct for – they're referencing on some index in an array, and you need to correct that sort of thing, so [crosstalk]. JAIM: [Inaudible] audio application where you're trying to create a volume control [crosstalk]. COLIN: Yeah, in an audio application – that sort of thing. It really is if you got arrays of numbers and you do any math to each value in the array, which could be a lot of situations, it might be an API to look at. The downside of Accelerate is it’s very picky about your number formats. So if you use the raw assembly functions you can get away with a little more, but Accelerate [inaudible] generalized can be very picky about what formats of numbers you bring in, so some functions only say, ‘I’ll work with the float’, some function would only work with has int, and you may have to convert your data between different [inaudible] of data formats, so that is one downside. It’s not necessarily a slam dunk in all situations, but if you have an operation you do over an array of data, you might take a look at it and if your data format matches cleanly with functions already with Accelerate, it could be a really viable option for an application. Image data is also predictably picky. I had an image come back from the camera that was in this special format and Accelerate only wanted RGB data, which makes sense for a lot of people, but you do have those corner cases where image data as well may not fit cleanly to Accelerate. You gotta do a little bit of overhead to convert the image into a format that Accelerate wants. JAIM: We need to accelerate for Accelerate, I think. COLIN: Yeah. Another set of APIs to get everything in the right data for Accelerate – yeah, that would be great. So I also do Android programming on the side. If this technology works for your application, we’re really lucky to have it because on Android and on other platforms, there isn’t very much like Accelerate. It’s kinda this toy box that’s been hidden off in the corner. There's a thick layer of dust over it, everyone’s kinda forgotten about it, but it’s over there, and if you open it there's tons of great toys in there to make your application a little bit fast, but no other platform really has this toy box. There's a few third-party APIs for Android; they're kinda starting trying to build something like Accelerate, but really it’s kinda this hidden treasure trove functionality that not many people know about, but it’s one of those things that when people know about it, makes iOS a really great platform. JAIM: Yeah, definitely. I met a developer who’s working on a piano tutor application and when he started working on that application, I don’t think Accelerate was available yet, so he ended up writing all of his routines by hand in the known algorithms. But you don’t need to do that, so it’s very cool, very powerful to have it all there [inaudible]. COLIN: I mean, not only does it just makes your application faster, but it saves you a lot of time because, like I said with the FFTs – the Fast Fourier Transforms – you can write your own. You can go on StackOverflow and there's a lot of threads on ‘here’s the algorithm you should use, you can write it in C.’ Not only are you going to have to write it yourself, but it’s going to be slow because it’s going to be in C. Meanwhile, there's this nice built-in methods for doing them really fast using these special instruction sets, so. ANDREW: I'm curious if you can talk a little bit about it – talk about how OpenCL sort of fits into the whole picture. I have this vague idea that OpenCL is sort of an API where you can write – I guess it’s really sort of almost like a language even though it’s C-like, is that right? But you can write stuff and then it can run on the GPU but it may fall back to the CPU depending on the hardware you're on? You just don’t have to worry about that, but I don’t really understand what it can be used for and in what ways it compares to Accelerate and differs. So could you talk about OpenCL? I'm certainly interested in that. COLIN: I’ll back up a bit. There's two competing standards out there right now – there's CUDA and OpenCL. Both are kind of efforts to –. CUDA is a framework in a language that is specific to Nvidia GPUs. And it all kinda started with CUDA. CUDA was actually where I did my first work on GPU sort of programming. Basically, what you're doing is you're writing a little program; if you’ve worked with OpenGL shaders – I don’t know if anyone here has – but if you have, they're kind of a similar concept where you're writing a little tiny program that’s going to get uploaded to the GPU and it’s going to get run on each of the little processors in the GPU. Which again, these days, there's thousands of processors. You can take this little, tiny app and upload it to the GPU, and put it on these thousands of little processors, and then you load data into the GPU and these thousands of little processors are going to run on your data with your little app you’ve uploaded, and basically chew on your data and send it back to you, and it’s going to alter your data in a way that the program wants the data altered – whatever program you’ve written. CUDA was Nvidia’s effort to do this with their GPUs. They had this framework and it was specific to them, and helped them sell a lot of Nvidia GPUs. So Apple, with several other companies, came on and they said, “That’s great, but we have a lot of different kinds of GPUs we’re shipping and we don’t want to have to have this technology specific to Nvidia GPUs. And back then, they shipped a lot of GPUs that were actually very good as well – you had a lot of not very great Intel integrated graphics chips they were using, so they really wanted something that would run really well on the CPU as well. If you wrote something in CUDA, typically you have to write it once in CUDA, and then write it once in C or C++, so you can actually run the same state functionality on machines that did not have Nvidia hardware. So having the CPU fallback was very important in these companies because they didn’t have the right hardware; they didn’t just want the program to stop working. You had to build around these apps on machines that didn’t have any GPU at all that was reasonable. So everybody got together – AMD, Athlon – they came up with this new standard called OpenCL. The idea behind OpenCL is you're going to write – it’s a very similar thing where you write these tiny programs, sometimes called kernels, and they get uploaded to the GPU. The OpenCL language is a little different. The other big difference with OpenCL is the little programs you're writing are actually getting compiled when you're application’s running. Typically when you write an application, you compile it in Xcode before you ship the application. Under the OpenCL model, typically what's happening is your app is running on the user’s computer, and because you don’t know the kind of hardware they have ahead of time, their computer is going to take it and say, “Okay, I have an AMD GPU” or “I don’t have any really decent GPU at all – I just got an Intel CPU” and they're going to take your program on the fly and compile it for the right architecture. JAIM: Kinda like a just-in-time type thing. COLIN: Totally just-in-time. Yeah, that’s exactly what it is. JAIM: Is that even available on iOS? COLIN: It is not – well –. So, I had an interesting experience the other day where, no – publicly, it’s not available on iOS – but I was doing some core image work the other day, and apparently put core image in a really weird situation. And core image started spitting out OpenCL errors at me, so I was like, “Hmm. That’s interesting.” So it seems to be there; we don’t have access to it, but Apple seems to be using it for the core image back end. And definitely, the processor architectures Apple’s using had advertised OpenCL support, so the hardware they're using is supposed to support OpenCL. The framework is definitely there on iOS; it’s not publicly available. My guess is because OpenCL is, in its nature, a just-in-time compiler. Apple may not feel great about having just-in-time compiles app in iOS devices. You kinda see that in WebKit where WebKit has a just-in-time version of JavaScript and iOS applications don’t get access to that either. I think Apple’s very worried about security concerns around just-in-time compiles. I did see there's a new version of OpenCL which, I believe –. Because again, OpenCL is not on iOS, I don’t do any of my [inaudible] work in OpenCL, but I saw there was a new version of OpenCL that’s supposed to let you compile to some mid-level language, and then it will just-in-time compile for maps. Maybe that'll make Apple feel better, I'm not sure. We may never see OpenCL on iOS. I hope it happens, but if Apple’s worried about the just-in-time nature of OpenCL, they just may never ship it. ANDREW: Oh and OpenCL is not the first or only thing that on OSX and also on iOS, but private on iOS. I think XPC is also on iOS and yet not exposed publicly. So we don’t seem to really have a problem with keeping certain things for themselves, but I hope we see it someday too. COLIN: Yeah. JAIM: There's nothing stopping from people – the just-in-time thing is, they don’t want their apps shipping with that, but if you're doing something like enterprise or doing something out of the App Store, they have a little bit more flexibility with that, because they're not checking that type of stuff. COLIN: Yeah. There was actually a really interesting development that happened around – people were trying get around this OpenCL, or OpenCL not being there is –. So I mentioned earlier that OpenGL has this concept called shaders, and shaders aren’t really the exact, same concept where you're writing a little program but instead of being designed to operate on raw data, this little program is designed to operate on pixels. But people said, “Gee, we’ve got OpenGL and shaders on iOS and I could probably represent the data I need to compute as pixels. I could – not really trick OpenGL, but misrepresent what OpenGL is actually working on. I can take my general data, pretend they're pixels, shove them into OpenGL and write a shader to actually do computations on my data.” And that, actually, I think there's a few example projects out there that I can probably go and find one, but there were a few people who did it and demonstrated these works. You can do these same sort of stuff OpenCL does, but write it as OpenGL and pretend that your data’s pixels. The current version of OpenGL is OpenGL 2.0 on iOS – is that 2.0? No, it’s 3.0 – 3.0 is the current version; the new version is 3.1. I always get my OpenGL numbers confused. So 3.0 is the current on iOS; 3.1 actually includes this bullet item called compute shaders, so it looks like what they're doing is they're taking the hack that people were using on iOS, of pretending their data was pixels and throwing it into the GPU and having the GPU to have work on it, and they're actually formalizing that hack and they're saying, ‘we’re also going to have just a general computing component of the OpenGL ES standard.’ On one hand, that’s great, because that means if Apple adopts OpenGL ES 3.1, we can do general computing on the GPU just fine. On the other hand, that makes me sad because it’s yet another standard, and I’d rather just write my code once and have it work everywhere and OpenCL seems to be the best way to do that. It still makes me sad that there's no OpenCL on iOS because I'd rather be using that instead of some OpenGL ES extension. CHUCK: So do you have examples of things that you’ve optimized using these techniques? Like, specific programs that you’ve worked on? COLIN: Yeah, well most of my work has been on iOS recently, and again, there's no support on iOS. Back on the Mac – I used to work at a company that did a lot of Final Cut plugins, After Effects plugins – that sort of thing – and that’s really where you wanna start taking advantage of that stuff. Back at that time, GPUs were still a new thing, but they were experimenting with GPU-based computations, so there's a lot of, again, image adjustments – that sort of thing. GPUs are really designed for – that was the basis of their design, is to take images or 3D graphics and do just a bunch of pixel computations at once. So typically, that’s where you see a lot of GPU work going right now is – I've seen a few projects that are just starting to get into, well, you know, let’s take audio buffers and put them on the GPU and see if we can change them or adjust them or do effects. But still, most of the time you see this sort of work done, it’s on images. There's still a few other projects out there that are trying to use it for other purposes; there's a few of those like the folding@home client, I think, might have an OpenCL version. There's a few of those sorts of projects out there that are trying to tap into GPUs. Bitcoin mining, apparently, is big on GPUs. I don’t know much about that; I don’t know much about Bitcoin in general, but I've heard a lot about Bitcoin miners using GPUs. It’s probably OpenCL, I'm not sure, but I'm guessing that’s going to be OpenCL they're using. JAIM: I think at this point, the only way to make money with coins is to use a kind of GPU-based solution. COLIN: Right. JAIM: Because the CPU-based ones are so processor-intensive, it takes so much energy to make it that you don’t actually make any money. But if you have a really serious GPU site up, you can actually efficiently mine the coins and do the transactions. COLIN: Right. ANDREW: I think it’s actually gotten even worse than that. Now, people are using custom A6 that are designed just for Bitcoin mining, which seems completely crazy to me that that got big enough that it was worth it to design custom silicon to mine these things, but that seems to be what people are doing now. JAIM: So I've been trying to mine Bitcoins with the Beowulf clusters of iPhone 4S – am I doing that right? Is that going to work? ANDREW: I think it would have worked three or four years ago, but not anymore. JAIM: Oh well. Alright. Back to the drawing board. [Chuckling] COLIN: So I think one reason you don’t see GPUs used more often as a general technique is because, as I mentioned earlier in the podcast, when you're talking about CPU optimizations, your data in the CPU are very close. You’ve got your RAM, and you’ve got your CPU, and it’s very easy to take your data out of RAM and put it into the CPU and do all the work you need, and then store it at the back end of the RAM. Typically, a lot of GPUs especially on desktops are on some sort of PCI card, so if you just look at the raw mechanics of getting your program and then your data in to the GPU, you’ve got to pull out a RAM and then send it over the PCI bus, into this card, which takes a lot of time – it’s not a lot of time, but in computer time it takes a lot of time – to get it over the PCI bus. And then you're going to run your program on the graphics card, and then your graphics card typically has to send the data back. Now if you're working on graphics, which again, people are typically working on, you don’t have to send the data back because the next step/stage for that data is actually going to be the screen, so that makes things very easy on you. But if you're trying to do general work, you gotta send the data back to the CPU because you're going to need to save it in RAM or save it to the disk or something, so that’s the other reason you don’t typically see the GPU workflow still uses much as CPU optimizations, because you gotta do all this work of sending your data up to the GPU and getting it back, which takes time. That is something that is changing, and actually that's changing courtesy of integrated graphics, where integrated graphics – your GPU is not on a PCI card on some other end of the system; your GPU’s actually right next to the CPU, on the same chip. I did a research project back in school; I had one of those little Macbook Pros where I had a Macbook Pro that had the integrated 9400m card and the discreet 9600 card – I think it was the first time that Apple had two graphics card on a laptop. I don’t have any bit of RAM in those machines, but you couldn’t use both cards at the same time for OpenGL, but you could use both cards at the same time for CUDA. So I had an experiment where I had a set of data, and I split it in half, and I sent one set of data to the 9400 card; I sent one set of data to the 9600 card, and I figured  this will be a great experiment because the 9600 will be faster because it’s a faster GPU. Surprisingly, the 9400, which is the integrated GPU, which I thought would be not great, was actually the faster GPU. The reason is because there's much less effort in loading that data onto something soclose to the CPU. So you look at your machines – you look at the benchmark scores, something like the Iris Pro, which is the new integrated GPU Apple’s using on their machines; the Iris Pro typically has better OpenCL benchmark scores than the Nvidia discreet cards that the new Macbook Pros come with. Part of that is because Nvidia is not well-optimized in their OpenCL right now, so part of it is just Nvidia is bad at OpenCL right now. The other part of it is because the work you have to do to do OpenCL with an integrated processor or an integrated GPU is much less. And you're starting to see that in game consoles as well; game consoles have moved to integrated GPU as opposed to discreet GPUs not just for cost reasons, but because there's actually a lot of potential performance there if you bring your GPU closer to the CPU. JAIM: So Colin, I wonder if you could walk us through what are the steps involved to actually test two different cards or chips? How would you do that? Are you writing this in C? What's the process for that? COLIN: Typically, what you just do is you have a switch. What I did back in the day was you just have two cards that have a file somewhere that would say, gee, I wanna test on this card and this card maybe about the same time. These days, typically what you do is you have an automated testing infrastructure. Like for example, we do at work a lot of – not GPU stuff because it’s not available on iOS – we do a lot of SIMD, Accelerate, raw NEON assembly, that sort of thing, and we’re very curious about how that performs over the different Apple architectures. So we’ll typically do automated testing through Xcode CI where we’ll say, first we wanna verify that these routines are actually running fast on each processor architecture, and [inaudible] whenn one of the processor architectures runs slow. The second important thing when you're writing all of this computationally-intensive code is to verify your results are actually correct. It may be great that you’ve written this highly-optimized, fast routine, but if your numbers coming out the other end aren’t correct, it doesn’t buy you very much. You’ve gotten the wrong answer very fast as opposed to the right answer very slow, so we also have rigs that will verify correctness and that sort of thing. JAIM: That’s very cool. So that’s a different setup for Xcode CI that what we usually hear about. We usually hear about how to write a unit test, but uses it to actually run tests on different hardware. How does that work? Or how is that working for you? COLIN: It’s working pretty well. There's a debate in the continuous integration sphere on iOS currently. It seems between Jenkins and Xcode CI, and we’re still using Xcode CI. Xcode CI, since we have very mono-products still where every once in a while, I’ll come in the morning and the testing rig will have stopped on a test; I’ll just unplug the device and plug it back in and then it’ll keep going on its way. But the nice thing about Xcode CI is if you're in this sort of situation where you need to –. At work, we have a larger office, so we have a core technologies team, we have an applications team, and I'm on the applications side, so I'm doing the higher-level optimizations, I may just say. I've got these two SDKs and they do the multi-threading or something to clean them. But on the lower end, they're writing all sorts of assembly and raw stuff, but I may need to get feedback to them that, gee, your assembly’s running great on this 32-bit machine but on this 64-bit, a new [inaudible] iPhone – it’s not working great. And the great thing about Xcode CI is that it gives you very clear reports and they're web-based as well. They have an Xcode integration component, but there's a web-based component as well, so you can just grab a link at your web browser, send it over the other team and go, “Hey, look at this test; [inaudible] have failed because probably something’s wrong with this algorithm, or it’s not performing very fast.” In Xcode CI, there's no timing data as part of the reports, but you can also pull the raw logs out of Xcode CI – the raw testing logs from your unit tests – and they’ll actually have the timing data in there. We do have certain test rigs to time out after a certain amount of time, so we’ll say this test should take no longer than 20 seconds, and if it does, something is severely wrong, so we’re going to just fail the test after 20 seconds. We do stuff like that, but correctness is also a big part of what we do in our –. The background in what we do is we take basically images, or packaging is our new thing, and we can embed something like an invisible QR code right on top where we’re taking the image and we’re very quietly changing values of certain pixels in the image. So we change these pixel values very subtly so that the human eye can’t see them – we can embed a pattern in the image that your phone can see or a checkout machine could see, in the case of packaging, so it’s basically like an invisible barcode is right on the top of your image. So for high-performance computing, the easy connection there is we have to embed these images very quickly, we have to read these images very quickly. So we both have algorithms that are going over these images, trying to put the signal in, and then we have algorithms that are looking at this image, trying to take this signal back out. So in that case, you have a very clear set of testing – you’ve got these images with the embedded signal for reading, so you could push them into your server and you can run the algorithms on these phones that are slaved to your testing server, and it can come back out and say, ‘okay, on this phone I was able to find a pattern and on this phone I was able to find a pattern; on this one I wasn’t able to find a pattern.’ The other great thing is not only do we have this Xcode CI server, but we have a robot hand downstairs, which is great. It’s in a room somewhere, I got to play with it when they first bought it. But the other important thing is that when you're trying to read these images, if you're scanning them in with a flappit scanner that’s easier – you'll only be looking at them from one perspective. But if you’ve got the phone, you'll be holding the phone in all sorts of different angles all over this image, trying to read it, and so not only do we have to evaluate for correctness in all these algorithms, but we have to evaluate for how hardened these algorithms are. If we look at it under low light, if we look at it under bright light, wrong angle, right angle, do these algorithms still work? This assembly we have written in SIMD and NEON and all these other stuff – [inaudible] wasn’t complicated enough to optimize it, you’ve got to make it really an aggressive algorithm that will find its pattern. So we have this robot hand downstairs and the robot hand has an iPhone in its hand and the robot hand just goes all sorts of different angles and it moves the phone, trying to read the image – that’s a more complicated automated test that’s run where it’s actually got an app on the phone that’s watching the results, sending it back over the network to a server, and then the server is collecting the results and may end up with a heat map that says – I think it’s a heat map organized by angle and light intensity and stuff like that. It gives us a nice spread of, okay this new algorithm you wrote reads less well than the old algorithm or it’s better than the old algorithm. So not only are we in a situation where we have to write these math algorithms, they have to work under all sorts of obscure circumstances. JAIM: That’s like CI to the extreme, I'm glad you're having success with the Xcode version of that type. I've had different reports, but something that you guys are doing is pretty cool – something’s pretty cool with it. COLIN: Yeah. I have two big problems right now in Xcode CI, and I will still choose Xcode CI if I have to choose at the end of the day, but my two big problems again as I mentioned, every so often the unit tests will just stop. And the server’s fine; I mean, the server’s not testing on any new devices, but the server itself isn’t locked up or anything, so usually I’ll come in the morning, [inaudible] get stuck and I’ll just unplug it and the server will continue testing. Of course [crosstalk] JAIM: Of course, it never happens with Jenkins, you know. COLIN: Yeah, no, we do have some Jenkins rigs deployed. So we have our raw level tests that test our algorithms through unit tests and things like that. Because we ship applications along the store, we have higher-level tests that are validating strings and buttons and transitions – that sort of thing. And Apple has two testing technologies for it; they’ve got the lower-level unit tests and they’ve got this higher-level thing called UI automation, which is really great. Jonathan Penn – has he been a guest in this podcast? CHUCK: Yup. COLIN: Ok, yeah. Jonathan Penn works that, and it’s really great. We have a QA engineer who writes all these tests; they're all written in JavaScript and we’ll go through and – it’s basically an automated person using the application who’ll go through and click buttons and it’ll verify [inaudible] on your screen. Great technology not supported at all by Xcode CI; I don’t know why. We had complained to Apple about it before; Xcode CI only does unit tests. And they're both Apple technologies, so, you know, unit automation is part of the instrument’s package and unit tests are part of the Xcode package, so we have to run those not through Xcode CIs, so we’re working on a secondary service to do that. But as you’ve kinda hinted at, Xcode CI is the only testing tool that has been officially sanctioned by Apple to load apps on devices and test. So Jenkins has its own flavor of weirdness where it’s using – I always forget the name, I think it’s called fruitstrap, I thing is what it’s called? It’s a tool called fruitstrap, which is basically someone’s attempt to reverse-engineer how loading apps on to an iOS device works. It’s this tool that Jenkins uses to load apps on devices and test and we’ve had varying degrees of success with fruitstrap. Jenkins comes with its own issues; it’s kind of a shame because there is no 100% bullet proof testing framework for iOS right now. I wish there was, but Xcode CI, I think, is still – with Xcode CI, you have cleaner reports; your data’s formatted a little nicer; the front end’s a little nicer. It’s actually sanctioned by Apple to load apps on devices, so I’d still probably go with Xcode CI. But yeah, Jenkins and fruitstrap are out there as well. JAIM: So one of the things you talked about earlier is GPU optimization and breaking down your problems so they can be solved by a GPU. How do you do that? I've got a problem set I wanna solve, and I think maybe I can get performance benefits from going at the GPU route. How do you start breaking down your problem? Do you use objective-C and did it help you with that? How do you do it? COLIN: There isn’t much at all in objective-C for these sorts of performance improvements right now; they're really all C-based, so that’s the first part of breaking down your data is typically –. We’re talking about running these operations over arrays. In NSArray, it isn’t going to cut it; you gotta have it in a pure C array. Aside from that, usually you can hope that your data is kinda naturally broken down into small parts you can work on, and that’s really why stuff like multi-threading on CPUs is alive and well because there just are some problems that you can’t break down. Again, typically the used cases are, if you have something that a data, a piece of data, that’s just naturally in a bunch of small pieces – images are still the go-to example, you got a bunch of pixels. Audio’s another example; you’ve got a bunch of audio samples and you might have an audio file full of audio samples and you want to take all these audio samples and do something to them. I mean, there's a lot of cases where you have arrays of data, indexes, that sorts of thing; maybe you’ve been doing something like an Excel worksheet does where you’ve got just a table of data and you do a bunch of operations on the tabled data. Probably should be a big table; big data if you're going to be sending it over to a GPU, because you're going to [inaudible] and then send it over to the GPU. But if you have a really big table of data it might be worth it. I haven't really talked about it, but there is of course still traditional CPU multi-threading available to you. CPU doesn’t have thousands of cores, but these days, if you're on iOS there's two cores; if you're on the Mac, you have anywhere from 4 to 12, still on down to two cores to work with, and so if you can’t break your data into thousands of little chunks, you might just say I'm going to break it into four or five chunks, and that makes a lot more sense for me, and as well you don’t have the overhead of sending your data up to the GPU. The good news is, in that case, there is objective-C ways of doing that available to you. There’s also a Grand Central Dispatch, which is a C-API, but it’s still pretty close to the ease of use of objective-C. Grand Central Dispatch is another great API that is on iOS, that if people haven't checked that, they should check that out. It basically is intelligently aware of how many cores are on the machine you're on, and it will – you give it a problem set, and you say –. It’s really great because there are stuff that are just dropping and replacing for loops. If you have for loop in your code, it could potentially get with along, maybe it goes up to you do a few dozen of iterations, hundreds of iterations, thousands of iterations, Grand Central Dispatch has a drop-in for loop replacement function for you that will say this for loop, take the contents of this for loop and instead of running it on one core over and over again, break it up for me and run it over all the cores on this processor at the same time. So Grand Central Dispatch is, again, something you need on Mac and iOS, and it’s very handy for – if you have a problem set, it doesn’t necessarily break itself up into small enough pieces just to [inaudible] using the GPU for but is breakable into chunks that you could split over a CPU. JAIM: Do you have to break up the chunks explicitly, or does it kinda guess it for you? And how can you provide more hints? COLIN: So at least when I've worked with GPUs, both GPUs and CPUs will split it out for you. The one exception with GPUs is if you bring multiple GPUs into the equation, typically if you’ve wanted to use more of than one GPU at a time, you gotta do the splitting yourself. The GPUs aren’t going to talk to each other; you’ve got to actually take your data and break them into chunks and send to each GPU. That’s a rabbit hole you go down though it’s a multiple GPUs for OpenCL. But for both just single GPU over the CL and Grand Central Dispatch or [inaudible] dispatch, it will break it up into chunks for you, and the great thing again about Grand Central Dispatch is based on how many cores you’ve got on the hardware, or rather conditions on the system, it will choose the best chunk size for you. So it’s going to do all that optimizations; you don’t have to worry about, ‘do I wanna do two or three chunks a day inside this for loop or do I just wanna have every singly loop working?’ It does all that for you. So definitely, if you're a beginner and you're just starting to get interested in these technologies, Grand Central Dispatch – if you're not already using it, which, if you're an iOS developer you really should check it out and working with it in your workflow. Grand Central Dispatch is just probably the first solution I’d jump to; those will be the easiest, low-level of optimizations to make. And then, really, at that point it becomes a choice between Accelerate and the GPU And again, if you're on iOS, that’s an easy choice because there is no GPU computing. So if you're on iOS, Accelerate is your option. If you're on the Mac, if you can break your data up into – and it’s been a while since I've looked at benchmarks and what the automated chunk size is, but if you can break your data into a few thousand chunks at least, the GPU might be a better option for you. JAIM: Very cool. So we’ve talked about problems in high-performance computing, one of them is parallelization, being able to split up your algorithms; we talked about keeping things closer on the actual hardware, so moving data back and forth is less time consuming – are there any other avenues that we haven’t talked about? COLIN: Yeah, those are really the two key problems of parallelization. I think there was another problem we haven't hit quite yet but will hit us in the future is that – one of the reasons we’re looking at parallel processing is, and one of the reasons that developers are looking at parallel processing is because if you take a task and you put it on one core of a CPU, and it maxes out that CPU core –. So you're taking the CPU 100% on one core – the energy drained for doing that is quite substantial. What's commonly said is that energy usage as you use more of the CPU is exponential that basically, on a very simple level – this may not be the exact numbers – but on a very simple level, if you get two times performance, you're going to get four times the energy usage and heat output. So if you're ramping up a single core, you're going to be dramatically increasing your energy output, so it’s better to split a task over two cores running at 50% than it is to max out a single core. So that’s typically why we see these models, but it’s because Intel’s able to – Intel, or Apple, or whoever’s making the processor – is able to get a lot of power savings by instead of making these CPUs go up to higher and higher, just add more cores because two cores is only going to make double the energy usage of one core. You get back to a linear relationship instead of an exponential relationship. Now the problem is, in order to split your code up to run over these different cores, you have to write more code. Either you're writing the code, or Apple’s written the code and grants to just especially do that for you, so you’ve got to use GPU to do the multi-core optimizations to make everything run faster. At a certain point sometime in the future, it’s thought that we’re going to add so many cores that the cost of coordinating all these cores is going to be more than the gain you'll get back, so that’s definitely one thing to watch out for today is, are you spending more performance on trying to get your data into these chunks, to get them on all these different cores, than it would be just around the computational one core. And the thought is that eventually, if we keep adding more and more cores, we’ll hit a limit where the overhead in managing this core is just too much now. I think back when I was in school – and this number may have changed. I think the number was somewhere in between 80-120 cores is where we’re going to max out – that number may have changed; I may be totally out of date, but there was a number that was like, once we get to this level, we see no way we can manage this number of cores. Now GPUs can do it, because GPUs aren’t quite as tightly managed. [inaudible] you send your data up, and you got a bunch of discreet cores all doing their thing, but definitely with CPUs that was the worry. Portland, where I'm from, is Intel-town. We have a lot of Intel engineers thinking about these problems and they would come and talk to us in school and talk about everything they were dealing with. The other classical problem that Intel was dealing with and they talked a lot about at that time, but they kinda solved it for now, is when you have all these cores and they're all reading from memory, they're all going over the same RAM connection. So at a certain point, you start to max out because not all these cores are going to have the bandwidth to talk to memory at once. And for Intel that was a real problem with the Core II Quad – I think was the last time there was a problem there. It was one of their first quad-core processors, and actually even though you had four cores, you didn’t have enough bandwidth to actually pull stuff out of RAM fast enough to feed all four cores. If you look at more modern machines they’ve used multiple memory lanes; I think the Mac Pro has four memory lanes that you can use all at once. I think it’s four; it’s about three or four – maybe three. But you start seeing these architectures where they're really increasing them out of bandwidth to memory, because as you're trying to feed all these cores, you’ve got to have the connection to RAM to actually pull your data out and process it. It’s just another thing to watch out for is depending on what era of machine you're on, you might be looking at trying to max out these cores but not have enough bandwidth to actually get your data into the cores. I've never run into that on the iOS device. Again, iOS devices are only two cores, so you're not really slamming memories as hard as you could be on a four-core device. But it’s something to keep an eye on as processor architectures evolve, when we add more cores is that RAM on the machine has really got to keep up, and if it doesn’t, you're not going to be able to get the performance improvements you probably want. CHUCK: Awesome. Is there anything else that we should go over before we get into the picks? COLIN: No, I think that’s it. CHUCK: Alright. JAIM: I think my brain is full, so I think we’re good. [Chuckling] CHUCK: You have to loosen your mental belt. COLIN: Yeah, it’s a lot of information. ANDREW: Yeah, I feel like this has been really a lot of good stuff and I've learned a lot today so thank you. CHUCK: Awesome. Alright, well Andrew, why don’t you start us off with picks then? ANDREW: Okay, so I've got a few picks. One of them is – I kinda ripped it off actually from Colin. We were talking before we started recording and he got himself set up with Shush and I've never heard of it before, so I got it. It’s pretty cool; it’s just a little menu bar app on the Mac App Store that gives you a push-to-talk hotkey for your Mac, and it’s nice for something like this where we’re recording but wanna be able to mute ourselves. Anyway, it’s like $3 on the App Store. And then the next pick is kind of a competing podcast, so I'm a little worried about it, but no, I'm just kidding. It’s the Core Intuition Jobs board, and this is a jobs board that was put up by Manton Reece and Daniel Jalkut. They have a Mac and iOS development podcast together, and I think it’s quickly gotten pretty popular. Apple’s posting jobs there and a bunch of other people, a bunch of other kinda big, well-known companies, they're posting jobs there and it seems like they have a high-quality audience, so it might be a good place to look if you're trying to find a job or if you're trying to find someone to hire. And it actually leads me to my next pick, which is that we have actually a job posting. Mixed In Key actually has a job posting there right now, so we’re hiring a Mac developer. You get to work with me; I don’t know if that’s a plus or a minus, but we work on some really cool stuff and it’s a great work environment, and it’s the best job I've ever had. So if you're looking for a job, or just looking for a change, check it out. Those are my picks. JAIM: You do music and audio software, right? ANDREW: Yeah, we write apps for DJs, so there's a lot of audio processing, a lot of custom UI. We’ve got a fair amount of midi code at this point for interfacing with external midi devices, and just a lot of really cool, fun stuff that people enjoy. People use our apps because they like them, not because they have to, so it’s fun to work on. CHUCK: Cool. Jaim, what are your picks? JAIM: I've got one pick. A few weeks ago, I went to the record store and I picked out some records I was going to buy, picked up a copy of Led Zeppelin II and paid like $15, which isn’t a terrible amount. A terrible amount, but wasn’t sure if it was like – there are so many things that could go wrong with vinyl. Even if it’s in a good condition from the mastering and how it was recorded – just a lot of stuff. So I looked into it and found out I actually had a pretty old copy of Led Zeppelin II – not like the really first version that everyone wants, and it’s like $200, having run over by a semi, but it [inaudible] a little bit after that. But if you get into vinyl, you kind of get into how stuff is recorded and how it actually ends up as sound on a platter. But there's people that are obsessed over this stuff and they exist at the Steve Hoffman forum. So if you end up digging for vinyl, you can learn a lot about how the record that you bought was recorded and mastered and things like that. Very useful, so I go in quite a bit just to learn a bit about how things were recorded. If you're an audio geek and get into vinyl or even CDs and like to discuss how things are recorded, check out the Steve Hoffman forum – a lot of cool stuff. ANDREW: +1. CHUCK: Awesome. Alright. I haven't really picked these on this show, but I've been listening to a lot of audio books lately on Audible.com, and this is a terrific resource for me as far as being able to get business information and stuff since I'm a coder for hire. I've really been enjoying some of the books that I've picked up and I picked one of those books. And then I'm also going to pick an iOS course that I just picked up; I wanted to check it out. The book I'm going to pick is The Total Money Makeover by Dave Ramsey, and I’ll put links to both Audible and Amazon so that you can get it in either place. And then the course that I picked up – I'm trying to remember what it’s called – it’s by Ray Wenderlich, and he’s got a tutorial for iOS programming, and I'm really looking forward to digging into that and seeing what's there. I haven't tried it yet; I got in trouble a little bit. I have some people complaining that I pick stuff that I haven't tried yet, so I haven't tried it yet, and this is your warning, but it looks really good. JAIM: It’s a future pick. CHUCK: Colin, you have some picks? COLIN: Sure. Traditionally I've been using GitHub, but recently I've been checking out this service called Bitbucket. It’s another git host, but they give you – it says right here on their webpage, “Unlimited private code repositories,” which is nice. So what's nice on its own is that you don’t have to create a pro account to host your own code. The second thing is the tool’s just a lot nicer; the layout’s a lot nicer on their web front end. So I'm enjoying using Bitbucket much more than I have with GitHub. So if you're a git user, check out Bitbucket – really great service. Another thing I've been doing is I've been playing Hearthstone a lot on my iPad. Blizzard just shipped an iPad client for Hearthstone, and I would not suggest downloading that if you have anything to do, or any projects to do, because it is a wonderful, addictive, little game that I've been playing. I used to play Threes all the time, which is not a great app, but now I'm playing Hearthstone. To take the place of Threes, you have to be a pretty, pretty good game, so I've been doing that. And then, another app I've been using that I'm actually beta testing right now – it’s not out publicly, but their website’s up; you can go check it out. It’s this app called Pixel Winch. It’s another app out there that does screen measurements and stuff like that that I will not name by name, but there's another popular one out there by a different company that does pixel measurements. You can [inaudible] take it into a UI review and measure the distance between different elements of the screen to make sure everything is lined up correctly. The workflow in this app is a little cleaner, and we have been beta testing that work and everybody loves it. It just helps you measure and it does everything on the fly; it takes screenshots over UI, so you have a screenshot to refer to instead of trying to measure everything live on the screen. A really great application; it should be out soon. I hope it is, because I'd like to be able to use it; but a really great app for everybody to keep your eye on. ANDREW: I've been using Bitbucket for several years for my own personal repositories. They are private, and I have been super happy with it. They also support Mercurial, which I consider a plus. So, +1. COLIN: Yeah. JAIM: Let me +1. Most of my stuff’s on Bitbucket too. COLIN: If you're behind firewalls, so if you're a corporate user, they have a service called Stash which I'm looking at right now, and Stash is really great as well. So if you wanna host your own on your own server, and you have some money to write a commercial license, Stash is a pretty great solution. CHUCK: Cool. Alright, well we’ll go ahead and wrap up the show. Thanks for coming, Colin. COLIN: No problem. ANDREW: Thank you, I thought that was great. JAIM: Yeah, great stuff. [Hosting and bandwidth provided by the Blue Box Group. Check them out at BlueBox.net.] [Bandwidth for this segment is provided by CacheFly, the world’s fastest CDN. Deliver your content fast with CacheFly. Visit cachefly.com to learn more] [Would you like to join the conversation with the iPhreaks and their guests? Want to support the show? We have a form that allows you to join the conversations and support the show at the same time. You can sign up at iphreaksshow.com/forum]

Sign up for the Newsletter

Join our newsletter and get updates in your inbox. We won’t spam you and we respect your privacy.