Monday, December 8, 2014

A type-safe definition for OpenCL's enqueue_kernel function

I want to share with you something that I initially thought it wouldn't work... but it does. No reason behind it, just to prove (once again) that C++11 is indeed fantastic and it can handle (almost) whatever you throw at it.

Since 1 year the OpenCL 2.0 standard was ratified. The thing which is most exciting for me is device-side enqueue. This is a functionality which allows a kernel to submit new work directly on the device without the need for host intervention.

However there is something fishy with the way the function is defined and I am going to explain why. The new enqueue_kernel function (defined in the OpenCL 2.0 language specification) has several overloads:
Ok, we like it. But wait for the next two ones:
Emh.. wait a second.. what? How many variadic arguments are in there?

The issue here are the two sets of "...". How many arguments should we accept? It is funny since the specs are not saying much about these additional arguments. However the only way to read this in a way that makes sense (in my understanding) is the following:
"If a closure function (or block) is passed which accepts N OpenCL "local" pointers, then their size is defined by an equal number of unsigned values (i.e., size0,..., sizeN-1). It is responsibility of the runtime support to allocate memory before the nested kernel is executed."
All of this to say that the length of the two variadic argument lists (the lambda's and the one internal to enqueue_kernel) must match. This means that it is responsibility of the compiler to perform this additional check. 

I can see many people being happy with this...  but couldn't we use the type system to enforce that? Can our beloved meta-programming fix this? Let's assume we were in C++? Would the API designer able to to express this concept (number of arguments in the closure equal number of arguments passed) just with the means of the type system? You will be glad to ear that with C++11, YES WE CAN! ...and I am going to show you how to do that.

For this example we use the sizes as input to the lambda (not for allocating device local memory as the actual implementation of enqueue_kernel is supposed to do). This is just a proof of concept, we are not interested in the actual implementation of OpenCL's device-side enqueue. Execution of the program will produce the following expected output:

> 10
> 20
> 30
> Calling closure
> Computed value: 60

This highlight the power of the ... expansion operator of C++11 variadic templates. For example if we try to call this function using an invalid number of sizes a compiler error will be generated:

@ThinkPad-X1-Carbon:~$ g++ -std=c++11 test.cpp 
test.cpp: In function ‘int main(int, char**)’:
test.cpp:25:10: error: too few arguments to function ‘int enqueue_kernel(std::function<void(Args ...)>, typename to_int<Args>::type ...) [with Args = {int, int, int}]’
    10, 20);
          ^
test.cpp:10:5: note: declared here
 int enqueue_kernel(std::function<void (Args... )> block, typename to_int<Args>::type... sizes)
     ^
And there you have it. A type-safe definition of OpenCL's enqueue_kernel using C++11. Just because in C++11 we can! Hate on that C lovers! :)

C++ <3

Tuesday, October 14, 2014

Serious programming on ChromeOS

In my previous post I explained how to get a working OpenCL for ARM's Mali GPUs on the Samsung Chromebook 2. That's great I know and you are very welcome. :) 

However, do you seriously want to write OpenCL goodness on that small 11" screen? And don't get me started on the keyboard... I mean the Samsung Chromebook 2's keyboard is not that bad considering this a 250$ laptop we are talking about... however I am currently typing on a Lenovo X1 carbon chiclet style keyboard and the experience is the closest thing to a nerdgasm (nerd-orgasm... yeah I just made it up, deal with it! ...ah no actually it already exists).  

First thing... we need to enable ssh server. Lucky enough, ChromeOS comes by default with an ssh daemon (/usr/sbin/sshd) , however it is not enabled by default. The way to enable it is described in [1,2]. For short there are the steps:
  1. Remove rootfs verification (!!please backup your stuff!!):
  2. $ sudo /usr/share/vboot/bin/make_dev_ssd.sh --remove_rootfs_verification --partitions 4
    $ reboot
  3. Mount the rootfs in rw mode (remember you will need to do this every time you reboot the device and want to write in the root partition):
    $ sudo mount -o remount,rw /
    
  4. Generate SSH keys:
  5. $ sudo mkdir /mnt/stateful_partition//etc/ssh
    $ ssh-keygen -t dsa -f /mnt/stateful_partition//etc/ssh/ssh_host_dsa_key
    $ ssh-keygen -t rsa -f mnt/stateful_partition/etc/ssh/ssh_host_rsa_key
    
  6. Allow incoming traffic on PORT 22:
  7. $ sudo /sbin/iptables -A INPUT -p tcp --dport 22 -j ACCEPT
    
  8. We can now create a new user used to remotely login (alternatively you can set the chronos user password). To create a new user you need to follow these steps:
    
    $ sudo useradd -G wheel -s /bin/bash mali_compute
    $ sudo passwd mali_compute
    $ sudo mkdir /home/mali_compute
    $ sudo chown /home/mali_compute mali_compute
    
    
    Now we have a user, however the user cannot run sudo and we already know we need to be able to run sudo in order to enter the Arch linux chroot. To solve this we need to make user belonging to the wheel group part of the sudoers. This is done using the visudo command.
    $ sudo su
    $ visudo
    
    Search and uncomment one of following lines:
    ## Uncomment to allow members of the group wheel to execute any command
    # %wheel ALL=(ALL) ALL
    
    ## Same thing without a password
    # %wheel ALL=(ALL) NOPASSWD: ALL
    
...and voila' you are done.

Now get your ip address using the ifconfig command

chronos@localhost / $ ifconfig 
lo: flags=73  mtu 65536
        [...]

mlan0: flags=4163  mtu 1500
        inet 192.168.1.80  netmask 255.255.255.0  broadcast 192.168.1.255
        ether xxx:xxx:xxx:xxx  txqueuelen 1000  (Ethernet)
        RX packets 1908  bytes 802994 (784.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1524  bytes 415774 (406.0 KiB)
        TX errors 4  dropped 0 overruns 0  carrier 0  collisions 0

Now from your desktop machine simply type:
$ ssh mali_compute@192.168.1.80
motonacciu@ThinkPad-X1-Carbon:~$ ssh chronos@192.168.1.80
The authenticity of host '192.168.1.80 (192.168.1.80)' can't be established.
RSA key fingerprint is ...
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.1.80' (RSA) to the list of known hosts.
Password: 
mali_compute@localhost ~ $ uname -a
Linux localhost 3.8.11 #1 SMP Tue Sep 30 23:28:35 PDT 2014 armv7l SAMSUNG EXYNOS5 (Flattened Device Tree) GNU/Linux

Super cool right? ...well.. not really! Unfortunately you are going to loose the configuration once you reboot your chromebook. Or better, you will need to rerun the iptables and sshd commands manually if you wish to enable the daemon again. No worry, we can automatically start the SSH daemon by adding a script under the /etc/init directory, e.g., sshd.conf; it should contain the following lines (remember to remount the rootfs in rw mode in the case you rebooted the device in the meantime).

start on started system-services
script
     /sbin/iptables -A INPUT -p tcp --dport 22 -j ACCEPT
     /usr/sbin/sshd
end script 

You should be able to login again from your next reboot.

Great, we can login into the super underpowered ChromOS shell... what's the deal? Right, but we are not done yet. The 'prestige' hasn't come yet :). Well, the idea of the last step is to let the user we just created jump into the Arch Linux chroot at login and therefore bypass the ChromeOS shell. This is rather simple to do as well (assuming you have created the chroot, otherwise go here):

$ su mali_compute
$ echo "sudo enter-chroot" > ~/.bash_profile


That's it! Wonder what's going to happen next time you log into your chromebook?

motonacciu@ThinkPad-X1-Carbon:~$ ssh mali_compute@192.168.1.80
Password: 
Last login: Mon Oct 13 22:17:40 BST 2014 from 192.168.1.71 on pts/1
Entering /mnt/stateful_partition/crouton/chroots/arch...
mali_compute@arch ~ $ uname -a
Linux arch 3.8.11 #1 SMP Tue Sep 30 23:28:35 PDT 2014 armv7l GNU/Linux

Limited chrosh... gone!! Welcome to fully fledged Linux environment! :)

[1]: http://www.de7ec7ed.com/2013/05/ssh-daemon-samsung-chromebook-exynos.html
[2]: https://sites.google.com/site/cr48ite/getting-technical/remove-rootfs-verification-make-read-write

C++ <3

Saturday, October 11, 2014

Run OpenCL on the new Samsung Chromebook 2 in 5(-ish) simple steps

Recently a colleague and friend of mine posted a great tutorial on how to run OpenCL on Samsung's Chromebook in 30 minutes. He has tested this tutorial on the older (Series 3) Chromebook.

I bought myself the newer version, the Samsung Chromebook 2 (11" version). The main difference between these two laptops is that the former chromebook hosts a Mali-T604 GPU while the latest model uses a bifier Mali T628-MP6 chip. The Mali-T604 GPU has 4 cores vs the 6 in the latest chip. The latter is definitely an interesting chip for OpenCL's folks since the 6 cores are going to be split into two physical devices with 4 and 2 cores respectively.

In this blog I will present a slightly different way of setting up a working OpenCL environment for the Chromebook 2 (which should be also working on the former chromebook).

Prerequisites

  • A Samsung Chromebook 2 (or any other ARM-based device running ChromeOS)
  • Some free space on your drive (2/4 GBs)
  • Knowledge of Arch Linux package manager (pacman)
  • Optional: I will install the development system in the internal SSD. If you think you are going to need more space for development, then you can use the microSD)

Step 1: Enable Developer mode


Enter Recovery Mode by holding the ESC and REFRESH (↻ or F3) buttons, and pressing the POWER button. In Recovery Mode, press Ctrl+D and ENTER to confirm and enable Developer Mode.


Step 2: Install chroarg

chroarg is a fork of the cruton process. It is based on the chroot command available in ChromeOS which allows to spawn lightweight virtual OSs, a more technical explanation follows:
What's a chroot?
Like virtualization, chroots provide the guest OS with their own, segregated file system to run in, allowing applications to run in a different binary environment from the host OS. Unlike virtualization, you are not booting a second OS; instead, the guest OS is running using the Chromium OS system. The benefit to this is that there is zero speed penalty since everything is run natively, and you aren't wasting RAM to boot two OSes at the same time. [...] 
While cruton will install by default Ubuntu, chroarg is based on Arch Linux. I personally prefer Arch Linux, but if you feel more confident with Ubuntu feel free the use cruton. Follows the creation of a chroot (more options are available from the project github page)

  1. Launch a crosh shell (Ctrl+Alt+T, you can paste in the console using Ctrl+Shift+V), then enter the command shell.
  2. Download and extract chroagh:
    $ cd ~/Downloads
    $ wget https://api.github.com/repos/drinkcat/chroagh/tarball -O chroagh.tar.gz
    $ tar xvf chroagh.tar.gz
    $ cd drinkcat-chroagh-*
    
    
  3. Create the rootfs:
    $ sudo sh -e installer/main.sh -r arch -t cli-extra 
The tool will install a minimal Arch, at some point it will ask to give the user name and password for the main user. If everything went fine (it often does), then you are ready to start your Arch installation within ChromeOS.

NOTE: If you want to install the chroot to a different location (e.g., an SD/microSD card or USB) then use the -p option to specify a destination folder.

Step 3: Enter-chroot and environment setup

After chroarg finishes installing a base Arch Linux installation we can enter this virtual environment using the command (from any crosh shell):

$ sudo enter-chroot

You should see the following output:
chronos@localhost ~/Downloads/drinkcat-chroagh-380f361 $ sudo enter-chroot
Entering /mnt/stateful_partition/crouton/chroots/arch...
[motonacciu@localhost ~]$ 

And magic magic, we are not in Arch linux. At this point you can install a bounce of packages which are going to be useful for OpenCL development.

$ sudo pacman -S gcc vim cmake base-devel git opencl-headers

Next (and final) step is downloading the Mali userspace drivers from. They are available from malideveloper.com and continuously updated. This is the moment where you should be aware of the Mali device you have installed in your chromebook. In my case the driver marked as MALI-T62x will do the trick. For the older chromebook the MALI-T604 driver should be used instead. Since we are using the command line we can download the fbdev version of the drivers:

$ wget http://malideveloper.arm.com/downloads/drivers/binary/r4p0-02rel0/mali-t62x_r4p0-02rel0_linux_1+fbdev.tar.gz
$ tar -xf mali-t62x_r4p0-02rel0_linux_1+fbdev.tar.gz
$ ls fbdev
libEGL.so  libGLESv1_CM.so  libGLESv2.so  libmali.so  libOpenCL.so


Next we can either edit the ~/.bashrc file to include this folder among the LD_LIBRARY_PATH, alternatively you can copy the libraries in your /usr/lib folder (but you will need to add the user among the sudoers). Or simply manually specify its path to GCC.

Step 4: Compile and Run your CL program

When you compile your program make sure the linker can find the libmali.so. The libOpenCL.so library is just a wrapper, the only library needed for running CL program is the libmali.so.

Compile your program as follows:
$ g++ -std=c++11 main.cpp -Iinclude -L/home/compute/fbdev -lmali -o clInfo

This is a simple CL program which prints the list of devices. We run using the following command (you can avoid specifying the LD_LIBRARY_PATH if you placed the libmali.so in a default location):

$ LD_LIBRARY_PATH=/home/compute/fbdev/:$LD_LIBRARY_PATH ./clInfo
Total number of CL devices: 2

Device 0 info
 - Vendor:            'ARM'
 - Name:              'Mali-T628'
 - Type:              'CL_DEVICE_TYPE_GPU'
 - Max frequency:     '533'
 - Max compute units: '2'
 - Global mem size:   '2097192960'
 - Local mem size:    '32768'
 - Profile:           'FULL_PROFILE'
 - Driver version:    '1.1'
 - Extensions:        'cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_arm_core_id cl_arm_printf'

Device 1 info
 - Vendor:            'ARM'
 - Name:              'Mali-T628'
 - Type:              'CL_DEVICE_TYPE_GPU'
 - Max frequency:     '533'
 - Max compute units: '4'
 - Global mem size:   '2097192960'
 - Local mem size:    '32768'
 - Profile:           'FULL_PROFILE'
 - Driver version:    '1.1'
 - Extensions:        'cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_arm_core_id cl_arm_printf'

Step 5: Done

Yes, that's all. You can now start writing you multi-device OpenCL codes for ARM's Mali GPU. Let's verify that everything is in order... shall we?

OpenMP 'matmul_1024x1024' on ARM-CPU   [cores:8] => 16632 msecs
OpenCL 'matmul_1024x1024' on Mali-T628 [cores:2] => 616 msecs
OpenCL 'matmul_1024x1024' on Mali-T628 [cores:4] => 319 msecs

Stay tuned for more experiments!

C++ <3

Sunday, March 30, 2014

Generating Flex and Bison rules with X-Macros

Lately I found myself working on a pet project for which I needed a parser for an assembly-like programming language for a real architecture. Since I wanted to use this opportunity to learn something new, I decided to use a combination of tools, i.e. Flex (for lexing) and Bison (for parsing)... obviously in C++ :)

However I found myself facing a pretty interesting problem for which I could not find a workaround, therefore I set myself with the goal of filling that void with this post. 

The problem is easily explained, in the language I am parsing (let's call it X for brevity), there  is a large amount of opcodes, e.g. add, sub, bra. These operations have different properties (e.g. accepting different number and type of parameter) and I wanted to have a centralized specification for those properties which I can pull-out as needed throughout my project. This is a typical case where X-macros should be used. However since flex and bison specification files are not a C-like language, I cannot use the preprocessor capabilities. Before we jump to the problem I wanted to recap what X-Macros are and why they are awesome. 

Let us consider the following example:

By properly defining the two macros in the opcodes.def file, and thus take advantage of the C preprocessor, we can reshape the information in different ways. As an example consider a function which determines if an argument is valid for a given operation:
As you expect the output of the code above is:
Is 'sub(_,_,_,reg)' valid? false
Is 'add(_,reg,_)'   valid? true
Is 'add(_,imm,_)'   valid? false
Is 'sub(_,_,imm)'   valid? true
Is 'sub(_,_,lab)'   valid? false

Cool eh? and think how easy is going to be adding a new opcode or change their semantics. This is quite standard practice in compiler-related project, beware that working with macros can get messy... but what not to love here? If you start to get weird errors by your compiler, sometimes it is instructive to run only through the preprocessor and have a look at the generated code. This can be done in GCC using the -E option (checkout an example for the code above here).
In my experience many programmers consider these meta-programming techniques (i.e., macros and templates) harmful and therefore to avoid like the plague. Of course everyone is entitled to his opinion but in a problem like this one, X-macros can save you an huge amount of time. The generated code is also a lot faster and more memory efficient then any implementation based on lookup tables.

Anyhow I am digressing... I hope all those macro haters have had enough by now and they left this post so that we can start to get things to the next level. Without all those 'C is better than C++' holding us back, can move on solving the actual problem which inspired this post. The idea is to use the the properties of X-macros inside the lexer and parser specification for a language parser. Since we are working in C++, the natural choice of tools is Flex (for lexing) and Bison (for parsing) [another choice could be Boost.Spirit, but I never felt the tool to be mature enough to be practical useful, feel free to comment on this]. For those who are not aware of parser generator tools, these are (meta-)programs which take a language specification (usually in a EBNF form) and produce code to determine whether an input stream matches that specification.

This is not going to be a tutorial on how to use Flex and Bison, there are other tutorials/blogs you can check out for that (Google is your friend). I start by setting up a basic flex and bison specification for parsing our language X. X allows 1 instruction per line, where each instruction is composed by an operator and a list of comma-separated arguments. Labels are allowed as well. For example the following is valid X program:

add r1, r2, r3 
sub r1, r1, #10
bra r1, exit
...
exit:
...

Follows the Flex and Bison specification files for the X language:
and
In order to compile this code we need a Makefile which triggers the generation of the lexer and parser. We are going to base our setup on cmake, checkout the CMakeLists.txt file for details. You can test out the code so far by checking out the following github repository (refer to the no_autogen branch).

You may start to notice the problem now. Our opcodes, which we defined in the opcodes.def file, need to be defined (again and again) both in flex and bison specifications. Since these are not C/C++ files, we cannot rely on the preprocessor for generating these rules automatically. This makes adding new opcodes to the X language very cumbersome since many files need to be modified. However there is my solution to the problem.

The idea is to generate temporary C/C++ files used to reshape the information contained in the opcodes.def file in a form that it is suitable for flex and bison. After that we replace the content generated by X-Macros in the language specification. The workflow is not particularly difficult, however we the main effort would be on automatizing this entire process so that specification files are refreshed when needed (i.e. when we lunch make).

Let's start with the Flex specification file (lexer.ll). My solution of the problem is the following. We define a placeholder, i.e. @@{{OPCODES}}@@ (you can choose any name for the placeholder, however make sure that it is unique within your flex specification),  which we are going to replace with some content automatically generated by the preprocessor. We also rename the file into lexer.in.ll which we are now going to call the template file. Next thing we need to do is to add to CMake the necessary action for generating and replacing the placeholder with content.
Similar thing can be done for the bison specification file. Note that here things are a little bit more complex since we need to declare opcodes both in the token section (line 28 of the parser.y file) and later in the grammar itself within the parsing rules (line 53 of the parser.y file). The full CMakeLists.txt file is available here.

In order to make replacement of placeholders with the preprocessor output I wrote a simple python script (i.e. xmacro_patcher.py). The script does remove empty lines and comments produced by the preprocessor and collapse the lines if needed. I actually spent lot of time finding out a way of doing this without the need of an additional script. I found a command line which will do the same work but unfortunately the way CMake treats the command line arguments in the COMMAND section made it impossible to use (damn you CMake). Anyway, in the case anyone is curious
awk -vf2="$(while read x; do echo $x; done)" '/$1/{print;print f2;next}1' $2
This like will read content from stdin into variable 'f2', then match the input file $2 with the regex provided in $1 and append the content of f2 at that location. The patched file will be retuned on the standard output.

And that's really it. We can now add new operators to our opcodes.def, press make... and our parser will now accept the broader language.

C++ <3