2014年4月15日星期二

Understanding ROS Memory Allocator in ART Virtual Machine

Agenda

1. Run基本结构
2. 主要算法
a)    Allocate
b)    Free

正文

ROS memory allocator Android ART virtual machine GC的一个内存分配器。 ROS Run-of-Slots的缩写。 ROS allocator中,分配的都是虚拟地址区间(具体的物理内存什么时候分配是由kernel中的page fault来负责的)。ROS allocator的基本分配单元是slotslot大小16Bytes2048Bytes,分别是16,32,48,… n*16,512,1024,2048. 不同大小的slot对应不同种类的Run,换句话说,一种Run提供大小相同的slot. 一共有34Run.

, Run的基本结构
一个Run,可以分为两大部分,分别是Run headerData Area。如下所示:

Run header里面包含了
            Magic number: Debug相关
            Size_bracket_idx_: 表示了该Run中的slotsize是多大。比如Size_bracket_idx_10,那么该Run中的slotsize是(10+1*16=176 Bytes.
            Is_thread_local:表示了该Run是否是Thread local area. ROS allocator规定每个线程都会持有一组Thread local Run,该组Runslot size最大为176Bytes. 即,如果某个线程需要分配一个space,若该space不大于176Bytes,那么就从该线程的Run中获得虚拟内存,而不会从ART runtime中获取。
           to_be_bulk_freed_: 该字段是一个flag,用来辅助处理一次释放多个对象的情况。若该标志置位,说明该Run有多个slot正在被释放。若该Runthread local的,那么bulk_free_bitmap的置位信息将会更新到thread_local_free_bitmap中。thread_local_free_bitmap_的信息会在分配thread local slot的时候存在失败的情况下,会更新thread_local_free_bitmap_中的信息到alloc_bitmap_中。若该Run不是thread local的,那么在bulk free的时候,会同步到alloc_bit_map_字段中。alloc_bit_map_是用来在分配slot的时候指示哪些slot已经被分配出去了。
            top_slot_idx_: 一个Run在分配slot的时候,slotRunData Area中的分配顺序是从小到大的。因此该字段表示当前slot分配到哪里了。若需要分配新的slot,只需要在该字段的基础上+1就可以了。可是,在Run分配slot的时候,也会有某些被分配过的slot又被释放的情形。因此若top_slot_idx_到头的情况下,会根据alloc_bit_map_看一下该Run中是否还有空闲的slot可供分配。若有,那么就分配出去。
            alloc_bit_map_[]: Run中的每一个slot在该字段代表一位。若为1,说明slot被分配出去了。若为0,说明该slot还是空闲的。
            bulk_free_bit_map_[]: 每一个slot在该字段中代表1位。若free,置1,该字段会在适当的时候,更新到thread_local_free_bit_map_[]然后在适当的时候更新到alloc_bit_map[]或者直接更新到alloc_bit_map_[],前者对应于该Runthread_local的情况,后者对应于该Run是非thread_local的情况。至于为什么要有这个bulk_free_bit_map_[]字段,是为了性能的考虑,可以减少锁冲突。
            thread_local_free_bit_map_[]: 若该Runthread local的,那么该字段会在slot free的时候置1,当在适当的时候会更新到alloc_bit_map_[]字段中(具体的时机见前面to_be_bulk_freed_的说明)。
            padding 是为了对齐的需要。由于每种Run占用的空间不同(占用的空间包含Run headerData Area),slot的大小,slot的数量都不同,因此三个bitmap(alloc_bit_map_[],bulk_free_bit_map_[],thread_local_free_bit_map_[])所占用的空间大小也不同。而last slot是需要页对其的,因此Data area部分是需要n*slot_size页对其,所以这个padding的空间大小也是每种Run都是不一样的。

slot0,slot 1,…,last slot,是真正的数据区。若没有被分配出去,就是空闲的,若分配出去了,就是已经被使用了。

二,主要算法
1.      分配
当系统向garbage collection请求为一个object分配空间的时候,首先计算该object所占用的空间大小request_size, request_size大于2048Bytes,则直接申请页。否则,根据request_size,查找对应的Run. 系统正在被分配的Run都组织为数组,这些Run数组分为两类,一类是用于分配大于176Bytes的数据块的,另一类是用于分配小于176Bytes的数据块,这类是thread local的。如开头所述,每一slot大小的Run对应于一个索引,slot大小为16Run对应数组中的第0个,数组第n个对应于slot大小为(n+1)*16Run. 找到Run后,
(1)     若该Run为空,表示,该Run还没有向系统申请空间。不过这种情况下并不会立即向系统申请空间,而是会查一下,是否有该种类的Run之前已经全部被分配出去后,又因为有些slot被释放,而又可以被复用的情况。如果有,就用可以复用的Run. 否则,就像系统申请若干虚拟页,每一类的Run在一次填充的时候,需要的页数都是不一样的,具体的页数存放在numOfPages[]数组中。所有slot用完的Run都存放在full_runs_hash_set中。而由于有slot被释放而又有可用的slotRun则存放在non_full_runs_Set中。
(2)     若该Run非空,先查一下top_slot_idx_字段,若该字段并没有到达结尾,就返回一个slot,并把top_slot_idx_字段加1。若top_slot_idx_字段已经到达结尾,那么会搜索一下有没有在该Run中已经分配出去的slot由释放的情况,如果有,分配出去。若该Runslot都用满了,那就看看该Run是不是thread local的,如果是的话,将thread_local_free_bit_map_[]中的信息同步过来,看是否有用的slot. 若至此仍然没有找到一个可用的slot,那把该Run加入到full_runs_hash_set中。重新分配一个Run,向系统申请页. 然后在从该Run中分配slot. 每分配一个slot 就会将alloc_bit_map_中对应的bit1,表示该slot被分配出去了。
2.     释放
若某个object被释放后,根据该object的地址,rundown成页地址,根据页地址去查询该页的用途。对于一个ROS系统所拥有的虚拟页都是连续的,事先通过mmap的方式分配得到。ROS系统维护了一个page_map_的数组。每一页对应一个Byte. 如果该页用于slot分配的Run,Run占用的第一页在page_map_中对应的byte设置为枚举常量kPageMapRun, 剩余页被设置为枚举kPageMapRun。因此当释放某个object的时候,根据地址以及page_map_可以得到Run在内存中的地址。 如果该object属于Largeobjectpage_map_中对应的byte的值为kPageMapLargeObject),那就直接释放该页。
若该页在page_map_中对应的byte的值为kPageMapRun,就在对应的Run中将alloc_bit_map slot的对应位清0.说明该slot空闲了。若该slot对应的Run并非系统正在使用的Run(Run因为slot被用完后,系统又分配了新的slot),那么就将该Runfull_runs_中移到non_full_runs中。如果该Run此时全部为free了,就会将该Run分解掉,其占用的pages退回到free_pages_set中。而如果该Runthread local的,那么该Run是不会被退回的,因为若该Runthread local的情况下,该Run一定属于当前正在使用的Run。如果thread localRun中的slot全部被消耗完了,会被扔到full_runs_中,其thread local的属性也会被去掉。

三,参考


2013年9月24日星期二

理解ARMV8 Device Memory的三个属性

ARMV8的spec.对device memory新引入了3个概念,分别是:Gathering, Reordering和Early Write acknowledge.下面从软件工程师的角度去阐述这三个概念。

Gathering

The Gathering attribute determines whether it is permissible for either:
• Multiple memory accesses of the same type, read or write, to the same memory location to be merged into a single transaction.
• Multiple memory accesses of the same type, read or write, to different memory locations to be merged into a single memory transaction on an interconnect.

当某条load/store指令在pipeline的access memory阶段,发生cache miss(对于non-cacheable的memory就直接跳过,不过对于device memory一般都是non-cacheable的)后,需要将access memory请求发到总线上(例如AXI总线),通过AXI总线访问device memory,这个通过总线访问memory的过程被称为一次transaction. 为了优化性能考虑,在core内,会引入一个Buffer, 当发生一个access memory请求后,会将这个请求信息丢到Buffer中,若某条指令和上一条指令共享同一个cache line或者是访问同一个地址,那么,该条指令便不会再向总线发送transaction了,它share上一条指令 读到buffer中的结果。这个share的过程称为gathering. 读到buffer中的数据,如果是cacheable的,会在某个时间内腾到cache中。

Reordering
reordering也是针对device memory的transaction, that is, 某个device memory的连续access的transaction或者device memory之间的连续access的transaction. 对某个device memory的一次transaction 必须等到上一次device memory的transaction结束后(上一次load或者 write的ack发生后)才可以进行,这叫non-reordering. 若这些transaction可以乱序,不需要等待上一次access的ack就可以进行transaction,这种方式叫reordering.

Early Write acknowledge

Early Write Acknowledgement is a hint to the platform memory system. Assigning the No Early Write Acknowledgement attribute to a Device memory location recommends that only the endpoint of the write access returns a write acknowledgement of the access, and that no earlier point in the memory system returns a write acknowledge. This means that a DSB barrier, executed by the processor that performed the write to the No Early Write Acknowledgement location, completes only after the write has reached its endpoint in the memory system. Typically, this endpoint is a peripheral or physical memory.

当对某个Device memory进行write时,需要ACK,下一个对device memory进行access的transaction才能进行。为了性能优化的考虑,device会提供一个类似与寄存器的程序员不可见的buffer,这样当CPU通过总线向device memory写数据的时候,直接写到buffer里就可以了。这样buffer就可以向cpu返回一个ACK,表示写successfully.至于何时buffer将数据真正腾到device memory就看device内部的协议了。

2013年9月13日星期五

Understanding DMB


DMB: Data memory barrier
理解DMB指令,先看下面例子,在core 0和core1上同时跑两个不同的指令(如下表所示)
core 0core 1
Write A;
Write B;
Load B;
Load A;
这里core0在执行两个指令,写A B两个值的时候,可能会发生乱序也可能Write A时发生Cache Miss,那么就会导致在cache中 A的最新值更新慢于B的最新值。于是在core1中的指令Load B就会拿到新值,而Load A 就会拿到旧值。如果A与B有相互关系的话,便可能产生死锁等问题。这里有一个典型的例子:https://lkml.org/lkml/2012/7/13/123 
于是,就有了下面的解决方法:
core 0core 1
Write A
DMB;
Write B
Load B
Load A
在core0所执行的两条指令之间加入一个DMB. 这样,若core1在Load B时,拿到了最新值。那么Load A 也一定拿到了最新值。这就是DMB的作用:DMB前面的LOAD/STORE读写的最新值的acknowledgement在时间上一定先于DMB之后的指令。
DSB 和DMB容易混淆。他们的区别在于:DMB可以继续执行之后的指令,只要这条指令不是内存访问指令。而DSB不管它后面的什么指令,都会强迫CPU等待它之前的指令执行完毕。其实在很多处理器设计的时候,DMB和DSB没有区别(DMB完成和DSB同样的功能)。他们以及ISB在arm reference中的解释如下[1]:
  • Data Synchronization Barrier (DSB) completes when all instructions before this instruction complete.
  • Data Memory Barrier (DMB) ensures that all explicit memory accesses before the DMB instruction complete before any explicit memory accesses after the DMB instruction start.
  • An Instruction Synchronization Barrier (ISB) flushes the pipeline in the processor, so that all instructions following the ISB are fetched from cache or memory, after the ISB has been completed.

ISB不仅做了DSB所做的事情,还将流水线清空[2]。于是他们的重量级排序可以是:ISB>DSB>DMB :-)

参考:

2013年6月2日星期日

How to Debug Crash Linux for ARM


It’s known that there are many tools to debug crash Linux for X86 architecture, such as kdump, LKCD, etc. Although many debugging tools claim that they could support ARM architecture, they are unstable. Is there a reliable method for ARM SOC? Yes, this blog will show a stable method for debugging ARM Linux. There is a premise that you could dump the memory snapshot by some assistive hardware tools when crash happen. This premise is easy for ARM SOC company.

This method for debugging ARM Linux is to compose a crash image and analyze the crash kernel with the aid of Crash Utility.

Now I will explain this method by step.
1.       Build Linux Kernel for ARM SOC. Please make sure that the feature kexec and debug info are enable.
·         Boot options  ---> 
·         [*] Kexec system call (EXPERIMENTAL)
·         [*] Export atags in procfs (NEW)
·         Kernel hacking  --->
·         [*] Compile the kernel with debug info
2.       Modify kernel source code for kernel/kexec.c
  12 --- a/kernel/kexec.c
  13 +++ b/kernel/kexec.c
  14 @@ -1089,14 +1089,16 @@ void crash_kexec(struct pt_regs *regs)
  15          * sufficient.  But since I reuse the memory...
  16          */
  17         if (mutex_trylock(&kexec_mutex)) {
  18 -               if (kexec_crash_image) {
  19 +//             if (kexec_crash_image) {
  20                         struct pt_regs fixed_regs;
  21  
  22                         crash_setup_regs(&fixed_regs, regs);
  23                         crash_save_vmcoreinfo();
  24                         machine_crash_shutdown(&fixed_regs);
  25 -                       machine_kexec(kexec_crash_image);
  26 -               }
  27 +                       flush_cache_all();
  28 +//                     machine_kexec(kexec_crash_image);
  29 +//             }
  30                 mutex_unlock(&kexec_mutex);
  31         }
3.       Build Linux Kernel.
4.       After the Linux Kernel run and the console shows, please output two physical address.
a.    cpu_notes_paddr: $cat sys/devices/system/cpu/cpu0/crash_notes
b.   vmcore_notes_paddr: $cat /sys/kernel/vmcoreinfo
5.       When crash happens, please use some assistive hardware method to dump the memory snapshot.
6.       Compose a available image for Crash Utility.
        This image is ELF format. There are some program headers. If the memory is flat sequence , two program headers are enough. One program header is for cpu_notes and vmcore_notes, the other is for memory snapshot. I have write this which could be referenced. The code could be found at https://github.com/tek-life/dump-crash-tool.
7.       After step 6 is completed, new file ‘newvmcore’ could be got. Then we could use Crash Utility to analyze the crash Linux Kernel.
$./crash vmlinux newvmcore
‘vmlinux’ is the kernel image with debug information. ‘newvmcore’ is got from step 6.

Following text is the Crash Utility output.

$ ./crash examples/vmlinux examples/vmcore

crash 6.1.5
Copyright (C) 2002-2013  Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010  IBM Corporation
Copyright (C) 1999-2006  Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012  Fujitsu Limited
Copyright (C) 2006, 2007  VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011  NEC Corporation
Copyright (C) 1999, 2002, 2007  Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002  Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions.  Enter "help copying" to see the conditions.
This program has absolutely no warranty.  Enter "help warranty" for details.

GNU gdb (GDB) 7.3.1
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=i686-pc-linux-gnu --target=arm-linux-gnueabi"...

      KERNEL: examples/vmlinux
    DUMPFILE: examples/vmcore
        CPUS: 1
        DATE: Sat Jan  1 10:29:39 2000
      UPTIME: 00:00:51
LOAD AVERAGE: 0.08, 0.03, 0.01
       TASKS: 22
    NODENAME: Linux
     RELEASE: 3.4.5-00527-g320e261-dirty
     VERSION: #506 Fri May 10 15:57:51 CST 2013
     MACHINE: armv7l  (unknown Mhz)
      MEMORY: 128 MB
       PANIC: "[   51.520000] Internal error: Oops: 817 [#1] ARM" (check log for details)
         PID: 297
     COMMAND: "sh"
        TASK: c7870900  [THREAD_INFO: c793a000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

crash>


2013年4月26日星期五

Understanding Kdump (How to make crashnote)


crashnote contains register status when crash happens. In kernel, this data is stored “note_buf_t __percpu *crash_notes”.

include/linux/kexec.h
typedef u32 note_buf_t[KEXEC_NOTE_BYTES/4];

crashnote is also one part of /proc/vmcore. At first, crashnotes[] address, got by reading sys/devices/system/cpu/cpu0/crash_notes, is stored as one program header, which could be got from “elfcorehdr”. Making crash_notes is done by crash_save_cpu().

crash_kexec->machine_crash_shutdown->crash_save_cpu:
1206 void crash_save_cpu(struct pt_regs *regs, int cpu)
1207 {
1208         struct elf_prstatus prstatus;
1209         u32 *buf;
1210
1211         if ((cpu < 0) || (cpu >= nr_cpu_ids))
1212                 return;
1213
1214         /* Using ELF notes here is opportunistic.
1215          * I need a well defined structure format
1216          * for the data I pass, and I need tags
1217          * on the data to indicate what information I have
1218          * squirrelled away.  ELF notes happen to provide
1219          * all of that, so there is no need to invent something new.
1220          */
1221         buf = (u32*)per_cpu_ptr(crash_notes, cpu);
1222         if (!buf)
1223                 return;
1224         memset(&prstatus, 0, sizeof(prstatus));
1225         prstatus.pr_pid = current->pid;
1226         elf_core_copy_kernel_regs(&prstatus.pr_reg, regs);
1227         buf = append_elf_note(buf, KEXEC_CORE_NOTE_NAME, NT_PRSTATUS,
1228                               &prstatus, sizeof(prstatus));
1229         final_note(buf);
1230 }

The final layout illustrate below.