认识Linux/ARM 中的冷热页

Agenda 1. 何谓冷热页？ 2. 冷热页到底起到什么作用？ 3. 多大的冷热页最合适？ 4. 在Linux/ARM中如何维护冷热页？ 4.1 初始化 4.2 怎样增加冷热页库存 4.3 怎样分配冷热页 1. 何谓冷热页？在Linux Kernel的物理内存管理的buddy system中，引入了冷热页的概念。冷页表示该空闲页已经不再高速缓存中了(一般是指L2 Cache)，热页表示该空闲页仍然在高速缓存中。冷热页是针对于每CPU的，每个zone中，都会针对于所有的CPU初始化一个冷热页的per-cpu-pageset. 2. 冷热页到底起到什么积极作用？作用有3点： 2.1).Buddy Allocator在分配order为0的空闲页的时候，如果分配一个热页，那么由于该页已经存在于L2 Cache中了。CPU写访问的时候，不需要先把内存中的内容读到Cache中，然后再写。如果分配一个冷页，说明该页不在L2 Cache中。一般情况下，尽可能用热页，是容易理解的。什么时候用冷页呢？While allocating a physical page frame, there is a bit specifying whether we would like a hot or a cold page (that is, a page likely to be in the CPU cache, or a page not likely to be there). If the page will be used by the CPU, a hot page will be faster. If the page will be used for device DMA the CPU cache would be invalidated anyway, and a cold page does not waste precious cache contents.[1] 2.2).Buddy System在给某个进程分配某个zone中空闲页的时候，首先需要用自旋锁锁住该zone,然后分配页。这样，如果多个CPU上的进程同时进行分配页，便会竞争。引入了per-cpu-set后，当多个CPU上的进程同时分配页的时候，竞争便不会发生，提高了效率。另外当释放单个页面时，空闲页面首先放回到per-cpu-pageset中，以减少zone中自旋锁的使用。当页面缓存中的页面数量超过阀值时，再将页面放回到伙伴系统中。
2.3).使用每CPU冷热页还有一个好处是，能保证某个页一直黏在1个CPU上，这有助于提高Cache的命中率。 3. 多大的冷热页最合适？由于冷热页是在一条链表上进行管理。热页在前，冷页在后。CPU每释放一个order为0的页，如果per-cpu-pageset中的页数少于其指定的阈值，便会将释放的页插入到冷热页链表的开始处。这样，之前插入的热页便会随着其后热页源源不断的插入向后移动，其页由热变冷的几率便大大增加。那么，要彻底让一个热页变冷需要维持一个多大数量的链表呢？Linux Kernel是这样运算的： 3638 static int zone_batchsize(struct zone *zone) 3639 { 3640 #ifdef CONFIG_MMU 3641 int batch; 3642 3643 /* 3644 * The per-cpu-pages pools are set to around 1000th of the 3645 * size of the zone. But no more than 1/2 of a meg. 3646 * 3647 * OK, so we don't know how big the cache is. So guess. 3648 */ 3649 batch = zone->present_pages / 1024; 3650 if (batch * PAGE_SIZE > 512 * 1024) 3651 batch = (512 * 1024) / PAGE_SIZE; 3652 batch /= 4; /* We effectively *= 4 below */ 3653 if (batch < 1) 3654 batch = 1; 3655 3656 /* 3657 * Clamp the batch to a 2^n - 1 value. Having a power 3658 * of 2 value was found to be more likely to have 3659 * suboptimal cache aliasing properties in some cases. 3660 * 3661 * For example if 2 tasks are alternately allocating 3662 * batches of pages, one task can end up with a lot 3663 * of pages of one half of the possible page colors 3664 * and the other with pages of the other colors. 3665 */ 3666 batch = rounddown_pow_of_two(batch + batch/2) - 1; 3667 3668 return batch; 3669 3670 #else 3671 /* The deferral and batching of frees should be suppressed under NOMMU 3672 * conditions. 3673 * 3674 * The problem is that NOMMU needs to be able to allocate large chunks 3675 * of contiguous memory as there's no hardware page translation to 3676 * assemble apparent contiguous memory from discontiguous pages. 3677 * 3678 * Queueing large contiguous runs of pages for batching, however, 3679 * causes the pages to actually be freed in smaller chunks. As there 3680 * can be a significant delay between the individual batches being 3681 * recycled, this leads to the once large chunks of space being 3682 * fragmented and becoming unavailable for high-order allocations. 3683 */ 3684 return 0; 3685 #endif 3686 } zone区的容量超过512MB，一律按照512MB的容量来算(3650~3651行)，其batch为31. 对不不超过512MB的，算法为3649，3652行。 3688 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) 3689 { 3690 struct per_cpu_pages *pcp; 3691 int migratetype; 3692 3693 memset(p, 0, sizeof(*p)); 3694 3695 pcp = &p->pcp; 3696 pcp->count = 0; 3697 pcp->high = 6 * batch; 3698 pcp->batch = max(1UL, 1 * batch); 3699 for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) 3700 INIT_LIST_HEAD(&pcp->lists[migratetype]); 3701 } 因此，如果是超过512MB的区，其pcp的high值是6*31=186个page. 也即是其count的值不能超过186个page. 4. Linux/ARM如何维护冷热页在Linux中，对于UMA的架构，冷热页在一个链表中。热页在前，冷页在后，插入页遵循FILO法则。热页被移动到后面就会变冷。 4.1 初始化：先看一下在Linux中，冷热页的数据结构。 197 struct per_cpu_pages { 198 int count; /* number of pages in the list */ 199 int high; /* high watermark, emptying needed */ 200 int batch; /* chunk size for buddy add/remove */ 201 202 /* Lists of pages, one per migrate type stored on the pcp-lists */ 203 struct list_head lists[MIGRATE_PCPTYPES]; 204 }; 206 struct per_cpu_pageset { 207 struct per_cpu_pages pcp; 208 #ifdef CONFIG_NUMA 209 s8 expire; 210 #endif 211 #ifdef CONFIG_SMP 212 s8 stat_threshold; 213 s8 vm_stat_diff[NR_VM_ZONE_STAT_ITEMS]; 214 #endif 215 }; 3291 static DEFINE_PER_CPU(struct per_cpu_pageset, boot_pageset); 3819 static __meminit void zone_pcp_init(struct zone *zone) 3820 { 3821 /* 3822 * per cpu subsystem is not up at this point. The following code 3823 * relies on the ability of the linker to provide the 3824 * offset of a (static) per cpu variable into the per cpu area. 3825 */ 3826 zone->pageset = &boot_pageset; 3827 3828 if (zone->present_pages) 3829 printk(KERN_DEBUG " %s zone: %lu pages, LIFO batch:%u\n", 3830 zone->name, zone->present_pages, 3831 zone_batchsize(zone)); 3832 } 初始化： start_kernel->build_all_zonelists->__build_all_zonelists 3329 for_each_possible_cpu(cpu) { 3330 setup_pageset(&per_cpu(boot_pageset, cpu), 0); start_kernel->build_all_zonelists->__build_all_zonelists->setup_pageset 3688 static void setup_pageset(struct per_cpu_pageset *p, unsigned long batch) 3689 { 3690 struct per_cpu_pages *pcp; 3691 int migratetype; 3692 3693 memset(p, 0, sizeof(*p)); 3694 3695 pcp = &p->pcp; 3696 pcp->count = 0; 3697 pcp->high = 6 * batch; 3698 pcp->batch = max(1UL, 1 * batch); 3699 for (migratetype = 0; migratetype < MIGRATE_PCPTYPES; migratetype++) 3700 INIT_LIST_HEAD(&pcp->lists[migratetype]); 3701 } 4.2 怎样增加冷热页的库存 2458 void __free_pages(struct page *page, unsigned int order) 2459 { 2460 if (put_page_testzero(page)) { 2461 if (order == 0) 2462 free_hot_cold_page(page, 0); 2463 else 2464 __free_pages_ok(page, order); 2465 } 2466 } 2461～2464行，若返回1页，则调用free_hot_cold_page，将空间页存进每CPU pageset中。否则调用__free_pages_ok. free_all_bootmem->free_all_bootmem_core->__free_pages_bootmem->__free_pages->free_hot_cold_page 1213 /* 1214 * Free a 0-order page 1215 * cold == 1 ? free a cold page : free a hot page 1216 */ 1217 void free_hot_cold_page(struct page *page, int cold) 1218 { 1219 struct zone *zone = page_zone(page); 1220 struct per_cpu_pages *pcp; 1221 unsigned long flags; 1222 int migratetype; 1223 int wasMlocked = __TestClearPageMlocked(page); 1224 1225 if (!free_pages_prepare(page, 0)) 1226 return; 1227 1228 migratetype = get_pageblock_migratetype(page); 1229 set_page_private(page, migratetype); 1230 local_irq_save(flags); 1231 if (unlikely(wasMlocked)) 1232 free_page_mlock(page); 1233 __count_vm_event(PGFREE); 1234 1235 /* 1236 * We only track unmovable, reclaimable and movable on pcp lists. 1237 * Free ISOLATE pages back to the allocator because they are being 1238 * offlined but treat RESERVE as movable pages so we can get those 1239 * areas back if necessary. Otherwise, we may have to free 1240 * excessively into the page allocator 1241 */ 1242 if (migratetype >= MIGRATE_PCPTYPES) { 1243 if (unlikely(migratetype == MIGRATE_ISOLATE)) { 1244 free_one_page(zone, page, 0, migratetype); 1245 goto out; 1246 } 1247 migratetype = MIGRATE_MOVABLE; 1248 } 1249 1250 pcp = &this_cpu_ptr(zone->pageset)->pcp; 1251 if (cold) 1252 list_add_tail(&page->lru, &pcp->lists[migratetype]); 1253 else 1254 list_add(&page->lru, &pcp->lists[migratetype]); 1255 pcp->count++; 1256 if (pcp->count >= pcp->high) { 1257 free_pcppages_bulk(zone, pcp->batch, pcp); 1258 pcp->count -= pcp->batch; 1259 } 1260 1261 out: 1262 local_irq_restore(flags); 1263 } 注意1229行，对于冷热页，page->private存放的是其migratetype。从1251～1254行，可以看出，若是热页加入到前面，若是冷页放在尾部。若加入的也数量太多，则批量(按照batch的数值)减少。假若目前zone区域中物理内存的容量是512MB，则batch的值是31.即若冷热页中的数量超过186页，则减少31页。 free_all_bootmem->free_all_bootmem_core->__free_pages_bootmem->__free_pages->free_hot_cold_page->free_pcppages_bulk: 621 /* 622 * Frees a number of pages from the PCP lists 623 * Assumes all pages on list are in same zone, and of same order. 624 * count is the number of pages to free. 625 * 626 * If the zone was previously in an "all pages pinned" state then look to 627 * see if this freeing clears that state. 628 * 629 * And clear the zone's pages_scanned counter, to hold off the "all pages are 630 * pinned" detection logic. 631 */ 632 static void free_pcppages_bulk(struct zone *zone, int count, 633 struct per_cpu_pages *pcp) 634 { 635 int migratetype = 0; 636 int batch_free = 0; 637 int to_free = count; 638 639 spin_lock(&zone->lock); 640 zone->all_unreclaimable = 0; 641 zone->pages_scanned = 0; 642 643 while (to_free) { 644 struct page *page; 645 struct list_head *list; 646 647 /* 648 * Remove pages from lists in a round-robin fashion. A 649 * batch_free count is maintained that is incremented when an 650 * empty list is encountered. This is so more pages are freed 651 * off fuller lists instead of spinning excessively around empty 652 * lists 653 */ 654 do { 655 batch_free++; 656 if (++migratetype == MIGRATE_PCPTYPES) 657 migratetype = 0; 658 list = &pcp->lists[migratetype]; 659 } while (list_empty(list)); 660 661 /* This is the only non-empty list. Free them all. */ 662 if (batch_free == MIGRATE_PCPTYPES) 663 batch_free = to_free; 664 665 do { 666 page = list_entry(list->prev, struct page, lru); 667 /* must delete as __free_one_page list manipulates */ 668 list_del(&page->lru); 669 /* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */ 670 __free_one_page(page, zone, 0, page_private(page)); 671 trace_mm_page_pcpu_drain(page, 0, page_private(page)); 672 } while (--to_free && --batch_free && !list_empty(list)); 673 } 674 __mod_zone_page_state(zone, NR_FREE_PAGES, count); 675 spin_unlock(&zone->lock); 676 } 该函数是一个while循环中套用另一个while循环。整体实现思路是，在pcp空闲页链表中根据类别不同一共有三类页，分别是： 38 #define MIGRATE_UNMOVABLE 0 39 #define MIGRATE_RECLAIMABLE 1 40 #define MIGRATE_MOVABLE 2 41 #define MIGRATE_PCPTYPES 3 /* the number of types on the pcp lists */ 顺序越靠后的释放的页数越多。对于MIGRATE_UNMOVABLE的页，一次释放一页(batch_free=1)。对于MIGRATE_RECLAIMABLE的页，一次释放2页(batch_free=2：655～672行的do{}while循环跑两遍，每次释放一个页)。对于最后一级链表(MIGRATE_MOVABLE),释放的最多，剩余多少释放多少。[2] 4.3怎样分配冷热页在分配order为0页的时候，先找到合适的zone,然后根据需要的migratetype类型定位冷热页链表（每个zone，对于每个cpu,有3条冷热页链表，对应于：MIGRATE_UNMOVABLE、MIGRATE_RECLAIMABLE、MIGRATE_MOVABLE）。若需要热页，则从链表头取下一页（此页最"热"）；若需要冷页，则从链表尾取下一页（此页最"冷"）。看下相关实现： 1350 static inline 1351 struct page *buffered_rmqueue(struct zone *preferred_zone, 1352 struct zone *zone, int order, gfp_t gfp_flags, 1353 int migratetype) 1354 { 1355 unsigned long flags; 1356 struct page *page; 1357 int cold = !!(gfp_flags & __GFP_COLD); 1358 1359 again: 1360 if (likely(order == 0)) { 1361 struct per_cpu_pages *pcp; 1362 struct list_head *list; 1363 1364 local_irq_save(flags); 1365 pcp = &this_cpu_ptr(zone->pageset)->pcp; 1366 list = &pcp->lists[migratetype]; 1367 if (list_empty(list)) { 1368 pcp->count += rmqueue_bulk(zone, 0, 1369 pcp->batch, list, 1370 migratetype, cold); 1371 if (unlikely(list_empty(list))) 1372 goto failed; 1373 } 1374 1375 if (cold) 1376 page = list_entry(list->prev, struct page, lru); 1377 else 1378 page = list_entry(list->next, struct page, lru); 1379 1380 list_del(&page->lru); 1381 pcp->count--; 1382 } else { 1367～1373行，如果缺少对应类型的页，会从伙伴分配系统中再次申请页。参考： 1.http://www.win.tue.nl/~aeb/linux/lk/lk-9.html 2.http://blog.chinaunix.net/uid-25845340-id-3039220.html

注：2012.8.30，添加了 “2.冷热页到底起到什么积极作用”的2.3点。

搜索此博客

Linux Stuff

认识Linux/ARM 中的冷热页

评论

发表评论

此博客中的热门博文

由RFE指令引发的一串故事

汇编代码杂记

笔记