we have situation in server 8 core initializes service in 6 minutes , other server 32 cpu-cores takes 35 minutes initialize service. after profiling have seen because of 2 kernel apis(get_counters , snmp_fold_field) tries collect data each existing cores , number of cpu-cores increases execution time takes longer expected.in order reduce initialization time thought make cores disabled , later initialization enable cpu cores.but in approach once enable cores synchronization happens on newly enabled cores smp kernel.
can suggest how reduce overhead caused increased cpu cores efficiently?
instead code rather explain initialization functionality of user defined system service.during initialization service plumbs virtual interfaces on configured ips. avoid overlapping ips situation each configured ip creates internal ip , communication done on interfaces plumbed on internal ip.as packet reaches system destination configured ip , mangling/natting/routing table rules applied on system mange it. interface plumb configured ip avoid ip forwarding.our issue when scale our system configured 1024 ips on 8 core machine takes 8 minutes , on 32 cores takes 35 minutes.
on further debugging done using system profiling saw arptables/iptables's kernel module consuming time in "get_counters()" , ip's kernel module consuming time in snmp_fold_field(). if disable arptables mangling rules time drops 18 minutes 35 minutes.i can share kernel modules's callstacks collected using profiler.
a common bottleneck in applications utilizing multi-core uncontrolled/-optimized access shared resources. biggest performance killer bus-lock (critical sections) followed accesses ram, accesses l3 cache , accesses l2 cache. long core operating inside own l1 (code , data) caches bottlenecks due poorly-written code , badly-organized data. when 2 cores need go outside l1 caches may collide when accessing (shared) l2 cache. gets go first? data there or must move down l3 or ram access it? every further level required find data you're looking @ 3-4x time penalty.
concerning application seems have worked reasonably because 8 cores can collide degree when accessing shared resources. thirty-two cores problem becomes more 4 times large because you'll have 4 times strain on resource , 4 times many cores resolve gets access first. it's doorway: long few people run in or out through doorway , there's no problem. approaching doorway might stop little let other guy through. have situation many people running in , out , you'll not have congestion, you'll have traffic jam.
my advice experiment restricting number of threads (which translate number of cores) application may utilize @ 1 time. 6 minute startup time on 8 cores serve benchmark. may find execute faster on 4 cores eight, not mention thirty-two.
if case , still insist on running faster on thirty-two cores today you're looking @ enormous amount of work both making snmp-server code more efficient , organizing data lessen load on shared resources. not faint-of-heart. may end re-writing entire application.
Comments
Post a Comment