15:04:53 #startmeeting oVirt Infra 15:04:53 Meeting started Mon Dec 23 15:04:53 2013 UTC. The chair is knesenko. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:04:53 Useful Commands: #action #agreed #help #info #idea #link #topic. 15:05:00 #chair ewoud dcaro eedri 15:05:00 Current chairs: dcaro eedri ewoud knesenko 15:05:03 orc_orc: here ? 15:06:42 #topic Hosting 15:06:47 hello guys 15:07:22 so I think rackspace is a big issue to talk about now 15:07:28 ewoud: yes 15:07:52 I don't even looking on their answer on the ticket they are trying to handle for few month 15:08:00 not worth our time 15:08:06 knesenko: Here, too. Running late 15:08:13 I think eedri sent a good start, but I'm having a hard time deciding because I don't know budgets constraints etc 15:08:15 since we are planning to move to another infra provider 15:08:25 #chair bkp 15:08:25 Current chairs: bkp dcaro eedri ewoud knesenko 15:08:32 hello bkp 15:08:34 ewoud, i already have a proposal for engine VM and storage server 15:08:52 ewoud, i wanted to wait for other suggestions of the hypervisors specs 15:09:01 bkp: maybe you should introduce yourself to ewoud ? :) 15:09:14 that would be helpful :) 15:09:22 ewoud: This is Brian Proffitt, the new Community Manager 15:09:45 bkp, welcome brian! 15:09:50 Thanks! 15:09:53 bkp: welcome 15:09:59 welcome aboard ! 15:10:09 ok lets continue 15:10:19 bkp, welcome 15:10:20 I think that starting with NFS is good 15:10:48 knesenko, i have a quote for storage server 15:10:48 without knowing anything about the budget, there's not much more I can say than 'bigger is better' ;) 15:10:51 * eedri looking for it 15:10:52 is there any updated instructions for installation from source for ovirt 3.3 or later? 15:11:02 ewoud, yea, that's always true 15:11:10 :) 15:11:13 ewoud, i'm trying to really understand our needs though 15:11:34 ewoud, for example i wouldn't got with 96 MEM, but rather with 64 15:11:39 eedri: we need powerful hypervisors to run jenkins slaves 15:11:40 ewoud, if we can upgrade it later 15:11:59 ewoud, every cost we'll involve might affect other services we might not get 15:12:09 ewoud, i'm not sure exactly what is the budget 15:12:31 eedri: we could build slaves more dedicated to some tasks to save memory 15:12:45 eedri: maybe we should sync with the guy who know more about the budget we have, and then we can discuss what iss the best we can get with the budget we have 15:12:47 ? 15:13:03 eedri: for example, we don't need to build java on all platforms IMHO, so having big slaves for all platforms might not be needed 15:13:15 eedri: where vdsm needs more platforms but less memory 15:13:26 skyy111: We're having a meeting right now, can you give us a bit? Thanks! 15:13:36 ewoud, yea, i also thought of that 15:14:07 ewoud, but i think there is a value in having the same hardware for all hypervisors 15:14:19 ewoud, since we will need them in the same cluster for maintance/migration/etc.. 15:14:34 eedri: hypervisors: yes; guests: no 15:14:48 ewoud, yes, i'm only talking on hypervisors 15:14:56 ewoud, the bare metal hosts 15:15:06 so I like the proposal of having 2 or 3 bare metal hosts + SAN/NAS 15:15:10 ewoud, guests we'll handle ourselves after we'll set the hypervisors 15:15:18 ewoud, i think 3 is a much 15:15:19 must 15:15:23 ewoud, for maintanance 15:15:27 eedri: +1 15:15:30 given NetApp is a sponsor, could we get them to sponsor some HW? 15:15:59 ewoud, that's can be an option, but the current propoal for softlayer isnt relevant to him 15:16:06 maybe bkp can help with that 15:16:31 ewoud: why NetApp ? 15:16:46 As far as getting the budget numbers, or getting budget lined up? 15:16:59 ewoud, this is the storage server i got a qoute for : https://www.softlayer.com/Sales/orderQuote/7bff92a098d78da6ed795c5832d99738/1052649 15:17:01 knesenko: a sponsor can generally provide better HW for the same budget 15:17:37 ewoud: yes I understand that ... but why NetApp ... maybe we can ask for EMC as well ? 15:17:43 that said, I have little knowledge of SANs 15:18:13 https://www.softlayer.com/services/storagelayer/quantastor-servers 15:18:19 knesenko: NetApp came to mind first because I read most about them and oVirt on blogs, but I have no preference 15:18:23 quantastor server is anyone is familiar with it 15:18:31 ewoud: ah ok :) 15:18:49 ewoud, knesenko guys, let's keep to the real and actual provider we have now 15:19:09 ewoud, knesenko we don't know if any other those companies is will to provide support for ovirt yet, 15:19:30 focusing on migrating as soon as we can from existing vendor 15:19:36 eedri: so this is a proposal for a storage server ? 15:19:39 anyone has comments on the suggested storage server? 15:19:44 the link you sent ? 15:19:52 * knesenko is looking 15:19:53 knesenko, yes, initial 2TB storage nfs 15:20:09 on quanta store 15:20:38 eedri: I am not a storage expert, but it depends on a disks we have 15:20:42 currently runs on SATA disks 15:20:45 Newbie question: how does this compare to what we were paying before? 15:21:07 of course SSD will be faster, but much more expensive 15:21:11 and SAS? 15:21:13 bkp, in terms of cost? 15:21:26 bkp, or performance? 15:21:33 Yes, to start. Config too... 15:22:41 eedri: SAS = SATA * 2 in price 15:22:44 eedri: +- 15:23:03 * orc_orc rolls in late to the office 15:23:17 bkp, softlayer should be cheaper than current vendor afaik 15:23:44 bkp, also, service should be better, they a usefull live chat option that proved helpfull when i was digging for proposals 15:24:03 knesenko, i'm not sure they even offer sas. 15:24:11 orc_orc, here? 15:24:12 eedri: more a general question than the storage: suppose we do run into limitations, how easy/fast can we switch? 15:24:20 Right, and from what I've picked up, it's going to be better service from the get-go. 15:24:26 orc_orc, i remember you wanted to comment on the hardware specs 15:24:48 ewoud, i think they offer very flexible upgrades 15:24:49 #chair orc_orc 15:24:49 Current chairs: bkp dcaro eedri ewoud knesenko orc_orc 15:25:15 ewoud, for example if we choose a server that can support up to 256 GB mem, no issues with upgrading 15:25:28 ewoud, also, each storage server supports up to 12-24 disks 15:25:28 eedri: and how long would the contract be? 15:25:52 not that I expect the same thing we have now at rackspace, but then again, we didn't expect it at rackspace either 15:25:53 ewoud, so even if we choose one disk, we can monitor it and change to a better one laster one 15:26:13 eedri: I run a public colo / hosting business in a high end datacenter 15:26:19 ewoud, from experienee from other groups in $company, it seems the they are safisfied 15:26:29 eedri: ok, sounds good 15:26:34 orc_orc, did you happen to see the email i sent on the specs? 15:26:55 orc_orc, i'm trying to get a ballpark estimation on which servers we should use for the hypervisors 15:27:02 yes * I did 15:27:05 orc_orc, which storage server we should use 15:27:17 usually a hoting center does not care so much about the hardware as the following: 15:27:20 the RUs used 15:27:23 the BW used 15:27:27 the A used, and 15:27:28 the ' 15:27:36 'hands time' needed 15:27:54 the customer specifies needs and they return a price 15:28:01 orc_orc, i'm trying to think on it form a CI point of view, what our slaves will need 15:28:09 sometimes IP leasing if the custoer does not have an ASN block to use 15:28:20 orc_orc, not sure i follow 15:28:30 eedri: how capacty constrained are we presently from usage stats? 15:28:50 orc_orc, pretty constrained from serveral points 15:28:51 as I understand it, R'03 was needed for space, not compute strength 15:29:01 eedri: what other points? 15:29:03 orc_orc, 1st i would say that using local storage for all vms pretty much lower the performance 15:29:15 orc_orc, and limits us from adding more vms 15:29:19 orc_orc: I think we're more interested in iops than raw storage, but we are low on storage currently 15:29:28 orc_orc, so one of the most important issues is storage imo 15:29:36 eeI have heard that said -- but I do not find a formal study indicating local store is sloter than, say, NFS on like loads 15:29:43 slower* 15:29:54 I'd expect local storage to be faster tbh 15:29:58 orc_orc, so maybe its worth investing more in NAS/SAN solution than taking the best servers for hypervisors 15:29:59 no network latency 15:30:01 as do I 15:30:12 but I am engaged in a study of this atm 15:30:32 ewoud, maybe the disks that were used were not fast enough then 15:30:33 we are rarely compute constrained 15:30:42 orc_orc, ewoud there is also the specific jobs on ci that needs cpu 15:30:47 orc_orc, ewoud like findbugs for e.g 15:30:53 eedri: cpu, or ram to work in? 15:30:55 orc_orc, ewoud or any other static analysis 15:31:11 we find ram constraints are the major chokepoint 15:31:16 orc_orc, i think there are some more mem oriented like maven builds with gwt compilation 15:31:32 for sure we need good HDs .... jenkins slaves creates a lot of IO 15:31:36 and other cpu cusuming like findbugs 15:32:13 eedri: as to commercial backend SAN servers, is this saying that iscsi, nfs, and gluster are 'less good' choices', or more the 'brand name effect is driving the desire? 15:32:45 orc_orc, i wouldn't care the any brand.. as long as it's performance is good enough for us 15:32:47 ... so I thought the email specifying hardware was a bit early in the process 15:32:47 orc_orc: Error: ".." is not a valid command. 15:32:50 though if we're flexible in upgrades I'm leaning towards starting sooner and upgrade in a month or 2 if needed 15:32:50 orc_orc, and maintance is low 15:32:53 ... so I thought the email specifying hardware was a bit early in the process 15:33:07 orc_orc, what do you suggest? 15:33:25 that said, generally upgrading a hypervisor is easier because downtime is more acceptable than your SAN 15:33:30 eedri: first, I think this interactive discussion is very good, compared to email 15:33:41 orc_orc, i agree 15:33:45 perhaps we should ask for a conference bridge and discuss in real time 15:34:00 orc_orc, i can arrange a conf call if needed 15:34:15 bkp: knows the model of the weekly LSB conference call, and thoase are very productive in knocking out issues 15:34:35 eedri: I would ask for that .. the holiday schedule hurts a bit, but 15:34:42 orc_orc, i'd really like to make the right choise here for ovirt infra going forward 15:34:50 eedri: ++ 15:35:09 orc_orc, and not revisit again a wrong infra layout 15:35:14 I agree, with the caveat of the holidays 15:35:18 so testing and surveying where the hurt points before deciding is a good thing 15:35:21 agree 15:35:45 dcaro, can you do a survery on our current bottle necks? 15:36:02 is sysstat running on all units to get real stats? 15:36:03 dcaro, assuming checking ovirt-engine webadmin + stats from awstats 15:36:16 web stats may not tell the tale 15:36:17 dcaro, or other monitoring tool we have running 15:36:32 orc_orc, we need something like cacti/graphite 15:36:48 we also track traffic in and out, and disk IP, and 'free' load 15:36:59 eedri: setting up some performance monitoring is an open issue, I can try to focus on that 15:37:02 IO* 15:37:16 orc_orc, i can tell you from current observation that jobs in ci takes longer than on other systems we have 15:37:30 orc_orc, even on other VMs, not just bare metal 15:37:39 eedri: but this may imply just that the CI tool is sluggish ;) 15:38:02 orc_orc, well.. i'm running very similar jobs on a differnt env, with much faster results 15:38:08 eedri: I am pretty sure its because we are running on the local disks 15:38:20 eedri: great, in that this permits comparing to find 'choke points' 15:38:47 eedri: can you set up your CI environment in an EL6 environment? 15:39:01 orc_orc, we do run it on RHEL6 15:39:03 I am running a test w the LSB atm on this and we can add your 15:39:06 load ... 15:39:22 I ill contact you out of band with details then 15:39:22 orc_orc, it's not public though 15:39:30 my tool is quite private 15:39:38 orc_orc, ok 15:40:21 #action conference call to discuss COLO needs to be scheduled by eedri 15:40:33 so we agreed that we need to do some research of performance chock points before moving forward? 15:40:41 or we'll discuss it on the conf call? 15:40:59 I will look at graphite and the other later today and know more by then 15:41:58 when is a good date to set up the call., with all the holidays 15:42:36 post 2 jan, sadly, I think ... isn't RH already on shutdown til EOY? 15:42:45 eedri: that's always hard 15:43:01 orc_orc, yea, most of it, exluding israel though 15:43:50 orc_orc: Pretty much, starting tomorrow 15:44:27 ok 15:45:05 so who's taking a lead on finding the chockpoints for current ci infra? 15:45:30 we'll need i guess a week worth of mem/cpu/io/network stats of ci jobs on current slaves? 15:45:50 yes ... 15:45:55 eedri: I converted your mail into an etherpad: http://etherpad.ovirt.org/p/Hardware_requirements_infra 15:46:08 would that be a good to use as a working document? 15:46:18 ewoud, +1 15:46:21 we probably will need to monitor our slaves and hypervisors here - http://monitoring.ovirt.org/icinga/ 15:46:38 ewoud, i would add the relevant links there to the current servers/proposals 15:47:22 google docs and etherpad are poor for revision history .. if simultaneous editting is not needed, perhaps the wiki should be preferred? 15:47:35 knesenko, can we can it public? 15:47:49 knesenko, it needs a login to view that 15:48:17 eedri: mmmm .... I think we can get a public ro permissions 15:48:36 orc_orc, personally i find adding/updating the wiki bit more cumbersome for collaborating on a in progress issue 15:48:45 * nod * 15:48:47 orc_orc, wiki is more for documenting a final doc imo 15:49:10 agreed: etherpad is working document which is then finalized into a wiki 15:49:52 also, I've seen that in the past we've been discussing some issues for a long time 15:50:05 can we help speeding it up by setting a general time frame with deadlines? 15:51:05 ewoud: some projects doing time based releases turn out poor product ... I think a general 'rule' is dangerous 15:51:16 for example, we want to decide on the HW before mid Januari and have the basics installed by mid February (dates are made up) 15:51:32 ewoud, i agree 15:51:45 ewoud, we should set a basic deadlines and try to follow on it 15:51:45 orc_orc: but infra isn't really a product with releases and this is more project management 15:51:49 ewoud, and not leave it in the air 15:52:45 orc_orc, i understand where ewoud comes from 15:52:59 orc_orc, we might have open issues that are taking too long to be resolved sometimes 15:53:12 oh. I do too -- that is part of why I started logging the r'03 updates weekly, so it would become a barb to action 15:53:13 eedri: exactly 15:53:27 and it feels like this weekly meeting might not be enough to push things forward 15:54:00 so 1st i don't think that not going over tasks weekly cause we've reached 18:00 is good pratice 15:54:09 this results in forever procastinating 15:54:37 ewoud, we should either appoint someone to make sure there is progress made, or even do a rorating montly 15:54:55 at $work we've made standard filters on issues to show which ones are open too long 15:55:00 we could do the same with trac reports 15:55:11 or set a ground rule of at least going over some tracs during the meeting 15:55:13 eedri: or simply have an agenda whre new bugs are triaged to priority, and old open items come first 15:55:23 +1 15:55:40 orc_orc, ewoud do you think a different ticketing system will help? or its not the case 15:55:52 eedri: the problem is not the tool, it is the process 15:56:02 eedri: I think all ticketing system can be made to work, but as orc_orc said the process 15:56:21 the tool is the implementation of the process 15:57:42 so now we've gone a bit offtopic: what do we decide on the hardware? 15:58:05 I saw some suggestions of monitoring 15:58:23 and learning the budget we have to work with 15:59:38 having real stats on local store builds network store ones, per eedri use case 16:01:03 OVF on any domain feature overview starting now 16:01:21 http://www.ovirt.org/Feature/OvfOnWantedDomains 16:01:46 eedri: knesenko it seems our time is up; can we finish it with some action items? 16:01:59 ewoud, there was an action item on me setting a conf call 16:02:09 I think we should monitor our current infra ... to see what we have now 16:02:11 ewoud, we need action item on who's doing the stats analysis 16:02:14 what do you think ? 16:02:20 eedri: and I am composing an OOB email to you atm 16:02:21 I can do 16:02:30 knesenko, like i said, we need to run a week long analysis 16:02:41 knesenko, on io/net/cpu/mem on our servers 16:02:57 knesenko, and find the chock points.. then we can have the conf and talk about the needs of the infra 16:03:11 #action knesenko add jenkins slaves and hypervisors to http://monitoring.ovirt.org/icinga/ 16:03:35 ok what else do we need ? 16:04:05 knesenko, is icigna more like nagios or cacai? 16:04:12 eedri: nagios 16:04:25 knesenko, ewoud so i don't see how that helps us 16:04:34 knesenko, ewoud we need to monitor performance.. 16:04:41 knesenko, ewoud like cacti or graphite do 16:05:31 isn't nagios for monitoring services? 16:05:48 eedri: yes, cacti, munin or graphite should be better tools 16:06:15 can we install it on the same machine we are running icinga ? 16:07:34 ok guys , let me handle it ... 16:07:36 Enter the etherpad for the discussion: http://etherpad.ovirt.org/p/OvfOnAnyDomain 16:07:54 I will install one of these tools and will monitor 16:08:09 any objects I will handle it ? 16:08:36 knesenko, +1 16:09:46 #action knesenko install one of cacti, munin or graphite 16:10:25 do we need more action items ? 16:11:07 eedri: ? 16:11:10 ewoud: ? 16:11:11 knesenko, we do, but we need to revise the way we doi the meeeings 16:11:17 knesenko, like we talked 16:12:10 knesenko, we need to think how make things happen faster 16:12:17 knesenko, in terms of open tickets, etc... 16:12:20 it's hard to meet in person, but maybe FOSDEM can be a good place to talk about it? 16:12:47 or cfgmgmtcamp.eu which is 3 & 4 February 16:12:51 ewoud, i'd love that, but unfourtunately i wont be there due to a test i hdave at the same day 16:12:55 ewoud, i think dcaro will be there 16:13:41 ewoud: yep :) 16:13:54 or http://community.redhat.com/blog/2013/12/announcing-infrastructure-next/ even 16:14:51 ok guys 16:14:56 I think we are done here 16:15:00 agreed 16:15:06 * nod * 16:15:12 Have a nice holiday ! 16:15:29 happy new year ! 16:15:30 :) 16:15:51 #endmeeting