Displaying posts tagged Mission Management

Mission Management 1.0

Only a fast announcement that the primary “production-ready” model of Mission Management simply went reside yesterday, at this easy-to-remember URL:

https://missioncontrol.telemetry.mozilla.org

For these not but conversant in the venture, Mission Management goals to trace launch stability and high quality throughout Firefox releases. It’s comparable in spirit to arewestableyet and different crash dashboards, with the next new and thrilling properties:

  • Makes use of the total set of crash counts gathered by way of telemetry, reasonably than the arbitrary pattern that customers determine to undergo crash-stats
  • Outcomes can be found inside minutes of ingestion by telemetry (though be warned preliminary outcomes for a launch at all times look unhealthy)
  • The denominator in our crash charge is utilization hours, reasonably than the probably-incorrect calculation of active-daily-installs utilized by arewestableyet (not a knock on the individuals who wrote that device, there was nothing higher accessible on the time)
  • We’ve an in depth breakdown of the outcomes by platform (reasonably than letting Home windows outcomes dominate the general charges because of its excessive quantity of utilization)

Normally, my hope is that this device will present a extra scientific and correct thought of launch stability and high quality over time. There’s tons extra to do, however I feel it is a promising begin. A lot gratitude to kairo, calixte, chutten and others who helped construct my understanding of this space.

The dashboard itself a better factor to indicate than discuss, so I recorded a fast demonstration of a number of the dashboard’s capabilities and printed it on air mozilla:

hyperlink

Mission Management replace

Yep, nonetheless engaged on this venture. We’ve shifted gears considerably from attempting to determine issues in a time collection of error aggregates to monitoring considerably long run developments launch over launch, to fill the wants of the discharge administration group at Mozilla. It’s been an excellent change, I feel. A little bit of a tighter focus.

The primary motivator for this work is that the ADI (energetic every day set up) numbers that crash stats used to supply as enter to the same service, AreWeStableYet (hyperlink requires Mozilla credentials), are going away and we’d like some type of substitute. I’ve been studying about this older system labored (this weblog put up from KaiRo was useful) and attempting to develop a substitute which reproduces a few of its helpful traits whereas additionally benefiting from a number of the new options which can be supplied by the error_aggregates dataset and the mission management consumer interface.

Some preliminary screenshots of what I’ve been capable of give you:

One of many key issues to bear in mind with this dashboard is that by default it exhibits an adjusted set of charges (outlined as complete variety of occasions divided by complete utilization khours), which implies we evaluate the most recent launch to the earlier one throughout the identical time interval.

So if, say, the most recent launch is “59” and it’s been out for 2 weeks, we are going to evaluate it in opposition to the earlier launch (“58”) in its first two weeks. As I’ve stated right here earlier than, issues are at all times crashier once they first exit, and evaluating a brand new launch to 1 that has been out within the area for a while just isn’t a good comparability in any respect.

This adjusted view of issues continues to be not apples-to-apples: the causality of crashes and errors is so complicated that there’ll at all times be variations between releases that are past our management and even understanding. Many crash experiences, for instance, don’t have anything to do with our product however with third get together software program and internet sites past our management. That stated, I really feel like this adjusted charge continues to be adequate to inform us (broadly talking) (1) whether or not our newest launch / beta / nightly is okay (i.e. there is no such thing as a main showstopper situation) and (2) whether or not our general error charge goes up or down over a number of variations (if there’s a continuous enhance in our crash charge, it would level to an issue in our launch/qa course of).

Apparently, the primary issues that we’ve discovered with this method aren’t actual issues with the product however information assortment points:

  • we don’t appear to be amassing counts of gmplugin crashes on Home windows anymore by way of telemetry
  • the variety of content_shutdown_crashes is bigger than the variety of content_crashes, though the previous is a superset of the latter

Knowledge points apart, the indications are that there’s been a gradual enhance within the high quality of Firefox over the previous few releases primarily based on the primary consumer going through error metric we’ve cared about up to now (most important crashes), in order that’s good. 🙂

See also  Set up Cisco Unified Communications Manager (CUCM) with Zoom App

If you wish to play with the system your self, the improvement occasion continues to be up. We are going to most likely take a look at making this factor “official” subsequent quarter.

Derived versus direct

To aim to make complicated phenomena extra comprehensible, we frequently use derived measures when representing Telemetry information at Mozilla. For error charges for instance, we frequently measure issues by way of “X per khours of use” (the place X could be “main crashes”, “appearance of the slow script dialogue”). I.e. as an alternative of exhibiting a uncooked depend of errors we present a charge. Usually it is a good factor: it permits the consumer to simply evaluate two issues which could have completely different uncooked numbers for no matter cause however the place you’d usually count on the ratio to be comparable. For instance, we see that though the uptake of the newly-released Firefox 58.0.2 is a bit slower than 58.0.1, the general crash charge (as sampled each 5 minutes) is kind of the identical after a few day has rolled round:

However, taking a look at uncooked counts doesn’t actually offer you a lot of a touch on find out how to interpret the outcomes. Relying on the size of the graph, the precise charges may really resolve to being vastly completely different:

Okay, so this straightforward device (utilizing a ratio) is helpful. Yay! Sadly, there may be one case the place utilizing this method can result in a really misleading visualization: when the variety of samples is actually small, a number of outliers can provide a very misunderstanding of what’s actually occurring. Take this graph of what the crash charge appeared like simply after Firefox 58.0 was launched:

10 to 100 errors per 1000 hours, say it isn’t so? However wait, what number of errors do we’ve completely? Hovering over a consultant level within the graph with the normalization (use of a ratio) turned off:

We’re actually solely speaking about one thing between 1 to 40 crashes occasions over a comparatively small variety of utilization hours. That is clearly so little information that we are able to’t (and shouldn’t) draw any type of conclusion in any way.

Okay, in order that’s simply science 101: don’t soar to conclusions primarily based on small, vastly unrepresentative samples. Sadly because of human psychology individuals are inclined to assume that charts like this are authoritative and symbolize one thing actual, absent a proof in any other case — and the usage of a ratio obscured the one truth (excessive lack of information) that will have given the consumer a touch on find out how to accurately interpret the outcomes. One thing to bear in mind as we construct our instruments.

Sustaining metricsgraphics

Only a fast announcement that I’ve taken it upon myself to imagine some maintership duties of the favored MetricsGraphics library and have launched a new model with some bug fixes (2.12.0). We use this package deal fairly extensively at Mozilla for visualizing telemetry and different time collection information, however its authentic authors (Hamilton Ulmer and Ali Almossawi) have largely moved on to different issues so there was a little bit of a spot in getting fixes and enhancements in that I hope to fill.

I don’t but declare to be an skilled on this library (which is kind of wealthy and complicated), however I’m certain I’ll be taught extra as I am going alongside. A minimum of initially, I count on that the modifications I make might be small and primarily targetted to filling the wants of the Mission Management venture.

Be aware that this emphatically doesn’t imply I’m promising to answer each situation/query/pull request made in opposition to the venture. Like my work with mozregression and perfherder, my upkeep work is being completed on a best-effort foundation to assist Mozilla and the bigger open supply group. I’ll assist individuals out the place I can, however there are solely so many working hours in a day and I have to spend most of these pushing my group’s rapid initiatives and deliverables ahead! Specifically, in the case of getting pull requests merged, small, self-contained and logical modifications with good commit messages will take precedence.

Higher or worse: by what measure?

Okay, after a collection of posts extolling the virtues of my present venture, it’s time to take a extra essential take a look at a few of its present limitations, and what we would do about them. In my introductory put up, I talked about how Mission Management can tell us how “crashy” a brand new launch is, inside a brief interval of it being launched. I additionally alluded to the truth that issues seem significantly worse when one thing first goes out, although I didn’t go into a number of element about how and why that occurs.

See also  Aseprite Free Download Mac - hererfil

It simply so occurs {that a} new level launch (56.0.2) simply went out, so it’s an ideal alternative to revisit this situation. Let’s check out what the graphs are saying (every of the photographs can also be a hyperlink to the dashboard the place they had been generated):

ZOMG! It seems to be like 56.0.2 is off the charts relative to the 2 earlier releases (56.0 and 56.0.1). Is it time to sound the alarm? Mission management abort? Effectively, let’s see what occurs the final time we rolled one thing new out, say 56.0.1:

We see the very same sample. Hmm. How about 56.0?

Yep, identical sample right here too (really barely worse).

What could possibly be occurring? Let’s begin by reviewing what these time collection graphs are primarily based on. Every level on the graph represents the variety of crashes reported by telemetry “main” pings akin to that channel/model/platform inside a 5 minute interval, divided by the variety of utilization hours (how lengthy customers have had Firefox open) additionally reported in that interval. A most important ping is submitted below a number of circumstances:

  • The consumer shuts down Firefox
  • It’s been about 24 hours for the reason that final time we despatched a most important ping.
  • The consumer begins Firefox after Firefox failed to begin correctly
  • The consumer modifications one thing about Firefox’s setting (provides an addon, flips a consumer choice)

A excessive crash charge both means a bigger variety of crashes over the identical variety of utilization hours, or a decrease variety of utilization hours over the identical variety of crashes. There are a number of doubtless explanations for why we would see this sort of crashy behaviour instantly after a brand new launch:

  • A Firefox replace is utilized after the consumer restarts their browser for any cause, together with their browser crash. Thus a consumer whose browser crashes a lot (for any cause), is extra liable to replace to the most recent model sooner than a consumer that doesn’t crash as a lot.
  • Inherently, any crash information submitted to telemetry after a brand new model is launched may have a low variety of utilization hours connected, as a result of the shopper wouldn’t have had an opportunity to make use of it a lot (as a result of it’s so new).

Assuming that we’re moderately happy with the above clarification, there’s a number of issues we may attempt to do to appropriate for this example when implementing an “alerting” system for mission management (the subsequent merchandise on my todo checklist for this venture):

  • Set “error” thresholds for every crash measure sufficiently excessive that we don’t think about these excessive preliminary values an error (i.e. solely alert if there may be are 500 crashes per 1k hours).
  • Solely set off an error threshold when some type of minimal amount of utilization hours has been noticed (this has the drawback of probably obscuring a major problem till a big proportion of the consumer inhabitants is affected by it).
  • Provide you with some anticipated vary of what we count on a worth to be for when a brand new model of firefox is first launched and ratchet that down as time goes on (based on some type of mannequin of our earlier expectations).

The preliminary specification for this venture referred to as for simply utilizing uncooked thresholds for these measures (discounting utilization hours), however I’m turning into more and more satisfied that gained’t minimize it. I’m not a high quality management skilled, however 500 crashes for 1k hours of use sounds utterly unacceptable if we’re measuring issues in any respect precisely (which I consider we’re given a enough time period). On the identical time, producing 20–30 “alerts” each time a brand new launch went out wouldn’t notably useful both. As soon as once more, we’re going to have to do that the exhausting method…

If this sounds attention-grabbing and you’ve got some react/d3/information visualization expertise (or wish to achieve some), find out about contributing to mission management.

Shout out to chutten for reviewing this put up and offering suggestions and additions.

Mission Management: Prepared for contributions

One of many nice design choices that was made for Treeherder was a strict seperation of the shopper and server parts of the codebase. Whereas its backend was reasonably difficult to rise up and operating (particularly right into a state that checked out all like what we had been operating in manufacturing), you would get its net frontend operating (pointed in opposition to the manufacturing information) simply by beginning up a easy node.js server. This dramatically lowered the barrier to entry, for Mozilla staff and informal contributors alike.

See also  How to burn MP3 to an Audio CD in Windows 10 (4 easy ways)

I knew proper from the start that I needed to take the identical strategy with Mission Management. Whereas the total supply of the venture is accessible, sadly it isn’t presently doable to deliver up the total stack with actual information, as that requires privileged entry to the athena/parquet error aggregates desk. However for the reason that UI is self-contained, it’s fairly straightforward to deliver up a improvement setting that permits you to freely browse the cached information which is saved server-side (basically: git clone https://github.com/mozilla/missioncontrol.git && yarn set up && yarn begin).

In my expertise, essentially the most attention-grabbing issues in the case of initiatives like these heart across the query of find out how to current extraordinarily complicated information in a method that’s intuitive however not deceptive. In all probability 90% of that work occurs within the frontend. Up to now, I’ve had fairly good luck discovering contributors for my initiatives (particularly Perfherder) by doing call-outs on this weblog. So let it’s recognized: If Mission Management feels like an attention-grabbing venture and you already know React/Redux/D3/MetricsGraphics (or need to be taught), let’s work collectively!

I’ve created some good first bugs to sort out within the github situation tracker. From there, I’ve a galaxy of different work in thoughts to enhance and improve the usefulness of this venture. Please get in contact with me (wlach) on irc.mozilla.org #missioncontrol if you wish to focus on additional.

Mission Management

Time for an overdue put up on the mission management venture that I’ve been engaged on for the previous few quarters, since I transitioned to the information platform group.

One of many gaps in our information story in the case of Firefox is having the ability to see how a brand new launch is doing within the rapid hours after launch. Instruments like crashstats and the telemetry evolution dashboard are nice, however it could actually take many hours (if not days) earlier than you possibly can reliably see that there’s a problem in a metric that we care about (variety of crashes, say). That is simply too lengthy — such delays unnecessarily retard rolling out a launch when it’s protected (as a result of there’s a paranoia that there could be some lingering downside which we we’re ready to see reported). And if, someway, regardless of our ample warning an issue did slip via it will take us a while to acknowledge it and roll out a repair.

Enter mission management. By hooking up a high-performance spark streaming job on to our ingestion pipeline, we are able to now be capable of detect inside moments whether or not firefox is performing unacceptably throughout the area based on a selected measure.

To make the quantity of information manageable, we create a grouped information set with the uncooked depend of the assorted measures (e.g. most important crashes, content material crashes, gradual script dialog counts) together with every distinctive mixture of dimensions (e.g. platform, channel, launch).

In fact, all this information just isn’t so helpful and not using a device to visualise it, which is what I’ve been spending the vast majority of my time on. The concept is to have the ability to go from a high degree description of what’s occurring a selected channel (launch for instance) all the way in which all the way down to an in depth view of how a measure has been performing over a time interval:

This specific screenshot exhibits the quantity of content material crashes (sampled each 5 minutes) over the past 48 hours on home windows launch. You’ll word that the later model (56.0) appears to be a lot crashier than earlier variations (55.0.3) which might appear to be an issue besides that the populations aren’t instantly comparable (for the reason that profile of a consumer nonetheless on an older model of Firefox is reasonably completely different from that of 1 who has already upgraded). This is likely one of the nonetheless unsolved issues of this venture: discovering a dependable, automatable baseline of what an “acceptable result” for any specific measure could be.

Even nonetheless, the device can nonetheless be helpful for exploring a bunch of information shortly and it has been progressing quickly over the previous few weeks. And like nearly the whole lot Mozilla does, each the supply and dashboard are open to the general public. I’m planning on flagging some simpler bugs for newer contributors to work on within the subsequent couple weeks, however within the meantime for those who’re on this venture and need to get entangled, be at liberty to look us up on irc.mozilla.org #missioncontrol (I’m there as ‘wlach’).

Leave a Reply

Your email address will not be published.