The US has 11 separate ‘nations’ with entirely different cultures

image

If you want to better understand U.S. politics, history, and culture, AMERICAN NATIONS: A History of the Eleven Rival Regional Cultures of North America, by Colin Woodard should be required reading.

He argues there isn’t, and never has been one America, but rather several Americas. In American Nations, Woodard leads us through the history of our fractured continent, and the rivalries and alliances between its component nations. It’s a revolutionary take on America’s myriad identities, and how the conflicts between them have shaped our past and mold our future. Continue reading “The US has 11 separate ‘nations’ with entirely different cultures”

How to create an agile organization

Rapid changes in competition, demand, technology, and regulations have made it more important than ever for organizations to be able to respond and adapt quickly. But according to a recent McKinsey Global Survey, organizational agility—the ability to quickly reconfigure strategy, structure, processes, people, and technology toward value-creating and value-protecting opportunities—is elusive for most.1 1. This definition of organizational agility was given to respondents when they began the survey and reflects McKinsey’s proprietary definition, which is distinct from how we define organizations with agile software-development processes. Throughout the report, we will use “agile transformations” to refer to transformations that focus on organizational agility. Continue reading “How to create an agile organization”

Choosing the Correct Architecture for an iOS Application

Launching an app doesn’t need to be daunting. Whether you’re just getting started or need a refresher on mobile app testing best practices, this guide is your resource! Brought to you in partnership with Perfecto.
Design patterns and architectures are very important today in creating a successful and reliable application and there I stumbled upon a question about choosing the architecture for creating the iOS application. The main objective is to explain what features make good architecture. And what having good architecture can do for your application.
Why Do We Care About Choosing the Correct Architecture?
It’s because if we don’t care, then one day it will be a nightmare to find and fix bugs. We can probably ignore architecture in simple applications like “Hello World,” or ones with a small number of screens and lines of code where you can simply end write all your code in your View Controller. But what if it’s not just Hello World? Then we might end up with a huge pile of code in the View Controller, making it a “Messy View Controller” or “Massive View Controller.” This can happen even if we follow Apple’s MVC guidelines.
What Can Be Considered Good Architecture?
Let’s define some of the features of good architecture:
• Balanced distribution among entities.

• Measurability.

• Testability.

• Ease of use and maintainability.

Why Balanced Distribution Among Entities?
The easiest way to reduce complexity is to divide the responsibilities among different entities. It should follow the Single Responsibility principle. A class should have one and only one reason to change.
Let’s consider an example about a class that creates a pdf in a view and prints that report. Now imagine a class that can perform these two changes. First, it loads the data coming from the web server or database, and second, it changes the user interface format. Both these tasks are entirely different from one other; the first one is a substantial change, while setting the user interface is entirely a cosmetic change. The single responsibility principle says that the two are entirely separate responsibilities which should be independent.
It would be bad to couple two things for entirely different reasons and that’s why the balance of distribution comes in. Distributing the class will be focused on a single concern and make the class robust.
Why Testability?
Testability does not mean testing. An application is testable when there is an effective test strategy used to verify the conformance of some implementation with regards to its specification. Writing automated tests becomes very easy because at the time you complete one composition root, it becomes ready to test independently. These tests save developers finding and fixing the bugs before the application is handed to the user’s device.
Why Ease of Use?
The less code you write, the fewer chances it has of going wrong. The more code you have, the more places there are for bugs. If the code is bad, it rots, so one should not look for quick solutions while keeping your eyes closed to the later maintenance cost. Even if a new developer starts to work, it should not feel lost in your code.
Now we have many design patterns that we can apply based on requirements of our project, like
• MVC

• MVP

• MVVM

MVC
Model-View-Controller is a widely used pattern for creating software applications. Most Cocoa applications and frameworks created by Apple have implemented this design pattern.

• The model is where your domain data resides, which manages reading and writing data and persisting states. Things like persistence, networking codes, model objects, and parsers which manipulate the data stay here.

• The view is the face of your application. This is responsible for the the presentation (User Interface) and handles user interactions.

• The controllers act as glue, or mediators between the model layers and presentation layers (Model and View). It alters the model by reacting to actions performed by the user on view and updating the view that comes from the changes in the model.

Now, what is wrong with MVC? If we try building complex applications, it gets difficult. Over time, more and more code gets transferred to the controllers, making them more fragile and bloated. The controllers are so tightly coupled with the views that if we try to change something in the view, we have to go back to the controller and make changes there, too. This violates balanced distribution among entitied features.
Who comes to the rescue for MVC now?
MVP
MVP stands for Model-View-Presenter; Cocoa’s MVC promise is fulfilled here. It fulfills testability surface and clean separation of view and model.

• The model is the same as MVC’s model. It manages reading and writing data and persisting states. There is no change.

• Here, the view part includes both view and view controllers. The view here delegates user interactions to the presenter. The view in MVP is as dumb as possible and contains no logic that can query the model.

• The presenter contains the logic that handles user interactions. It does not have any UIKit dependencies. The responsibility of the presenter is to communicate with the model, convert the data to a user friendly format, then update the view.

Here in MVP, the view controllers are considered the subclasses of view, not presenter. Now the responsibility is divided between the model and presenter as the view is dumb and contains no logic, fulfilling the balanced distribution feature. The code is much cleaner now and can easily do unit tests for the presenter.
We cannot say that MVP is a perfect pattern or that one should follow MVP without going with the requirements of the application. MVP is not suitable for simple screen applications; it would lead to boilerplate code written to get the interface of the view to start working.
MVVM
MMVM is one of the latest of Model-View patterns. It stands for Model-View-ViewModel. Here, the mediator is ViewModel. It is an implementation of observer design pattern where any changes in the model are represented in the view and viewmodel. Nowadays, when we think of using MVVM, we think of Reactive Cocoa, although it is possible to build the MVVM pattern with simple bindings, too. MVVM includes:
• Model: This represents a data model that our app consumes. This class declares the properties to manage business data similar to the above two design patterns.

• View: It is similar to MVP. The MVVM view includes both view and view controllers. It simply holds data and delegates everything to the model.

• ViewModel: The viewmodel acts as a link between the model and view. It is responsible for wrapping up the model and preparing the observable data needed by the view.

One can use MVVM by keeping in mind some points:
1. The view is dumb it should only know how to present data.

2. The controller knows nothing about the model.

3. The model knows nothing about the viewmodel.

4. The viewmodel owns the model.

5. The view controller owns the view.

6. The controller owns the view model and interacts with the model layer through the ViewModel.

The MMVM satisfies almost all the features of good architecture.The responsibility is now distributed among the view model and view. One of the advantage of using MMVM is testability as view model has nothing to do with the view so each entity can be separately tested.
But as we know, “With great power comes great responsibility.” It could be very easy to mess up with Reactive Cocoa; if you did something wrong, then you have to spend a lot of time debugging and fixing the issue. This pattern cannot be used for simple linited screen applications otherwise it could end up making your code more complex and difficult to maintain for new developers.
I hope this artice helps you in understanding the importance of choosing correct architecture and design patterns based on requirements and scale of your application.
Keep up with the latest DevTest Jargon with the latest Mobile DevTest Dictionary. Brought to you in partnership with Perfecto.
Like This Article? Read More From DZone

A User Story Checklist

For Training, Coaching, and Consulting engagements, please contact me for details on how I can help your organization.
Sometimes it’s difficult to determine if something is a User Story or not. Not everything a software team does is a User Story.
In my other article, User Story Basics , I talk about the definition of a User Story. I will use that definition to make a checklist of characteristics a User Story must have.
1. Does the story describe new or changed functionality in the system under development?
2. Is there a fairly unique written description of the story? (This can be just a couple of words, like a title)
3. Have conversations about the story, that clarify story details, taken place?
4. Does the story have acceptance tests that convey and confirm when the functionality of the story is complete, correct, and present in the system under development?
If you cannot answer “Definitely, yes!” to all of the above 5 questions, then it is almost always… NOT a User Story.
Be sure you know who your key stakeholders are.
With respect to Question #2, I tend to define ¨developer” fairly broadly, like Scrum does. From the Scrum Guide: “[Development] Team members often have specialized skills, such as programming, quality control, business analysis, architecture, user interface design, or data base design…”. On some development teams, people who play a role of developer also play a role of “product support member” for the system under development. In this way, sometimes User Stories describe functionality that is important to someone doing production support. Other than this one exception, though, functionality and features that are primarily of importance to developers are almost always not considered User Stories. On the other hand, there is nothing wrong with a developer suggesting some functionality to a key stakeholder, and/or convincing a key stakeholder that some functionality will be primarily of value to that stakeholder. If that stakeholder agrees, then you can re-classify that functionality as a User Story and put it on the Product Backlog… so long as you can say “Yes” to all of the other questions above.
If something fails Question #2 because there is no direct key stakeholder value, then the work to be completed is likely a task for the Sprint Backlog, or some work that needs to be represented on the The Dev Team Improvement Backlog. If the work is fairly small, you’ll probably lean towards a Sprint Backlog task. If the work will be done later or is fairly large in size, then you’ll probably lead towards the improvement backlog. This “non user story” work is often extremely important to accomplish, but it’s not something that the key stakeholders can really appreciate, and it’s not something that the Product Owner should have control over when ordering the backlog.
Should I use a User Story to represent bugs/defects in a system?
The short answer is “it depends.” If it is a legacy or deferred bug, then yes, and it should end up on the Product Backlog(story points assigned). If it is any other bug, then it should end up on the Sprint Backlog(no story points assigned) and is not a User Story. See One way to handle Bugs and Production Support in Scrum for the longer answer.
Looking at our checklist above, the reason that legacy and deferred bugs are User Stories is because they have business value. At some point, the legacy/deferred bug was not fixed (deferred) because there were higher priority business value items to be implemented. This includes bugs that have existed for a really long time that no one knew about — clearly the bug was not impacting the business in any meaningful way. In essence, the deferred bug fix became the equivalent of a deferred feature, and thus it now has business value attached to it.
Bugs where there was no business decision to defer are software defects. There is no business value in creating software defects. If anything, these kinds of defects detract from business value, which is really bad. These defects should be handled by putting them on the Sprint Backlog and fixing them immediately. Again, see more at the above link on handling bugs.

From Idea To Development: How To Write Mobile Application Requirements That Work

• 4 Comments
Today, too many websites are still inaccessible. In our new book Inclusive Design Patterns, we explore how to craft flexible front-end design patterns and make future-proof and accessible interfaces without extra effort. Hardcover, 312 pages. Get the book now!
Why write requirements? Well, let’s imagine you want to produce a mobile app, but you don’t have the programming skills. So, you find a developer who can build the app for you, and you describe the idea to him. Surprisingly, when he showcases the app for the first time, you see that it is not exactly what you want. Why? Because you didn’t provide enough detail when describing the idea.
To prevent this from happening, you need to formalize the idea, shape it into something less vague. The best way to do that is to write a requirements document and share it with the developer. A requirements document describes how you see the result of the development process, thus making sure that you and the developer are on the same page.
In this article, we will outline the most common approaches to writing requirements documents. You will learn the basic steps of writing mobile application requirements and what a good requirements document looks like.
A carefully crafted requirements document eliminates ambiguity, thus ensuring that the developer does exactly what needs to be done. In addition, the document gives a clear picture of the scope of the work, enabling the developer to better assess the time and effort required. But how do we create a good document? Below are some tips that our mobile team at Polecat follows when crafting requirements.
We believe that a proper description of the idea should fit in one sentence. The sentence may include a core feature of the application, so that the reader understands instantly what the app is about. For a calorie-tracking mobile application, it could be, “An app to track calorie consumption to help those who care about their weight.”
Hint: Gua Tabidze shares a few models that others use to describe an idea.
Study basic navigation patterns, and describe your application in the same sequence that users would experience while exploring it. Once the idea part is done, describe the first steps of the application, such as the onboarding screens and user registration.
Then, move on to what goes next, such as the application’s home screen. This approach will give the reader a sense of what the user’s journey would look like.
At the end, don’t forget about basic features and screens such as the privacy policy and the “forgot password” feature.
Review existing applications in Apple’s App Store and Google Play, and refer to them when describing your app. If you like how the “forgot password” feature works in applications A and B, put it in the requirements document.
Focus on the features of the application, and skip details such as the color of a button. Most app users do not care about such details. What they do care about is whether your application helps to solve their problem. So, when writing requirements, concentrate on things that the user should be able to do in the app.
Convey which features are more important than others, so that the developer knows what to focus on first. We usually follow the MoSCoW method, marking items with “Must,” “Should,” “Could” and “Won’t” levels of priority.
Create wireframes of the screens of the application to accompany your textual description of them. If you have more than four wireframe screens, then drawing a screen map makes sense. We’ll show a screen map later in this article.
Now that you know how to write the requirements, you’ll need to choose an appropriate format for the document. There are a few basic formats for writing the requirements for a mobile app, such as a functional specification document (FSD), user stories and wireframes.
An FSD is probably the default format in the software development industry. It consists of a standard list of items that cover what the product should do and how it should do it.
Let’s take a simple calculator application and describe its features as an FSD:
• Application screen presents a digital keyboard with additional buttons for basic arithmetic operations (addition, subtraction, multiplication, division) and a result button (marked with “=”).
• Tapping on a digit button adds it to the display section of the screen. Each new digit is added to the right side of the number.
• Tapping on an operation button causes the current number shown in the display section to be added to the memory. It also clears the display section for the next number.
• Tapping on the display-result button combines the number in memory with the one in the display section according to the operation requested previously. The resulting number is shown in the display section of the screen.
As you can see, this format requires quite a detailed description of the product because the description will be used by both the business and the developers. It ensures that all participants are on the same page.
The person who composes the FSD should have strong experience in software development and should know the specifics of the mobile or other platform for which you are building. Also, because of the high level of detail required, creating and polishing such a document usually takes a decent amount of time.
A user story is less formal than an FSD yet still very powerful. It lists the things that the user can do in the application and is described from the user’s perspective. The document could also briefly explain why the user would want to do it, if that’s not obvious.
Let’s take our calculator example and add a few other features, describing them as a user story:
• As a user, I want to be able to change the number notation from decimal to exponential (and vice versa), so that I can work with very small or very large numbers.
• As a user, I want to be able to export a calculation’s history as a PDF file to share with my colleagues.
Because of the explanation, such a format provides not only a technical overview of the requirements, but also a good business case for them. Thus, if a feature is identified that is not critical to the business, you could decide either to completely remove it from the scope or to postpone it to a future release.
Using this format, you can easily split one story into multiple sub-stories to provide more detail. For example, we could split the PDF-exporting story into the following sub-stories:
• As a user, I want to be able to tap on the sharing button (top right of the screen) to see my options (sharing as PDF, text, image).
• Once I select a sharing option, I want to select the calculation timeframe that will be shared, using iOS’ date picker.
Because of the simplicity and non-technical nature of user stories, in most cases, a manager cannot simply ask a developer to implement a particular user story. Turning a story into a task that can be added to a task tracker requires further discussion and detailing between the manager and technical leader.
User stories have become one of the most convenient and popular formats because of their simplicity and flexibility.
Another way to outline an application’s requirements is to visualize them in sketches or wireframes. With iOS development, around 70% of development time is spent on interface implementation, so having all of the screens in front of you would give you a good sense of what needs to be done and the scope of the work.

Calculator wireframe example created in Balsamiq Mockups.
Creating a relevant set of wireframes for a mobile application requires you to know the basics of the user experience: how screens can be linked with each other, which states each screen can have, and how the application will behave when it is opened from a push notification.
Don’t be afraid to mix formats. By doing this properly, you take advantage of the strengths of each format. In our experience, mixing user stories and wireframes makes the most sense. While the user stories describe the features of the application and provide a business case for them, the wireframes show how these features would appear on the screens of the app. In addition, putting together user stories and wireframes would take you less time than writing an FSD, with all of its accompanying detail and descriptions of the interactions.
Start by sketching out wireframes for the application. Once the wireframes are done, add two or more user stories for each screen, describing what the user can do on that screen. We’ve found this approach to be the most appropriate for mobile application development, so we use it a lot.
I’ll take our What I Eat application as an example. I’ll compose the requirements document as if we were developing the application from scratch.
First, let’s formalize the idea using Steve Blank’s XYZ pattern: “We help X do Y by doing Z.” The premise of the application is to enable users to take control of what they eat during the day and of their calorie intake. According to the XYZ method: “What I Eat helps those who care about their weight to track calorie consumption by providing functionality for a simple meal log.”
As mentioned, mixing user stories and wireframes works best for us, so why not use them here?
The next step is to describe the What I Eat app as user stories, screen by screen. We’ll begin with the application’s start and home screen:
• As a user, I want to open the app and instantly see today’s meal log and the calories consumed.
• I want to be able to quickly add new meals and calories that I’ve just consumed.
• I also want quick access to the in-app calendar to view my meal logs from previous days.
To avoid any ambiguity, we’ll create a wireframe for this screen.

Home screen wireframe
As you can see, we weren’t able to put the “Add new meal” functionality on the home screen. Instead, we added a button to navigate to another screen that presents this feature. Now, we need to put together user stories for this new screen:
• I want to type in the name of the meal I’ve just had.
• Along with the name of the meal, I want to enter the number of calories.

Wireframe for add-meal screen
The home screen has a button that opens the calendar. Because there are many other calendar apps, checking their designs first makes sense. We like the iPhone’s default calendar app, so we will use it as a reference.
• As a user, I want to be able to quickly select a date in the current month.
• When selecting a date, I want to see a list of meals for that date below, like in the iPhone’s calendar app.
• I want to be able to switch to the next or previous month.
We will also put a piece of the iPhone calendar’s user interface in the wireframe.

Calendar wireframe
Finally, we need to add some settings to the app.
• I want to be able to enable and disable iCloud backups for my meal records.
• I want to be able to enable and disable daily push notifications that remind me to track my calorie intake.

Wireframe of settings screen
Phew! Almost done. The final step is to put the wireframes and user stories together in one document, with each wireframe and its respective story on its own page.

Wireframe and respective user story on one page. Download the full document (PDF, 0.2 MB). (View large version)
In addition, we can draw a map to visualize how the screens are connected to each other. We’ll use RealtimeBoard for that.

Screen map for calorie-tracking iPhone application (View large version)
In doing the screen map, we realize that there is no button to go to the settings screen, so we’ll add that to the home screen.
We have created two documents: a PDF with user stories and wireframes, and a screen map that complements the PDF. Together, they describe in detail what features the application should have. We can go ahead and send that to our developer. This time, the application he delivers will correlate with your vision.
Generally speaking, writing a requirements document is mostly about conveying your vision to the rest of the team. Don’t limit yourself to the methodology described above. Feel free to experiment and find the solution that works best for you.
Delve deeper into the subject with the following resources:
We’d love to hear about your approach to creating requirements documents. Please share your thoughts in the comments.
(da, vf, yk, al, il)

Getting Started With Scrapy

Find your next Big Data job at DZone Jobs. See jobs focused on Big Data or create your profile and have the employers come to you!
Scrapy is a Python-based web crawler that can be used to extract information from websites. It is fast and simple, and can navigate pages just like a browser can.
However, note that it is not suitable for websites and apps that use JavaScript to manipulate the user interface. Scrapy loads just the HTML. It has no facilities to execute JavaScript that might be used by the website to tailor the user’s experience.
Installation
We use Virtualenv to install scrapy. This allows us to install scrapy without affecting other system-installed modules.
Create a working directory and initialize a virtual environment in that directory.
Install scrapy now.
Check that it is working. The following display shows the version of scrapy as 1.4.0.
scrapy [options] [args]
genspider Generate new spider using pre-defined templates
runspider Run a self-contained spider (without creating a project)
Writing a Spider
Scrapy works by loading a Python module called a spider, which is a class inheriting from scrapy.Spider.
Let’s write a simple spider class to load the top posts from Reddit.
To begin with, create a file called redditspider.py and add the following to it. This is a complete spider class, though one which does not do anything useful for us. A spider class requires, at a minimum, the following:
• A name identifying the spider.
• A start_urls list variable containing the URLs from which to begin crawling.
• A parse() method, which can be a no-op as shown.
def parse(self, response):
This class can now be executed as follows:
2017-06-16 10:42:34 [scrapy.middleware] INFO: Enabled item pipelines:
2017-06-16 10:42:34 [scrapy.core.engine] INFO: Spider opened
2017-06-16 10:42:34 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
Turn Off Logging
As you can see, this spider runs and prints a bunch of messages, which can be useful for debugging. However, since it obscures the output of out program, let’s turn it off for now.
Add these lines to the beginning of the file:
Now, when we run the spider, we should not see the obfuscating messages.
Parsing the Response
Let’s now parse the response from the scraper. This is done in the method parse(). In this method, we use the method response.css() to perform CSS-style selections on the HTML and extract the required elements.
To identify the CSS selections to extract, we use Chrome’s DOM Inspector tool to pick the elements. From reddit’s front page, we see that each post is wrapped in a

.
So we select all div.thing from the page and use it to work with further.
for element in response.css(‘div.thing’):
We also implement the following helper methods within the spider class to extract the required text.
The following method extracts all text from an element as a list, joins the elements with a space, and strips away the leading and trailing whitespace from the result.
def a(self, response, cssSel):
return ‘ ‘.join(response.css(cssSel).extract()).strip()
And this method extracts text from the first element and returns it.
Extracting Required Elements
Once these helper methods are in place, let’s extract the title from each Reddit post. Within div.thing, the title is available at div.entry>p.title>a.title::text. As mentioned before, this CSS selection for the required elements can be determined from any browser’s DOM Inspector.
for e in resp.css(‘div.thing’):
‘title’: self.a(e,’div.entry>p.title>a.title::text’),
The results are returned to the caller using python’s yield statement. The way yield works is as follows — executing a function which contains a yield statement returns a generator to the caller. The caller repeatedly executes this generator and receives results of the execution till the generator terminates.
In our case, the parse() method returns a dictionary object containing a key (title) to the caller on each invocation till the div.thing list ends.
Running the Spider and Collecting Output
Let us now run the spider again. A part of the copious output is shown (after re-instating the log statements).
2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from
2017-06-16 11:35:27 [scrapy.core.scraper] DEBUG: Scraped from
It is hard to see the real output. Let us redirect the output to a file (posts.json).
And here is a part of posts.json.
Extract All Required Information
Let’s also extract the subreddit name and the number of votes for each post. To do that, we just update the result returned from the yield statement.
for e in r.css(‘div.thing’):
‘votes’: S.f(e,’div.score.likes::attr(title)’),
‘subreddit’: S.a(e,’div.entry>p.tagline>a.subreddit::text’),
The resulting posts.json:
Conclusion
This article provided a basic view of how to extract information from websites using Scrapy. To use scrapy, we need to write a spider module which instructs scrapy to crawl a website and extract structured information from it. This information can then be returned in JSON format for consumption by downstream software.
Find your next Big Data job at DZone Jobs. See jobs focused on Big Data or create your profile and have the employers come to you!
Like This Article? Read More From DZone

Gartner confirms what we all know: AWS and Microsoft are the cloud leaders, by a fair way

Paranormal parallelogram for IaaS has Google on the same lap, IBM and Oracle trailing
Gartner has published a new magic quadrant for infrastructure-as-a-service (IaaS) that – surprising nobody – has Amazon Web Services and Microsoft alone in the leader’s quadrant and a few others thought outside of the box.
Here’s the Soothsaying Square in all its glory.

Gartner’s Magic Quadrant for Cloud Infrastructure as a Service, Worldwide June 2017. Click here to embiggen
That Oracle and IBM are rated visionaries may turn heads, as both strut like cloud leaders: Oracle regularly says its cloud is superior to Amazon’s. Yet Gartner rates Oracle’s cloud “a bare-bones ‘minimum viable product’” that offers “only the most vitally necessary cloud IaaS compute, storage and networking capabilities.” The analyst firm also worries about the Oracle cloud’s “limited operational track record” and warns that “Customers need to have a very high tolerance for risk, along with strong technical acumen.”
It’s not all scary: Grtner says “Oracle has a realistic perspective on its late entry into the market, and has a sensible engineering roadmap focused on building a set of core capabilities that will eventually make it attractive for targeted use cases.”
IBM was rated visionary because its cloud is a work in progress. Probably slow progress. While the company is working on a “Next-Generation Infrastructure” project that will improve scale and efficiency, there’s no news of when it will debut. For now, IBM’s cloud offering is mostly SoftLayer with a feature set Gartner says “has not improved significantly since the IBM acquisition in mid-2013; it is SMB-centric, hosting-oriented and missing many cloud IaaS capabilities required by midmarket and enterprise customers.”
Gartner also offers the following observations
IBM has, throughout its history in the cloud IaaS business, repeatedly encountered engineering challenges that have negatively impacted its time to market. It has discontinued a previous attempt at a new cloud IaaS offering, an OpenStack-based infrastructure that was offered via the Bluemix portal in a 2016 beta. Customers must thus absorb the risk of an uncertain roadmap. This uncertainty also impacts partners, and therefore the potential ecosystem.
Don’t forget, dear readers, that IBM has pretty much bet its future on cloudy cognitive services.
Alibaba also makes the visionary quadrant, as it’s ambitious and growing fast. But Gartner warns its English-language portal doesn’t offer all the services available in China. On top of that, the company’s not an innovator. Or as Gartner puts it, “Alibaba Cloud’s vision seems inextricably tied to that of its global competitors; it takes liberal inspiration from competitors when developing service capabilities and branding.”
Google’s the visionary closest to escaping into the leaders’ quadrant. Gartner says it’s a fine cloud for developers of cloud-native applications and is building features suited to more conventional workloads at speed. The analyst firm suggests that once Google’s partner ecosystem is more mature, it’ll be more attractive. For now, it’s offering “deep discounts and exceptionally flexible contracts to try to win projects from customers that are currently spending significant sums of money with cloud competitors.” Which makes it a fine alternative to AWS, a role it’s often filling.
The leaders
On Azure, Gartner is impressed that Microsoft’s moved beyond building a vanilla IaaS to innovating with its own features and praises its useful role for companies committed to Microsoft. But the cautions it offers are a bit scary.
Here’s the first:
While Microsoft Azure is an enterprise-ready platform, Gartner clients report that the service experience feels less enterprise-ready than they expected, given Microsoft’s long history as an enterprise vendor. Customers cite issues with technical support, documentation, training and breadth of the ISV partner ecosystem.
Microsoft is actively addressing these issues and has made significant improvements over the last year. However, the disorganized and inexperienced ecosystem of managed and professional service partners makes it challenging for customers to obtain expertise and mitigate risks, resulting in greater reluctance to deploy production applications or conduct data center migrations.
Azure Fast Start implementations by Microsoft professional services are inconsistent in quality, and do not always accurately reflect what a customer will need to deploy production applications in Azure.
The second caution warns that security and DevOps features may not be sufficiently mature to satisfy customers, while “Multiple generations of solutions, coupled with unclear guidance on when to use each, create significant complexity in determining the right implementation.”
AWS is rated “the most mature, enterprise-ready provider, with the deepest capabilities for governing a large number of users and resources.” Gartner says it can satisfy the cool kids who want cloud-native and old hands who want to shift traditional workloads to the cloud, in part because independent software vendors have clambered aboard in large numbers.
But the analyst firm warns that AWS “has just begun to adapt to the emergence of meaningful competitors”, continues to offer complex pricing that makes third-party cost-management tools highly desirable and doesn’t offer SLAs on most services.
It also plays hard: Gartner says “Its disciplined approach to contract negotiation and discounts is based almost solely on customer spending and near-term revenue opportunity.”
There’s a flock of challengers on the quadrant too, namely CenturyLink, Joyent, Virtustream, Interoute, Skytap, NTT, Rackspace and Fujitsu. But this article’s already long enough, so if you want to read them perhaps you’ll go get the whole Supernatural Square for yourself: AWS has a free data-for-download link here and Microsoft’s done the same here.
As you read it, ponder the omissions, too. OVH hasn’t made Gartner’s list. Nor has Digital Ocean. The likes of Salesforce, SAP and ServiceNow, which will happily rent you servers by the hour, fall outside Gartner’s definition of IaaS, falling into PaaS territory and therefore a different paranormal parallelogram. ®

There Is a Better Way to Teach Math (and Understand It)

In 1939, the fictional professor J. Abner Pediwell published a curious book called “The Saber-Tooth Curriculum.”
Through a series of satirical lectures, Pediwell (or the actual author, education professor Harold R. W. Benjamin) describes a Paleolithic curriculum that includes lessons in grabbing fish with your bare hands and scaring saber-toothed tigers with fire. Even after the invention of fishnets proves to be a far superior method of catching fish, teachers continued teaching the bare-hands method, claiming that it helps students develop “generalized agility.”
Pedwill showed how curricula can become entrenched and ritualistic, failing to respond to changes in the world around it. In math education, the problem is not quite so dire – but it’s time to start breaking a few of our own traditions. There’s a growing interest in emphasizing problem-solving and understanding concepts over skills and procedures. While memorized skills and procedures are useful, knowing the underlying meanings and understanding how they work builds problem-solving skills so that students may go beyond solving the standard book chapter problem.
As education researchers, we see two different ways that educators can build alternative mathematics courses. These updated courses work better for all students by changing what they teach and how they teach it.
In math, the usual curricular pathway – or sequence of courses – starts with algebra in eighth or ninth grade. This is followed by geometry, second-year algebra and trigonometry, all the way up to calculus and differential equations in college.
This pathway still serves science, technology, engineering and mathematics (STEM) majors reasonably well. However, some educators are now concerned about students who may have other career goals or interests. These students are stuck on largely the same path, but many end up terminating their mathematics studies at an earlier point along the way.
In fact, students who struggle early with the traditional singular STEM pathway are more likely to fall out of the higher education pipeline entirely. Many institutions have identified college algebra courses as a key roadblock leading to students dropping out of college altogether.
Another issue is that there is a growing need for new quantitative skills and reasoning in a wide variety of careers – not just STEM careers. In the 21st century, workers across many fields need to know how to deal effectively with data (statistical reasoning), detect trends and patterns in huge amounts of information (“big data”), use computers to solve problems (computational thinking) and make predictions about the relationships between different components of a system (mathematical modeling).
The quest to improve student retention has led schools to consider other pathways that would provide students with the quantitative skills they need. For example, courses that use spreadsheets extensively for mathematical modeling and powerful statistical software packages have been developed as part of an alternative pathway designed for students with interests in business and economics.What’s more, sophisticated computational tools provide us with mathematical capabilities far beyond arithmetic calculations. For example, large numerical data sets can be visually examined for patterns using computer graphing software. Other tools can derive predictive equations that would be impractical for anyone to compute with paper and pencil. What’s really needed are people who can make use of those tools productively, by posing the right questions and then interpreting the results sensibly.
The Carnegie Foundation for the Advancement of Teaching has created alternative math curricula called Quantway and Statway as examples of alternative pathways – used primarily in community colleges – that focus on quantitative reasoning and statistics/data analysis, respectively.
These alternative pathways involve activities that go beyond students writing examples down in their notebooks. Students might use software, build mathematical models or exercise other skills – all of which require flexible instruction.
Both new and old pathways can benefit from new and more flexible methods. In 2012, the President’s Council of Advisors on Science and Technology called for a 34 percent increase in the number of STEM graduates by 2020. Their report suggested current STEM teaching practices could improve through evidence-based approaches like active learning.
In a traditional classroom, students act as passive observers, watching an expert correctly work out problems. This approach doesn’t foster an environment where mistakes can be made and answers can be questioned. Without mistakes, students lack the opportunity to more deeply explore how and why things don’t work. They then tend to view mathematics as a series of isolated problems for which the solution is merely a prescribed formula.
Conversely, classrooms that incorporate active learning allow students to ask questions and explore. Active learning is not a specifically defined teaching technique. Rather, it’s a spectrum of instructional approaches, all of which involve students actively participating in lessons. For example, teachers could pose questions during class time for students to answer with an electronic clicker. Or, the class could skip the lecture entirely, leaving students to work on problems in groups.
While the idea of active learning has existed for decades, there has been a greater push for widespread adoption in recent years, as more scientific research has emerged. A 2014 analysis looked at 225 studies comparing active learning with traditional lecture in STEM courses. Their findings unequivocally support using active learning and question whether or not lecture should even continue in STEM classrooms. If this were a medical study in which active learning was the experimental drug, the authors write, trials would be “stopped for benefit” – because active learning is so clearly beneficial for students.
The studies in this analysis varied greatly in the level of active learning that took place. In other words, active learning, no matter how minimal, leads to greater student achievement than a traditional lecture classroom.
Regardless of pathway, all students can benefit from active engagement in the classroom. As mathematician Paul Halmos put it: “The best way to learn is to do; the worst way to teach is to talk.”
Mary E. Pilgrim, Assistant Professor of Mathematics Education, Colorado State University and Thomas Dick, Professor of Mathematics, Oregon State University
This article was originally published on The Conversation. Read the original article.

In 1939, the fictional professor J. Abner Pediwell published a curious book called “The Saber-Tooth Curriculum.”
Through a series of satirical lectures, Pediwell (or the actual author, education professor Harold R. W. Benjamin) describes a Paleolithic curriculum that includes lessons in grabbing fish with your bare hands and scaring saber-toothed tigers with fire. Even after the invention of fishnets proves to be a far superior method of catching fish, teachers continued teaching the bare-hands method, claiming that it helps students develop “generalized agility.”
Pedwill showed how curricula can become entrenched and ritualistic, failing to respond to changes in the world around it. In math education, the problem is not quite so dire – but it’s time to start breaking a few of our own traditions. There’s a growing interest in emphasizing problem-solving and understanding concepts over skills and procedures. While memorized skills and procedures are useful, knowing the underlying meanings and understanding how they work builds problem-solving skills so that students may go beyond solving the standard book chapter problem.
As education researchers, we see two different ways that educators can build alternative mathematics courses. These updated courses work better for all students by changing what they teach and how they teach it.
In math, the usual curricular pathway – or sequence of courses – starts with algebra in eighth or ninth grade. This is followed by geometry, second-year algebra and trigonometry, all the way up to calculus and differential equations in college.
This pathway still serves science, technology, engineering and mathematics (STEM) majors reasonably well. However, some educators are now concerned about students who may have other career goals or interests. These students are stuck on largely the same path, but many end up terminating their mathematics studies at an earlier point along the way.
In fact, students who struggle early with the traditional singular STEM pathway are more likely to fall out of the higher education pipeline entirely. Many institutions have identified college algebra courses as a key roadblock leading to students dropping out of college altogether.
Another issue is that there is a growing need for new quantitative skills and reasoning in a wide variety of careers – not just STEM careers. In the 21st century, workers across many fields need to know how to deal effectively with data (statistical reasoning), detect trends and patterns in huge amounts of information (“big data”), use computers to solve problems (computational thinking) and make predictions about the relationships between different components of a system (mathematical modeling).
The quest to improve student retention has led schools to consider other pathways that would provide students with the quantitative skills they need. For example, courses that use spreadsheets extensively for mathematical modeling and powerful statistical software packages have been developed as part of an alternative pathway designed for students with interests in business and economics.What’s more, sophisticated computational tools provide us with mathematical capabilities far beyond arithmetic calculations. For example, large numerical data sets can be visually examined for patterns using computer graphing software. Other tools can derive predictive equations that would be impractical for anyone to compute with paper and pencil. What’s really needed are people who can make use of those tools productively, by posing the right questions and then interpreting the results sensibly.
The Carnegie Foundation for the Advancement of Teaching has created alternative math curricula called Quantway and Statway as examples of alternative pathways – used primarily in community colleges – that focus on quantitative reasoning and statistics/data analysis, respectively.
These alternative pathways involve activities that go beyond students writing examples down in their notebooks. Students might use software, build mathematical models or exercise other skills – all of which require flexible instruction.
Both new and old pathways can benefit from new and more flexible methods. In 2012, the President’s Council of Advisors on Science and Technology called for a 34 percent increase in the number of STEM graduates by 2020. Their report suggested current STEM teaching practices could improve through evidence-based approaches like active learning.
In a traditional classroom, students act as passive observers, watching an expert correctly work out problems. This approach doesn’t foster an environment where mistakes can be made and answers can be questioned. Without mistakes, students lack the opportunity to more deeply explore how and why things don’t work. They then tend to view mathematics as a series of isolated problems for which the solution is merely a prescribed formula.
Conversely, classrooms that incorporate active learning allow students to ask questions and explore. Active learning is not a specifically defined teaching technique. Rather, it’s a spectrum of instructional approaches, all of which involve students actively participating in lessons. For example, teachers could pose questions during class time for students to answer with an electronic clicker. Or, the class could skip the lecture entirely, leaving students to work on problems in groups.
While the idea of active learning has existed for decades, there has been a greater push for widespread adoption in recent years, as more scientific research has emerged. A 2014 analysis looked at 225 studies comparing active learning with traditional lecture in STEM courses. Their findings unequivocally support using active learning and question whether or not lecture should even continue in STEM classrooms. If this were a medical study in which active learning was the experimental drug, the authors write, trials would be “stopped for benefit” – because active learning is so clearly beneficial for students.
The studies in this analysis varied greatly in the level of active learning that took place. In other words, active learning, no matter how minimal, leads to greater student achievement than a traditional lecture classroom.
Regardless of pathway, all students can benefit from active engagement in the classroom. As mathematician Paul Halmos put it: “The best way to learn is to do; the worst way to teach is to talk.”
Mary E. Pilgrim, Assistant Professor of Mathematics Education, Colorado State University and Thomas Dick, Professor of Mathematics, Oregon State University
This article was originally published on The Conversation. Read the original article.

Creating Your First Machine Learning Classifier Model with Sklearn

Originally appeared on kasperfred.com
Okay, so you’re interested in machine learning.
But you don’t know where to start, or perhaps you have read some theory, but don’t know how to implement what you have learned.
This tutorial will help you break the ice, and walk you through the complete process from importing and analysing a dataset to implementing and training a few different well known classification algorithms and assessing their performance.
I’ll be using a minimal amount of discrete mathematics, and aim to express details using intuition, and concrete examples instead of dense mathematical formulas. You can read why here.
At the end of the post you will know how to:
• Import and transform data from a .csv file to use with sklearn
• Inspect the dataset and select relevant features
• Train different classifiers on the data using sklearn
• Analyse the results with the intention of improving your model
We will be classifying flower-species based on their sepal and petal characteristics using the Iris flower dataset which you can download from Kaggle here.
Kaggle, if you haven’t heard of it, has a ton of cool open datasets, and is a place where data scientists share their work which can be a valuable resource when learning.
The Iris flower dataset is rather small (consisting of only 150 evenly distributed samples), and is well behaved which makes it ideal for this project.
You might ask why use this admittedly rather boring data-set when there are so many other interesting ones available? The reason is when we’re learning about data analysis, using simple, well behaved, data reduces the cognitive load, and makes it easier to debug as we are able to better comprehend the data we are working with.
When learning machine learning the data is less important than how it’s analysed.
Importing data
Once we have downloaded the data, the first thing we want to do is to load it in and inspect its structure. For this we will use pandas.
Pandas is a python library that gives us a common interface for data processing called a DataFrame. DataFrames are essentially excel spreadsheets with rows and columns, but without the fancy UI excel offers. Instead, we do all the data manipulation programmatically.
Pandas also have the added benifit of making it super simple to import data as it supports many different formats including excel spreadsheets, csv files, and even HTML documents.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use(‘ggplot’) # make plots look better
After having imported the libraries we are going to use, we can now read the datafile using pandas’ read_csv() method.
Pandas automatically interprets the first line as column headers. If your dataset doesn’t specify the column headers in first line, you can pass the argument header=None to the read_csv() function to interpret the whole document as data. Alternatively, you can also pass a list with the column names as the header parameter.
To confirm that pandas has correctly read the csv file we can call df.head() to display the first five rows.
print (df.head())

# Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
# 0 1 5.1 3.5 1.4 0.2 Iris-setosa
# 1 2 4.9 3.0 1.4 0.2 Iris-setosa
# 2 3 4.7 3.2 1.3 0.2 Iris-setosa
# 3 4 4.6 3.1 1.5 0.2 Iris-setosa
# 4 5 5.0 3.6 1.4 0.2 Iris-setosa
It’s seen that panda has indeed imported the data correctly. Pandas also has a neat function, df.describe() to calculate the descriptive statistics for each column like so:
print (df.describe())

# Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
# count 150.000000 150.000000 150.000000 150.000000 150.000000
# mean 75.500000 5.843333 3.054000 3.758667 1.198667
# std 43.445368 0.828066 0.433594 1.764420 0.763161
# min 1.000000 4.300000 2.000000 1.000000 0.100000
# 25% 38.250000 5.100000 2.800000 1.600000 0.300000
# 50% 75.500000 5.800000 3.000000 4.350000 1.300000
# 75% 112.750000 6.400000 3.300000 5.100000 1.800000
# max 150.000000 7.900000 4.400000 6.900000 2.500000
As we new can confirm there are no missing values, we are ready to begin analyzing the data with the intention of selecting the most relevant features.
Feature selection
After having become comfortable with the dataset, it’s time to select which features we are going to use for our machine learning model.
You might reasonably ask why do feature selection at all; can’t we just throw all the data we have at the model, and let it figure out what’s relevant?
To answer this, it’s important to understand that features are not the same as information.
Suppose you want to predict a house’s price from a set of features. We can ask ourselves if it’s really important to know how many lamps, and power outlets there are; is it something people think about when buying a house? Does it add any information, or is it just data for the sake of data?
Adding a lot of features that don’t contain any information makes the model needlessly slow, and you risk confusing the model into trying to fit informationless features. Furthermore, having many features increases the risk of your model overfitting (more on that later).
As a rule of thumb, you want the least amount of features that gives you as much information about your data as possible.
It’s also possible to combine correlated features such as number of rooms, living area, and number of windows from the example above into higher level principal components, for example size, using combination techniques such as principal component analysis (PCA). Although we won’t be using these techniques in this tutorial, you should know that they exist.
One useful way of determining the relevance of features is by visualizing their relationship to other features by plotting them. Below, we plot the relationship between two axis using the plot.scatter() subclass method.
df.plot.scatter(x=”SepalLengthCm”, y=”SepalWidthCm”)

plt.show()

The above figure correctly shows the relationship between the sepal length and the sepal width, however, it’s difficult to see if there’s any grouping without any indication of the true species of flower a datapoint represents.
Luckily, this is easy to get using seaborn’s FacetGrid class where we can use a column to drive the color, or hue, of the scatter points.
sns.FacetGrid(df,
hue=”Species”).map(plt.scatter, “SepalLengthCm”, “SepalWidthCm”).add_legend()

plt.show()

This is much better.
Using the function above with different feature combinations, it’s found that PetalLengthCm and PetalWidthCm is clustered together in fairly well defined groups as per the figure below.

Notable is it how the boundary between iris-versicolor and iris-virginca intuitively seems fuzzy. This is something that may cause trouble for some classifiers, and is worth keeping in mind when training.
How did i know how to create those graphs?
I googled it.
When doing machine learning, you will find being able to look up things will be essential. There are endless of things to remember. Spending a lot of time trying to memorize these things is incredibly inefficient. It’s more efficient to look things up of which you are unsure, and let your brain automatically remember things you often use.
Being able to quickly look things up is much more valuable than memorizing the entire sklearn documentation.
On the bright side, sklearn is extensibly documented, and well organized making it easy to look up. Sklearn also has a very consistent interface; something you will likely notice throughout the tutorial.
If correlating different features in order to select the best ones sounds like a lot of work, it should be noted that there are automated methods of doing this such as kbest, and recursive feature elimination both of which are available in sklearn.
Preparing data to be trained by a sklearn classifier
Now that we have selected the features we want to use (PetalLengthCm and PetalWidthCm), we need to prepare the data, so we can use it with sklearn.
Currently, all the data is encoded in a DataFrame, but sklearn doesn’t work with pandas’ DataFrames, so we need to extract the features and labels and convert them into numpy arrays instead.
Separating the labels is quite simple, and can be done in one line using np.asarray().
labels = np.asarray(df.Species)
We could stop now as it’s possible to train a classifier using the labels above, however, because the species values’ data type is not numeric, but strings, we will run into problems when evaluating the model.
Luckily, sklearn provides a nifty tool that encodes label-strings as numerical representations. It works by going through an array of labels, and encode the first unique label as 0, then the next unique label as 1 and so on.
Using the LabelEncoder follows the standard sklearn interface protocol with which you will soon become familiar.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(labels)

# apply encoding to labels
labels = le.transform(labels)
The table below shows labels before and after the data transformation, and was created using df.sample(5).
After Before
139 2 Iris-virginica
87 1 Iris-versicolor
149 2 Iris-virginica
45 0 Iris-setosa
113 2 Iris-virginica
It’s seen that each unique string label now has a unique integer associated with it. If we ever want to return to the string labels we can use le.inverse_transform(labels).
Encoding the features follows a similar process.
First, we want to remove all the feature columns that we don’t want from the DataFrame. We do that by using the drop() method.
df_selected = df.drop([‘SepalLengthCm’, ‘SepalWidthCm’, “Id”, “Species”], axis=1)
Now we only have the columns PetalLengthCm and PetalWidthCm left.
Since we want to use more than one column, we can’t just simply use np.asarray(). Instead, we can use the to_dict() method together with sklearn’s DictVectorizer.
df_features = df_selected.to_dict(orient=’records’)
The sklearn interface for using the DictVectorizer class is similar to that of the LabelEncoder. One notable difference is the .toarray() method that is used with fit_transform.
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
features = vec.fit_transform(df_features).toarray()
Now that we have numerical feature and label arrays, there’s only one thing left to do which is to split our data up into a training and a test set.
Why have a test set when you can train using all the data? You might ask.
Having a test set helps validating the model, and check for things like overfitting where the model fails to generalize from the training data, and instead just memorizes the answers; this is not ideal if we want it to do well on unknown data. The purpose of the test set is to mimic the unknown data the model will be presented to in the real world. It’s therefore very important not to train using the test set.
Sometimes, with algorithms particularly prone to overfitting, you also have a validation set which you want to avoid even looking at because sometimes when optimizing a model, information about the test set may leak from tuning the model parameters (often called hyperparameters) causing it to overfit on the test set as well as the training set.
For this tutorial, however, we will only use a test set and a training set.
Sklearn has a tool that helps dividing up the data into a test and a training set.
from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(
features, labels,
test_size=0.20, random_state=42)
Interesting here are the test_size, and the random_state parameters. The test size parameter is the numeric fraction of the total dataset which will be reserved for testing. A 80/20 split is often thought to be a good rule of thumb, but may need some adjustment later.
The other notable variable is the random_state. The value of it is not really important as it’s just a seed number, but the act of randomizing the data is important.
Why?
The dataset is sorted by type, so if we only train using the first two species, the model won’t be useful when testing with a third species which it has never seen before.
If you have never seen something before, it is difficult to correctly classify it.
Choosing a classifier
Now that we have separated the data into test and training sets, we can begin to choose a classifier.
When considering our data a Random Forest classifier stands out as being a good starting point. Random Forests are simple, flexible in that they work well with a wide variety of data, and rarely overfit. They are therefore often a good starting point.
One notable downside to Random Forests is that they are non-deterministic in nature, so they don’t necessarily produce the same results every time you train them.
While Random Forests are a good starting point, in practice, you will often use multiple classifiers, and see which ones get good results.
You can limit the guesswork over time by developing a sense for which algorithms generally do well on what problems; of course, doing a first principles analysis from the mathematical expression will help with this as well.
Training the classifier
Now that we chosen a classifier, it’s time to implement it.
Implementing a classifier in sklearn follows three steps.
• Import (I usually Google this)
• Initialization (usually self-evident from the import statement)
• Training (or fitting)
In code, it looks like this:
# import
from sklearn.ensemble import RandomForestClassifier

# initialize
clf = RandomForestClassifier()

# train the classifier using the training data
clf.fit(features_train, labels_train)
A trained classifier isn’t much use if we don’t know how accurate it is.
We can quickly get an idea of how well the model works on the data by using the score() method on the classifier.
# compute accuracy using test data
acc_test = clf.score(features_test, labels_test)

print (“Test Accuracy:”, acc_test)
# Test Accuracy: 0.98
98%!
That is not bad for three lines of code. Granted, this wasn’t the hardest of problems, but 98% on our first try is still really good.
Note: If you get a slightly different result, you shouldn’t worry, it’s expeceted with this classifier as it works by generating a random decision trees, and averaging their predictions.
We can also compute the accuracy of the training data, and compare the two to get an idea of how much the model is overfitting. The method is similar to how we computed the test accuracy, only this time we use the training data as for evaluation.
# compute accuracy using training data
acc_train = clf.score(features_train, labels_train)

print (“Train Accuracy:”, acc_train)
# Train Accuracy: 0.98
We see that for our training data the accuracy is also 98% which suggests that the model is not overfitting.
But what about entirely new data?
Suppose now we have a found a new, unique, iris flower, and we measure its petal length and width.
Suppose we measured the length to be 5.2cm, and the width as being 0.9cm; how can we figure out which species this is using our newly trained model?
The answer is by using the predict() method as shown below.
flower = [5.2,0.9]
class_code = clf.predict(flower) # [1]
This is great.
We now know the categorical species type. However, the usefulness is limited as it’s not easily understandable by humans.
It would be much easier if it would return the species label instead.
Remember the inverse_transform() on the label encoder from before? We can use this to decode the group ID like so:
flower = [5.2,0.9]
class_code = clf.predict(flower) # [1]

decoded_class = le.inverse_transform(class_code)
print (decoded_class) # [‘Iris-versicolor’]
And so we see that our new flower is of the species Iris versicolor.
Valuating the results
Even though we can see the test accuracy lies at 98%, it would be interesting to see what kind of mistakes the model makes.
There are two ways a classification model can fail to predict the correct result; false positives, and false negatives.
A false positive is where something is guessed to be true when it’s really false.
A false negative is where something is guessed to be false when it’s really true.
Since we are not running a binary classifier (one which predict “yes” or “no”), but instead a classifier that guesses which of a series of labels, every mistake will both be a false positive with respect to some labels, and a false negative with respect to others.
In machine learning, we often use precision and recall instead of false positives and false negatives.
Precision attempts to reduce false positives whereas recall attempts to reduce false negatives. They are both a decimal number, or fraction, between 0 and 1 where higher is better.
Formally, precision and recall are calculated like so:


Sklearn has builtin functions to calculate the precision and recall scores, so we don’t have to.
from sklearn.metrics import recall_score, precision_score

precision = precision_score(labels_test, pred, average=”weighted”)
recall = recall_score(labels_test, pred, average=”weighted”)

print (“Precision:”, precision) # Precision: 0.98125
print (“Recall:”, recall) # Recall: 0.98
As seen above, the model has slightly more false negatives than false positives, but is generally evenly split.
Tuning the classifier
Currently, our Random Forests classifier just uses the default parameter values. However, for increased control, we can change some or all of the values.
One interesting paramter is min_samples_split. This parameter denotes the minimum samples required to split the decision tree.
Genereally speaking the lower it is the more detail the model captures, but it also increases the likelyhood of overfitting. Whereas if you give it a high value, you tend to record the trends better while ignoring the little details.
By default it’s set to 2.
It doesn’t make much sense to lower the value, and the model doesn’t seem to be overfitting, however, we can still try to raise the value from 2 to 4.
We can specify classifer parameters when we create the classifier like so:
clf = RandomForestClassifier(
min_samples_split=4
)
And that’s it.
Train Accuracy: 0.98
Test Accuracy: 1.0

Precision: 1.0
Recall: 1.0
When we retrain the model, we see that the test accuracy has risen to a perfect 100%, but the training accuracy remains at 98% suggesting that there’s still more information to extract.
Another parameter we can change is the criterion parameter which denotes how it should measure the quality of a split.
By default it is set to “gini” which measure the impurity, but sklearn also supports entropy which measures the information gain.
We can train the classifier using entropy instead just by setting the parameter like we set min_samples_split.
clf = RandomForestClassifier(
min_samples_split=4,
criterion=”entropy”
)
When we retrain the model with the new parameters, nothing changes which suggests maybe the criterion function isn’t an important influencer for this type of data/problem.
Train Accuracy: 0.98
Test Accuracy: 1.0

Precision: 1.0
Recall: 1.0
You can read about all the tuning parameters for the RandomForestClassifier on Sklearn’s documentation page.
Other classifiers
Another way of improving a model is by changing the algorithm.
Suppose we want to use a support vector machine instead.
Using sklearn’s support vector classifier only requires us to change two lines of code; the import, and the initialization.
from sklearn.svm import SVC
clf = SVC()
And that’s all.
Running this with just the default settings gives us comparable results to the random forests classifier.
Train Accuracy: 0.95
Test Accuracy: 1.0

Precision: 1.0
Recall: 1.0
We can also use any other classifier supported by sklearn using this method.
Conclusion
Some 3000 words later, first of congratulations on making it this far. You should celebrate by watching this hamster eating a tiny pizza.
You back? Great!
We have covered everything from reading the data into a pandas dataframe to using relevant features in the data with sklearn to train a classifier, and assessing the model’s accuracy to tune the parameters, and if necessary, change the classifier algorithm.
You should now have the tools necessary to investigate unknown datasets, and build simple classification models; even with algorithms with which we are not yet familiar.
Homework
• Optimize the SVM classifier model using the parameter found here.
• Train different classifiers sklearn’s documentation may be of use.
• Use the methods discussed in the article to analyse a different dataset for example Kaggle’s Titanic Dataset.
For feedback, send your answers to “homework [at] kasperfred.com”. Remember to include the title of the blog post in the subject line.
You can access the full source code used in the tutorial here.