@Wikidata - #Quality in a #Wiki environment

11:44, Sunday, 20 2019 January UTC
What quality is, quality in a data environment has been studied often enough. Lots of words are spend about it but one notion is always left out. What is data quality in a Wiki environment. How does that translate to Wikidata.

First of all; Wikidata serves many purposes. The initial purpose of Wikidata was to replace the in-article "interwiki" links. They were notoriously difficult to maintain, often wrong. A single Wikidata item replaced the links for a subject in all Wikipedias and this brought stability and a high level of confidence in the result. Over time the quality of the "interwiki' links went down; there are fewer people involved adding and curating these links and it is seen as a quality issue when new items are generated for new articles; they do not have statements and are often not linked. There have been protests against these new additions.

A second purpose is the use of Wikidata statements in Wikipedia templates. Assessing data quality becomes complicated as there are micro, mesa and macro levels of quality at play. The micro level: is sufficient data available for one template in one Wikipedia article. The mesa level: is sufficient data available for one template in Wikipedia articles on the same topic. The macro level: is the same data available for all interested Wikipedias and do we have the required labels in those languages.

Quality considerations are driven by this approach. On a micro level you want all awards for a scientist to be linked on an item. On the mesa level you want all recipients of an award to be linked to their item. On the macro level you want all awards to have labels in the language of a Wikipedia and have all local considerations been met.

Standard quality considerations in a Wiki environment are not helpful; they are judgemental. People contribute to Wikidata and all have their own purposes. A Wiki is a work in progress and when quality assessments are to be performed, the question should focus on the extend a specific function is supported. What people seek in support also changes; as long as there was no article for professor Angela Byars-Winston it was fine only to know about her for one publication. Now that Jess Wade picked her for an article, it may be relevant that she is the first and so far only person known to Wikidata who was a "champion of change" and that more papers are identified for her.

Wikidata includes many references to scientific papers and authors. However, so far it serves no purpose. Allegedly there is a process underway that imports papers used as citations in the Wikipedias but it is not clear what papers are used in what Wikipedia article. So far it is a big stamp collection, a collection with a rapidly growing quality. A collection that highlights authors who are open about their work and who share the details of their work at ORCID. In effect, this data set indicates that the relevance of a scientist improves by being open.

Wikidata invites people to add/curate the data that is of interest to them. Particularly the esoteric data, data about subjects like African geography, Islamic history need a lot of tender loving care. It is where Wikidata and the large Wikipedias are weak. For as long as Wikidata is largely defined by the large Wikipedias it will reflect the same biases and these biases will be hard to assess and curate.
Thanks,
      GerardM

Gerrit now automatically adds reviewers

01:33, Saturday, 19 2019 January UTC

Finding reviewers for a change is often a challenge, especially for a newcomer or folks proposing changes to projects they are not familiar with. Since January 16th, 2019, Gerrit automatically adds reviewers on your behalf based on who last changed the code you are affecting.

Antoine "@hashar" Musso exposes what lead us to enable that feature and how to configure it to fit your project. He will offers tip as to how to seek more reviewers based on years of experience.


When uploading a new patch, reviewers should be added automatically, that is the subject of the task T91190 opened almost four years ago (March 2015). I declined the task since we already have the Reviewer bot (see section below), @Tgr found a plugin for Gerrit which analyzes the code history with git blame and uses that to determine potential reviewers for a change. It took us a while to add that particular Gerrit plugin and the first version we installed was not compatible with our Gerrit version. The plugin was upgraded yesterday (Jan 16th) and is working fine (T101131).

Let's have a look at the functionality the plugin provides, and how it can be configured per repository. I will then offer a refresher of how one can search for reviewers based on git history.

Reviewers by blame plugin

The Gerrit plugin looks at affected code using git blame, it extracts the top three past authors which are then added as reviewers to the change on your behalf. Added reviewers will thus receive a notification showing you have asked them for code review.

The configuration is done on a per project basis and inherits from the parent project. Without any tweaks, your project inherits the configuration from All-Projects. If you are a project owner, you can adjust the configuration. As an example the configuration for operations/mediawiki-config which shows inherited values and an exception to not process a file named InitialiseSettings.php:

The three settings are described in the documentation for the plugin:

plugin.reviewers-by-blame.maxReviewers
The maximum number of reviewers that should be added to a change by this plugin.
By default 3.

plugin.reviewers-by-blame.ignoreFileRegEx
Ignore files where the filename matches the given regular expression when computing the reviewers. If empty or not set, no files are ignored.
By default not set.

plugin.reviewers-by-blame.ignoreSubjectRegEx
Ignore commits where the subject of the commit messages matches the given regular expression. If empty or not set, no commits are ignored.
By default not set.

By making past authors aware of a change to code they previously altered, I believe you will get more reviews and hopefully get your changes approved faster.

Previously we had other methods to add reviewers, one opt-in based and the others being cumbersome manual steps. They should be used to compliment the Gerrit reviewers by blame plugin, and I am giving an overview of each of them in the following sections.

Gerrit watchlist

The original system from Gerrit lets you watch projects, similar to a user watch list on MediaWiki. In Gerrit preferences, one can get notified for new changes, patchsets, comments... Simply indicate a repository, optionally a search query and you will receive email notifications for matching events.

The attached image is my watched projects configuration, I thus receive notifications for any changes made to the integration/config config as well as for changes in mediawiki/core which affect either composer.json or one of the Wikimedia deployment branches for that repo.

One drawback is that we can not watch a whole hierarchy of projects such as mediawiki and all its descendants, which would be helpful to watch our deployment branch. It is still useful when you are the primary maintainer of a repository since you can keep track of all activity for the repository.

Reviewer bot

The reviewer bot has been written by Merlijn van Deen (@valhallasw), it is similar to the Gerrit watched projects feature with some major benefits:

  • watcher is added as a reviewer, the author thus knows you were notified
  • it supports watching a hierarchy of projects (eg: mediawiki/*)
  • the file/branch filtering might be easier to gasp compared to Gerrit search queries
  • the watchers are stored at a central place which is public to anyone, making it easy to add others as reviewers.

One registers reviewers on a single wiki page: https://www.mediawiki.org/wiki/Git/Reviewers.

Each repository filter is a wikitext section (eg: === mediawiki/core ===) followed by a wikitext template and a file filter using using python fnmatch. Some examples:

Listen to any changes that touch i18n:

== Listen to repository groups ==
=== * ===
* {{Gerrit-reviewer|JohnDoe|file_regexp=<nowiki>i18n</nowiki>}}

Listen to MediaWiki core search related code:

=== mediawiki/core ===
* {{Gerrit-reviewer|JaneDoe|file_regexp=<nowiki>^includes/search/</nowiki>

The system works great, given maintainers remember to register on the page and that the files are not moved around. The bot is not that well known though and most repositories do not have any reviewers listed.

Inspecting git history

A source of reviewers is the git history, one can easily retrieve a list of past authors which should be good candidates to review code. I typically use git shortlog --summary --no-merges for that (--no-merges filters out merge commit crafted by Gerrit when a change is submitted). Example for MediaWiki Job queue system:

$ git shortlog --no-merges --summary --since "one year ago" includes/jobqueue/|sort -n|tail -n4
     3 Petr Pchelko
     4 Brad Jorsch
     4 Umherirrender
    16 Aaron Schulz

Which gives me four candidates that acted on that directory over a year.

Past reviewers from git notes

When a patch is merged, Gerrit records in git trace votes and the canonical URL of the change. They are available in git notes under /refs/notes/review, once notes are fetched, they can be show in git show or git log by passing --show-notes=review, for each commit, after the commit messages, the notes get displayed and show votes among other metadata:

$ git fetch refs/notes/review:refs/notes/review
$ git log --no-merges --show-notes=review -n1
commit e1d2c92ac69b6537866c742d8e9006f98d0e82e8
Author: Gergő Tisza <tgr.huwiki@gmail.com>
Date:   Wed Jan 16 18:14:52 2019 -0800

    Fix error reporting in MovePage
    
    Bug: T210739
    Change-Id: I8f6c9647ee949b33fd4daeae6aed6b94bb1988aa

Notes (review):
    Code-Review+2: Jforrester <jforrester@wikimedia.org>
    Verified+2: jenkins-bot
    Submitted-by: jenkins-bot
    Submitted-at: Thu, 17 Jan 2019 05:02:23 +0000
    Reviewed-on: https://gerrit.wikimedia.org/r/484825
    Project: mediawiki/core
    Branch: refs/heads/master

And I can then get the list of authors that previously voted Code-Review +2 for a given path. Using the previous example of includes/jobqueue/ over a year, the list is slightly different:

$ git log --show-notes=review --since "1 year ago" includes/jobqueue/|grep 'Code-Review+2:'|sort|uniq -c|sort -n|tail -n5
      2     Code-Review+2: Umherirrender <umherirrender_de.wp@web.de>
      3     Code-Review+2: Jforrester <jforrester@wikimedia.org>
      3     Code-Review+2: Mobrovac <mobrovac@wikimedia.org>
      9     Code-Review+2: Aaron Schulz <aschulz@wikimedia.org>
     18     Code-Review+2: Krinkle <krinklemail@gmail.com>

User Krinkle has approved a lot of patches, even if he doesn't show in the list of authors obtained by the previous mean (inspecting git history).

Conclusion

The Gerrit reviewers by blame plugin acts automatically which offers a good chance your newly uploaded patch will get reviewers added out of the box. For finer tweaking one should register as a reviewer on https://www.mediawiki.org/wiki/Git/Reviewers which benefits everyone. The last course of action is meant to compliment the git log history.

For any remarks, support, concerns, reach out on IRC freenode channel #wikimedia-releng or fill a task in Phabricator.

Thank you @thcipriani for the proof reading and english fixes.

EmbedScript 2019

00:24, Saturday, 19 2019 January UTC

Based on my decision to stick with browser-based sandboxing instead of pushing on with ScriptinScript, I’ve started writing up notes reviving my old idea for an EmbedScript extension for MediaWiki. It’ll use the <iframe> sandbox and Content-Security-Policy to run widgets in article content (with a visible/interactive HTML area) and plugins (headless) for use in UI extensions backed by trusted host APIs which let the plugin perform limited actions with the user’s permission.

There are many details yet to be worked out, and I’ll keep updating that page for a while before I get to coding. In particular I want to confirm things like the proper CSP headers to prevent cross-origin network access (pretty sure I got it, but must test) and how to perform the equivalent sandboxing in a web view on mobile platforms! Ensuring that the sandbox is secure in a browser before loading code is important as well — older browsers may not support all the sandboxing needed.

I expect to iterate further on the widgets/plugins/host APIs model, probably to include a concept of composable libraries and access to data resources (images, fonts, audio/video files, even data from the wiki, etc).

The widget, plugin, and host API definitions will need to be stored and editable on-wiki — like a jsfiddle.net “fiddle” editing can present them as 4-up windows of HTML, CSS, JS, and output — but with additional metadata for dependencies and localizable strings. I hope to use MediaWiki’s new “multi-content revisions” system to store the multiple components as separate content pieces of a single wiki page, versioned together.

Making sure that content widgets can be fairly easy ported to/from non-MediaWiki platforms would be really wise though. A git repo adapter? A single-file .html exporter? Embedding the content as offsite-embeddable iframes as well, without the host page having API or data access? Many possibilities. Is there prior art in this area?

Also need to work out the best way to instantiate objects in pages from the markup end. I’d like for widgets in article pages to act like media files in that you can create them with the [[File:blah]] syntax, size them, add borders and captions, etc. But it needs to integrate well with the multimedia viewer zoom view, etc (something I still need to fix for video too!)… and exposing them as a file upload seems weird. And what about pushing one-off data parameters in? Desirable or extra complication?

Anyway, check the on-wiki notes and feel free to poke me with questions, concerns, and ideas y’all!

After the release of the long-awaited BlueSpice version 3.0 at the end of 2018, we are pleased to announce significant improvements of our enterprise wiki software with our first patch release.

In this article we give you a quick overview of the new features you can expect:

 

1. Changing between the editors

Editors can now switch between visual editing and text editing without having to (temporarily) save. This is very useful if you want to review wikitext or add additional features that are only available in wikitext.

2. Privacy center: data protection and GDPR

From now on wiki users will have improved control over their personal data stored in BlueSpice. Users can not only request anonymization and deletion of the account, but also retrieve all data stored about them. In addition, the management of approvals for cookies and data protection guidelines has been optimized.

Administrators can now better manage anonymization and deletion. In addition, data protection officers receive an overview of which users have given their consent to the use of cookies and the implementation of data protection policies.

The innovations mentioned cover essential requirements of the GDPR (general data protection regulation) and make it easier for administrators to comply with applicable regulations. In the upcoming BlueSpice updates, this function will be further optimized to meet the requirements and procedures of the GDPR. We are pleased to announce a separate article with further details on this topic which will be published in the upcoming weeks.

3. Customizable top navigation

The header area of your wiki can now be used to integrate an additional navigation area (top navigation). For example, you can integrate a central, cross-page navigation or place important links to other platforms. There are no limits to your creativity. The navigation is structured hierarchically and has a predefined number of navigation levels.

4. Revision status

With our new revision list quality managers can easily view the condition of all pages undergoing a quality check. The list can be filtered by both page name and approval status.

5. Navigation und access rights

In BlueSpice 3.0 some navigation areas were shown or hidden depending on the login status of a user. For example, users who were not logged in could not see the book list even if they had read permission. This problem is now fixed.

6. Backlinking

Users who navigate to a discussion page now have a link to return to the original page without having to press the back button. This makes it easier to navigate between pages and discussions.

 


What’s next?

Our goal is to publish patch releases at significantly shorter intervals in the future, so that even small improvements can be implemented more quickly in the productive system of our customers.

 

You are a customer already?

Customers who would like to benefit from the new features in BlueSpice 3.0.1 are kindly requested to contact their project manager via phone, use our ticket system or contact us via support@hallowelt.com.

 

Not a customer yet but interested in BlueSpice MediaWiki?

Please send an e-Mail at sales@bluespice.com or give us a call at: +49 (0) 941 660 80 197.

 

Here you can find further information about BlueSpice and our services:


We look forward to working with you to make BlueSpice 3 better and better. Let’s Wiki together!

The post BlueSpice 3.0.1 – our patch release with many exciting improvements appeared first on BlueSpice Blog.

Defender of the Realm

04:51, Friday, 18 2019 January UTC

I showed some iterations of ScriptinScript’s proposed object value representation, using native JS objects with a custom prototype chain to isolate the guest world’s JS objects. The more I looked I saw more corner cases… I thought of the long list of security issues with the old Caja transpiling embedding system, and decided it would be best to change course.

Not only are there a lot of things to get right to avoid leaking host objects, it’s simply a lot of work to create a mostly spec-compliant JavaScript implementation, and then to maintain it. Instead I plan to let the host JavaScript implementation run almost the entire show, using realms.

What’s a Realm?

Astute readers may have clicked on that link and noticed that the ECMAScript committee’s realms proposal is still experimental, with no real implementations yet… But realms are actually a part of JS already, there’s just no standard way to manipulate them! Every function is associated with a realm that it runs in, which holds the global object and the intrinsic objects we take for granted — say, Object. Each realm has its own instance of each of these instrinsics, so if an object from one realm does make its way to another realm, their prototype chains will compare differently.

That sounds like what we were manually setting up last time, right? The difference is that when native host operations like throwing exceptions in a built-in function, auto-boxing a primitive value to an object, etc happen, the created Error or String etc instance will have the realm-specific prototype without us having to guard for it and switch it around.

If we have a separate realm for the guest environment, then there are a lot fewer places we have to guard against getting host objects.

Getting a realm

There are a few possible ways we can manage to get ahold of a separate realm for our guest code:

  • Un-sandboxed <iframe>
  • Sandboxed <iframe>
  • Web Worker thread
  • ‘vm’ module for Node.js

It should be possible to combine some of these techniques, such as using the future-native Realm inside a Worker inside a sandboxed iframe, which can be further locked down with Content-Security-Policy headers!

Note that using sandboxed or cross-origin <iframe>s or Workers requires asynchronous messaging between host and guest, but is much safer than Realm or same-origin <iframe> because they prevent all object leakage.

Similar techniques are used in existing projects like Oasis to seeming good effect.

Keep it secret! Keep it safe!

To keep the internal API for the guest environment clean and prevent surprise leakages to the host realm, it’s probably wise to clean up the global object namespace and the contents of the accessible intrinsics.

This is less important if cross-origin isolation and Content-Security-Policy are locked down carefully, but probably still a good idea.

For instance you probably want to hide some things from guest code:

  • the global message-passing handlers for postMessage to implement host APIs
  • fetch and XMLHttpRequest for network access
  • indexedDB for local-origin info
  • etc

In an <iframe> you would probably want to hide the entire DOM to create a fresh realm… But if it’s same-origin I don’t quite feel confident that intrinsics/globals can be safely cleaned up enough to avoid escapes. I strongly, strongly recommend using cross-origin or sandboxed <iframe> only! And a Worker that’s loaded from an <iframe> might be best.

In principle the realm can be “de-fanged” by walking through the global object graph and removing any property not on an allow list. Often you can also replace a constructor or method with an alternate implementation… as long as its intrinsic version won’t come creeping back somewhere. Engine code may throw exceptions of certain types, for instance, so they may need pruning in their details as well as pruning from the global tree itself.

In order to provide host APIs over postMessage, keep local copies of the global’s postMessage and addEventListener in a closure and set them up before cleaning the globals. Be careful in the messaging API to use only local variable references, no globals, to avoid guest code interfering with the messaging code.

Whither transpiling?

At this point, with the host environment in a separate realm *and* probably a separate thread *and* with its globals and intrinsics squeeky clean… do we need to do any transpiling still?

It’s actually, I think, safe at that point to just pass JS code for strict mode or non-strict-mode functions in and execute it after the messaging kernel is set up. You should even be able to create runtime code with eval and the Function constructor without leaking anything to/from the host context!

Do we still even need to parse/transpile? Yes!

But the reason isn’t for safety, it’s more for API clarity, bundling, and module support… Currently there’s no way to load JS module code (with native import/export syntax) in a Worker, and there’s no way to override module URL-to-code resolution in <script type=”module”> in an <iframe>.

So to support modern JS modules for guest code, you’d need some kind of bundling… which is probably desired anyway for fetching common libraries and such… and which may be needed to combine the messaging kernel / globals cleanup bootstrap code with the guest code anyway.

There’s plenty of prior art on JS module -> bundle conversion, so this can either make use of existing tools or be inspired by it.

Debugging

If code is simply executed in the host engine, this means two things:

One, it’s hard to debug from within the web page because there aren’t tools for stopping the other thread and introspecting it.

Two, it’s easy to debug from within the web browser because the host debugger Just Works.

So this is probably good for Tools For Web Develepers To Embed Stuff but may be more difficult for Beginner’s Programming Tools (like the BASIC and LOGO environments of my youth) where you want to present a slimmed-down custom interface on the debugger.

Conclusions

Given a modern-browser target that supports workers, sandboxed iframes, etc, using those native host tools to implement sandboxing looks like a much, much better return on investment than continuing to implement a full-on interpreter or transpiler for in-process code.

This in some ways is a return to older plans I had, but the picture’s made a LOT clearer by not worrying about old browsers or in-process execution. Setting a minimal level of ES2017 support is something I’d like to do to expose a module-oriented system for libraries and APIs, async, etc but this isn’t strictly required.

I’m going to re-work ScriptinScript in four directions:

First, the embedding system using <iframe>s and workers for web or ‘vm’ for Node, with a messaging kernel and global rewriter.

Second, a module bundling frontend that produces ready-to-load-in-worker JS, that can be used client-side for interactive editing or server-side for pre-rendering. I would like to get the semantics of native JS modules right, but may approximate them as a simplification measure.

Third, a “Turtle World” demo implementing a much smaller interpreter for a LOGO-like language, connected to a host API implementing turtle graphics in SVG or <canvas>. This will scratch my itch to write an interpreter, but be a lot simpler to create and maintain. ;)

Finally, a MediaWiki extension that allows storing the host API and guest code for Turtle World in a custom wiki namespace and embedding them as media in articles.

I think this is a much more tractable plan, and can be tackled bit by bit.

Eliminating PHP polyfills

07:50, Thursday, 17 2019 January UTC

The Symfony project has recently created a set of pure-PHP polyfills for both PHP extensions and newer language features. It allows developers to add requirements upon those functions or language additions without increasing the system requirements upon end users. For the most part, I think this is a good thing, and valuable to have. We've done similar things inside MediaWiki as well for CDB support, Memcached, and internationalization, just to name a few.

But the downside is that on platforms where it is possible to install the missing PHP extensions or upgrade PHP itself, we're shipping empty code. MediaWiki requires both the ctypes and mbstring PHP extensions, and our servers have those, so there's no use in deploying polyfills for those, because they'll never be used. In September, Reedy and I replaced the polyfills with "unpolyfills" that simply provide the correct package, so the polyfill is skipped by composer. That removed about 3,700 lines of code from what we're committing, reviewing, and deploying - a big win.

Last month I came across the same problem in Debian: #911832. The php-symfony-polyfill package was failing tests on the new PHP 7.3, and up for removal from the next stable release (Buster). On its own, the package isn't too important, but was a dependency of other important packages. In Debian, the polyfills are even more useless, since instead of depending upon e.g. php-symfony-polyfill-mbstring, the package could simply depend upon the native PHP extension, php-mbstring. In fact, there was already a system designed to implement those kinds of overrides. After looking at the dependencies, I uploaded a fixed version of php-webmozart-assert, filed bugs for two other packages. and provided patches for symfony. I also made a patch to the default overrides in pkg-php-tools, so that any future package that depends upon a symfony polyfill should now automatically depend upon the native PHP extension if necessary.

Ideally composer would support alternative requirements like ext-mbstring | php-symfony-polyfill-mbstring, but that's been declined by their developers. There's another issue that is somewhat related, but doesn't do much to reduce the installation of polyfills when unnecessary.

Incident Documentation: An Unexpected Journey

15:45, Wednesday, 16 2019 January UTC

Introduction

The Release Engineering team wants to continually improve the quality of our software over time. One of the ways in which we hoped to do that this year is by creating more useful Selenium smoke tests. (From now on, test will be used instead of Selenium test.) This blog post is about how we determined where the tests should focus and the relative priority.

At first, I thought this would be a trivial task. A few hours of work. A few days at most. A week or two if I've completely underestimated it. A couple of months later, I know I have completely underestimated it.

Things I needed to do:

  • Define prioritization scheme.
  • Prioritize target repositories.

Define Prioritization Scheme

In general:

  • Does a repository have stewards? (Do the stewards want tests?)
  • Does a repository have existing tests?

For the last year:

  • How much change did happen for a repository? Simply put: more change can lead to more risk.
  • How many incidents is a repository connected to? We wanted to make sure we didn't miss any obvious problematic areas.

Does a Repository Have Stewards?

This was relatively simple task. The best source of information is Developers/Maintainers page.

Does a Repository Have Existing Tests?

This was also easy. Selenium/Node.js page has list of repositories that have tests in Node.js. I already had all repositories with Node.js and Ruby tests on my machine, so a quick search for webdriverio (Node.js) and mediawiki_selenium (Ruby) found all the tests. In order to be really sure I've found all repositories with tests, I've cloned all repositories from Gerrit.

$ ack --json webdriverio
extensions/Echo/package.json
27:        "webdriverio": "4.12.0"
...
$ ack --type-add=lock:ext:lock --lock mediawiki_selenium
skins/MinervaNeue/Gemfile.lock
42:    mediawiki_selenium (1.7.3)
...

To make extra sure I have not missed any repositories, I've used MediaWiki code search (mediawiki_selenium, webdriverio) and GitHub search (org:wikimedia extension:lock mediawiki_selenium, org:wikimedia extension:json webdriverio)

This is the list.

Repository Language
mediawiki/core JavaScript
mediawiki/extensions/AdvancedSearch JavaScript
mediawiki/extensions/CentralAuth Ruby
mediawiki/extensions/CentralNotice Ruby
mediawiki/extensions/CirrusSearch JavaScript
mediawiki/extensions/Cite JavaScript
mediawiki/extensions/Echo JavaScript
mediawiki/extensions/ElectronPdfService JavaScript
mediawiki/extensions/GettingStarted Ruby
mediawiki/extensions/Math JavaScript
mediawiki/extensions/MobileFrontend Ruby
mediawiki/extensions/MultimediaViewer Ruby
mediawiki/extensions/Newsletter JavaScript
mediawiki/extensions/ORES JavaScript
mediawiki/extensions/Popups JavaScript
mediawiki/extensions/QuickSurveys Ruby
mediawiki/extensions/RelatedArticles JavaScript
mediawiki/extensions/RevisionSlider Ruby
mediawiki/extensions/TwoColConflict JavaScript, Ruby
mediawiki/extensions/Wikibase JavaScript, Ruby
mediawiki/extensions/WikibaseLexeme JavaScript, Ruby
mediawiki/extensions/WikimediaEvents PHP
mediawiki/skins/MinervaNeue Ruby
phab-deployment JavaScript
wikimedia/community-tech-tools Ruby
wikimedia/portals/deploy JavaScript

How Much Change Did Happen for a Repository?

After reviewing several tools, I've found that we already use Bitergia for various metrics. There is even a nice list of top 50 repositories by the number of commits. The tool even supports limiting the report from a date to a date. Exactly what I needed.

Bitergia > Last 90 days > Absolute > From 2017-11-01 00:00:00.000 > To 2018-10-31 23:59:59.999 > Go > Git > Overview > Repositories (raw data: P7776, direct link).

This is the top 50 list (excludes empty commits and bots).

Repository Commits
mediawiki/extensions 11300
operations/puppet 7988
mediawiki/core 4590
operations/mediawiki-config 4005
integration/config 1652
operations/software/librenms 1169
pywikibot/core 927
mediawiki/extensions/Wikibase 806
apps/android/wikipedia 789
mediawiki/services/parsoid 700
mediawiki/extensions/VisualEditor 692
operations/dns 653
VisualEditor/VisualEditor 599
mediawiki/skins 570
mediawiki/extensions/MobileFrontend 504
mediawiki/extensions/ContentTranslation 491
translatewiki 486
oojs/ui 469
wikimedia/fundraising/crm 457
mediawiki/extensions/BlueSpiceFoundation 414
mediawiki/extensions/CirrusSearch 357
mediawiki/extensions/AbuseFilter 306
phabricator/phabricator 302
mediawiki/services/restbase 290
mediawiki/extensions/Flow 232
mediawiki/extensions/Echo 223
mediawiki/vagrant 221
mediawiki/extensions/Popups 184
mediawiki/extensions/Translate 182
mediawiki/extensions/DonationInterface 180
analytics/refinery 178
mediawiki/extensions/PageTriage 177
mediawiki/extensions/Cargo 176
mediawiki/tools/codesniffer 156
mediawiki/extensions/TimedMediaHandler 152
mediawiki/extensions/UniversalLanguageSelector 142
mediawiki/vendor 140
mediawiki/extensions/SocialProfile 139
analytics/refinery/source 138
operations/software 137
mediawiki/services/restbase/deploy 136
operations/debs/pybal 123
mediawiki/extensions/CentralAuth 116
mediawiki/tools/release 116
mediawiki/services/cxserver 112
mediawiki/extensions/BlueSpiceExtensions 110
mediawiki/extensions/WikimediaEvents 110
labs/private 108
operations/debs/python-kafka 104
labs/tools/heritage 96

I've got similar results with running git rev-list for all repositories (script, results: P7834).

How Many Incidents Is a Repository Connected To?

This proved to be the most time consuming task.

I have started by reviewing existing incident documentation. Take a look at a few incidents. Can you tell which incident report is connected to which repository? I couldn't. (If you can, please let me know. I need your help.)

Incident reports are a wall of text. It was really hard for me to connect an incident report to a repository. An incident report has a title and text, example: 20180724-Train. Text has several sections, including Actionables. Text contains links to Gerrit patches and Phabricator tasks. (From now on, I'll use patches instead of Gerrit patches and tasks instead of Phabricator tasks.)

A patch belongs to a repository. Wikitext [[gerrit:448103]] is patch mediawiki/extensions/Wikibase/+/448103, so repository is mediawiki/extensions/Wikibase. That is the strongest link between an incident and a repository.

A task usually has patches associated with it. Wikitext [[phab:T181315]] is patch T181315. Gerrit search bug:T181315 finds many connected patches, many of them in operations/puppet and one in mediawiki/vagrant. That is an useful, but not a strong link between an incident and a repository. Some tasks have several related patches, so it provides a lot of data.

A task also usually has several tags. Most of them are not useful in this context, but tags that are components (and not for example milestones or tags) could be useful, if the component can be linked to a repository. It is also not a strong link between an incident and a repository, and it usually does not provide a lot of data.

At the end, I wrote a tool with imaginative name, Incident Documentation. The tool currently collects data from patches and tasks from Actionables section of the incident report. It does not collect data from task components. It is tracked as issue #5.

Incident Review 2017-11-01 to 2018-10-31

After reviewing Actionables section for each incident report, related patches and tasks, here are the results. Please note this table only connects incident report and repositories. It does not show how many patches from a repository are connected to an incident report. It is tracked as issue #11.

Repository Incidents
operations/puppet 22
mediawiki/core 6
operations/mediawiki-config 4
mediawiki/extensions/Wikibase 4
wikidata/query/rdf 2
operations/debs/pybal 2
mediawiki/extensions/ORES 2
integration/config 2
wikidata/query/blazegraph 1
operations/software 1
operations/dns 1
mediawiki/vagrant 1
mediawiki/tools/release 1
mediawiki/services/ores/deploy 1
mediawiki/services/eventstreams 1
mediawiki/extensions/WikibaseQualityConstraints 1
mediawiki/extensions/PropertySuggester 1
mediawiki/extensions/PageTriage 1
mediawiki/extensions/Cognate 1
mediawiki/extensions/Babel 1
maps/tilerator/deploy 1
maps/kartotherian/deploy 1
integration/jenkins 1
eventlogging 1
analytics/refinery/source 1
analytics/refinery 1
All-Projects 1

Selecting Repositories

This table is sorted by the amount of change. The only column that needs explanation is Selected. It shows if a test makes sense for the repository, taking into account all available data. Repositories without maintainers and with existing tests are excluded.

Repository Change Stewards Coverage Incidents Selected
mediawiki/extensions 11300
operations/puppet 7988 SRE 22
mediawiki/core 4590 Core Platform JavaScript 6
operations/mediawiki-config 4005 Release Engineering 4
integration/config 1652 Release Engineering 2
operations/software/librenms 1169 SRE
pywikibot/core 927
mediawiki/extensions/Wikibase 806 WMDE JavaScript, Ruby 4
apps/android/wikipedia 789
mediawiki/services/parsoid 700 Parsing
mediawiki/extensions/VisualEditor 692 Editing
operations/dns 653 SRE 1
VisualEditor/VisualEditor 599 Editing
mediawiki/skins 570 Reading
mediawiki/extensions/MobileFrontend 504 Reading Ruby
mediawiki/extensions/ContentTranslation 491 Language engineering
translatewiki 486
oojs/ui 469
wikimedia/fundraising/crm 457 Fundraising tech
mediawiki/extensions/BlueSpiceFoundation 414
mediawiki/extensions/CirrusSearch 357 Search Platform JavaScript
mediawiki/extensions/AbuseFilter 306 Contributors
phabricator/phabricator 302 Release Engineering
mediawiki/services/restbase 290 Core Platform
mediawiki/extensions/Flow 232 Growth
mediawiki/extensions/Echo 223 Growth JavaScript
mediawiki/vagrant 221 Release Engineering 1
mediawiki/extensions/Popups 184 Reading JavaScript
mediawiki/extensions/Translate 182 Language engineering
mediawiki/extensions/DonationInterface 180 Fundraising tech
analytics/refinery 178 Analytics 1
mediawiki/extensions/PageTriage 177 Growth 1
mediawiki/extensions/Cargo 176
mediawiki/tools/codesniffer 156
mediawiki/extensions/TimedMediaHandler 152 Reading
mediawiki/extensions/UniversalLanguageSelector 142 Language engineering
mediawiki/vendor 140
mediawiki/extensions/SocialProfile 139
analytics/refinery/source 138 Analytics 1
operations/software 137 SRE 1
mediawiki/services/restbase/deploy 136 Core Platform
operations/debs/pybal 123 SRE 2
mediawiki/extensions/CentralAuth 116 Ruby
mediawiki/tools/release 116 1
mediawiki/services/cxserver 112
mediawiki/extensions/BlueSpiceExtensions 110
mediawiki/extensions/WikimediaEvents 110 PHP
labs/private 108
operations/debs/python-kafka 104 SRE
labs/tools/heritage 96

Since some of the repositories connected to incidents are not in the top 50 Bitergia report, I've used git rev-list to sort them. Numbers are different because Bitergia excludes empty commits and bots (script, results: P7834).

Repository Change Stewards Coverage Incidents Selected
mediawiki/extensions/WikibaseQualityConstraints 910 WMDE 1
mediawiki/extensions/ORES 364 Growth JavaScript 2
wikidata/query/rdf 204 WMDE 2
mediawiki/extensions/Babel 146 Editing 1
mediawiki/services/ores/deploy 84 Growth 1
maps/kartotherian/deploy 80 1
mediawiki/extensions/PropertySuggester 67 WMDE 1
maps/tilerator/deploy 61 1
mediawiki/extensions/Cognate 47 WMDE 1
All-Projects 37 1
eventlogging 26 1
integration/jenkins 19 Release Engineering 1
mediawiki/services/eventstreams 16 1
wikidata/query/blazegraph 10 WMDE 1

Prioritize Repositories

Change column uses Bitergia numbers. Numbers in italic are from git rev-list.

Repository Change Stewards Coverage Incidents Selected
mediawiki/extensions/VisualEditor 692 Editing
mediawiki/extensions/ContentTranslation 491 Language engineering
mediawiki/extensions/AbuseFilter 306 Contributors
phabricator/phabricator 302 Release Engineering
mediawiki/extensions/Flow 232 Growth
mediawiki/extensions/Translate 182 Language engineering
mediawiki/extensions/DonationInterface 180 Fundraising tech
mediawiki/extensions/PageTriage 177 Growth 1
mediawiki/extensions/TimedMediaHandler 152 Reading
mediawiki/extensions/UniversalLanguageSelector 142 Language engineering
mediawiki/extensions/WikibaseQualityConstraints 910 WMDE 1
mediawiki/extensions/Babel 146 Editing 1
mediawiki/extensions/PropertySuggester 67 WMDE 1
mediawiki/extensions/Cognate 47 WMDE 1

The same table grouped by stewards.

Repository Change Stewards Coverage Incidents Selected
mediawiki/extensions/VisualEditor 692 Editing
mediawiki/extensions/Babel 146 Editing 1
mediawiki/extensions/ContentTranslation 491 Language engineering
mediawiki/extensions/Translate 182 Language engineering
mediawiki/extensions/UniversalLanguageSelector 142 Language engineering
mediawiki/extensions/AbuseFilter 306 Contributors
phabricator/phabricator 302 Release Engineering
mediawiki/extensions/Flow 232 Growth
mediawiki/extensions/PageTriage 177 Growth 1
mediawiki/extensions/DonationInterface 180 Fundraising tech
mediawiki/extensions/TimedMediaHandler 152 Reading
mediawiki/extensions/WikibaseQualityConstraints 910 WMDE 1
mediawiki/extensions/PropertySuggester 67 WMDE 1
mediawiki/extensions/Cognate 47 WMDE 1

Conclusions

  • There are some repositories that do not fit the Selenium/end-to-end testing model (eg: operations/puppet or operations/mediawiki-config) but could benefit from other testing mechanisms or deployment practices.
  • A test could prevent an outage if it runs:
    • Every time a patch is uploaded to Gerrit. That way it could find a problem during development. That is already done for repositories that have tests.
    • After deployment. That way it could find a problem that was not found during development. In ideal case, deployment would be made to a test server in production, a test would run targeting the tests server. If it fails, further deployment would be cancelled. This is not yet done.
  • Automattic runs tests targeting WordPress.com production:

We decided to implement some basic e2e test scenarios which would only run in production – both after someone deploys a change and a few times a day to cover situations where someone makes some changes to a server or something.

Next steps:

  • I will contact owners of selected repositories (see Prioritize Repositories section) and offer help in creating the first test.
  • I will add results from Incident Documentation tool to incident reports as a new Related Repositories section. The section will link to the tool and explain how it got the data. It will also ask for edits if the data is not correct.
  • I will reach out to people that created (or edited) incident reports and ask them to populate Related Repositories section. This might have mixed results. For best results, the section will already be populated with the data from Incident Documentation tool.
  • I will add Related Repositories section to the incident report template.

Incident Documentation tool improvements:

  • There are several way to link from a wiki page to a patch or task. The tool for now only supports [[gerrit:]] and [[phab:]]. Tracked as issue #6.
  • Gerrit patches and Phabricator tasks from Actionables section do not provide enough data. The entire incident report should be used. I have limited it first because I was collecting data manually (and Actionables looked like the most important part of the incident report), later because of #6. Tracked as issue #4.
  • Find Gerrit repository from task component. Tracked as issue #5.
  • A table with the number of patches from each repository would be helpful. Tracked as issue #11.
  • A report with folder/file names from a repository that are mentioned the most. Especially useful for big repositories like operations/puppet and mediawiki/core. Tracked as issue #12.

ScriptinScript value representation

20:20, Tuesday, 15 2019 January UTC

As part of my long-running side quest to make a safe, usable environment for user-contributed scripted widgets for Wikipedia and other web sites, I’ve started working on ScriptinScript, a modern JavaScript interpreter written in modern JavaScript.

It’ll be a while before I have it fully working, as I’m moving from a seat-of-the-pants proof of concept into something actually based on the language spec… After poking a lot at the spec details of how primitives and objects work, I’m pretty sure I have a good idea of how to represent guest JavaScript values using host JavaScript values in a safe, spec-compliant way.

Primitives

JavaScript primitive types — numbers, strings, symbols, null, and undefined — are suitable to represent themselves; pretty handy! They’re copyable and don’t expose any host environment details.

Note that when you do things like reading str.length or calling str.charCodeAt(index) per spec it’s actually boxing the primitive value into a String object and then calling a method on that! The primitive string value itself has no properties or methods.

Objects

Objects, though. Ah now that’s tricky. A JavaScript object is roughly a hash map of properties indexed with string or symbol primitives, plus some internal metadata such as a prototype chain relationship with other objects.

The prototype chain is similar, but oddly unlike, class-based inheritance typical in many other languages.

Somehow we need to implement the semantics of JavaScript objects as JavaScript objects, though the actual API visible to other script implementations could be quite different.

First draft: spec-based

My initial design modeled the spec behavior pretty literally, with prototype chains and property descriptors to be followed step by step in the interpreter.

Guest property descriptors live as properties of a this.props sub-object created with a null prototype, so things on the host Object prototype or the custom VMObject wrapper class don’t leak in.

If a property doesn’t exist on this.props when looking it up, the interpreter will follow the chain down through this.Prototype. Once a property descriptor is found, it has to be examined for the value or get/set callables, and handled manually.

// VMObject is a regular class
[VMObject] {
    // "Internal slots" and implementation details
    // as properties directly on the object
    machine: [Machine],
    Prototype: [VMObject] || null,

    // props contains only own properties
    // so prototype lookups must follow this.Prototype
    props: [nullproto] {
        // prop values are virtual property descriptors
        // like you would pass to Object.defineProperty()
        aDataProp: {
            value: [VMObject],
            writable: true,
            enumerable: true,
            configurable: true,
        },
        anAccessorProp: {
            get: [VMFunction],
            set: [VMFunction],
            enumerable: true,
            configurable: true,
        },
    },
}

Prototype chains

Handling of prototype chains in property lookups can be simplified by using native host prototype chains on the props object that holds the property descriptors.

Instead of Object.create(null) to make props, use Object.create(this.Prototype ? this.Prototype.props : null).

The object layout looks about the same as above, except that props itself has a prototype chain.

Property descriptors

We can go a step further, using native property descriptors which lets us model property accesses as direct loads and stores etc.

Object.defineProperty can be used directly on this.props to add native property descriptors including support for accessors by using closure functions to wrap calls into the interpreter.

This should make property gets and sets faster and awesomer!

Proper behavior should be retained as long as operations that can affect property descriptor handling are forwarded to props, such as calling Object.preventExtensions(this.props) when the equivalent guest operation is called on the VMObject.

Native objects

At this point, our inner props object is pretty much the “real” guest object, with all its properties and an inheritance chain.

We could instead have a single object which holds both “internal slots” and the guest properties…

let MachineRef = Symbol('MachineRef');

// VMObject is prototyped on a null-prototype object
// that does not descend from host Object, and which
// is named 'Object' as well from what guest can see.
// Null-proto objects can also be used, as long as
// they have the marker slots.
let VMObject = function Object(val) {
    return VMObject[MachineRef].ToObject(val);
};
VMObject[MachineRef] = machine;
VMObject.prototype = Object.create(null);
VMObject.prototype[MachineRef] = machine;
VMObject.prototype.constructor = VMObject;

[VMObject] || [nullproto] {
    // "Internal slots" and implementation details
    // as properties indexed by special symbols.
    // These will be excluded from enumeration and
    // the guest's view of own properties.
    [MachineRef]: [Machine],

    // prop values are stored directly on the object
    aDataProp: [VMObject],
    // use native prop descriptors, with accessors
    // as closures wrapping the interpreter.
    get anAccessorProp: [Function],
    set anAccessorProp: [Function],
}

The presence of the symbol-indexed [MachineRef] property tells host code in the engine that a given object belongs to the guest and is safe to use — this should be checked at various points in the interpreter like setting properties and making calls, to prevent dangerous scenarios like exposing the native Function constructor to create new host functions, or script injection via DOM innerHTML properties.

Functions

There’s an additional difficulty, which is function objects.

Various properties will want to be host-callable functions — things like valueOfand toString. You may also want to expose guest functions directly to host code… but if we use VMObject instances for guest function objects, then there’s no way to make them directly callable by the host.

Function re-prototyping

One possibility is to outright represent guest function objects using host function objects! They’d be closures wrapping the interpreter, and ‘just work’ from host code (though possibly careful in how they accept input).

However we’d need a function object that has a custom prototype, and there’s no way to create a function object that way… but you can change the prototype of a function that already has been instantiated.

Everyone says don’t do this, but you can. ;)

let MachineRef = Symbol('MachineRef');

// Create our own prototype chain...
let VMObjectPrototype = Object.create(null);
let VMFunctionPrototype = Object.create(VMObjectPrototype);

function guestFunc(func) {
    // ... and attach it to the given closure function!
    Reflect.setPrototypeOf(func, VMFunction.prototype);

    // Also save our internal marker property.
    func[MachineRef] = machine;
        return func;
}

// Create our constructors, which do not descend from
// the host Function but rather from VMFunction!
let VMObject = guestFunc(function Object(val) {
    let machine = VMObject[MachineRef];
    return machine.ToObject(val);
});

let VMFunction = guestFunc(function Function(src) {
    throw new Error('Function constructor not yet supported');
});

VMFunction.prototype = VMFunctionPrototype;
VMFunctionPrototype.constructor = VMFunction;

VMObject.prototype = VMObjectPrototype;
VMObjectPrototype.constructor = VMObject;

This seems to work but feels a bit … freaky.

Function proxying

An alternative is to use JavaScript’s Proxy feature to make guest function objects into a composite object that works transparently from the outside:

let MachineRef = Symbol('MachineRef');

// Helper function to create guest objects
function createObj(proto) {
    let obj = Object.create(proto);
    obj[MachineRef] = machine;
    return obj;
}

// We still create our own prototype chain...
let VMObjectPrototype = createObj(null);
let VMFunctionPrototype = createObj(VMObjectPrototype);

// Wrap our host implementation functions...
function guestFunc(func) {
    // Create a separate VMFunction instance instead of
    // modifying the original function.
    //
    // This object is not callable, but will hold the
    // custom prototype chain and non-function properties.
    let obj = createObj(VMFunctionPrototype);

    // ... now wrap the func and the obj together!
    return new Proxy(func, {
        // In order to make the proxy object callable,
        // the proxy target is the native function.
        //
        // The proxy automatically forwards function calls
        // to the target, so there's no need to include an
        // 'apply' or 'construct' handler.
        //
        // However we have to divert everything else to
        // the VMFunction guest object.
        defineProperty: function(target, key, descriptor) {
            if (target.hasOwnProperty(key)) {
                return Reflect.defineProperty(target, key, descriptor);
            }
            return Reflect.defineProperty(obj, key, descriptor);
        },
        deleteProperty: function(target, key) {
            if (target.hasOwnProperty(key)) {
                return Reflect.deleteProperty(target, key);
            }
            return Reflect.deleteProperty(obj, key);
        },
        get: function(target, key) {
            if (target.hasOwnProperty(key)) {
                return Reflect.get(target, key);
            }
            return Reflect.get(obj, key);
        },
        getOwnPropertyDescriptor: function(target, key) {
            if (target.hasOwnProperty(key)) {
                return Reflect.getOwnPropertyDescriptor(target, key);
            }
            return Reflect.getOwnPropertyDescriptor(obj, key);
        },
        getPrototypeOf: function(target) {
            return Reflect.getPrototypeOf(obj);
        },
        has: function(target, key) {
            if (target.hasOwnProperty(key)) {
                return Reflect.has(target, key);
            }
            return Reflect.has(obj, key);
        },
        isExtensible: function(target) {
            return Reflect.isExtensible(obj);
        },
        ownKeys: function(target) {
            return Reflect.ownKeys(target).concat(
                Reflect.ownKeys(obj)
            );
        },
        preventExtensions: function(target) {
            return Reflect.preventExtensions(target) &&
                Reflect.preventExtensions(obj);
        },
        set: function(target, key, val, receiver) {
            if (target.hasOwnProperty(key)) {
                return Reflect.set(target, key, val, receiver);
            }
            return Reflect.set(obj, key, val, receiver);
        },
        setPrototypeOf: function(target, proto) {
            return Reflect.setPrototypeOf(obj, proto);
        },
    });
}

// Create our constructors, which now do not descend from
// the host Function but rather from VMFunction!
let VMObject = guestFunc(function Object(val) {
    // The actual behavior of Object() is more complex ;)
    return VMObject[MachineRef].ToObject(val);
});

let VMFunction = guestFunc(function Function(args, src) {
    // Could have the engine parse and compile a new guest func...
    throw new Error('Function constructor not yet supported');
});

// Set up the circular reference between
// the constructors and protoypes.
VMFunction.prototype = VMFunctionPrototype;
VMFunctionPrototype.constructor = VMFunction;
VMObject.prototype = VMObjectPrototype;
VMObjectPrototype.constructor = VMObject;

There’s more details to work out, like filling out the VMObject and VMFunction prototypes, ensuring that created functions always have a guest prototype property, etc.

Note that implementing the engine in JS’s “strict mode” means we don’t have to worry about bridging the old-fashioned arguments and caller properties, which otherwise couldn’t be replaced by the proxy because they’re non-configurable.

My main worries with this layout are that it’ll be hard to tell host from guest objects in the debugger, since the internal constructor names are the same as the external constructor names… the [MachineRef] marker property should help though.

And secondarily, it’s easier to accidentally inject a host object into a guest object’s properties or a guest function’s arguments…

Blocking host objects

We could protect guest objects from injection of host objects using another Proxy:

function wrapObj(obj) {
    return new Proxy(obj, {
        defineProperty: function(target, key, descriptor) {
            let machine = target[MachineRef];
            if (!machine.isGuestVal(descriptor.value) ||
                !machine.isGuestVal(descriptor.get) ||
                !machine.isGuestVal(descriptor.set)
            ) {
                throw new TypeError('Cannot define property with host object as value or accessors');
            }
            return Reflect.defineProperty(target, key, descriptor);
        },
        set: function(target, key, val, receiver) {
            // invariant: key is a string or symbol
            let machine = target[MachineRef];
            if (!machine.isGuestVal(val)) {
                throw new TypeError('Cannot set property to host object');
            }
            return Reflect.set(target, key, val, receiver);
        },
        setPrototypeOf: function(target, proto) {
            let machine = target[MachineRef];
            if (!machine.isGuestVal(val)) {
                throw new TypeError('Cannot set prototype to host object');
            }
            return Reflect.setPrototypeOf(obj, proto);
        },
    };
}

This may slow down access to the object, however. Need to benchmark and test some more and decide whether it’s worth it.

For functions, can also include the `apply` and `construct` traps to check for host objects in arguments:

function guestFunc(func) {
    let obj = createObj(VMFunctionPrototype);
    return new Proxy(func, {
        //
        // ... all the same traps as wrapObj and also:
        //
        apply: function(target, thisValue, args) {
            let machine = target[MachineRef];
            if (!machine.isGuestVal(thisValue)) {
                throw new TypeError('Cannot call with host object as "this" value');
            }
            for (let arg of args) {
                if (!machine.isGuestVal(arg)) {
                    throw new TypeError('Cannot call with host object as argument');
                }
            }
            return Reflect.apply(target, thisValue, args);
        },
        construct: function(target, args, newTarget) {
            let machine = target[MachineRef];
            for (let arg of args) {
                if (!machine.isGuestVal(arg)) {
                    throw new TypeError('Cannot construct with host object as argument');
                }
            }
            if (!machine.isGuestVal(newTarget)) {
                throw new TypeError('Cannot construct with host object as new.target');
            }
            return Reflect.apply(target, args, newTarget);
        },
    });
}

Exotic objects

There are also “exotic objects”, proxies, and other funky things like Arrays that need to handle properties differently from a native object… I’m pretty sure they can all be represented using proxies.

Next steps

I need to flesh out the code a bit more using the new object model, and start on spec-compliant versions of interpreter operations to get through a few simple test functions.

Once that’s done, I’ll start pushing up the working code and keep improving it. :)

Update (benchmarks)

I did some quick benchmarks and found that, at least in Node 11, swapping out the Function prototype doesn’t appear to harm call performance while using a Proxy adds a fair amount of overhead to short calls.

$ node protobench.js 
empty in 22 ms
native in 119 ms
guest in 120 ms

$ node proxybench.js
empty in 18 ms
native in 120 ms
guest in 1075 ms

This may not be significant when functions have to go through the interpreter anyway, but I’ll consider whether the proxy is needed and weigh the options…

Update 2 (benchmarks)

Note that the above benchmarks don’t reflect another issue — de-optimization of call sites that accept user-provided callbacks, if you sometimes pass them regular functions and other times pass them re-prototyped or proxied objects, they can switch optimization modes and end up slightly slower also when passed regular functions.

If you know you’re going to pass a guest object into a separate place that may be interchangeable with a native host function, you can make a native wrapper closure around the guest call and it should avoid this.

15 January 2019 marks the eighteenth birthday of Wikipedia—and the fourth year of #1Lib1Ref, an annual campaign that asks everyone to jump in and improve Wikipedia by adding at least one citation. In doing so, they help improve the reliability and authenticity of the site for billions of readers.

Though literally meaning “One Librarian, One Reference”, #1Lib1Ref has grown to include archivists, professors, researchers, and Wikimedia volunteer editors interested in an entertaining way to effect change on Wikipedia.

You can participate in the campaign through five easy steps:

  1. Find an article that needs a citation, using Citation Hunt
  2. Find a reliable source that can support that article
  3. Add a citation using referencing tools
  4. Add the project hashtag #1Lib1Ref in the Wikipedia edit summary
  5. Share your edit(s) on social media and invite others to participate!

 

Video by Felix Nartey/Jessamyn West/Wikimedia Foundation, CC BY-SA 4.0. The video may not play on certain internet browsers. If you are having trouble, please watch it directly on Wikimedia Commons.

Need more help? The new resources page has materials that can aid anybody who wishes to participate irrespective of their experience level on editing Wikipedia.

Event organizers and community leaders can register their #1Lib1Ref events and activities on an outreach dashboard that allows for easy metrics and activity tracking, and we invite all participants or organizers to share stories of their achievements and pain points through a new feedback form. Participants can also sign up to a community event or program near them.

———

#1Lib1Ref has been a key catalyst for improving Wikipedia’s references and improving trust in the encyclopedia within the librarian and similar professional circles around our movement.

We’re very excited to see where this year’s competition brings us. #1Lib1Ref has proven to be a key catalyst in directly improving references on Wikipedia and has raised the level of trust librarians and other similar professional circles have in the encyclopedia. Wikipedia is an invaluable part of every researcher’s process, and more credible and reliable references help bring us closer to our dream of creating “the sum of all human knowledge.”

Felix Nartey, Global Coordinator, The Wikipedia Library
Wikimedia Foundation

Wikipedia Day: a year in review

17:43, Tuesday, 15 2019 January UTC

Every year on January 15, we celebrate Wikipedia’s birthday. It takes thousands all around the world to make Wikipedia the resource that it is; Wikipedia Day is a great time to recognize all that hard work and successful collaboration. It’s also a day to speak to the importance of freely available knowledge and to continue conversations about how we can further Wikipedia’s purpose as a community. On Wikipedia’s 18th birthday, we’d like to share what Wiki Education has been up to over the last year to help achieve the vision of a world where everyone has access to free, accurate knowledge.

Our new strategy

Wikidata education for librarians group at WikiCite 2018

We announced our new strategy, which will shape our work for the next three years. We will increase knowledge equity by focusing on content and communities that are underrepresented on Wikipedia and Wikidata; provide people who seek knowledge online with accurate information in topic areas that are underdeveloped; and reach large audiences with free knowledge by making Wikipedia and Wikidata more complete.

Part of our new strategic direction also includes the development of Wikidata-focused programs. We kicked off this program development with a visit to Wayne State University, where we are beginning a collaboration with the School of Information Sciences around integrating Wikidata assignments into the curriculum. We also attended WikiCite, where we had great conversations about the amazing work that’s happening globally related to citations, open knowledge, and structured data.

Training academics to channel their expertise into Wikipedia

We were thrilled to launch our professional development program this last year, a series of courses for academics, researchers, and other scholars to learn how to contribute their expertise to Wikipedia. Our approach was featured in William Beutler’s “Top Ten Wikipedia Stories of 2018”. It’s a model that offers a potential solution to engaging more academics and subject-matter experts in Wikipedia editing. These professionals target highly trafficked, complex topics that student editors don’t necessarily have the skills to tackle in our Wikipedia Student Program (formerly named the Classroom Program). And so far, we’ve received a lot of positive feedback from course participants about the value of the collaborative learning experience. It seems the rest of the Wikipedia community is as eager as we are to see where the venture goes.

Our Wikipedia Student Program is as booming as ever

We supported more instructors and student editors than ever before in our Wikipedia Student Program. More than 16,000 student editors added more than 13 million words to more than 16,000 articles on Wikipedia, our highest numbers to date.

Our partnership with the National Women’s Studies Association (one of the many academic associations we work with) was featured in the Chronicle of Higher Education last March. The article shows the impact of our efforts so far to engage women’s and gender studies students around the country to channel their classwork into the public resource that is Wikipedia. Since 2014, the partnership has yielded more than 4.4 million words added to Wikipedia to help close the gender gap.

We also hit a major milestone last April: student editors in our Wikipedia Student Program have officially contributed more words to Wikipedia since our program’s inception in 2010 than were published in the last print edition of the Encyclopædia Britannica. Executive Director Frank Schulenburg wrote about the significance of this milestone in the context of encyclopedic history and conversations around open access.

Reinforcing our tools so that more people can do better work on Wikipedia, worldwide

We continually seek feedback from the thousands of people that use our Dashboard so that we can improve it to fit new and changing needs. A portion of these continual improvements are made through our tech mentorship program, which engages new coders in our open source project. An example of a great feature to come out of one of these mentorships last year is Google Summer of Code student Pratyush Singhal’s Article Finder. This is a tool on our Dashboard which will help newcomers find Wikipedia articles in need of development in an automated, straight-forward way.

We also embarked on an exciting new journey for our software development team. Chief Technology Officer Sage Ross hired Wes Reid as our new Software Developer. The expansion of our tech team will allow for more integral changes to our Dashboard in the near future, enabling the thousands who use it worldwide to learn Wikipedia editing and track their contributions with more efficiency.

Connecting with our fellow community members

We attended numerous academic conferences this year to invite instructors and researchers to join our programs. Multiple alumni from our professional development courses presented with our staff at these conferences, speaking to the skills and enthusiasm that the experience fostered. Participating instructors in our Wikipedia Student Program presented about incorporating Wikipedia editing into their curricula, as well.

Dr. Jenn Brandt presents at NWSA’s annual meeting about her experience in our professional development course.
Alum from one of our professional development courses visits our booth at the National Communication Association annual convention.
At the Midwest Political Science Association’s annual conference, Director of Partnerships Jami Mathewson served as discussant alongside three instructors who are teaching with Wikipedia: Dr. Jinu Abraham, Dr. Matthew Bergman, and Dr. Megan Osterbur.

We also had the privilege of attending multiple Wikipedian-centered conferences and events last year. In April, Executive Director Frank Schulenburg and Sage traveled to Berlin to participate in the annual Wikimedia Conference. It was a great opportunity to join other leaders of the global Wikimedia movement both to speak to the present and future of the global Programs & Events Dashboard that we maintain and discuss the strategic direction of the Wikimedia movement.

Group photo at the Wikimedia Conference 2018 in Berlin.
Image: File:Wikimedia Conference 2018, Group photo.jpg, Jason Krüger, CC BY-SA 4.0, via Wikimedia Commons.

Then, during October, we had a wonderful opportunity to interact with the North American Wikimedia community at WikiConference North America. In attendance were several of our instructors. Both Wiki Education staff as well as instructors in our program gave presentations during the conference, making education one of the most prominent themes of this year’s gathering.

Wikipedia Student Program instructor Winnie Lamour presents at WikiConference North America

We often see visitors at our San Francisco office in the Presidio, which is a treat for us and a great way to further connect with Wikipedians, instructors, and students who we usually work with only virtually.

Frank, Camelia Boban, and Rosie Stephenson-Goodknight in the Presidio of San Francisco
Ole Miss students visit Wiki Education in July to learn what it means to work for a mission-driven non-profit in the San Francisco Bay Area.

On the whole, we’re proud of what we’ve accomplished in 2018 and look forward to productive work in the coming year to make Wikipedia even more of a robust and accurate resource for everyone.


For detailed reports of programmatic activities and budgeting, see our Monthly Reports published on our blog and on Wikimedia Commons.


Header imageFile:WikiConNA 18 -WikiEd Group Photo 1.jpgSixflashphoto, CC BY-SA 4.0, via Wikimedia Commons.

#WikipediaDay – Wikipedia turns 18

17:29, Tuesday, 15 2019 January UTC
Wikipedia birthday cakes made for Wikipedia’s 16th birthday – image by Beko CC BY-SA 4.0

By John Lubbock, Wikimedia UK Communications Coordinator

January 15 is the anniversary of the day on which Wikipedia was launched in 2001. I first got involved with Wikipedia in 2011, when I volunteered at a party organised by a friend of mine for Wikipedia’s 10th anniversary. 18, although a coming of age in many countries, doesn’t have quite the same ring to it as the 10th or 20th anniversary, and so there’s no big party this year, but we are marking it on social media anyway with the hashtag #WikipediaDay, and asking people to send us messages about why they value Wikipedia, why they think others should value Wikipedia, and what they would say to someone to encourage them to become a Wikipedia editor.

We’ve also released a video interview with Wikipedia co-founder Jimmy Wales, which is on our YouTube channel, as well as on Wikimedia Commons, where you can download it to reuse however you want.

We’d love to hear how everybody else is celebrating Wikipedia Day, and what you are looking forward to doing or working on with any of the Wikimedia projects this year. There are lots of important Wikimedia events coming up this year, and we hope to work with more academic and cultural institutions than ever before to grow Wikipedia and help people use it in an effective way. The Structured Data on Commons will hopefully finish, which will lead to big improvements on Commons, and there will be lots of work to promote and document Wikidata as it continues to evolve into an important project in its own right. So send us a message on social media and tell us what you’re doing and what you’re looking forward to!

A forgotten upheaval

14:54, Tuesday, 15 2019 January UTC
Among the many amazing stories from India that went into the pages of science is that of Sindri fort. It is hardly mentioned in India anymore but it was of interest to Charles Lyell and whose work was significant also for Charles Darwin. This fort sank along with a large area around it on 16 January 1819 around 6.45 PM when the region was struck by an earthquake that also caused a tsunami. While this region sank, the northern edges a few kilometers rose. It formed a feature that was named as Allah Bund. It apparently became, for a while, a standard textbook example demonstrating a dynamic earth, where sudden and catastrophic changes could occur. Some years ago, when I learned about the significance of this location in the history of geology and evolutionary thought, I decided that it needed a bit more coverage and began an entry in Wikipedia at Sindri Fort. When referring to geographical entities, Wikipedia articles can also be marked by coordinates (so as to show up on maps) but I had a lot of trouble figuring out where this fort stood, and having never been anywhere near Gujarat, I had set it aside after some fruitless searches across the largely featureless Rann of Kutch on Google Earth. Today, I happened to look up the work of a geologist A.B. Wynne and found that he had mapped the region in 1869 (along with the very meticulous people from the Survey of India). Fortunately, the Memoirs of the Geological Survey have recently been scanned by the Biodiversity Heritage Library and are readily available online. I downloaded the four map pages and stitched them into one large map which I then uploaded the map to Wikimedia Commons (the shared image repository of Wikipedias in various languages) at https://commons.wikimedia.org/wiki/File:Kutch_geology.jpg and then altered the metadata template from one for information to "map" instead - this then allowed the image to be transferred to the MapWarper (an open-source system) installation at https://warper.wmflabs.org/ - finding a few corresponding points on the old map and the base map allowed me to overlay the image atop Google Earth by exporting a kml file of the alignment.
Wynne's 1869 map of the region (stitched)

I then looked up Google Earth to see what lay under the location of the ruins of the Sindri Fort indicated on the old map from 1869 and not very far (within a kilometer) from the marked location, lo and behold, there were faint traces of a structure which would be right where Sindri Fort stood. So here it is for anyone interested. It is quite possible that the location is well-known to locals, but it was still quite thrilling that one could work this out from afar thanks to the accessibility of information. It would not have been possible but for a combination of Wikipedia, the Biodiversity Heritage Library, the Internet Archive, Google Earth, the Warper project, and the numerous people behind all these who are working not just as researchers but as research enablers.


It would seem that this was not visible until the imagery of 2013.

More information

Happy Birthday Wikipedia!

13:14, Tuesday, 15 2019 January UTC

Wikipedia turns 18 today!  Hurray!  I hope it doesn’t go out and get completely hammered and wake up in the morning with no memory of how it got home.   To celebrate this momentous occasion, Wikimedia UK has asked us all to tell them why we value Wikipedia.

  • What does Wikipedia mean to you?

The power of open knowledge at your fingertips!

  • Why do you think people should value Wikipedia?

Used correctly, Wikipedia is an invaluable source of open knowledge.  It’s one of the few truly open and transparent sources of knowledge and information on the web.  Its very existence is a testament to human ingenuity and perseverance, and a challenge to those who seek to manipulate and restrict access to knowledge and information.

Also it’s dead handy when you need to know the population of villages in Fife.

  • What would you say to someone to encourage them to become a Wikipedia editor?

Wikipedia is an amazing achievement but we still have so much work to do.  The encyclopaedia is a reflection of the world and the people who edit it and as such it mirrors all our inequalities, prejudices and power structures.  If we want Wikipedia to be more diverse, more inclusive and more representative, then we need to encourage more people, and specifically more women and minorities, to edit.  Now more so than ever, open knowledge is far too important to be left in the hands of the few.

Ewan McAndrew, our fabulous Wikimedian in Residence at the University of Edinburgh, often reminds us that the number of Very Active editors (i.e. more than 100 contributions in a given month) on English Wikipedia is just over 3,000, which is roughly equivalent to the population of a small village in Fife.  Anstruther for example.  Imagine the sum of all knowledge being left in the hands of Fifers?!  Perish the thought!  You know what you have to do….Edit!

Anstruther from Kirkyard, CC0, Poliphilo, Wikimedia Commons

Disclaimer: I’m sure Anstruther is lovely.

Code Health Metrics and SonarQube

11:42, Tuesday, 15 2019 January UTC

Code Health

Inside a broad Code Health project there is a small Code Health Metrics group. We meet weekly and discuss how code health could be improved by metrics. Each member has only a few hours each week to work on this, so our projects are small.

In our discussions, we have agreed on a few principles. Some of them are:

  • Metrics are about improving the process as much improving the code.
  • Focus on new code, not existing one.
  • Humans are smarter than tools.

The goal of the project is to provide fast and actionable feedback on code health metrics. Since our time for this project is limited, we've decided to make a spike (T207046). The spike focuses on:

  • one repository,
  • one language,
  • one metric,
  • one tool,
  • one feedback mechanism.

All of the above tasks are already completed, except for the last one. In parallel to finishing the spike, we are also working on expanding the scope to more repositories, languages and metrics. At the moment, the spike works for several Java repositories.

SonarQube

After some investigation, the tool we have selected is SonarQube. The tool does everything we need, and more. In this post I'll only mention one feature. We have decided not to host SonarQube ourselves at the moment. We are using a hosted solution, SonarCloud. You can see the our current dashboart at wmftest organization at SonarCloud.

As mentioned in the principles, in order to make the metrics actionable, we've decided to focus only on new code, ignoring existing code for now. That means that when you make a change to a repository with a lot of code, you are not overwhelmed with all metrics (and problems) the tool has found. Instead, the tool focuses just on the code you have wrote. So, for example, if a small patch you have submitted to a big repository does not introduce new problems, the tool says so. If the patch introduces new problems (like decreased branch coverage) the tools let's you know.

Members of the Code Health Metrics group have reminded me multiple times that I have to mention SonarLint, an IDE extension. I don't use it myself, since it doesn't support my favorite editor.

Example

A good example is at at wmftest organization at SonarCloud. Elasticsearch extra plugins has failed quality gate.

Opening the project Elasticsearch extra plugins project you see that the failure is related to test coverage (less than 80%).

Click the warning and you get more details: Coverage on New Code 0.0%.

Click the ExtraCorePlugin.java file. New lines have yellow background. It's easy to see that there are lines that are marked red (meaning no coverage) but it's also easy to see which new lines (yellow background) have no coverage (red sidebar).

Talks

We have planned to present what we have so far during Wikimedia Foundation All Hands. The prepare for that, we're created this blog post and presented at 5 Minute Demo and Testival Meetup.

I would like to thank all members of the Code Health Metrics Working group for help writing this post and especially to Guillaume Lederrey and Kosta Harlan.

FAQ

Q: Sonar-what?!
A: SonarQube is the tool. SonarCloud is the hosted version of the tool. SonarLint in an IDE extension.

Q: When can I use this on my project?
A: Soon. Probably when T207046 is resolved. If there are no blockers, in a few weeks.

Q: Why are we using SonarCloud instead of hosting SonarQube ourselves?
A: We did not want to invest time in hosting it ourselves until we're sure the tool is the right choice for us.

വിക്കിപീഡിയയുടെ പതിനെട്ടാം പിറന്നാളാണിന്ന്. അമ്പത്തെട്ടുലക്ഷം ലേഖനങ്ങളോടെ ഇംഗ്ലീഷ് വിക്കിപീഡിയയും അറുപതിനായിരത്തോളം ലേഖനങ്ങളോടെ മലയാളം വിക്കിപീഡിയയും ഒരുപാടു പരിമിതികൾക്കും വെല്ലുവിളികൾക്കുമിടയിൽ യാത്ര തുടരുന്നു.

292 ഭാഷകളിൽ വിക്കിപീഡിയ ഉണ്ടെങ്കിലും ഉള്ളടക്കത്തിന്റെ അനുപാതം ഒരുപോലെയല്ല. വിക്കിമീഡിയ ഫൗണ്ടേഷനിൽ കഴിഞ്ഞ നാലുവർഷമായി എന്റെ പ്രധാനജോലി ഭാഷകൾ തമ്മിൽ മെഷീൻ ട്രാൻസ്‌ലേഷന്റെയും മറ്റും സഹായത്തോടെ ലേഖനങ്ങൾ പരിഭാഷപ്പെടുത്തുന്ന സംവിധാനത്തിന്റെ സാങ്കേതികവിദ്യയ്ക്ക് നേതൃത്വം കൊടുക്കലായിരുന്നു.

ഇന്നലെ ഈ സംവിധാനത്തിന്റെ സഹായത്തോടെ പുതുതായി കൂട്ടിച്ചേർത്ത ലേഖനങ്ങളുടെ എണ്ണം നാലുലക്ഷമായി.

Mapping Ford Go Bike trips in the Bay Area

05:00, Tuesday, 15 2019 January UTC

Toolforge: Trusty deprecation and grid engine migration

16:25, Monday, 14 2019 January UTC

Ubuntu Trusty was released in April 2014, and support for it (including security updates) will cease in April 2019. We need to shut down all Trusty hosts before the end of support date to ensure that Toolforge remains a secure platform. This migration will take several months because many people still use the Trusty hosts and our users are working on tools in their spare time.

Initial timeline

Subject to change, see Wikitech for living timeline.

  • 2019-01-11: Availability of Debian Stretch grid announced to community
  • Week of 2019-02-04: Weekly reminders via email to tool maintainers for tools still running on Trusty
  • Week of 2019-03-04:
    • Daily reminders via email to tool maintainers for tools still running on Trusty
    • Switch login.tools.wmflabs.org to point to Stretch bastion
  • Week of 2019-03-18: Evaluate migration status and formulate plan for final shutdown of Trusty grid
  • Week of 2019-03-25: Shutdown Trusty grid

What is changing?

  • New job grid running Son of Grid Engine on Debian Stretch instances
  • New limits on concurrent job execution and job submission by a single tool
  • New bastion hosts running Debian Stretch with connectivity to the new job grid
  • New versions of PHP, Python2, Python3, and other language runtimes
  • New versions of various support libraries

What should I do?

Some of you will remember the Ubuntu Precise deprecation from 2016-2017. This time the process is similar, but slightly different. We were unable to build a single grid engine cluster that mixed both the old Trusty hosts and the new Debian Stretch hosts. That means that moving your jobs from one grid to the the other is a bit more complicated than it was the last time.

The cloud-services-team has created the News/Toolforge Trusty deprecation page on wikitech.wikimedia.org to document basic steps needed to move webservices, cron jobs, and continuous jobs from the old Trusty grid to the new Stretch grid. That page also provides more details on the language runtime and library version changes and will provide answers to common problems people encounter as we find them. If the answer to your problem isn't on the wiki, ask for help in the #wikimedia-cloud IRC channel or file a bug in Phabricator.

See also

  • News/Toolforge Trusty deprecation on Wikitech for full details including links to tools that will help us monitor the migration of jobs to the new grid and help with common problems

Using bots to change the landscape of Wikipedia

15:33, Monday, 14 2019 January UTC
A robot by Banksy in New York – image by Scott Lynch CC BY-SA 2.0

This post has been written by User:TheSandDoctor, an admin on English Wikipedia. An original version of this article appeared on Medium.

A Request for Comment (RfC) is a process for requesting outside input concerning disputes, policies, guidelines or article content. As an admin on the English Wikipedia, I deal with these kind of bureaucratic issues regularly.

For a bot task to be approved on the English Wikipedia, a request, called a Bot Request For Approval (BRFA), must be filed. If there is determined to be sufficient need warranting the task, a member of the body which provides oversight on bots, the Bot Approvals Group, will generally request a trial. If the trial goes to plan, the task is usually approved within a couple of days following the trial’s completion. In the event that there are issues, those are then resolved by the submitter(s) and the reviewing member(s) are notified. This is then followed by, potentially, a new trial. In the event that things went according to plan this time around during the retrial, the task would most likely be approved shortly thereafter.

After a successful Request for Comment, I knew it was time to get to work on my next Wikipedia bot. Little did I realize at the time, that this would be the most controversial task that I had filed to date and would end up triggering an unprecedented series of events I never predicted, culminating in the rare re-opening of a Request for Comment. The change that resulted in this series of events? Moving the year an election or other referendum took place from the end to the front of the page name. For example,

United States presidential election, 2016 would become 2016 United States presidential election or Electoral fraud and violence during the Turkish general election, June 2015 would be renamed Electoral fraud and violence during the June 2015 Turkish general election, with the old titles being valid redirects as to avoid the breakage of any incoming links.

It was October 17, 2018 and the opening of the approval request started off as countless others I had filed did in the past, with routine questions being asked by a volunteer Bot Approvals Group member, in this case the user named SQL. It was at this point when there were some indications that this would not go as smoothly as I had previously experienced. It was slightly unusual when the normally quiet and routine process began to attract more attention from editors and other members of the Bot Approvals Group, who began to express concerns regarding the RfC itself. In particular, concerns were expressed that there was not enough participation within the original Request for Comment and that it was inadequately advertised at the various relevant noticeboards watched by editors who may be affected by the proposed article naming convention change. By October 20th, the unprecedented happened. The decision was made to reopen the Request for Comment, and the discussion kicked off once again, with the bot approval request taking a temporary backseat. The reopening of a Request for Comment is a fairly unusual measure that while possible, is seldom done or deemed necessary.

Following the RfC’s reopening, there was thorough discussion on both sides of the debate, which lasted an additional 31 days. On November 20th, 2018, the findings of the original close were confirmed. The consensus was that the naming convention was to be updated as proposed and, as a direct side effect, the bot task which I had submitted was given a renewed life. The upholding of the initial close, this time with clearer support, effectively cleared the way for a trial run. It was decided on the task’s discussion page that roughly 150 articles would be renamed in the trial of my task approval request. The task to move the pages to correspond with the updated naming conventions was approved on November 27th, following the successful completion of the trial and after leaving a few days holding time for any further comments or technical concerns.

Number of pages renamed, taken from xTools. Notice the similarity in the numbers.

From November 27th until early December, TheSandBot enacted the consensus achieved by the Request for Comment, moving (renaming) over 43,000 election related pages within a couple of days.

When a page is moved/renamed, mediawiki, the wiki software which Wikipedia uses, creates a redirect from the old title to the new one. This is done in an effort to prevent the breakage of any links to the older title. Instead of visiting the old link and receiving the equivalent of a HTTP 404 error, readers are instead merely redirected to the new location. Move operations have either two or four parts, each of which takes one edit. In the case of the former, since both parts of a move operation take one edit each (a redirect page creation and a move), two edits are performed for every ‘move’. In the latter case it is slightly more complicated, but the actions are doubled. Taking advantage of this property, I was able to save time and reduce the size of the task script. As a consequence, despite the fact that approximately 21,000 articles were moved, the logs indicate 43,000 were and registered over 86,000 edits within that time frame (see figures above/below).

From left to right: total number of edits over the account lifetime, further statistics regarding the edits made within the past year.

An example of the four edits per page move mentioned above. N signifies a page creation, m signifies a minor edit, which page moves are considered automatically by the software

With the successful completion of all the specified page moves, it is the end for that particular task. Now it is time for me to move onto different ones, like the recently approved task removing article specific templates from drafts. There is always more work to do within the largest online encyclopedia that is Wikipedia.

Find out more about TheSandDoctor’s work at thesanddoctor.com.

ScriptinScript is coming

03:23, Monday, 14 2019 January UTC

Got kinda sidetracked for the last week and ended up with a half-written JavaScript interpreter written in JavaScript, which I’m calling “ScriptinScript”. O_O

There are such things already in existence, but they all seem outdated, incomplete, unsafe, or some combination of those. I’ll keep working on this for my embeddable widgets project but have to get back to other projects for the majority of my work time for now… :)

I’ve gotten it to a stage where I understand more or less how the pieces go together, and have been documenting how the rest of it will be implemented. Most of it for now is a straightforward implementation of the language spec as native modern JS code, but I expect it can be optimized with some fancy tricks later on. I think it’s important to actually implement a real language spec rather than half-assing a custom “JS-like” language, so code behaves as you expect it to … and so we’re not stuck with some totally incompatible custom tool forever if we deploy things using it.
Will post the initial code some time in the next week or two once I’ve got it running again after some major restructuring from initial proof of concept to proper spec-based behavior.

weeklyOSM 442

03:01, Sunday, 13 2019 January UTC

01/01/2019-07/01/2019

Logo

JOSM Plugin for landuse overlap 1 | © KiaaTiX, JOSM © Map data OpenStreetMap contributors

Mapping

  • A fool with a tool…” (… is still a fool) is probably not the right subject line if you want to address an issue on an international mailing list, even if you encounter an OSM element where name spacing is taken to an extreme level:
  • Disputed boundaries are a sensitive issue, and so are proposals as to how to map them. John Paris announced version 1.6 of his proposal for mapping disputed boundaries in OSM and requests feedback.
  • Konrad Lischka suggested extending the tag amenity=kindergarten with operator=, operator:type=charitable and organisation= to properly tag kindergartens that are operated as independent charitable entities.
  • Following the recent approval of the new key interval= for tagging the time between departures at any given stop (see Headway on Wikipedia), Leif Rasmussen continues with the next step and suggests the key departures= for adding the departure time for a given interval.
  • Minh Nguyen reports about his mapping of Oldenburg (Indiana), a city where street signs are in German.
  • In some countries the number 13 is considered as unlucky. In these countries the 13th floor is often omitted in multi-level buildings. The mapper ‘This Is A Display Name Desu’ asks how to deal with it when mapping building levels.

Community

  • A community representative from the Thai community will meet with the mapping lead from Grab to discuss recent quality issues caused by Grab’s activity. Mishari Muqbil is helping to prepare for the meeting and asks for input from the wider community on what topics should be addressed during the meeting.
  • “What use is OpenStreetMap?” John Whelan was asked by an employee of a municipal government. How would you have responded and where would you refer them to find further information? The answers to his question might help you too.
  • The interpretation of population limits in the OSM wiki for choosing the correct value for a place= tag led to a dispute between mappers in Turkmenistan. The issue is that the importance of a settlement in some countries is not reflected in the same way by the population number as it is in Western Europe for example. Joseph Eisenberg wrote a good summary of how to distinguish between the place-values without using hard population limits. Kevin Kenny reports on a similar discussion in the US and points to the outcome.
  • David Garcia, a cartographer, wrote an article on medium.com about why he is committed to the commons of cartography. As he lays out in the article, his turning point was his humanitarian contribution to OSM after the devastating cyclone Haiyan.

OpenStreetMap Foundation

  • We want to repeat a tweet from the OSM Operations Team: “Thank you to @AssoGrifon 🇫🇷 and @iwayAG 🇨🇭 for offering us Tile CDN servers today. Both now up & already serving traffic.”
  • The German company Alpha9 Marketing Gmbh Co KG with its brand auskunft.de joined the OpenStreetMap Foundation as a Bronze Corporate Member.

Events

  • The 4th GeoPython Conference will take place on June 24-26, 2019 in Basel/Muttenz, Switzerland. The topics range from Python in general, geospatial webservices and geovisualisation to indoor mapping and modelling

Education

  • Yongyang Xu, Zhong Xie and Liang Wu from the Faculty of Information Engineering, China University of Geosciences, Wuhan, China, have analysed the road network in OSM and found out that the data in OSM is highly detailed and complex but also includes many duplicate lines which degrade the efficiency and increase the difficulty of extracting multilane roads. The team suggested a machine‐learning‐based approach to predicted multilane roads. Unfortunately the study is behind a paywall, hence, it could not be confirmed whether the research area has indeed many duplicate roads or if it is a misconception about OSM’s data model.

Maps

  • Scott Davies published a nicely styled Chronological Map of Walthamstow on Twitter that he made with OSM, QGIS and inkscape.
  • Paul Norman offers pre-rendered OSM carto tiles at low zoom levels for download. The files that are available for the zoom levels 0 to 6, 8, 9 or 10 could be useful, if you need a world map without high zoom levels and don’t want to set up a database server.
  • Juminet has created (fr)(automatic translation) an OSM map of the Ardennes region. He describes a clever method to determine orientation for markers, and shows the special attention his map gives to Christmas trees.
  • An interesting study about street names in Germany won an award at the Information is Beautiful Awards.
  • OpenSnowMap points out that a number of cross-country ski trails in the Jura are yet to be mapped in OSM.

Software

Releases

  • Version 2.12.2 of the iD editor has recently been released. The new version allows a preset to control the “add field” dropdown, improves usability and performance, fixes a lot of bugs, adds localisation to label/description pulled from Wikidata, and further improves the presets.

Did you know …

  • … about get-map.org, a tool to create printable maps?
  • … how to run your own instance of uMap? A user asked in the forum and was pointed to a step-by-step guide.
  • … that Austria’s building coverage in OSM has reached 85 percent? That’s an increase of 21% compared with the beginning of 2016. Even the least-mapped municipalities exceed 50 percent coverage.
  • … Geofabrik’s tile calculator, that returns the number and size of tiles in a given bounding box.

Other “geo” things

  • The theory that the geography of a country determines its development is not new. However, Lionel Page disagrees and points to a road map of the border area between the neighbouring countries Finland and Russia, that shows significant differences although they are very close to each other geographically. To the critics that said one example does not falsify a theory, Lionel responded by naming similar examples in North/South Korea, Austria and Czechia (the Czech Republic), USA and Mexico.
  • Frost & Sullivan, a global business consulting firm headquartered in Mountain View, California, has chosen Mapbox’s mapping and live location platform as 2019 Platform of the Year.
  • The company Here, formerly known as Navteq and Nokia Maps, launched an app for planning and sharing rides called SoMo (derived from “social mobility”). The app aims to combine all public, private and personal transport offerings in a single system and adds social functions to share rides between personal contacts.

Upcoming Events

Where What When Country
Dresden Stammtisch Dresden 2019-01-10 germany
Berlin 127. Berlin-Brandenburg Stammtisch 2019-01-10 germany
Nantes Réunion mensuelle 2019-01-10 france
Zurich OSM Stammtisch Zurich 2019-01-11 switzerland
Rennes Réunion mensuelle 2019-01-14 france
London Missing Maps January London mid-month mapping party/working group 2019-01-15 uk
Toulouse Rencontre mensuelle 2019-01-16 france
Karlsruhe Stammtisch 2019-01-16 germany
Salzburg Maptime – Stammtisch 2019-01-16 austria
Mumble Creek OpenStreetMap Foundation public board meeting 2019-01-17 everywhere
Freiberg Stammtisch Freiberg 2019-01-17 germany
Leoben Stammtisch Obersteiermark 2019-01-17 austria
Reutti Stammtisch Ulmer Alb 2019-01-22 germany
Nottingham Nottingham 2019-01-22 england
Cologne Köln Stammtisch 2019-01-23 germany
Lübeck Lübecker Mappertreffen 2019-01-24 germany
Mannheim Mannheimer Mapathons e.V. 2019-01-24 germany
Greater Vancouver area Metrotown mappy Hour 2019-01-25 canada
Bremen Bremer Mappertreffen 2019-01-28 germany
Arlon Réunion au Pays d’Arlon 2019-02-04 belgium
London Missing Maps Monthly Mapathon London 2019-02-05 uk
Toulouse Rencontre mensuelle 2019-02-06 france
Stuttgart Stuttgarter Stammtisch 2019-02-06 germany
Dresden Stammtisch Dresden 2019-02-07 germany
Berlin 128. Berlin-Brandenburg Stammtisch 2019-02-08 germany
Ulm ÖPNV-Mapathon Ulm 2019-02-09 germany
Dresden FOSSGIS 2019 2019-03-13-2019-03-16 germany
Montpellier State of the Map France 2019 2019-06-14-2019-06-16 france
Heidelberg HOT Summit 2019 2019-09-19-2019-09-20 germany
Heidelberg State of the Map 2019 (international conference) 2019-09-21-2019-09-23 germany
Grand-Bassam State of the Map Africa 2019 2019-11-22-2019-11-24 ivory coast

Note: If you like to see your event here, please put it into the calendar. Only data which is there, will appear in weeklyOSM. Please check your event in our public calendar preview and correct it, where appropriate.

This weeklyOSM was produced by Polyglot, Rogehm, SK53, SunCobalt, TheSwavu, YoViajo, derFred.

Why performance matters

01:19, Sunday, 13 2019 January UTC

There are practical reasons that web performance matters. From a user perspective, a site that’s slow results in frustration, annoyance, and ultimately a preference for alternatives. From the perspective of a site operator, frustrated users are users who aren’t going to return, and that makes it more difficult to accomplish your mission (be it commercial or public service). Optimizations keep people happy, keep them coming back, and keep them engaged[1].

But, there’s a far more important reason to care about performance, especially for an organization like Wikimedia: improving performance is an essential step toward equity of access.

There are a multitude of factors that influence how quickly a web site loads. Many of these are universal to every user: the software itself, the operational environment in which that software runs, the network that carries the bits from the server. Improvement in any of these areas benefits every consumer of the site.

This doesn’t account for the large number of factors that are user specific. Among the factors that can significantly influence how quickly a web page loads for a given user are geography (a user who lives further away from the servers that host a web site will typically have slower access than a user who is closer); the network between the server and the user (a network that is less developed may be slower, or more susceptible to congestion); the user’s connection (mobile data is slower than wired broadband in most cases); and the user’s actual device (an old computer will load pages more slowly than a new one).

The common thread between these factors is that they correlate to socioeconomic and social factors, rather than technical ones. Wealthier people, in more developed countries, have a significantly easier time accessing the vast resources of the Internet than others. If an increasingly networked world is going to result in a more equal human society, we need to make thoughtful interventions, including interventions focused on performance.

Geography
The correspondence of geography to socioeconomic factors manifests primarily in where servers are located. Data centers, by and large, are located in wealthier parts of wealthier countries -- places where physical and network security guarantees are high, infrastructure is reliable, and trained staff are easy to hire. This is a sensible decision by those who build and operate these facilities, but it has the unintended consequence of slowing web performance for anyone who isn’t located in a wealthier part of a wealthy country.

Backbone Networks
Backbone networks are the networks that carry traffic from servers to end users -- the highways that collectively make up the “information superhighway”. And like highways, not all are equal. Massive cables connect cities like San Francisco, Seattle, and New York; many other cities, even ones that are quite large, are served by second or third order spurs off of these primary lines. Dozens of cables traverse the North Atlantic and North Pacific; only a small handful cross any oceans South of the equator. Interior network maps are hard to come by, but in most of the world we know that smaller towns and sometimes even smaller cities are simply not connected to the Internet at all.

Last-mile connectivity
Last-mile connectivity is the way that engineers talk about the way that your computer or smartphone connects to the network. Cable internet is one form of last-mile connectivity; so is 4G cellular, or DSL. In most of the world, the last mile is the biggest bottleneck in network traffic. It’s more likely than not that the last mile is the slowest part of the entire journey from the server to your computer, regardless of where you are in the world.

However, depending on where in the world you are, “slowest” can have very different meanings. In many countries, only a tiny fraction of the population has any access to high-speed internet, whether wired or wireless. Less than 1% in Ethiopia; about 2.5% of the population in Nicaragua; 15% in Libya. Even in India, considered by many to be a key cog in the modern Internet economy, less than 25% of the population has high speed data access. Meanwhile, in Japan, the average individual has 2 broadband subscriptions. In much of Western Europe, too, the rate of broadband penetration approaches or exceeds 100%.

Device quality
The final factor that corresponds with development and socioeconomic status is device quality. Stated simply, computers are expensive, whether those computers are placed on a desk or carried in a pocket. Recent trends in software development have pushed more computation down the wire to the client. This, in turn, means that the performance difference for a site when run on a high-end versus a low-end device can be quite significant, and in some cases it’s not even possible to access sites on devices that are underpowered[2].


Though there is no single change that we can make that will address all of these factors, addressing each of them is core to serving the mission of the Wikimedia Foundation, and of the Wikimedia movement as a whole.

One ongoing element of this work is research to understand the actual factors that influence user perception of performance, and the way that user satisfaction is impacted when a page loads slowly. This allows us to make data-driven decisions about where to spend our time and our energy.

We’ve shown that expanding our cache footprint can help to minimize the effects of geography. This gives us a way to address the imbalances that result from immutable physics.

We’re not in a position to address inequality of backbone or last-mile network infrastructure -- that’s something best left to telecom companies, governments, or non-profit organizations that have chosen that as their work. What we can do is to minimize the effects of these disparities by reducing the number of bytes that need to go down the wire in order to display a page, by exploring technologies like peer-to-peer distribution to eliminate them altogether, or by increasing usage of offline content that can be downloaded in bulk using public high-speed connections.

Finally, we can aggressively work to lower the compute cost of each page that we serve, so that the cost or the age of a user’s device doesn’t impact their ability to read, learn, and contribute to the world of free knowledge.

Performance engineering matters, in other words, because it gives us a way to eliminate technological divides that are otherwise difficult, expensive, or even impossible to address at a systemic level.


[1] http://engineroom.ft.com/2016/04/04/a-faster-ft-com/ is a great breakdown of the implications of performance on content consumption, based on the experience of the Financial Times as they were developing a new website. https://medium.com/@vikigreen/impact-of-slow-page-load-time-on-website-performance-40d5c9ce568a aggregates a number of different studies that illustrate the financial implications of slow page-load performance for commercial websites.

[2] A number of years ago, Chris Zacharias, formerly an engineer at Youtube, published an anecdote about the creation of a very lightweight video display page. When they launched it to a subset of traffic, the result was that measured page performance got worse, a surprising result when the page was significantly smaller. In the end it turned out that this happened because it was suddenly possible to load the player on low-powered devices and in less-connected geographies -- previously those data hadn’t been included at all because Youtube was entirely inaccessible at any speed.

Why a student wrote “oat milk” into Wikipedia

21:41, Friday, 11 2019 January UTC

The current cultural buzz around oat milk weaves together conversations around food sustainability, plant science, health, pop culture, and new industry growth. But before December 10, 2018, you couldn’t find anything about oat milk on Wikipedia. Now, thanks to a student in Yin-long Qiu’s Plants and Human Health course at the University of Michigan, the oat milk Wikipedia article is a great source of information about the plant-based product that anyone with an internet connection can access.

Oat milk has taken off internationally in the last few years, and in the United States in the last year. Coffee shops are running out of the stuff; celebrities like Leonardo DiCaprio are backing new production companies; and publications like Quartz and the New York Times are writing about its sudden cultural relevance. In 2018, Innova Market Insights predicted that the industry would reach a value of $16 billion. But not as much is known about the production process of oat milk (what is it exactly?) or its health properties (is it really better for you than dairy?).

The Wikipedia article can now help demystify the product for the curious. This University of Michigan student created the article from scratch as a classroom assignment. Throughout the process, he connected course subject-matter to a wide array of interdisciplinary topics. And now the article sees an average of about 82 pageviews a day. Considering that most student writing is read, on average, one time (by the instructor and maybe peer), 82 pageviews a day is an incredible impact for a single classroom assignment. Studies have shown that the visibility of student work in an assignment like this motivates them to do better work. It also inspires a sense of pride that they’re making a difference for public knowledge.

Read about the production, market expansion, and uses of oat milk on Wikipedia now!


Interested in incorporating a similar assignment into your classroom? Visit teach.wikiedu.org for all you need to know to get started. Or reach out to contact@wikiedu.org with questions.


ImageFile:Green oat field.jpgW.carter, CC BY-SA 4.0, via Wikimedia Commons.

This Month in GLAM: December 2018

20:02, Thursday, 10 2019 January UTC
  • Armenia report: Cooperation with Yerevan Drama Theatre Named After Hrachia Ghaplanian; Singing Wikipedia (continuation); Photographs by Vahan Kochar (continuation)
  • Australia report: 2019 Australia’s Year of the Public Domain
  • Belgium report: Writing weeks German-speaking Community; End of year drink; Wiki Loves Heritage photo contest
  • Brazil report: Google Art and GLAM initiatives in Brazil
  • India report: Collaboration with RJVD Municipal Public Library
  • Italy report: Challenges and alliances with libraries, WLM and more
  • Macedonia report: Exhibition:”Poland through photographs” & Wikipedia lectures with children in social risk
  • Malaysia report: Technology Talk and Update on Wikipedia @ National Library of Malaysia
  • Portugal report: Glam Days ’18 at the National Library of Portugal
  • Sweden report: Hats 🎩🧢👒🎓
  • UK report: Oxford
  • USA report: Holiday gatherings and visit to Internet Archive
  • Wikidata report: Wikidata reports
  • WMF GLAM report: Structured Data on Wikimedia Commons: pilot projects and multilingual captions
  • Calendar: January’s GLAM events

No creature as horrible as Tlaltecuhtli – the “embodiment of the chaos that raged before Earth’s creation” – should be allowed to roam the world, decided Mesoamerican gods Quetzalcoatl and Tezcatlipoca. So, the two powerful deities tore the great sea monster in half in a terrific battle. But Tlaltecuhtli survived (although in pieces) and demanded human sacrifice evermore as retribution. The other gods heard of her affliction and took it to be a great injustice. They scattered her dismembered body around the new world to right the wrong. “Her skin became grasses and small flowers, her hair the trees and herbs, her eyes the springs and wells, her nose the hills and valleys, her shoulders the mountains, and her mouth the caves and rivers.”

Thanks to a Yale University student in Dr. Barbara Mundy’s Fall 2018 course, Aztec Art and Architecture, the Wikipedia article about Tlaltecuhtli is robust and fascinating. The student, whose Wikipedia username is Pestocavatappi, made improvements to the article as an assignment, learning the intricacies of Wikipedia editing and community etiquette over 9 weeks. As Dr. Mundy’s course description states, the class focused on how the Aztecs of Mexico “used art and architecture to align themselves to the larger cosmos and to connect their empire to past Mesoamerican civilizations and project it into the future.”

Stone carving of Tlaltecuhtli, found in Tenochtitlan (ca. 1500). 
Public domain.

Tlaltecuhtlii’s article had remained relatively unchanged for years before Pestocavatappi began working on it. Back in early October, it was just two paragraphs long with half as many references as the current article cites. Now it boasts six additional images, a clear organization, and information that contextualizes the deity in larger narratives of Aztec mythology and history.

Pestocavatappi uploaded five of the great images now featured in the article, too. One shows a stone carving of Tlaltecuhtli, found in Tenochtitlan (ca. 1500). Another shows Tlaltecuhtli depicted in the Codex Borbonicus, an Aztec text written by priests around the time of Spanish conquest of Mexico. Images can convey additional information about a concept that words can’t; they’re a great addition to an article, especially about art history.

Tlaltecuhtli depicted in the Codex Borbonicus, an Aztec text written by priests around the time of Spanish conquest of Mexico.
Image: File:Tlaltecuhtli codex painting.jpg, Pestocavatappi, CC BY-SA 4.0, via Wikimedia Commons.

Uploading images to Wikipedia presents different challenges than the technical process of editing text in an article. Wikipedia has strict licensing requirements, prompting necessary discussions of “fair use”, open licenses, and copyright violation. In addition to our uploading images training, we’ve developed a handout for art history students that walks through these distinctions. Dr. Mundy was one of the instructors who we consulted in the production of that handout, and we thank her again!

To see what other articles Dr. Mundy’s students improved this last term, check out their course page on our Dashboard.


For more information about teaching with Wikipedia, visit teach.wikiedu.org or reach out to contact@wikiedu.org.


Image: File:Tlaltecuhtli monolith.jpg, public domain, via Wikimedia Commons. 

The WikiCite 2018 conference in Berkeley, California was an exciting meeting of the minds. There were a number of good developments for the Newspapers on Wikipedia (NOW) campaign. Here, I’ll recap those that stood out to me, as well as a few points that are unrelated to NOW. (Most of the talk videos linked below are very short, 1-3 minutes.)

This was the third annual WikiCite conference. WikiCite is an initiative to ensure that citation data (broadly defined, including publications, articles, authors, publishing houses, etc.) is well represented as open data on the web. (See also my recent post on Wikimedia Executive Director Katherine Maher’s keynote talk.) WikiCite has a great deal of overlap with NOW; though the primary focus of NOW has been prose on Wikipedia, but we have been improving Wikidata in parallel, and we can see the increasing importance of structured data to our project’s broad goal of making information about newspapers more accessible.

Newspapers on Wikipedia: A popular initiative

Pete Forsyth presents Newspapers on Wikipedia. (most) photos by Dario Taraborelli, dedicated CC0.

The WikiCite organizing committee encouraged NOW to engage with the conference, and when I arrived I could immediately see why. Many participants (librarians and Wikimedia enthusiasts, for the most part) were intrigued by what we are doing, and motivated to help out in a variety of ways.

I formally introduced NOW (3 minute video) on the second day of the 3-day conference, focusing on our choice to structure our approach as a “WikiProject,” what that means, and why it has been a good fit. Day 3 was a “hackathon” day; no fewer than four sessions (how gratifying!!) produced tangible accomplishments for NOW. These included:

  • My own hackathon session (video intro; video report) had three groups working in parallel: (1) Don Elsborg, a librarian from Colorado, and Satdeep Gill, a longtime Wikimedian, jumped right in to start an article on an Oregon newspaper, the Stayton Mail, engaging with the fundamental work of our initiative. (2) Susanna Ånäs, a Finnish librarian, worked to import a database of Finnish newspapers to Wikidata; and we had an interesting discussion around an interesting pair of U.S.-based Finnish language newspapers I had recently discovered. (3) Stas Malyshev of the Wikimedia Foundation found that many newspapers’ Wikidata entries had no description, and worked on a scalable/automated method for filling in a basic description on each.
  • Mahmoud Hashemi, Stephen LaPorte, Chunliang Lyu, and Sam Walton (video intro; video report) worked on two very cool things: (1) Demonstrating how to make a Wikidata-based citation on Wikipedia newspaper article (see the Register-Guard article); and (2) Working on an enduring tool that will help campaigns like ours (useful term, by the way…a “campaign” to achieve a specific goal in a specific time period is more specific than most WikiProjects, and can occur within one) measure progress. They named the project PaceTrack; there is a Google Doc and a GitHub repository. They’ve made much progress, and work is still underway.
  • Simon Cobb (video report), a Welsh librarian, created an example of a newspaper infobox built entirely from information on Wikidata. See the Cambrian.
  • Rob Fernandez (U.S.-based Wikimedian & librarian) demonstrated the Listeria tool, which can generate automatic lists for Wikipedia campaigns based on a Wikidata search. He created a list for Florida as an example.

Some fruitful informal chats

  • Mark Graham, Executive Director of the Wayback Machine, told me about their efforts to create archival copies of news items at scale. He also drew my attention to a substantial directory of black newspapers in the U.S., which I immediately used as a reference to expand a newspaper article, and he pointed out a couple of aligned projects to address trust in news media.
  • Dan Brickley, founder of schema.org and a Google employee, suggested a number of aligned projects. In addition, he affirmed our general belief that Wikidata is in ascendance as an important source for search results and knowledge panels.
  • Joshua Dockery pointed out the “Misinformation Alerts” site, in which humans fact-check algorithm-based misinformation spreading on the web.
  • I had a chance to catch up with LiAnna Davis of the Wiki Education Foundation, and learn a bit about how a project like ours fits with their current priorities. One specific point of interest: they are working on their first piece of university curriculum centering on Wikidata. Lane Rasberry and Daniel Mietchen are working with Wiki Ed on this as well.
  • While this is not directly “of WikiCite,” I had the chance to visit with Sage Ross, also of Wiki Ed, just before the conference; Sage has made time to guide me, Nicholas Boudreau, and Lane in learning the Python programming language. This started as an effort to build a tool to measure progress in NOW; it’s likely that the PaceTrack project described above will “outpace” us, but regardless, a better understanding of Python can only help in any effort to work closely with the PaceTrack team’s emerging project. Also during our visit, Sage and I dug into Wikidata, and had a really educational session that deepened my understanding of how the site works. Many thanks to Sage!

Relevant grant proposal advances

Coinciding with the WikiCite conference, our colleague Lane Rasberry learned that his proposal to the “Ethics and Governance of AI Initiative” had advanced to the second round, surviving a cut from 500+ applicants to 66. Lane’s proposal, through the Center for Data Ethics at the University of Virginia, is strongly aligned with NOW, and if successful may help us to forge ahead with a second round of our project. His colleague Daniel Mietchen, who has also substantially contributed to NOW by writing most of the code for our progress-tracking map, was an organizer of WikiCite. Lane, Daniel and I took the opportunity to work together in crafting the response to the Round 2 questions, and submitted the application shortly after the conference’s conclusion. It was great to have the chance to work together on this in person! Here is the application we submitted.

Getting the word out

  • Konrad Förstner, Professor for Information Literacy at Cologne, interviewed me (Pete Forsyth) and Lane Rasberry about NOW for the Open Science Radio audio podcast. UPDATE 1/9/19: Here’s the link!
  • Lane interviewed me about NOW as well, in a shorter video piece; it should be published by early 2019.
  • Separate from WikiCite, but coinciding, was the successful newspaper edit-a-thon that NOW founding members Eni Mustafaraj and Emma Lurie hosted at Wellesley College. Wellesley published a nice blog post about it.

Can’t always live in the NOW: General highlights from WikiCite

Megan Wacha & Phoebe Ayers presenting at WikiCite 2018

Megan Wacha & Phoebe Ayers, two serious Wikimedians, gave us a break from the serious with a comedy routine, “Mind Your P’s & Q’s.”

WikiCite 2018 offered many compelling moments, and of course many were not directly related to the NOW project. Here are a few that stood out to me:

Wikimedia Foundation (WMF) executive director Katherine Maher spoke about Wikimedia as the “essential infrastructure of the ecosystem of free knowledge.” See my recap and commentary here.

The second day centered on group work around strategic questions. Megan Wacha asked a question of one of the groups that Wikimedians will appreciate: In mapping out a product vision, what was your take on the role of Wikimedia volunteers and the Wikimedia Foundation? It was an insightful question, shining a light on an area of disconnect that seemed to crop up at various times in the conference. The group’s answer(s) focused almost entirely on the Wikimedia Foundation, suggesting to me that there wasn’t much understanding of the volunteers of Wikimedia as a separate entity. It seems to me that there was a great deal of technical learning and individual networking at the conference, as well as good strategic work (Day 2) and technical work (Day 3). But I’m not sure that the librarians and professionals in attendance had many opportunities to learn about the Wikimedia movement’s culture, values, or social norms. Megan’s question highlighted the point concisely, and I find myself hoping that future conferences might seek out ways to make cultural learning a more central component.

Many of my fellow Signpost colleagues, former and present, were in attendance. Phoebe Ayers, Andrew Lih, Rob Fernandez, Rosie Stephenson-Goodnight, Lane Rasberry and I discussed what it might take to put together a thorough history of Wikimedia’s in-house newspaper, perhaps as an oral history, and perhaps in time for Wikipedia’s upcoming 20th birthday. I discussed similar things with former editor Sage Ross the week before the conference.

Tpt pointed out to me that the 2019 Community Wishlist Survey, an annual effort to identify the top 10 projects of interest to members of the Wikimedia editing community for work by the Wikimedia Foundation’s developers, was likely to include improved export of books from Wikisource in formats like PDF, ePub, etc. This proved true; it was announced as the #4 priority. I look forward to seeing an improved mechanism for sharing valuable Wikimedia content more widely in offline formats!

John Mark Ockerbloom, digital librarian at the University of Pennsylvania, brought many ideas; the one that stood out most to me was a tool to make it easier to search copyright renewals. He has a database of copyright renewals for U.S. periodicals, which he introduced in a lightning talk. He talked about the value of maintaining such a list outside of Wikimedia, where an expert can assert responsibility for things like completeness.

Of course, in this blog post I have touched only what most caught my attention. There were all kinds of great things happening; if these topics speak to you, I urge you to explore the conference wiki pages, including video links, notes on Etherpad pages, etc.

Daniel Mietchen & the WikiCite 2018 organizing committee in the closing session.

Group photo by Satdeep Gill, CC BY-SA 4.0

(Note: Post updated with items on the Signpost and on copyright renewals after initial publication.)

Migrating tools.wmflabs.org to HTTPS

19:45, Wednesday, 09 2019 January UTC

Starting 2019-01-03, GET and HEAD requests to http://tools.wmflabs.org will receive a 301 redirect to https://tools.wmflabs.org. This change should be transparent to most visitors. Some webservices may need to be updated to use explicit https:// or protocol relative URLs for stylesheets, images, JavaScript, and other content that is rendered as part of the pages they serve to their visitors.

Three and a half years ago @yuvipanda created T102367: Migrate tools.wmflabs.org to https only (and set HSTS) about making this change. Fifteen months ago a change was made to the 'admin' tool that serves the landing page for tools.wmflabs.org so that it performs an http to https redirect and sets a Strict-Transport-Security: max-age:86400 header in its response. This header instructs modern web browsers to remember to use https instead of http when talking to tools.wmflabs.org for the next 24 hours. Since that change there have been no known reports of tools breaking.

The new step we are taking now is to make this same redirect and set the same header for all visits to tools.wmflabs.org where it is safe to redirect the visitor. As mentioned in the lead paragraph, there may be some tools that this will break due to the use of hard coded http://... URLs in the pages they serve. Because of the HSTS header covering tools.wmflabs.org, this breakage should be limited to resources that are loaded from external domains.

Fixing tools should be relatively simple. Hardcoded URLs can be updated to be either protocol relative (http://example.org//example.org) or explicitly use the https protocol (http://example.orghttps://example.org). The proxy server also sends an X-Forwarded-Proto: https header to the tool's webservice which can be detected and used to switch to generating https links. Many common web application frameworks have support for this already:

If you need some help figuring out how to fix your own tool's output, or to report a tool that needs to be updated, join us in the #wikimedia-cloud IRC channel.

Studies estimate that there are more than 7,000 languages spoken around the world. Wikipedia exists in about 300 of them. That’s about 4 percent of some of the world’s languages documenting some of the world’s knowledge.

Consider the Arabic language. With more than 420 million speakers, it’s one of the most widely spoken languages in the world. Yet, only 3 percent of internet content today is available in Arabic. Or consider Zulu, with more than 12 million speakers—but only about 1,100 Wikipedia articles.

In the Wikimedia vision lies a core promise to everyone who uses our sites—all the world’s knowledge, for free, and in your own language. We have a long way to go to achieving that vision, but we’re excited about the expansion of a tool we already know has been successful in helping us get there.

Our content translation tool has been used to translate nearly 400,000 articles on Wikipedia. We leverage machine translation to support editors by producing an initial translation of an article they can then review, edit, and improve. Today, we’re excited to announce that Google Translate, one of the most advanced machine translation systems available today, will now be available for editors to utilize when translating articles through the content translation tool.

How it works

Integrating Google Translate into the content translation tool on Wikipedia has been long-requested by volunteer editor communities. Editors can select from several machine translation systems to support an initial article translation, Google Translate now being one of these options. By introducing Google Translate as one of the machine translation systems, the content translation tool can now support an additional 15 languages, including Hausa, Kurdish (Kurmanji), Yoruba, and Zulu. Today, the content translation tool can facilitate translations in 121 total languages.

We’re excited to collaborate with Google on this new added functionality of the content translation tool. Translations will be published under a free license that allows content to be integrated back into Wikipedia in line with our own licensing policies. No personal data will be shared with Google or Wikimedia as part of Google Translate’s integration into the content translation tool.

If you have questions about this new functionality and how it works, you can also take a look at the FAQ on MediaWiki.org and post any questions on the project’s talk page.

Stay tuned for more updates on the content translation tool and the Wikimedia Foundation’s Language team’s work to expand language support for all the world’s knowledge in all the world’s languages.

Runa Bhattacharjee, Senior Engineering Manager, Language
Pau Giner, Senior User Experience Designer, Audiences Design
Wikimedia Foundation

Challenges I Face

02:15, Wednesday, 09 2019 January UTC

Thus far, I’ve been stocked on traversing the DOM. Understanding the different parts of the DOM is one thing and traversing the DOM is another as one has to traverse the DOM to be able to detect any issue like the link inside link issue as is the case for this project. I remember talking about the DOM without mentioning BFS (Breadth First Search) and DFS (Depth First Search). This means I mistakenly left out the ways of traversing the DOM which is actually very important, the methods we saw are alone are not sufficient to traverse the DOM. This has gotten me stuck on the task at hand: writing code in Parsoid to detect the use of links-in-links.

I am actually trying to write a code to detect the external links in an HTML file. I am suppose to parse the HTML to get a DOM structure, of which I should detect any external links (one which is not a wikilink). After detecting this I am suppose to output the outer HTML and the dsr value of the external link.

My mentor is helping me understand the code in the files and to work on. He is actually my primary and main source as I turn to him whenever I have doubt and when I am stuck.

Sometimes I feel like I can do it on my own … ok most at times actually which is not right. The best thing is to ask for help when you get stuck on a single task.

I will double efforts, become more focus and cautious to be able to get through this task, I want my story to flow till the end so I can tell you how I did that.

Looking forward to see you in the next post where I’ll give you my solution:-)


The post Challenges I Face appeared first on Farida.

Wikidata Architecture Overview (diagrams)

01:03, Wednesday, 09 2019 January UTC

Over the years diagrams have appeared in a variety of forms covering various areas of the architecture of Wikidata. Now, as the current tech lead for Wikidata it is my turn.

Wikidata has slowly become a more and more complex system, including multiple extensions, services and storage backends. Those of us that work with it on a day to day basis have a pretty good idea of the full system, but it can be challenging for others to get up to speed. Hence, diagrams!

All diagrams can currently be found on Wikimedia Commons using this search, and are released under CC-BY-SA 4.0. The layout of the diagrams with extra whitespace is intended to allow easy comparison of diagrams that feature the same elements.

High level overview

High level overview of the Wikidata architecture

This overview shows the Wikidata website, running Mediawiki with the Wikibase extension in the left blue box. Various other extensions are also run such as WikibaseLexeme, WikibaseQualityConstraints, and PropertySuggester.

Wikidata is accessed through a Varnish caching and load balancing layer provided by the WMF. Users, tools and any 3rd parties interact with Wikidata through this layer.

Off to the right are various other external services provided by the WMF. Hadoop, Hive, Ooozie and Spark make up part of the WMF analytics cluster for creating pageview datasets. Graphite and Grafana provide live monitoring. There are many other general WMF services that are not listed in the diagram.

Finally we have our semi persistent and persistent storages which are used directly by Mediawiki and Wikibase. These include Memcached and Redis for caching, SQL(mariadb) for primary meta data, Blazegraph for triples, Swift for files and ElasticSearch for search indexing.

Getting data into Wikidata

There are two ways to interact with Wikidata, either the UI or the API.

The primary UI is JS based and itself interacts with the API. The JS UI covers most of the core functionality of Wikibase with the exception of some small small features such as merging of entities (T140124, T181910). 

A non JS UI also exists covering most features. This UI is comprised of a series of Mediawiki SpecialPages. Due to the complexities around editing statements there is currently no non JS UI for this.

The API and UIs interact with Wikidata entities stored as Mediawiki pages saving changes to persistent storage and doing any other necessary work.

Wikidata data getting to Wikipedia

Wikidata clients within the Wikimedia cluster can use data from wikidata in a variety of ways. The most common and automatic way is the generation of the “Languages” side bar on projects linking to the same article in other languages.

Data can also be accessed through the property parser function and various LUA functions.

Once entities are updated on wikidata.org that data needs to be pushed to client sites that are subscribed to the entity. This happens using various subscription metadata tables on both the clients and the repo(wikidata.org) itself. The Mediawiki jobqueue is used to process the updates outside of a regular webrequest, and the whole process is controlled by a cron job running the dispatchChanges,php maintenance script.

For wikidata.org multiple copies of the dispatchChanges script run simultaneously, looking at the list of client sites and changes that have happened since updates were last pushed, determining if updates need to be pushed and queueing jobs to actually update the data where needed, causing a page purge on the client. When these jobs are triggered the changes are also added to the client recent changes table so that they appear next to other changes for users of the site.

The Query Service

The Wikidata query service, powered by blazegraph, listens to a stream of changes happening on Wikidata.org. There are two possible modes, polling Special:RecentChanges, or using a kafka queue of EventLogging data. Whenever an entity changes the query service will request new turtle data for the entity from Special:EntityData, munge it (do further processing) and add it to the triple store.

Data can also be loaded into the query service from the RDF dumps. More details can be found here.

Data Dumps

Wikidata data is dumped in a variety of formats using a couple of different php based dump scripts.

More can be read about this here.

The post Wikidata Architecture Overview (diagrams) appeared first on Addshore.

As 2018 turned to 2019, people around the world celebrated the start of a brand new year with parties, family, and friends. The transition into 2019 also marked a new era for access to knowledge and culture in the United States, as new works finally entered the public domain through copyright expiration for the first time in over 20 years.

This means that when the clock struck midnight on 1 January, a flood of photographs, songs, novels, and artwork from 1923 all became freely available for everyone to use, remix, and share in the U.S. Sounds like something worth celebrating!

But let’s back up: what is the public domain, how does it work, and why should you care about its grand re-opening?

In the United States, copyright law and the public domain are enshrined in the founding documents. Article 1 of the U.S. Constitution incentivizes the creation of new works by granting them copyright protection for a limited amount of time, after which those works will enter the public domain where they can be used by anyone, for any purpose. This baked-in balancing act is a constitutional recognition both that the public benefits from the creation of new works and that the benefit to the public is not fully realized until everyone has the freedom to share, build on, and adapt those works.

Over the years, copyright expiration in the U.S. has been subject to a number of extensions. Where a work would once have be available for everyone to learn from, share, and build upon within 28 years of publication, the copyright in a work created today will last until 70 years after the author’s death. This slow creep of copyright terms has left gaps in works entering the public domain—the largest a result of the Copyright Term Extension Act of 1998, which resulted in no new works entering the public domain in the last 20 years.

It is true that works can enter the public domain in other ways. Official publications by the United States government are not protected by copyright and thus automatically enter the public domain, including everything from presidential speeches to spectacular photographs of space from NASA. Others recognize the importance of a rich public domain to learning and culture and donate their works using the CC0 public domain dedication. Still, due to the slow creep of term extensions, works from the early 20th century that are most in need of preservation and digitization have only just been released and others have been lost before their copyright expired.

It is important to note that the rules mentioned above apply only to the United States. In many countries, the public domain is slightly more robust, with works entering the public domain 50 years after the author’s death or earlier. In others, it is more restrictive, with copyright terms lasting for the life of the author plus 100 years or restrictions placed on government works.

No matter where you are, the public domain is a gift that cannot be taken for granted. The benefits of information, art, and culture are universal, and with the proliferation of the internet across the globe, public domain works are reaching broader audiences than their original authors may ever have imagined.

Wikimedia’s volunteer editors excel at honoring the public domain, and put it to good use. Every year, when new creative works around the world enter the public domain, those works and their creators receive a flurry of attention on Wikipedia and its sister projects. This month, German Wikipedians have extended Kurt Schwitters’ Wikipedia article with quite a few new illustrations. French Wikimedians have begun transcribing Antonin Artaud’s Le théâtre et son double on Wikisource. And international volunteers have uploaded hundreds of new files to Wikimedia Commons—allworks that have entered the public domain in January 2019.

This is what is so special about the public domain. The commons is not just a collection of old photographs and classical music. It’s a living record of our history, our discoveries, and our humanity. It deserves to be celebrated, and in the United States this year, we have a reason to celebrate!

So, join us in celebrating the public domain however you can. Here are some ways you can participate:

  • Upload. An entire year’s worth of works, from Charlie Chaplin’s film Safety Last! to William Carlos Williams‘ The Great American Novel are now free to upload on projects like Wikimedia Commons and Wikisource.
  • Proofread. With so many works already in the public domain, Wikisource could use your proofreading skills for works from all years, not just 1923.
  • Remix. If your skills lie in creation, then a brand new set of music, movies, photographs, stories, and artwork just opened up for you to play with. Why not try your hand at coming up with a creative way to make these classics new again?
  • Tweet. Share your love of the public domain on social media by tweeting out your favorite public domain image or media file.
  • Learn. As always, one of the best ways to show your love for something is to learn about it, so why not spend a few minutes reading articles about notable events from 1923, like the first publication of Time magazine or the completion of the Hollywood sign? Or, you could learn more about the public domain itself.
  • Party. If you are based in the San Francisco Bay Area, join us, Creative Commons, and the Internet Archive for an event celebrating the “Grand Re-opening of the Public Domain” on January 25th. The event will have demos, interactive displays, lightning talks and several addresses by public domain enthusiasts, including Lawrence Lessig, Pamela Samuelson, Cory Doctorow, and our own Ben Vershbow, who will give a talk about GLAM institutions and the public domain. Tickets are still available!

Allison Davenport, Technology, Law, and Policy Fellow
Wikimedia Foundation

Thank you to Sandra Fauconnier, who contributed content for this post.

Editor’s note: In a previous version of this post, we identified Bambi, A Life in the Woods as entering the public domain in 2019. However, a 1996 court decision officially found that the novel was not copyrighted in the United States until 1926, not 1923. We look forward to welcoming Bambi, A Life in the Woods as a member of the public domain class of 2022.

Older blog entries