@Wikipedia - could give a clue to #deleted articles

16:00, Saturday, 15 2019 June UTC
Even deleted Wikipedia articles have "false friends". In a list of award winners a Mr Markku Laakso used to have an article. This Mr Laakso was actually a conductor and not the diabetes researcher the list was there for. For whatever reason, the article for the conductor was deemed to be not notable and it was deleted.

When you are NOT a Wikipedia admin, there is no way to know what was deleted.

One solution is for all blue, red and black links to refer to Wikidata items. When an article is deleted, the Wikidata item is still there making it easy to prevent cases of mistaken identity like with Mr and Mr Laakso.

A more expanded proposal you may find here.


23:30, Thursday, 13 2019 June UTC


Heading to the airport soon, to fly to Sydney for the WMAU Community Conference. I'm looking forward to meeting new people, and finding out more about what's going on around the country. It's pretty rare that we all get together — it's never happened since I've been involved in Wikimedia stuff. Australia's a bit too big, really.

The annual conference behind Wikipedia, one of the world’s most visited and beloved websites, will take place in Stockholm from 14–18 August 2019. The event will bring together members of the global community behind Wikipedia and the other Wikimedia free knowledge projects: volunteer editors (Wikimedians) from around the world, museum and archival experts, leaders of the free knowledge movement, academics, and more. It will be the first time the conference is held in the Nordics, with previous events being held in Cape Town, South Africa; Montreal, Canada; and Esino Lario, Italy, among other locations.

The conference theme this year is “Stronger together: Wikimedia, free knowledge and the Sustainable Development Goals,” emphasizing the shared aspirations of Wikimedia, free knowledge, the United Nations Sustainable Development Goals, and the role of each in creating a more equitable world. It is not up to one or a few actors to build a better world. It is up to all of us who inhabit this planet.

Wikimania will encourage attendees to reflect on how our free knowledge movement can support and collaborate with others to address some of the world’s most challenging problems in gender diversity, environmental sustainability, quality education, and more. The conference programme will include a diverse array of speakers and presentations, panel discussions, workshops, and more. This year’s conference is co-organized by Wikimedia Sweden, the independent non-profit Wikimedia chapter dedicated to supporting Wikipedia, the other Wikimedia projects, and free knowledge in Sweden, and the Wikimedia Foundation, the international non-profit that operates Wikipedia.

“Wikimedians believe that a world that is better informed is a world that is better equipped to meet our shared challenges. This year’s Wikimania is an opportunity to come together and celebrate the accomplishments that we have made in free knowledge over the past year. It’s also a time to reflect on what’s next—how shared knowledge can contribute to a more equitable and sustainable future for all,” said Katherine Maher, CEO of the Wikimedia Foundation.

An estimated 1,000 people are expected to attend this year’s conference. Among the main speakers are Michael Peter Edson, co-founder of the Museum for the United Nations – UN Live; Liv Inger Somby, journalist and associate professor at the Sami University of Applied Sciences; Karin Holmgren, Vice-Chancellor at the Swedish University of Agricultural Sciences; and Ryan Merkley, CEO of Creative Commons.

“One of Wikipedia’s greatest strengths lies in collaboration—that we can collectively achieve greater things when we collaborate together. The UN has also explicitly stated that we can only achieve greater equity and peace in the world when we openly contribute and work together. At the first ever Wikimania in the Nordics, Wikimedians from around the world will meet to discuss how we can collectively support the Sustainable Development Goals in our work with the free knowledge movement. We are also hoping to see many of our partners and users of Wikipedia join the discussions in Stockholm,” said John Andersson the Executive Director at Wikimedia Sweden.

This year’s conference program is underway, with sessions based on individual Sustainable Development Goals, including diversity, education, technology, and Wikimedia’s role within each of the goals.

For more information, please see the program (presented in full at the beginning of July) and how to register for the conference. For questions regarding interviews and press accreditation, please contact Victoria Englund at victoria[at]wenderfalck[dot]com.

Quick facts

      • Our theme is “Stronger together: Wikimedia, free knowledge and the Sustainable Development Goals”. This year’s Wikimania is hosted by Wikimedia Sweden, the local independent non-profit Wikimedia affiliate.
      • The fifteenth edition of Wikimania will take place in Stockholm between 14–18 August 2019, a city recognized internationally for its work in sustainability and sustainable development.
      • Among the expected 1,000 guests will be: Katherine Maher, CEO of the Wikimedia Foundation, Jimmy Wales, founder of Wikipedia, Michael Peter Edson, co-founder of the Museum of the United Nations; Ryan Merkley, CEO of Creative Commons; and Wikimedians from all over the world.


About Wikimedia Sweden

Wikimedia Sweden is the non-profit volunteer-driven Swedish Wikimedia chapter. It exists to promote the free knowledge movement in the country by supporting Wikipedia editors, contributions to Wikimedia projects such as Wikipedia, Wikidata and Wikimedia Commons, and increasing access to free knowledge sources. Wikimedia Sweden has around 500 members and several organisational members. The organisation consists of nine board members and nine employees. Their headquarters are located in Stockholm, but their members are spread out all over the country and a large extent of their activity is managed online, which makes it easier for people to get engaged in the organisation.

About the Wikimedia Foundation

The Wikimedia Foundation is the international non-profit organization that operates Wikipedia and the other Wikimedia free knowledge projects. Our vision is a world in which every single human can freely share in the sum of all knowledge. We believe that everyone has the potential to contribute something to our shared knowledge, and that everyone should be able to access that knowledge freely. We host Wikipedia and the Wikimedia projects, build software experiences for reading, contributing, and sharing Wikimedia content, support the volunteer communities and partners who make Wikimedia possible, and advocate for policies that enable Wikimedia and free knowledge to thrive. The Wikimedia Foundation is a charitable, not-for-profit organization that relies on donations. We receive financial support from millions of individuals around the world, with an average donation of about $15. We also receive donations through institutional grants and gifts. The Wikimedia Foundation is a United States 501(c)(3) tax-exempt organization with offices in San Francisco, California, USA.

About Wikimania

Wikimania is the annual conference centered on the Wikimedia projects (Wikipedia and the other Wikimedia free knowledge websites) and the Wikimedia community of volunteers who contribute to them. It features presentations on Wikimedia projects, other wikis, free and open source software, free knowledge and free content, and the social and technical aspects which relate to these topics. Wikimania 2019 marks the 15th year of the conference.

About Wikipedia

Wikipedia is the world’s free knowledge resource. It is a collaborative creation that has been added to and edited by millions of people from around the globe since it was created in 2001: anyone can edit it, at any time. Wikipedia is offered in 300 languages containing a total of more than 50 million articles, and viewed more than 15 billion times every month. It is the largest, collaborative collection of free knowledge in human history, and today its content is contributed and edited by a community of more than 250,000 volunteer editors each month.

Unlike most websites, Wikipedia and its sister projects are ad-free. This is actually one of the reasons why our performance is so good. We don't have to deal with slow and invasive third-parties.

However, while we don't have ads, we do display announcement and fundraising banners frequently at the top of wikis. Here's an example:

Those are driven by JS and as a result always appear after the initial page render. Worse, they push down content when they appear. This is a long-standing technical debt issue that we hope to tackle one day. One of the most obvious issues we deal with that may impact performance perception. How big is the impact? With our performance perception micro survey asking our visitors about page performance, we can finally find out.

Perception distribution

We can look at the distribution (Y axis) of positive and negative survey answers based on when the banner was injected into the DOM, in milliseconds (X axis).

We see the obvious pattern that positive answers to the micro-survey question (did this page load fast enough?) are more likely if the banner appeared quickly. However, by looking at the data globally like this, we can't separate the banner's slowness from the page's. After all, if your internet connection and device are slow, both the page itself and the banner will be slow, and users might be responding based on the page, ignoring the banner. This distribution might be near identical to the same being done for page load time, regardless of a banner being present or not.

Banner vs no banner

A simple way to look at this problem is to check the ratio of micro-survey responses for pageviews where a banner was present vs pageviews where there was no banner. Banner campaigns tend to run for specific periods, targeting certain geographies, meaning that a lot of visits don't have a banner displayed at all. Both samples sizes should be enough to draw conclusions.

Corpus User satisfaction ratio Sample size
No banner or answered before banner 86.64% 1,111,542
Banner and answered after banner 87.8% 311,332

For the banner case, we didn't collect whether the banner was in the user's viewport (i.e. was it seen?).

What is going on? It would seem that users are slightly more or equally satisfied of the page performance when a banner is injected. It would suggest that our late-loading banners aren't affecting page performance perception. This sounds too good to be true. We're probably looking at data too globally, including all outliers. One of our team's best practices when findings that are to good to be true appear is to keep digging to try to disprove it. Let's zoom in on more specific data.

Slow vs fast banners

Let's look at "fast" pageloads, where loadEventEnd is under a second. That event happens when the whole page has fully loaded, including all the images.

Corpus User satisfaction ratio Sample size
Banner injected into DOM before loadEventEnd 92.66% 4,761
Banner injected into DOM less than 500ms after loadEventEnd 92.03% 67,588
Banner injected into DOM between 2 and 5 seconds after loadEventEnd 85.33% 859

We can see that the effect on user performance satisfaction starts being quite dramatic as soon as the banner is really late compared to the speed of the main page load.

What if the main pageload is slow? Are users more tolerant of a banner that takes 2-5 seconds to appear? Let's look at "slow" pageloads, where loadEventEnd is between 5 and 10 seconds:

Corpus User satisfaction ratio Sample size
Banner injected into DOM before loadEventEnd 79.13% 3019
Banner injected into DOM less than 500ms after loadEventEnd 78.45% 2488
Banner injected into DOM between 2 and 5 seconds after loadEventEnd 76.17% 2480

While there is a loss of satisfaction, it's not as dramatic as for fast pages. This makes sense, as users experiencing slow page loads probably have a higher tolerance to slowness in general.

Slicing it further

We've established that even for a really slow pageload, the impact of a slow late-loading banner is already visible at 2-5 seconds. If it happens within 500ms after loadEventEnd, the impact isn't that big (less than 1% satisfaction drop). Let's look at the timespan after loadEventEnd in more detail for fast pageloads (< 1s loadEventEnd) in order to find out where things start to really take a turn for the worse.

Here's the user page performance satisfaction ratio, based on how long after loadEventEnd the banner was injected into the DOM:


The reason why the issues caused by late-loading banners when looking at data globally is probably because most of the time banners load fast. But when they happen after loadEventEnd, users start to be quite unforgiving, with the performance satisfaction ratio dropping rapidly. For users with an otherwise fast experience, we can't afford for banners to be injected more than 500ms after loadEvendEnd if we want to maintain a 90% satisfaction ratio.

Of course, we would like to change our architecture so that banners are rendered server-side, which would get rid of the issue entirely,. But in the meantime loadEventEnd + 500ms seems like a good performance budget we should aim for if we want to mitigate the user impact of our current architectural limitations.

What would your life be like #WithoutWikipedia?

16:00, Wednesday, 12 2019 June UTC

For more than two years, the people of Turkey have been unable to access or participate on Wikipedia—what Ege, a high school student in the country, calls the “source of the sources.”

And they’re not the only ones affected. Without Wikipedia, the rest of the world is missing the voices, perspectives, and knowledge of Turkey’s 82 million people. We’ve petitioned the European Court of Human Rights to lift this block, part of our continued commitment to knowledge and freedom of expression as fundamental rights for every person.

In light of all this, we wanted to know how people might imagine their lives would be different #WithoutWikipedia.

Hundreds of people responded on social media. Here are a few of their answers:

• • •

@manikesh15: “Are you kidding me? Wikipedia is the answer book for variety of questions, ranging from 1940s Telugu films to scholastic publications, Chemical elements to Sherlock Holmes, Harry  Potter adventures. It’s one place destination for all the queries.”

@buseyerlii: “It’s been hell for us without you guys. Turkey missed you❤

@DanGFSouza: “Wikipedia, despite its shortcomings, is the best collaborative project in the internet. It amazes me when people join together to build something for the benefit of all. I’d be very disappointed if it closes down or if my country restricts access to it.”

@AashirwadGupta7: “Trust me I can’t imagine life without Wikipedia. I remember the day when I started using internet and only thing I know was Wikipedia where I can find most probably ever answer to my questions it has changed the I act and think my prospective towards life has changed bcoz of it.”

@KC_Velaga: “If I had answered this question six years ago, I would have said; completing my assignments will be tough. Being a Wikimedian for five years now, I will say I would miss the wonderful experience of being part of this incredible community and even more delightful experience”

@iugoaoj: “Depressing. Browsing Wikipedia was my favourite (and maybe only) past time for a good chunk of my childhood”

@alphrho: “Going to library to find outdated encyclopaedias or paying  an online subscription to access one of those.”

@GrungiePunkiePie: “Something BIG is missing #WeMissWikipedia

@1stbullet: “Finding the result on google but not able to click is disappointment and ‘learned desperateness’”

@LeniLaniLucero: “I would not have instant access to knowledge when a question arises about most anything! I would probably pay more attention to the back stories of news stories, instead of waiting until it came up again, and then search for answers or data.”

@alexjstinson: “I would suck at Trivial Pursuit.”

“Wouldn’t know what to stare at (in bed) on my phone at 2.30am.” – @shreshthx

@Merolyn: “We’re just two lost souls
Swimming in a fish bowl
Year after year
Running over the same old ground
And how we found
The same old fears
Wish you were here @Wikipedia”

Introducing the codehealth pipeline beta

02:54, Wednesday, 12 2019 June UTC

After many months of discussion, work and consultation across teams and departments[0], and with much gratitude and appreciation to the hard work and patience of @thcipriani and @hashar, the Code-Health-Metrics group is pleased to announce the introduction of the code health pipeline. The pipeline is currently in beta and enabled for GrowthExperiments, soon to be followed by Notifications, PageCuration, and StructuredDiscussions. (If you'd like to enable the pipeline for an extension you maintain or contribute to, please reach out to us via the comments on this post.)

What are we trying to do?

The Code-Health-Metrics group has been working to define a set of common code health metrics. Our current understanding of code health factors are: simplicity, readability, testability, buildability. Beyond analyzing a given patch set for these factors, we also want to have a historical view of code as it evolves over time. We want to be able to see which areas of code lack test coverage, where refactoring a class due to excessive complexity might be called for, and where possible bugs exist.

After talking through some options, we settled on a proof-of-concept to integrate Wikimedia's gerrit patch sets with SonarQube as the hub for analyzing and displaying metrics on our code[1]. SonarQube is a Java project that analyzes code according to a set of a rules. SonarQube has a concept of a "Quality Gate", which can be defined organization wide or overridden on a per-project basis. The default Quality Gate says that of code added in a patch set, over 80% of it must be covered by tests, less than 3% of it may contain duplicated lines of code, and the maintainability, reliability and security ratings should be graded as an A. If code passes these criteria then we say it has passed the quality gate, otherwise it has failed.

Here's an example of a patch that failed the quality gate:

screenshot of sonarqube quality gate

If you click through to the report, you can see that it failed because the patch introduced an unused local variable (code smell), so the maintainability score for that patch was graded as a C.

How does it integrate with gerrit?

For projects that have been opted in to the code health pipeline, submitting a new patch or commenting with "check codehealth" will result in the following actions:

  1. The mwext-codehealth-patch job checks out the patchset and installs MediaWiki
  2. PHPUnit is run and a code coverage report is generated
  3. npm test:unit is run which may generate a code coverage report if the package.json file is configured to do so
  4. sonar-scanner binary runs which sends 1) the code, 2) PHP code coverage, and 3) the JavaScript code coverage to Sonar
  5. After Sonar is done analyzing the code and coverage reports, the pipeline reports if the quality gate passed or failed. The outcome does not prevent merge in case of failure.
pipeline screenshot

If you click the link, you'll be able to view the analysis in SonarQube. From there you can also view the code of a project and see which lines are covered by tests, which lines have issues, etc.

Also, when a patch merges, the mwext-codehealth-master-non-voting job executes which will update the default view of a project in SonarQube with the latest code coverage and code metrics.[3]

What's next?

We would like to enable the code health pipeline for more projects, and eventually we would like to use it for core. One challenge with core is that it currently takes ~2 hours to generate the PHPUnit coverage report. We also want to gather feedback from the developer community on false positives and unhelpful rules. We have tried to start with a minimal set of rules that we think everyone could agree with but are happy to adjust based on developer feedback[2]. Our current list of rules can be seen in this quality profile.

If you'll be at the Hackathon, we will be presenting on the code health pipeline and SonarQube at the Code health and quality metrics in Wikimedia continuous integration session on Friday at 3 PM. We look forward to your feedback!

Kosta, for the Code-Health-Metrics group

[0] More about the Code Health Metrics group: https://www.mediawiki.org/wiki/Code_Health_Group/projects/Code_Health_Metrics, currently comprised of Guillaume Lederrey (R), Jean-Rene Branaa (A), Kosta Harlan (R), Kunal Mehta (C), Piotr Miazga (C), Željko Filipin (R). Thank you also to @daniel for feedback and review of rules in SonarQube.
[1] While SonarQube is an open source project, we currently use the hosted version at sonarcloud.io. We plan to eventually migrate to our own self-hosted SonarQube instance, so we have full ownership of tools and data.
[2] You can add a topic here https://www.mediawiki.org/wiki/Talk:Code_Health_Group/projects/Code_Health_Metrics
[3] You might have also noticed a post-merge job over the last few months, wmf-sonar-scanner-change. This job did not incorporate code coverage, but it did analyze most of our extensions and MediaWiki core, and as a result there is a set of project data and issues that might be of interest to you. The Issues view in SonarQube might be interesting, for example, as a starting point for new developers who want to contribute to a project and want to make some small fixes.

On Thursday, 30 May, Wikimedia Foundation lawyers were in a courtroom in Alexandria, Virginia, to watch oral arguments in our ongoing case against the United States government’s mass surveillance practices. As our counsel from the American Civil Liberties Union (ACLU) rose to stand before the Judge, we rehearsed in our heads the arguments we knew our team had prepared, eager to hear them play out in real time.

Just over four years ago, Wikimedia joined eight co-plaintiffs in filing a lawsuit against the US National Security Agency. We detailed how the NSA’s Upstream mass surveillance practices violate both the U.S. Constitution and the federal law that supposedly authorizes the surveillance. The U.S. government sought to dismiss the case, arguing that none of the plaintiffs had standing—that is, the right to have their claims heard in court. In October of that year, District Court Judge T.S. Ellis, III granted the government’s motion. The Foundation and our co-plaintiffs appealed to the Court of Appeals for the Fourth Circuit, which overturned the lower court ruling in May 2017, but only as to the Foundation. The case would proceed, but Wikimedia would go it alone.

Two years later, following a long discovery process, Patrick Toomey—one of the ACLU lawyers—again stood before Judge Ellis. And again, the topic of the day was standing. Toomey was prepared to argue that Wikimedia could proceed as a plaintiff because we had presented extensive evidence that our communications are subject to Upstream surveillance. This evidence includes the government’s own official disclosures about the scope of this surveillance, as well as testimony from an internet networking expert.

As the movant, the U.S. government presented  its case first. It argued that the Foundation had not provided sufficient evidence for the lawsuit to continue. It also claimed that the case could not proceed because doing so would require the court to consider information the government claims is protected by the state secrets privilege. In other words, the government alleged that disclosing certain information about Upstream surveillance would harm national security—and, since that information cannot be produced as evidence, the entire case must be dismissed.

Next, Toomey presented Wikimedia’s arguments, clearly and concisely laying out the case while fielding questions from the judge about topics such as our expert declaration and the legal criteria for establishing standing. Toomey  also explained that the state secrets privilege does not apply, because the Foreign Intelligence Surveillance Act created a process by which courts must review review privileged information in electronic surveillance  cases such as ours. IOf the government seeks to rely on privileged information for its defense, it cannot withhold that information from the court and argue for the dismissal of Wikimedia’s lawsuit on that basis. Instead, FISA’s procedures apply, and the court is required to examine the government’s sensitive evidence behind closed doors.

And now we wait. The court will likely issue a ruling soon as to whether or not the case can proceed to the merits. At the next stage of the case, the judge may review the government’s evidence about Wikimedia’s standing in camera, relying on FISA’s procedures. If the court concludes that the case can proceed, Wikimedia  would at last be able to present our arguments about the NSA’s activities, and demonstrate why the interception of Wikimedia’s internet communications is a violation of the law. It has been a long road, and we are grateful for the hard work and wisdom of our pro bono counsel at the ACLU, the Knight First Amendment Institute, and Cooley, LLP. We continue to look forward to the opportunity to present the merits of our case in court.

When the court’s ruling is announced, we will update our communities on the outcome and next steps. For more information, you can always visit our landing page on the case or the ACLU’s website. As Toomey observed, reflecting on the hearing, “The government’s own public disclosures establish that Wikimedia is subject to Upstream surveillance and has standing to challenge it. Congress authorized the public courts to hear cases challenging unlawful surveillance, recognizing that judicial oversight is essential to accountability. The case should now go forward for the court to determine whether the NSA’s spying is lawful.”

The annual Wikimedia Hackathon brings people with different skillsets together to improve the software that powers Wikimedia websites. This edition took place at the Prague National Library of Technology and provided an opportunity for volunteer technologists to meet in person, share ideas, have fun, and work together to improve the software that Wikipedia and its sister projects depend on to ensure free knowledge is available and accessible to anyone with an internet connection.

Each year the Wikimedia Hackathon takes place in a different location. This year, volunteers with WMCZ—an independent, nonprofit affiliate organization—did the work of planning and hosting a diverse group of attendees, who ranged from long term contributors to Wikimedia projects, to brand new members of the community.

Natalia, WMCZ’s event organizer, describes her experience: “It was both challenging and exciting to work on organizing this technical event. Our ultimate goal was to ensure the best working conditions for participants so they could focus on what they came to Prague for. This would not have been possible if I had not had an amazing team which was well organized, proactive and deeply involved.”

The planning was worth it! For three days, we saw people who are used to collaborating in online spaces working together in a physical one. Attendees shared their knowledge and skills in real time. They discussed and informed each other about ongoing projects. Experienced community members enthusiastically mentored and helped newcomers to get their hands on code.

As Jon describes it, “Everywhere I look, someone is helping somebody solve the latest problem, or sharing a laugh, or furiously coding to wrap up their project so they have something to show.

It was truly inspiring to observe and participate with others in a real space. Whether people spent their time one-on-one, attended organized sessions on software development, or gathered together in front of their laptops at tables in the hacking area, we know that every collaboration has a real effect, and that we were able to help facilitate that.  What happened at the Prague Hackathon will make the life of readers, editors, translators, multimedia creators, open data enthusiasts, and developers easier.

It’s especially inspiring to see new attendees. Nearly half of this year’s were at their first Wikimedia Hackathon!

Gopa, who attended his first Wikimedia Hackathon and works on an online tool which will allow users to cut videos uploaded to Wikimedia Commons, shares that “hacking through the code for 3 continue days and developing something productive is a great learning experience.”

More than 300 software projects, trainings, discussion sessions and activities proposed by attendees took place over the three days of the Hackathon, and hundreds of code changes were made to improve the user experience on Wikimedia websites.

On the last day, dozens of achievements were presented by attendees in a showcase session. This includes:



  • Tonina worked on improving the search function (see screenshot above). She added a new feature which allows you to sort your search results on Wikimedia websites by date edited or date created of a wiki page.
  • Florian wrote an extension called PasswordlessLogin. His proof of concept allows you to log into a Wikimedia website with your smartphone without having to enter your password.

The next large gathering of Wikimedians to work on Wikimedia software will take place at Wikimania Stockholm in August 2019.

Andre Klapper, Developer Advocate (Contractor), Technical Engagement
Sarah Rodlund, Technical Writer,
Technical Engagement

Wikimedia Foundation

Last chance to sign up for July Wikidata courses!

17:45, Monday, 10 2019 June UTC

Like Wikipedia, Wikidata is a collaborative online community that organizes knowledge and presents it to the world for free. This global repository is important for so many reasons, chiefly among them that the data stored in Wikidata is machine readable. That means when you ask Alexa or Siri a question, it’s likely that the answer is coming from Wikidata. By engaging in Wikidata’s community, you have the power to equip millions of people with accurate and comprehensive information.

Wiki Education is offering online courses and in-person workshops for those interested in learning how to contribute to Wikidata and answer their own research questions utilizing its analytical tools. Take a course or attend a workshop with us this July! Enroll one participant for $800, or, enroll two or more participants from your institution for $500 each. To reserve your seat, please fill out your registration by EOD June 17th. The payment deadline is June 24th.

Wikidata as a platform allows librarians (and researchers in general!) to do things we’ve all only dreamt of before. What’s so great about it is that once you have a grip on linked data practices and the mechanics of working within Wikidata, the applications are endless. The repository is only expanding (there are already 56 million items). Soon, thousands of institutions around the world will have connected their collections where we can all query them. Here are just a few applications you can pursue after taking our Wikidata course(s):

  • Elevate the visibility of your collections by mapping your items in the global repository that is Wikidata.
  • Draw new insights about your collections using Wikidata’s customizable visualization and query tools.
  • Gain a comprehensive understanding of existing research by tracking and linking faculty publications.
  • Develop an equitable and inclusive ontology for linked data.
  • Teach students data literacy by incorporating Wikidata / metadata practices into the classroom.

If you’re new to linked data, check out our beginner course. If you have some experience with linked data (not necessarily Wikidata), check out our intermediate course. And if you’re in the DC or New York area, sign up for one of our in-person workshops!

So much is possible with Wikidata. We’ll help you discover how it can best work for your goals.

To explore more options or to sign up to receive updates about future courses, visit data.wikiedu.org.

Ewan McAndrew (centre) at the University of Edinburgh Spy Week Wikipedia edit-a-thon – image by Mihaela Bodlovic CC BY-SA 4.0

Wikimedia UK is very pleased to announce that our partners at the University of Edinburgh have been awarded the Innovative Use of Technology award for their use of Wikipedia in the Curriculum at the Herald Higher Education Awards 2019.

University of Edinburgh Wikimedian in Residence Ewan McAndrew has been leading on this work in Edinburgh, running dozens of Wikimedia events since he began his residency in January 2016, and developing innovative projects and partnerships across the university

The award is well-deserved recognition for Ewan’s hard work in changing the perception of Wikipedia within academia, and the progress that has been made in the understanding of Wikimedia projects as important teaching tools.

Ewan attended the LILAC information literacy conference at Nottingham University in April, where he saw evidence of the growing recognition of Wikipedia as a learning platform.he University of Edinburgh was also awarded Wikimedia UK’s Partnership of the Year award in 2018.

As of April 2019, Ewan has delivered a total of 156 training sessions, trained 635 students, 419 staff, and 260 members of the public, and helped create 476 Wikipedia articles and improve 1946 articles.

Courses at the University which now include a Wikipedia assignment include: World Christianity MSc, Translation Studies MSc, History MSc (Online), Global Health MSc, Digital Sociology MSc, Data Science for Design MSc, Language Teaching MSc, Psychology in Action MSc, Digital Education MSc, Public Health MSc and Reproductive Biology Honours. Working with the Wikimedia projects not only allows students to improve the skills any university wants to develop (such as critical reading, summarising, paraphrasing, original writing, referencing, citing, publishing, data handling), but allows them to have influence beyond the university, with their work being read and influencing thousands of people reading Wikipedia.

Wikimedia UK hopes that the long term success of Ewan’s residency will encourage other universities to also employ Wikimedians to mainstream the use of Wikimedia projects as teaching and learning tools in UK universities. This trend seems to be taking effect already as Coventry University’s Disruptive Media Learning Lab has recently employed Andy Mabbett as a part time Wikimedian in Residence, and we hope that other universities will follow suit.

We would like to thank Melissa Highton, the Director of Learning, Teaching and Web Services at the University, for her vision and support of the residency, and Allison Littlejohn, whose research was crucial in showing that it was a worthwhile endeavour. Dr Martin Poulter’s work as Wikimedian in Residence at the Bodleian Libraries, Oxford, was also instrumental in demonstrating the worth of a Wikimedian in Residence in a university setting.

We wish Ewan and the rest of the team at the University of Edinburgh all the best, and look forward to their further recognition as trailblazers of learning innovation in the higher education sector.


@Wikipedia: #notability versus #relevance

11:40, Monday, 10 2019 June UTC
I had a good discussion with imho a deletionist Wikipedia admin. For me the biggest take away was how notability is in the way of relevance.

With statements made like: "There are only two options, one is that the same standards apply, and the other is the perpetuation of prejudice" and "I view our decisions of notability as primarily subjective--decisions based on individual values and understandings of what WP should be like" no/little room is given for contrary points of view.

Notability has as its problem that it enables such a personal POV while relevance is about what others want to read. For a professor Bart O. Roep there is no article. Given two relevant diabetes related awards he should be notable and as he started a human study for a vaccine for diabetes type 1, he should be extremely relevant.

A personal POV ignoring the science that is in the news has its dangers. It is easy enough for Wikimedians to learn about scientific credentials, the papers are there to read but what we write is not for us but for our public. Withholding articles opens our public up to fake facts and fake science. An article about Mr Roep is therefore relevant and timely particularly because people die as they cannot afford their insulin. Articles about the best of what science has to bring about diabetes now is of extreme relevance.

At Wikidata, there is no notability issue. Given the relevance of diabetes all that is needed is to concentrate effort for a few days on a subject. New authors and papers are connected to what we already have, genders are added to authors (to document the gender ratio) and as a result more objective facts available for the subjective Wikipedia admins to consider, particularly when they accept tooling like Scholia to open up the available data.

Tech News issue #24, 2019 (June 10, 2019)

00:00, Monday, 10 2019 June UTC
TriangleArrow-Left.svgprevious 2019, week 24 (Monday 10 June 2019) nextTriangleArrow-Right.svg
Other languages:
Bahasa Indonesia • ‎English • ‎Tiếng Việt • ‎dansk • ‎español • ‎français • ‎polski • ‎português do Brasil • ‎suomi • ‎svenska • ‎čeština • ‎русский • ‎українська • ‎עברית • ‎العربية • ‎فارسی • ‎हिन्दी • ‎中文 • ‎日本語 • ‎粵語

The title is a little wordy, but I hope you get the gist. I just spent 10 minutes staring at some data on a Grafana dashboard, comparing it with some other data, and finding the numbers didn’t add up. Here is the story in case it catches you out.

The dashboard

The dashboard in question is the Wikidata Edits dashboard hosted on the Wikimedia Grafana instance that is public for all to see. The top of the dashboard features a panel that shows the total number of edits on Wikidata in the past 7 days. The rest of the dashboard breaks these edits down further, including another general edits panel on the left of the second row. 

The problem

The screenshot above shows that the top edit panel is fixed to show the last 7 days (this can be seen by looking at the blue text in the top right of the panel). The second edits panel on the left of the second row is also currently displaying data for the last 7 days (this can be seen by looking at the range selector on the top right of the dashboard.

The outlines of the 2 graphs in the panels appear to follow the same general shape. However both panels show different totals for the total edits made in the window. The first panel reports 576k edits in 1 week, but the second panel reports 307k. What on earth is going on?

Double checking the data against another source I found that both numbers  here are totally off. For a single day the total edits is closer to 700k, which scales up to 4-5 million edits per week.

hive (event)> select count(*)
            > from mediawiki_revision_create
            > where `database` = "wikidatawiki"
            > and meta.dt between "2018-09-09T02:00Z" and "2018-09-10T02:00Z"
            > and year=2018 and month=9 and (day=9 or day=10)
            > ;
Time taken: 24.991 seconds, Fetched: 1 row(s)


The Graphite render API used by Grafana has a parameter called maxDataPoints which decides the total number of data points to return. The docs are slightly more detailed saying:

Set the maximum numbers of datapoints for each series returned when using json content.
If for any output series the number of datapoints in a selected range exceeds the maxDataPoints value then the datapoints over the whole period are consolidated.
The function used to consolidate points can be set using the consolidateBy function.

Graphite 1.14 docs

Reading the documentation of the consolidateBy functions we find the problem:

The consolidateBy() function changes the consolidation function from the default of ‘average’ to one of ‘sum’, ‘max’, ‘min’, ‘first’, or ‘last’.

Graphite 1.14 docs

As the default consolidateBy function of ‘average’ is used, the total value on the dashboard will never be correct. Instead we will get the total of the averages.

Fixes for the dashboard

I could set the maxDataPoints parameter to 9999999 for all panels, that would mean that the previous assumptions would now hold true. Grafana would be getting ALL of the data points in Graphite and correctly totaling them. I gave it a quick shot but it probably isn’t what we want. We don’t need that level of granularity.

Adding consolidateBy(sum) should do the trick. And in the screenshot below we can now see that the totals make sense and roughly line up with our estimations.

For now I have actually set the second panel to have a maxDataPoints value for 9999999. As the data is stored at a minutely granularity this means roughly 19 years of minutely data can be accessed. When looking at the default of 7 days that equates to 143KB of data.

Continued confusion and misdirection

I have no doubt that Grafana will continue to trip me and others up with little quirks like this. At least the tooltip for the maxDataPoints options explains exactly what the option does, although this is hidden by default on the current Wikimedia version.

Data data everywhere. If only it were all correct.

The post Grafana, Graphite and maxDataPoints confusion for totals appeared first on Addshore.

Wikidata is 6

18:42, Sunday, 09 2019 June UTC

It’s was Wikidata’s 6th birthday on the 30th of October 2018. WMUK celebrated this with a meetup on the 7th of November. They also made this great post event video.

Video from WMUK hosted Wikidata birthday event

Celebrated all over the world

The 6th birthday was celebrated in over 40 different locations around the world according to the Wikidata item for the birthday:


Various Wikidata related presents were made by various volunteers. The presents can be found on Wikidata:Sixth_Birthday on the right hand side and include various scripts, tools, dashboard and lists.

Next year

The 7th birthday will again take the form of a WikidataCon conference.

Watch this space…

The post Wikidata is 6 appeared first on Addshore.

Wikidata Map October 2018

18:42, Sunday, 09 2019 June UTC

It has been another 6 months since my last post in the Wikidata Map series. In that time Wikidata has gained 4 million items, 1 property with the globe-coordinate data type (coordinates of geographic centre) and 1 million items with coordinates [1]. Each Wikidata item with a coordinate is represented on the map with a single dim pixel. Below you can see the areas of change between this new map and the once generated in March. To see the equivalent change in the previous 4 months take a look at the previous post.

Comparison of March 26th and October 1st maps in 2018

Daniel Mietchen believes that lots of the increased coverage could probably be attributed to Cebuano Bot. (link needed).

Areas of increase

Below I have extracted sections of the map that have shown significant increase in items.

If you know why these areas saw an increase, such as a wikiproject or individual working on the area, then please leave a comment below and I’ll be sure to add explanations for each area.

If you think I have missed an area also please leave a comment and I’ll add that too!


Some areas within Africa can be picked out as having specific increases:

  • Republic of Cameroon
  • Gabonese Republic
  • Democratic Republic of the Congo
  • People’s Democratic Republic of Algeria
  • Republic of Djibouti

The increase in the coverage on the African continent in general by Wikidata could be down to Wikimania 2018 which was held in Cape Town. Cape Town itself doesn’t show any real increase in items in the 6 month period and is not included in the image above. Mexico also so an increase to the number of items in Wikidata in the area when Wikimania was hosted there in 2015.


The main areas of increase here appear to be:

  • Jakarta
  • Indonesia
  • Bangkok
  • North Korea


The main areas of increase here appear to be:

  • Scotland
  • Ireland
  • Norway
  • Finland
  • Latvia
  • Greece
  • Croatia
  • Cyprus (while not in europe) can be seen in the bottom right

North America

There is a general increase across the whole of North America. Most notably the west of continent and Canada.

The Dominican republic can also be seen in bright colour to the bottom right of the image.

South America

South America has a general increase throughout, however various areas appear highlighted such as:

  • Columbia
  • Chile
  • São Paulo & Brazil

Smaller snippets


Sri Lanka & Maldives



[1] Number of items with coordinates based on grepping the generated wdlabel.json file used by the map generation.
addshore@stat1005:~$ grep -o ",\"Q" wdlabel-20181001.json | wc -l
addshore@stat1005:~$ grep -o ",\"Q" wdlabel-20180326.json | wc -l


The October 2018 images: https://tools.wmflabs.org/wikidata-analysis/20181001/geo2png/

The post Wikidata Map October 2018 appeared first on Addshore.

wikibase-docker, Mediawiki & Wikibase update

18:42, Sunday, 09 2019 June UTC

Today on the Wikibase Community User Group Telegram chat I noticed some people discussing issues with upgrading Mediawiki and Wikibase using the docker images provided for Wikibase.

As the wikibase-registry is currently only running Mediawiki 1.30 I should probably update it to 1.31, which is the next long term stable release.

This blog post was written as I performed the update and is yet to be proofread, so expect some typos. I hope it can help those that were chatting on Telegram today.

Starting state


There is a small amount of documentation in the wikibase docker image README file that talks about upgrading, but this simply tells you to run update.php.

Update.php has its own documentation on mediawiki.org.
None of this helps you piece everything together for the docker world.


The installation creation process is documented in this blog post, and some customization regarding LocalSettings and extensions was covered here.
The current state of the docker-compose file can be seen below with private details redacted.

This docker-compose files is found in /root/wikibase-registry on the server hosting the installation. (Yes I know that’s a dumb place, but that’s not the point of this post)

version: '3'

    image: wikibase/wikibase:1.30-bundle
    restart: always
      - mysql
     - "8181:80"
      - mediawiki-images-data:/var/www/html/images
      - ./LocalSettings.php:/var/www/html/LocalSettings.php:ro
      - ./Nuke:/var/www/html/extensions/Nuke
      - ./ConfirmEdit:/var/www/html/extensions/ConfirmEdit
    - mysql
      MW_ADMIN_NAME: "private"
      MW_ADMIN_PASS: "private"
      MW_SITE_NAME: "Wikibase Registry"
      DB_SERVER: "mysql.svc:3306"
      DB_PASS: "private"
      DB_USER: "private"
      DB_NAME: "private"
      MW_WG_SECRET_KEY: "private"
         - wikibase.svc
         - wikibase-registry.wmflabs.org
    image: mariadb:latest
    restart: always
      - mediawiki-mysql-data:/var/lib/mysql
      MYSQL_DATABASE: 'private'
      MYSQL_USER: 'private'
      MYSQL_PASSWORD: 'private'
         - mysql.svc
    image: wikibase/wdqs-frontend:latest
    restart: always
     - "8282:80"
    - wdqs-proxy
      BRAND_TITLE: 'Wikibase Registry Query Service'
      WIKIBASE_HOST: wikibase.svc
      WDQS_HOST: wdqs-proxy.svc
         - wdqs-frontend.svc
    image: wikibase/wdqs:0.3.0
    restart: always
      - query-service-data:/wdqs/data
    command: /runBlazegraph.sh
      WIKIBASE_HOST: wikibase-registry.wmflabs.org
         - wdqs.svc
    image: wikibase/wdqs-proxy
    restart: always
      - PROXY_PASS_HOST=wdqs.svc:9999
     - "8989:80"
    - wdqs
         - wdqs-proxy.svc
    image: wikibase/wdqs:0.3.0
    restart: always
    command: /runUpdate.sh
    - wdqs
    - wikibase
      WIKIBASE_HOST: wikibase-registry.wmflabs.org
         - wdqs-updater.svc




So that you can always return to your previous configuration take a snapshot of your docker-compose file.

If you have any other mounted files it also might be worth taking a quick snapshot of those.


The wikibase docker-compose example README has a short section about backing up docker volumes using the loomchild/volume-backup docker image.
So let’s give that a go.

I’ll run the backup command for all 3 volumes used in the docker compose file which cover the 3 locations that I care about that persist data.

docker run -v wikibase-registry_mediawiki-mysql-data:/volume -v /root/volumeBackups:/backup --rm loomchild/volume-backup backup mediawiki-mysql-data_20190129
docker run -v wikibase-registry_mediawiki-images-data:/volume -v /root/volumeBackups:/backup --rm loomchild/volume-backup backup mediawiki-images-data_20190129
docker run -v wikibase-registry_query-service-data:/volume -v /root/volumeBackups:/backup --rm loomchild/volume-backup backup query-service-data_20190129

Looking in the /root/volumeBackups directory I can see that the backup files have been created.

ls -lahr /root/volumeBackups/ | grep 2019
-rw-r--r-- 1 root root 215K Jan 29 16:40 query-service-data_20190129.tar.bz2
-rw-r--r-- 1 root root  57M Jan 29 16:40 mediawiki-mysql-data_20190129.tar.bz2
-rw-r--r-- 1 root root  467 Jan 29 16:40 mediawiki-images-data_20190129.tar.bz2

I’m not going to bother checking that the backups are actually complete here, but you might want to do that!

Prepare the next version

Grab new versions of extensions

The wikibase-registry has a couple of extension shoehorned into it mounted through mounts in the docker-compose file (see above).

We need new versions of these extensions for Mediawiki 1.31 while leaving the old versions in place for the still running 1.30 version.

I’ll do this by creating a new folder, copying the existing extension code into it, and then changing and fetching the branch.

# Make copies of the current 1.30 versions of extensions
root@wbregistry-01:~/wikibase-registry# mkdir mw131
root@wbregistry-01:~/wikibase-registry# cp -r ./Nuke ./mw131/Nuke
root@wbregistry-01:~/wikibase-registry# cp -r ./ConfirmEdit ./mw131/ConfirmEdit

# Update them to the 1.31 branch of code
root@wbregistry-01:~/wikibase-registry# cd ./mw131/Nuke/
root@wbregistry-01:~/wikibase-registry/mw131/Nuke# git fetch origin REL1_31
From https://github.com/wikimedia/mediawiki-extensions-Nuke
 * branch            REL1_31    -> FETCH_HEAD
root@wbregistry-01:~/wikibase-registry/mw131/Nuke# git checkout REL1_31
Branch REL1_31 set up to track remote branch REL1_31 from origin.
Switched to a new branch 'REL1_31'
root@wbregistry-01:~/wikibase-registry/mw131/Nuke# cd ./../ConfirmEdit/
root@wbregistry-01:~/wikibase-registry/mw131/ConfirmEdit# git fetch origin REL1_31
From https://github.com/wikimedia/mediawiki-extensions-ConfirmEdit
 * branch            REL1_31    -> FETCH_HEAD
root@wbregistry-01:~/wikibase-registry/mw131/ConfirmEdit# git checkout REL1_31
Branch REL1_31 set up to track remote branch REL1_31 from origin.
Switched to a new branch 'REL1_31'

Define an updated Wikibase container / service

We can run a container with the new Mediawiki and Wikibase code in alongside the old container without causing any problems, it just needs a name.

So below I define this new service, called wikibase-131 using the same general details as my previous wikibase service, but pointing to the new versions of my extensions, and add it to my docker-compose file.

Note that no port is exposed, as I don’t want public traffic here yet, and also no network aliases are yet defined. We will switch those from the old service to the new service at a later stage.

    image: wikibase/wikibase:1.31-bundle
    restart: always
      - mysql
      - mediawiki-images-data:/var/www/html/images
      - ./LocalSettings.php:/var/www/html/LocalSettings.php:ro
      - ./mw131/Nuke:/var/www/html/extensions/Nuke
      - ./mw131/ConfirmEdit:/var/www/html/extensions/ConfirmEdit
    - mysql
      MW_ADMIN_NAME: "private"
      MW_ADMIN_PASS: "private"
      MW_SITE_NAME: "Wikibase Registry"
      DB_SERVER: "mysql.svc:3306"
      DB_PASS: "private"
      DB_USER: "private"
      DB_NAME: "private"
      MW_WG_SECRET_KEY: "private"

I tried running this service as is but ran into an issue with the change from 1.30 to 1.31. (Your output will be much more verbose if you need to pull the image)

root@wbregistry-01:~/wikibase-registry# docker-compose up wikibase-131
wikibase-registry_mysql_1 is up-to-date
Creating wikibase-registry_wikibase-131_1 ... done
Attaching to wikibase-registry_wikibase-131_1
wikibase-131_1   | wait-for-it.sh: waiting 120 seconds for mysql.svc:3306
wikibase-131_1   | wait-for-it.sh: mysql.svc:3306 is available after 0 seconds
wikibase-131_1   | wait-for-it.sh: waiting 120 seconds for mysql.svc:3306
wikibase-131_1   | wait-for-it.sh: mysql.svc:3306 is available after 1 seconds
wikibase-131_1   | /extra-entrypoint-run-first.sh: line 3: MW_ELASTIC_HOST: unbound variable
wikibase-registry_wikibase-131_1 exited with code 1

The wikibase:1.31-bundle docker image includes the Elastica and CirrusSearch extensions which were not a part of the 1.30 bundle, and due to the entrypoint infrastructure added along with it I will need to change some things to continue without using Elastic for now.

Fix MW_ELASTIC_HOST requirement with a custom entrypoint.sh

The above error message shows that the error occurred while running extra-entrypoint-run-first.sh which is provided as part of the bundle.
It is automatically loaded by the base image entry point.
The bundle now also runs some extra steps as part of the install for wikibase that we don’t want if we are not using Elastic.

If you give the entrypoint file a read through you can see that it does a few things:

  • Makes sure the required environment variables are passed in
  • Waits for the DB server to be online
  • Runs extra scripts added by the bundle image
  • Does the Mediawiki / Wikibase install on the first run (if LocalSettings does not exist)
  • Run apache

This is a bit excessive for what the wikibase-registry requires right now, so lets strip this down, saving next to our docker-compose file, so /root/wikibase-registry/entrypoint.sh for the wikibase-registry


for i in ${REQUIRED_VARIABLES[@]}; do
    if [ -z "$THISSHOULDBESET" ]; then
    echo "$i is required but isn't set. You should pass it to docker. See: https://docs.docker.com/engine/reference/commandline/run/#set-environment-variables--e---env---env-file";
    exit 1;

set -eu

/wait-for-it.sh $DB_SERVER -t 120
sleep 1
/wait-for-it.sh $DB_SERVER -t 120

docker-php-entrypoint apache2-foreground

And mount it in the wikibase-131 service that we have created by adding a new volume.

      - ./entrypoint.sh:/entrypoint.sh

Run the new service alongside the old one

Running the service now works as expected.

root@wbregistry-01:~/wikibase-registry# docker-compose up wikibase-131
wikibase-registry_mysql_1 is up-to-date
Recreating wikibase-registry_wikibase-131_1 ... done
Attaching to wikibase-registry_wikibase-131_1
{snip, broing output}

And the service appears in the list of running containers.

root@wbregistry-01:~/wikibase-registry# docker-compose ps
              Name                             Command               State          Ports
wikibase-registry_mysql_1           docker-entrypoint.sh mysqld      Up      3306/tcp
wikibase-registry_wdqs-frontend_1   /entrypoint.sh nginx -g da ...   Up>80/tcp
wikibase-registry_wdqs-proxy_1      /bin/sh -c "/entrypoint.sh"      Up>80/tcp
wikibase-registry_wdqs-updater_1    /entrypoint.sh /runUpdate.sh     Up      9999/tcp
wikibase-registry_wdqs_1            /entrypoint.sh /runBlazegr ...   Up      9999/tcp
wikibase-registry_wikibase-131_1    /bin/bash /entrypoint.sh         Up      80/tcp
wikibase-registry_wikibase_1        /bin/bash /entrypoint.sh         Up>80/tcp


From here you should now be able to get into your new container with the new code.

root@wbregistry-01:~/wikibase-registry# docker-compose exec wikibase-131 bash

And then run update.php

In theory updates to the database, and anything else, will always be backward compatible for at least 1 major version, which is why we can run this update while the site is still being served from Mediawiki 1.30

root@40de55dc62fc:/var/www/html# php ./maintenance/update.php --quick
MediaWiki 1.31.1 Updater

Your composer.lock file is up to date with current dependencies!
Going to run database updates for wikibase_registry
Depending on the size of your database this may take a while!
{snip boring output}
Purging caches...done.

Done in 0.9 s.

Switching versions

The new service is already running alongside the old one, and the database has already been updated, now all we have to do is switch the services over.

If you want a less big bangy approach you could probably setup a second port exposing the updated version and direct a different domain or sub domain to that location, but I don’t go into that at all here.

Move the “ports” definition and “networks” definition from the “wikibase” service to the “wikibase-131” service. Then recreate the container for each service using the update configuration. (If you have any other references to the “wikibase” service in the docker-compose.yml file such as in depends-on then you will also need to change this.

root@wbregistry-01:~/wikibase-registry# docker-compose up -d wikibase
wikibase-registry_mysql_1 is up-to-date
Recreating wikibase-registry_wikibase_1 ... done
root@wbregistry-01:~/wikibase-registry# docker-compose up -d wikibase-131
wikibase-registry_mysql_1 is up-to-date
Recreating wikibase-registry_wikibase-131_1 ... done

If everything has worked you should see Special:Version reporting the newer version, which we now see on the wikibase-registry.


Now that everything is updated we can stop and remove the previous “wikibase” service container.

root@wbregistry-01:~/wikibase-registry# docker-compose stop wikibase
Stopping wikibase-registry_wikibase_1 ... done
root@wbregistry-01:~/wikibase-registry# docker-compose rm wikibase
Going to remove wikibase-registry_wikibase_1
Are you sure? [yN] y
Removing wikibase-registry_wikibase_1 ... done

You can then do some cleanup:

  • Remove the “wikibase” service definition from the docker-compose.yml file, leaving “wikibase-131” in place.
  • Remove any files or extensions (older versions) that are only loaded by the old service that you have now removed.

Further notes

There are lots of other things I noticed while writing this blog post:

  • It would be great to move the env vars out of the docker-compose and into env var files.
  • The default entrypoint in the docker images is quite annoying after the initial install and if you don’t use all of the features in the bundle.
  • We need a documentation hub? ;)

The post wikibase-docker, Mediawiki & Wikibase update appeared first on Addshore.

Wikidata Architecture Overview (diagrams)

18:42, Sunday, 09 2019 June UTC

Over the years diagrams have appeared in a variety of forms covering various areas of the architecture of Wikidata. Now, as the current tech lead for Wikidata it is my turn.

Wikidata has slowly become a more and more complex system, including multiple extensions, services and storage backends. Those of us that work with it on a day to day basis have a pretty good idea of the full system, but it can be challenging for others to get up to speed. Hence, diagrams!

All diagrams can currently be found on Wikimedia Commons using this search, and are released under CC-BY-SA 4.0. The layout of the diagrams with extra whitespace is intended to allow easy comparison of diagrams that feature the same elements.

High level overview

High level overview of the Wikidata architecture

This overview shows the Wikidata website, running Mediawiki with the Wikibase extension in the left blue box. Various other extensions are also run such as WikibaseLexeme, WikibaseQualityConstraints, and PropertySuggester.

Wikidata is accessed through a Varnish caching and load balancing layer provided by the WMF. Users, tools and any 3rd parties interact with Wikidata through this layer.

Off to the right are various other external services provided by the WMF. Hadoop, Hive, Ooozie and Spark make up part of the WMF analytics cluster for creating pageview datasets. Graphite and Grafana provide live monitoring. There are many other general WMF services that are not listed in the diagram.

Finally we have our semi persistent and persistent storages which are used directly by Mediawiki and Wikibase. These include Memcached and Redis for caching, SQL(mariadb) for primary meta data, Blazegraph for triples, Swift for files and ElasticSearch for search indexing.

Getting data into Wikidata

There are two ways to interact with Wikidata, either the UI or the API.

The primary UI is JS based and itself interacts with the API. The JS UI covers most of the core functionality of Wikibase with the exception of some small small features such as merging of entities (T140124, T181910). 

A non JS UI also exists covering most features. This UI is comprised of a series of Mediawiki SpecialPages. Due to the complexities around editing statements there is currently no non JS UI for this.

The API and UIs interact with Wikidata entities stored as Mediawiki pages saving changes to persistent storage and doing any other necessary work.

Wikidata data getting to Wikipedia

Wikidata clients within the Wikimedia cluster can use data from wikidata in a variety of ways. The most common and automatic way is the generation of the “Languages” side bar on projects linking to the same article in other languages.

Data can also be accessed through the property parser function and various LUA functions.

Once entities are updated on wikidata.org that data needs to be pushed to client sites that are subscribed to the entity. This happens using various subscription metadata tables on both the clients and the repo(wikidata.org) itself. The Mediawiki jobqueue is used to process the updates outside of a regular webrequest, and the whole process is controlled by a cron job running the dispatchChanges,php maintenance script.

For wikidata.org multiple copies of the dispatchChanges script run simultaneously, looking at the list of client sites and changes that have happened since updates were last pushed, determining if updates need to be pushed and queueing jobs to actually update the data where needed, causing a page purge on the client. When these jobs are triggered the changes are also added to the client recent changes table so that they appear next to other changes for users of the site.

The Query Service

The Wikidata query service, powered by blazegraph, listens to a stream of changes happening on Wikidata.org. There are two possible modes, polling Special:RecentChanges, or using a kafka queue of EventLogging data. Whenever an entity changes the query service will request new turtle data for the entity from Special:EntityData, munge it (do further processing) and add it to the triple store.

Data can also be loaded into the query service from the RDF dumps. More details can be found here.

Data Dumps

Wikidata data is dumped in a variety of formats using a couple of different php based dump scripts.

More can be read about this here.

The post Wikidata Architecture Overview (diagrams) appeared first on Addshore.

Hacking vs Editing, Wikipedia & Declan Donnelly

18:42, Sunday, 09 2019 June UTC

On the 18th of November 2018 the Wikipedia article for Declan Donnelly was edited and vandalised. Vandalism isn’t new on Wikipedia, it happens to all sorts of articles throughout every day. A few minutes after the vandalism the change made its way to Twitter and from there on to some media outlets such as thesun.co.uk and  metro.co.uk the following day, with another headline scaremongering and misleading using the word “hack”.

“I’m A Celebrity fans hack Declan Donnelly by changing his height on Wikipedia after Holly Willoughby mocks him”

Hacking has nothing to do with it. One of the definitions of hacking is to “gain unauthorized access to data in a system or computer”. What actually happened is someone, somewhere, edited the article, which everyone is able and authorized  to do. Editing is a feature, and its the main action that happens on Wikipedia.

The word ‘hack’ used to mean something, and hackers were known for their technical brilliance and creativity. Now, literally anything is a hack — anything — to the point where the term is meaningless, and should be retired.

The word ‘hack’ is meaningless and should be retired – 15 June 2018 by MATTHEW HUGHES

The edit that triggered the story can be seen below. It adds a few words to the lead paragraph of the article at 22:04 and was reverted at 22:19 giving it 15 minutes of life on the site.

The resulting news coverage increased the traffic to the article quite dramatically, going from just 500-1000 views a day to 27,000-29,000 for the 2 days following then slowly subsiding to 12,000 and 9,800 on day 4. This is similar to the uptick in traffic caused by a youtube video I spotted some time ago, but realistically these upticks happen pretty much every day for various articles for various reasons.

Wikimedia pageviews tool for Declan Donnelly article

I posted about David Cameron’s Wikipedia page back in 2015 when another vandalism edit made some slightly more dramatic changes to the page. Unfortunately the page views tool for Wikimedia projects doesn’t have readily available data going back that far.

Maybe one day people will stop vandalising Wikipedia… Maybe one day people will stop reported everything that happens online as a “hack”.

The post Hacking vs Editing, Wikipedia & Declan Donnelly appeared first on Addshore.

Creating a Dockerfile for the Wikibase Registry

18:42, Sunday, 09 2019 June UTC

Currently the Wikibase Registry(setup post) is deployed using the shoehorning approach described in one of my earlier posts. After continued discussion on the Wikibase User Group Telegram chat about different setups and upgrade woes I have decided to convert the Wikibase Registry to use the prefered approach of a custom Dockerfile building a layer on top of one of the wikibase images.

I recently updated updated the Wikibase registry from Mediawiki version 1.30 to 1.31 and described the process in a recent post, so if you want to see what the current setup and docker-compose file looks like, head there.

As a summary the Wikibase Registry uses:

  • The wikibase/wikibase:1.31-bundle image from docker hub
  • Mediawiki extensions:
    • ConfirmEdit
    • Nuke

Creating the Dockerfile

Our Dockerfile will likely end up looking vaugly similar to the wikibase base and bundle docker files, with a fetching stage, possible composer stage and final wikibase stage, but we won’t have to do anything that is already done in the base image.

FROM ubuntu:xenial as fetcher
# TODO add logic
FROM composer as composer
# TODO add logic
FROM wikibase/wikibase:1.31-bundle
# TODO add logic

Fetching stage

Modifying the logic that is used in the wikibase Dockerfile the extra Wikibase Registry extensions can be fetched and extracted.

Note that I am using the convenience script for fetching Mediawiki extensions from the wikibase-docker git repo matching the version of Mediawiki I will be deploying.

FROM ubuntu:xenial as fetcher

RUN apt-get update &amp;&amp; \
    apt-get install --yes --no-install-recommends unzip=6.* jq=1.* curl=7.* ca-certificates=201* &amp;&amp; \
    apt-get clean &amp;&amp; rm -rf /var/lib/apt/lists/*

ADD https://raw.githubusercontent.com/wmde/wikibase-docker/master/wikibase/1.31/bundle/download-extension.sh /download-extension.sh

RUN bash download-extension.sh ConfirmEdit;\
bash download-extension.sh Nuke;\
tar xzf ConfirmEdit.tar.gz;\
tar xzf Nuke.tar.gz

Composer stage

None of these extensions require a composer install, so there will be no composer step in this example. If Nuke for example required a composer install, the stage would look like this.

FROM composer as composer
COPY --from=fetcher /Nuke /Nuke
RUN composer install --no-dev

Wikibase stage

The Wikibase stage needs to pull in the two fetched extensions and make any other modifications to the resulting image.

In my previous post I overwrote the entrypoint to something much simpler removing logic to do with ElasticSearch that the Registry is not currently using. In my Dockerfile I have simplified this even further inlining the creation of a simple 5 line entrypoint, overwriting what was provided by the wikibase image.

I have left the default LocalSettings.php in the image for now, and I will continue to override this with a docker-compose.yml volume mount over the file. This avoid the need to rebuild the image when all you want to do is tweak a setting.

FROM wikibase/wikibase:1.31-bundle

COPY --from=fetcher /ConfirmEdit /var/www/html/extensions/ConfirmEdit
COPY --from=fetcher /Nuke /var/www/html/extensions/Nuke

RUN echo $'#!/bin/bash\n\
set -eu\n\
/wait-for-it.sh $DB_SERVER -t 120\n\
sleep 1\n\
/wait-for-it.sh $DB_SERVER -t 120\n\
docker-php-entrypoint apache2-foreground\n\
' > /entrypoint.sh

If the composer stage was used to run a composer command on something that was fetched then you would likely need to COPY that extension –from the composer layer rather than the fetcher layer.

Building the image

I’m going to build the image on the same server that the Wikibase Registry is running on, as this is the simplest option. More complicated options could involve building in some Continuous Integration pipeline and publishing to an image registry such as Docker Hub.

I chose the descriptive name “Dockerfile.wikibase.1.31-bundle” and saved the file alongside my docker-compose.yml file.

There are multiple approaches that could now be used to build and deploy the image.

  1. I could add a build configuration to my docker-compose file specifying the location of the Dockerfile as described here then building the service image using docker-compose as described here.
  2. I could build the image separate to docker-compose, giving it an appropriate name and then simply use that image name (which will exist on the host) in the docker-compose.yml file

I’m going with option 2.

docker build --tag wikibase-registry:1.31-bundle-1 --pull --file ./Dockerfile.wikibase.1.31-bundle .

docker build documentation can be found here. The command tells docker to build an image from the “Dockerfile.wikibase.1.31-bundle” file, pulling new versions of any images being used and giving the image the name “wikibase-registry” with tag “1.31-bundle-1”

The image should now be visible in the docker images list for the machine.

root@wbregistry-01:~/wikibase-registry# docker images | grep wikibase-registry
wikibase-registry         1.31-bundle-1       e5dad76c3975        8 minutes ago       844MB

Deploying the new image

In my previous post I migrated from one image to another having two Wikibase containers running at the same time with different images.

For this image change however I’ll be going for more of a “big bang” approach and I’m pretty confident.

The current wikibase service definition can be seen below. This includes volumes for the entry point, extensions, LocalSettings and images, some of which I can now get rid of. Also I have removed the requirement for most of these environment variables by using my own entrypoint file and overriding LocalSettings entirely.

    image: wikibase/wikibase:1.31-bundle
    restart: always
      - mysql
     - "8181:80"
      - mediawiki-images-data:/var/www/html/images
      - ./LocalSettings.php:/var/www/html/LocalSettings.php:ro
      - ./mw131/Nuke:/var/www/html/extensions/Nuke
      - ./mw131/ConfirmEdit:/var/www/html/extensions/ConfirmEdit
      - ./entrypoint.sh:/entrypoint.sh
    - mysql
      MW_SITE_NAME: "Wikibase Registry"
      DB_PASS: "XXXX"
      DB_USER: "XXXX"
      DB_NAME: "XXXX"
         - wikibase.svc
         - wikibase-registry.wmflabs.org

The new service definition has an updated image name, removed redundant volumes and reduced environment variables (DB_SERVER is still used as it is needed in the entrypoint I added).

    image: wikibase-registry:1.31-bundle-1
    restart: always
      - mysql
     - "8181:80"
      - mediawiki-images-data:/var/www/html/images
      - ./LocalSettings.php:/var/www/html/LocalSettings.php:ro
    - mysql
      DB_SERVER: "mysql.svc:3306"
         - wikibase.svc
         - wikibase-registry.wmflabs.org

For the big bang switchover I can simply reload the service.

root@wbregistry-01:~/wikibase-registry# docker-compose up -d wikibase-131
wikibase-registry_mysql_1 is up-to-date
Recreating wikibase-registry_wikibase-131_1 ... done

Using the docker-compose images command I can confirm that it is now running from my new image.

root@wbregistry-01:~/wikibase-registry# docker-compose images | grep wikibase-131
wikibase-registry_wikibase-131_1    wikibase-registry        1.31-bundle-1   e5dad76c3975   805 MB

Final thoughts

  • This should probably be documented in the wikibase-docker git repo which everyone seems to find, and also in the README for the wikibase image.
  • It would be nice if there were a single place to pull the download-extension.sh script from, perhaps with a parameter for version?

The post Creating a Dockerfile for the Wikibase Registry appeared first on Addshore.

weeklyOSM 463

08:27, Sunday, 09 2019 June UTC



Multimapas – a combination of many historical, topographic, satellite and road maps 1 | © CC BY 4.0 – Instituto Geográfico Nacional de España | © Leaflet | map data © OpenStreetMap contributors


  • The latest version of “How Did You Contribute” by Pascal Neis now uses osmose to display information about the quality of the user’s edits.
  • Facebook’s Maps Team announced the RapiD Editor based on iD, which enables mappers to convert data they extracted using machine learning into OSM features.
  • WiGeoGIS, a contractor of the fuel station chain OMV, announced plans to improve and maintain fuel stations of the brands OMV, AVANTI, Petrom, FE Trading and EuroTruck in OSM. See also the discussion on the Talk mailing list (May, June).
  • Valor Naram put a revised version of changing_table=* tagging to a vote.
  • Developers of the Maps.me navigator, which uses OSM data, created a validator for underground railways (subways/metros) all over the world.


  • Simon Poole made some suggestions on the Tagging mailing list on how to deal with the increasing number of messages and proposals there. There were lots of replies.
  • Recently the Kazakhstan OSM community created a Telegram chat.

OpenStreetMap Foundation

  • Joost Schouppe reports about the OSMF Board meeting in Brussels (including discussing the survey that was carried out beforehand) on the OSM Foundation’s blog.


  • OpenStreetMap Argentina announces a local State of the Map to be held on Saturday 27th July in Santa Fe.

Humanitarian OSM

  • HOT shares on their blog about an experimental version of the Tasking Manager which incorporates machine learning assistance for various tasks.
  • Simon Johnson writes on Medium about how “big data” in the humanitarian community is often at odds with the idea of “minimum viable data” – the idea that less data is actually more valuable because it’s better quality and easier to verify.
  • In Better Bombing with Machine Learning Frederik Ramm points out that computer vision/machine learning could be used by military forces for aerial bombing, and the OSM community should consider whether we should be so jubilant regarding companies that use those technologies for improving OSM.


  • Richard Fairhust tweets an example of using Lua to automatically transliterate names whilst processing OSM data for rendering. Post-processing OSM data in this way often removes the need for many name:<iso-code> tags.
  • The map of electoral districts on the website of the Magnitogorsk City (in Russia) is based on OSM.
  • A map of roads that are or will be repaired has been posted on the website of the Russian federal program “Safe and high-quality roads”. It is based on OSM but first you need to choose a region.


  • A new stable JOSM version (15155) was released. Category icons and a field for filtering background image settings have been added. Dynamic entries in the Background Images menu are displayed in submenus as well as many more improvements.
  • The OsmAnd Online GPS Tracker has been updated to version 0.5. New features are contact search, proxy settings, GPS settings and active markers.
  • Thanks to v2.0 of Babykarte, babies will not lose their way anymore. Or rather, it will become easier for parents to find baby and toddler friendly amenities (map).

Did you know …

OSM in the media

  • This short podcast from BBC Radio 4 asks the question: are there more stars in the universe than grains of beach sand on Earth? A contributor to the programme, using the OSM Planet File, computed an approximation for the amount of beach sand on the planet Earth.
  • An interview with Russian mapper Ilya Zverev has been posted (ru) on Habr. He spoke about what he did during his two years on the OSMF Board, why the American OSM society is the most friendly and why you need to participate in offline conferences. (automatic translation)

Other “geo” things

  • How to make a Simpsons-inspired map with expressions.
  • John Murray has been using the recent release of Rapids AI, a data science library for GPUs, to compute distances to everywhere in Great Britain from a point in a few seconds.
  • An extract from a forthcoming book by Barbara Tversky discusses “What makes a good map?”
  • An outdoor clothing company has been sneakily adding photos of its products to Wikipedia articles, in an attempt to get their brand higher up in Google image results. Just another danger of an open wiki system that we must be aware of.
  • Daniel J-H announced the release of a new version of RoboSat, which can detect roads and buildings in aerial imagery.
  • Die Welt has an article (automatic translation) about the best apps for water sports enthusiasts.

Upcoming Events

Where What When Country
Rennes Réunion mensuelle 2019-06-10 france
Bordeaux Réunion mensuelle 2019-06-10 france
Lyon Rencontre mensuelle pour tous 2019-06-11 france
Salt Lake City SLC Mappy Hour 2019-06-11 united states
Zurich OSM Stammtisch Zurich 2019-06-11 switzerland
Bordeaux Réunion mensuelle 2019-06-11 france
Hamburg Hamburger Mappertreffen 2019-06-11 germany
Wuppertal Wuppertaler Stammtisch im Hutmacher 18 Uhr 2019-06-12 germany
Leoben Stammtisch Obersteiermark 2019-06-13 austria
Munich Münchner Stammtisch 2019-06-13 germany
Bochum Mappertreffen 2019-06-13 germany
Berlin 132. Berlin-Brandenburg Stammtisch 2019-06-14 germany
Montpellier State of the Map France 2019 2019-06-14-2019-06-16 france
Essen 5. OSM-Sommercamp und 12. FOSSGIS-Hackingevent im Linuxhotel 2019-06-14-2019-06-16 germany
Kyoto 京都!街歩き!マッピングパーティ:第9回 光明寺 2019-06-15 japan
Dublin OSM Ireland AGM & Talks 2019-06-15 ireland
Santa Cruz Santa Cruz Ca. Mapping Party 2019-06-15 California
Cologne Bonn Airport Bonner Stammtisch 2019-06-18 germany
Lüneburg Lüneburger Mappertreffen 2019-06-18 germany
Sheffield Sheffield pub meetup 2019-06-18 england
Rostock Rostocker Treffen 2019-06-18 germany
Karlsruhe Stammtisch 2019-06-19 germany
London #geomob London 2019-06-19 england
Leoberdorf Leobersdorfer Stammtisch 2019-06-20 austria
Rennes Préparer ses randos pédestres ou vélos 2019-06-23 france
Bremen Bremer Mappertreffen 2019-06-24 germany
Angra do Heroísmo Erasmus+ EuYoutH_OSM Meeting 2019-06-24-2019-06-29 portugal
Salt Lake City SLC Map Night 2019-06-25 united states
Montpellier Réunion mensuelle 2019-06-26 france
Lübeck Lübecker Mappertreffen 2019-06-27 germany
Mannheim Mannheimer Mapathons e.V. 2019-06-27 germany
Düsseldorf Stammtisch 2019-06-28 germany
London OSMUK Annual Gathering including Wikidata UK Meets OSM 2019-06-29 united kingdom
Kyoto 幕末京都オープンデータソン#11:京の浪士と池田屋事件 2019-06-29 japan
Santa Fe State of the Map Argentina 2019 2019-07-27 argentina
Minneapolis State of the Map US 2019 2019-09-06-2019-09-08 united states
Edinburgh FOSS4GUK 2019 2019-09-18-2019-09-21 united kingdom
Heidelberg Erasmus+ EuYoutH_OSM Meeting 2019-09-18-2019-09-23 germany
Heidelberg HOT Summit 2019 2019-09-19-2019-09-20 germany
Heidelberg State of the Map 2019 (international conference) 2019-09-21-2019-09-23 germany
Grand-Bassam State of the Map Africa 2019 2019-11-22-2019-11-24 ivory coast

Note: If you like to see your event here, please put it into the calendar. Only data which is there, will appear in weeklyOSM. Please check your event in our public calendar preview and correct it, where appropriate.

This weeklyOSM was produced by Polyglot, Rogehm, SK53, Silka123, SomeoneElse, TheFive, TheSwavu, YoViajo, derFred, geologist, jinalfoflia.

#Wikidata - Exposing #Diabetes #Research

07:04, Sunday, 09 2019 June UTC
People die of diabetes when they cannot afford their insulin. There is not much that I can do about it but I can work in Wikidata on the scholars, the awards, the papers that are published that have to do with diabetes. The Wikidata tools that are important in this are: Reasonator, Scholia and SourceMD and the ORCiD, Google Scholar and VIAF websites prove themselves to be essential as well.

One way to stay focused is by concentrating on awards and, at this time it is the Minkowski Prize, it is conferred by the European Association for the Study of Diabetes. The list of award winners was already complete so I concentrated on their papers and co-authors. The first thing to do is to check if there is an ORCiD identifier and if that ORCiD identifier is already known in Wikidata, I found that it often is and merges of Wikidata items may follow. I then submit a SourceMD job to update that author and its co-authors.

The next (manual) step is about gender ratios. Scholia includes a graphic representation of co-authors and for all the "white" ones no gender has been entered. The process is as follows: when the gender is "obvious", it is just added. For an "Andrea" you look them up in Google and add what you think you see. When a name is given as "A. Winkowsky", you check ORCiD for a full name and iterate the process.

Once the SourceMD job is done, chances are that you have to start the gender process again because of new co-authors. Thomas Yates is a good example of a new co-author, already with a sizable amount of papers (95) to his name but not complete (417). Thomas is a "male".

What I achieve is an increasingly rich coverage of everything related to diabetes. The checks and balances ensure a high quality. And as more data is included in Wikidata, people who query will gain a better result.

What I personally do NOT do is add authors without an ORCiD identifier. It takes much more effort and chances of getting it wrong make it unattractive as well. In addition, I care for science but when people are not "Open" about their work I am quite happy for their colleagues to get the recognition they deserve.

This Month in GLAM: May 2019

01:21, Sunday, 09 2019 June UTC

We've recently published research on performance perception that we did last year. The micro survey used in this study is still running on multiple Wikipedia languages and gives us insights into perceived performance.

The micro survey simply asks users on Wikipedia articles, in their own language, if they think that the current page loaded fast enough:

Let's look at the results on Spanish and Russian Wikipedias, where we're collecting the most data. We have collected more than 1.1 million survey responses on Spanish Wikipedia and close to 1 million on Russian Wikipedia so far. The survey is displayed to a small fraction of our visitors.

How satisfied are our visitors with our page load performance?

Ignoring neutral responses ("I'm not sure"), we see that consistently across wikis between 85 and 90% of visitors find that the page loaded fast enough. That's an excellent score, one that we can be proud of. And it makes sense, considering that Wikipedia is one of the fastest websites on the Web.

Now, a very interesting finding is that this satisfaction ratio varies quite a bit depending on whether you're logged into the website, or if like most Wikipedia visitors, you're logged out:

wiki status sample size satisfaction ratio
spanish logged in 1,500 89.70%
spanish logged out 1,109,205 85.82%
russian logged in 7,093 92.28%
russian logged out 885,926 85.82%

It appears that logged-in users are consistently more satisfied about our performance than logged-out visitors.

The contributor performance penalty

Andres Apevalov — Press team of Prima Vista Literature Festival, CC BY-SA 4.0
Andres Apevalov — Press team of Prima Vista Literature Festival, CC BY-SA 4.0

What's very surprising about logged-in users being more satisfied is that we know for a fact that the logged-in experience is slower. Because our logged-in users have to reach our master datacenter in the US, instead of hitting the cache point of presence closest to them. This is a long-standing technical limitation of our architecture. An issue we intend to resolve one day.

Why could they possibly be happier, then?

The Spanish paradox

Spanish Wikipedia, at first glance, seems to contradict this phenomenon of slower page loads for logged-in users. Looking at the desktop site only (to rule out differences in the mobile/desktop mix):

wiki status median loadEventEnd
spanish logged in 1400.5
spanish logged out 1834
russian logged in 1356
russian logged out 1075

The reason why - contrary to what we see on other wikis and at a global scale - Spanish Wikipedia page loads seem faster for logged-in users, is that Spanish Wikipedia traffic has a very peculiar geographic distribution. Logged-in users are much more likely to be based in Spain (30.04%) than in latin american countries than their logged-out counterparts (22.3%). Since internet connectivity tends to be faster in Spain, this ratio difference explains why the logged-in experience appears to be faster - but isn't - when looking at RUM data at the website level.

This is a very common pitfall of RUM data, where seemingly contradicting results can emerge depending on how you slice the data. RUM data has to be studied from many angles before drawing conclusions.

Caching differences

Looking at the Navigation Timing data we collect for survey respondants, we see that for logged-in users the median connect time on Spanish Wikipedia is 0 and for logged-out users it's 144ms. This means that logged-in users view a lot of pages and the survey mostly ends up being displayed on their nth viewed page, where n is more than 1, because their browser is already connected to our domain. Whereas for a lot of logged-out users, we capture their first page load, with a higher probability of a cold cache. This means that logged-in users, despite having a (potential) latency penalty of connecting to the US, tend to have more cached assets, particularly the JS and CSS needed by the page. This doesn't fully compensate the performance penalty of connecting to a potentially distant datacenter, but it might reduce the variability of performance between page loads.

In order to further confirm this theory, in the future we could try to record information about how much of the JS and CSS was already available in the browser cache and the time the page load happened. This is not information we currently collect. Such data would allow us to confirm whether or not satisfaction is correlated to how well cached dependencies are, regardless of the user's logged-in/logged-out status.

Brand affinity?

Becoming a Wikipedia contributor - and therefore, logging in - requires a certain affinity to the Wikipedia project. It's possible, as a result, that logged-in users have a more favourable view of Wikipedia than logged-out users on average. And that positive outlook might influence how they judge the performance of the website.

This is a theory we will explore in the future by asking more questions in the micro survey, in order to determine whether or not the user who responds has a positive view of our website in general. This would allow us to quantify how large the effect of brand affinity might be on performance perception.

Kitty Quintanilla is a second year medical student at Western Michigan University Homer Stryker M.D. School of Medicine. Here, she shares why she’s passionate about increasing access to free information.

Kitty Quintanilla

Have you ever had your parents ramble at you about the “when I was young, we didn’t have all these computers and the internet and smartphones” thing? Mine have done it countless times, but they often sounded sad when they said it, wistful instead of condescending. A lot of older folk like to give our generation flak for using the internet as much as we do, but some like my parents wish they had this kind of thing growing up.

We have billions of pages of information at our fingertips, in seconds. It’s a modern miracle.

But among my family, a bunch of Latinos from El Salvador, there was a considerable portion of them that I noticed were under-educated, had grown up in rural towns and extreme poverty, in a third-world country that didn’t have a lot of technology at all—like basic electricity—let alone computers of any kind. Even now, in their forties, fifties, and sixties, they struggle with modern technology, and if they manage to figure out how to get Google open, they struggle to find things in the language they understand best.

I grew up translating things for my parents and family members. Countless people who didn’t speak or read English well enough to navigate this country or find the information and help they needed, so as I grew older, I found myself in another position of translation. I worked at Johns Hopkins doing pediatrics research, trialing a text-alert help system to help new, Latino mothers navigate their newborn children’s healthcare visits and needs, something to help break down the language barriers often present in healthcare. We hoped that once we had this system working, we could expand on it and even spread it farther, so hospitals all across the country could use similar templates for multiple languages, helping make the healthcare system a little easier to handle.

Look, healthcare is hard to navigate even for those of us who speak English. It must be even more terrifying and frustrating when you don’t speak English at all!

So I was delighted when I found out that the librarians at my medical school, Liz Lorbeer and Isaac Clark, wanted to create (and had been creating!) elective and projects to help translate medical resources into Spanish, and Spanish resources into English.

The more information we can make accessible to Spanish-speaking people, the more we can help those who consistently are left floundering in the U.S. healthcare system. My parents would be thrilled to discover that there were pages upon pages of information in their native tongue—and more importantly, at a level, they could understand. They didn’t have the benefit of a robust education system, and my father never even finished his equivalent of middle school, while my mother only had a high-school education. They always lament about their lack of education, their struggle with English as a second language, the way that the Salvadoran Civil War stole many opportunities and chances that others take for granted. They want to learn! They want to be able to search for information quickly and find what they are looking for. They want to make up for all the lost time.

For them to be able to learn at the click of a button, to open Wikipedia and find that the Spanish Wikipedia had pages on what they were looking for? That would be monumental.

I want to make all kinds of information more accessible for people like my parents: people who maybe do not know all the complicated jargon, or do not feel confident in their English, or want to read something simple and understandable in their native tongue. Wikipedia was created because of a desire to share knowledge and make it possible for anyone to learn, anyone to access and read, at the click of a button.

Our Wikipedia project was a fantastic chance to get to build something to help. I was delighted to get to help with creating the curriculum and syllabus for the elective, which would have students adding more information to Spanish Wikipedia articles, or even creating new ones! The English Wikipedia by far seems to have the most articles, but there is such a vast gap of knowledge between the different Wikipedias, with so many topics not covered in other languages.

There were plenty of things I realized while helping to work on the project, though. For one, I realize how badly I’ve always taken my course syllabi for granted, especially in undergrad. Having now achieved the amount of planning and detail-work that a syllabus requires has given me a completely new appreciation for every professor who has had to make one.

(I’m sorry, every single undergraduate professor whose syllabus I never read.)

The project also required a lot of testing—with me as a guinea pig, oh boy—to make sure it was feasible, and I also had to go about finding resources for students who maybe weren’t super fluent in Spanish. This is a translation course, but I was told to make it accessible and possible even for students who aren’t very fluent or bilingual, which was a challenge.  

If our particular project does become successful, I hope we can share how we’ve adapted the Wiki Education course template with other institutions and encourage their students to help in the endeavor to make medical information available, in even more languages than Spanish. Long story short, hopefully this project can be a step forward in the big grand goal of accessible information for everyone.

If you’re interested in having students write or translate Wikipedia articles as an assignment, use our free assignment templates and management tools! Visit teach.wikiedu.org for all you need to know to get started.

Wikipedia for Peace at Europride

10:14, Friday, 07 2019 June UTC

Next week I’ll be taking a little time out from my work at Edinburgh to go to Wikipedia for Peace at Europride 2019 in Vienna. Europride promotes lesbian, gay, bisexual, trans (LGBT) and other queer issues on an international level through parades, festivals and other cultural activities.  During the event a group of international editors will be coming together to create and edit LBGT+ articles in a range of European languages.  The event, which is run by Wikimedia Austria, is part of the Wikipedia for Peace movement which aims to strengthen peace and social justice through Wikimedia projects. Wikipedia for Peace organises community projects which bring together Wikipedia editors and people active in social and peace movements.

Although I’m not exactly the world’s most prolific Wikipedia editor, one of my proudest editing achievements is creating a page for Mary Susan McIntosh during one of Ewan McAndrew’s early editathons at the University of Edinburgh.  McIntosh was one of the founders of the Gay Liberation Front in the UK, and a member of the Policy Advisory Committee which advocated for lowering the age of male homosexual consent from 21 to 18.  As an academic criminologist and sociologist, she was one of the first to present evidence that homosexuality was not a psychiatric or clinical pathology but rather influenced by historical and cultural factors, and her paper The Homosexual Role was crucial in shaping the development of social constructionism. 

I had never heard of McIntosh before writing her Wikipedia entry and it was shocking to me that such an important activist and foundational thinker had been omitted from the encyclopedia.  I hope I can use my time in Vienna to create articles for other overlooked individuals from the queer community.   I’m particularly interested in focusing on the creation of articles around bisexual topics and individuals, which are sometimes marginalised in the LGBT+ community.  So if their are any LGBT+, with emphasis on the B, topics or individuals that you think should be added to the encyclopedia, please let me know!  You can also participate in the event remotely by signing up here.

I’m also looking forward to having an opportunity to photograph the European Pride Parade for Wikimedia Commons.  I think this will be my first Pride since 1998!

I’m immensely grateful to Wikimedia Austria for supporting my attendance at this event, and to Wikipedia UK for funding my travel through one of their project grants. Wikimedia UK’s project grants support volunteers to complete activities that benefit the organisation’s strategic goals including creating and raising awareness of open knowledge, building volunteer communities, releasing information and images under an open licence, and technology innovation. You can find out more information about project grants and how to apply here Wikimedia UK Project Grants.

Perspectives on #references, #citations

13:02, Thursday, 06 2019 June UTC
Wikipedia articles, scientific papers and some books have them: citations. Depending on your outlook, citations serve a different purpose. They exist to prove a point or to enable further reading. These differing purposes are not without friction.

In science, it makes sense to cite the original research establishing a fact. Important because when such a fact is retracted, the whole chain of citing papers may need to be reconsidered. In a Wikipedia article it is imho a bit different. For many people references are next level reading material and therefore a well written text expanding on the article is to be preferred, it helps bring things together.

When you consider the points made in a book to be important, like the (many) points made in Superior, the book by Angela Saini, you can expand the Wikidata item for the book by including its citations. It is one way to underline a point because those who seek such information will find a lot of additional reading and confirmation for the points made.

Adding citations in Wikidata often means that the sources and its authors are to be introduced. It takes some doing and by adding DOI, ORCiD, VIAF, and or Google Scholar data it is easy to make future connections. When you care to add citations to this book with me, this is my project page.

Welcome interns Amit, Khyati, and Ujjwal!

18:27, Wednesday, 05 2019 June UTC
Amit Joki

Last week we kicked off three exciting internship projects to improve the Wiki Education Dashboard. Over the next few months, Outreachy and Google Summer of Code students will join the Wiki Education technology team to build new features and tools for both Wiki Education programs and the global Wikimedia community.

Amit Joki, an Information Technology student from Madurai, has wide-ranging plan for improving the support for tracking cross-wiki programs. Among other things, Amit’s project will make it easier and more intuitive to choose which wikis to track edits for. Amit has been contributing to the Dashboard since spring 2018, and is responsible for a number of key features, including completing the internationalization system for Dashboard training modules. He just finished his second year of college.

Khyati Soneji

Khyati Soneji, a Computer Science student from Gandhinagar, will be working on making the Dashboard a better tool for the #1lib1ref campaign, which focuses on adding citations and other improvements to Wikipedia through outreach to librarians. One big part of this project will be to add ‘references added’ as one of the core statistics that the Dashboard tracks. Khyati also just finished her second year of college.

Ujjwal Agrawal, who just finished his Electronics and Communication Engineering degree from Indian Institute of Technology, Dhanbad, will be building an Android app for accessing the Dashboard. Ujjwal is a veteran Android developer who spent last summer working on the Wikimedia Commons app for Google Summer of Code.

Ujjwal Agrawal

Wes Reid and I will serve as mentors, and we’re looking forward to seeing what Amit, Khyati, and Ujjwal can accomplish.

To read more about Wiki Education’s open tech project and mentorship, read our blog post about running a newbie-friendly software project.




This week, the Wikimedia Foundation was invited to provide opening remarks for the third annual Global Conference of the Internet Jurisdiction and Policy Network in Berlin, Germany. This conference represents a place for civil society, platforms, elected representatives, policymakers, and other stakeholders to come together and discuss how we can manage tensions between competing national laws that impact information on the internet while elevating our essential rights and freedoms.  As advocates of free knowledge, it was an opportunity to share our belief in the importance of policymaking that supports an internet that is open and accessible to all.

This conference comes at a critical moment. The internet is in a moment of change, a testing of the boundaries of the free exchange of information and ideas. In the past year, we have seen increased concern about what information is available on social media and online, and how videos, images and stories are being shared more quickly and with wider audiences than has previously been possible.

This summit is an opportunity for all of us to continue to weigh how potential regulation may impact the promise of the internet to connect people and serve the common good. An overly broad, one-size-fits-all approach to regulation across the internet preferences platforms over people, places limits on knowledge and collaboration online, and effectively builds walls between people and ideas, rather than bridges. As stakeholders consider the very real challenges and responsibilities posed by internet governance and regulation, it is crucial to consider the following:

  • The importance of clearly articulating the norms and values we seek to uphold
  • The responsibility of governments to protect, and platforms to respect human rights
  • The challenges and risk of reactionary responses, and one size fits all regulation
  • The need for cross-border collaboration in service of our common humanity, and
  • The need to engage all stakeholders, especially civil society, in these critical dialogues

Laws and public policy should promote and preserve the freedom to share and participate in knowledge and exchange. The internet—and Wikipedia—is richer, more useful, and more representative when more people can engage together. That is why, unlike other internet platforms, Wikipedia does not localize knowledge for different countries or target it to individual users. Versions of Wikipedia are differentiated only by language—never by geography, demographic, or personal preference.

That means the information on Wikipedia is the same whether you are in Berlin or Brasilia, and editors from around the world can work together to improve, correct, and advance knowledge. Such a flourishing and competition of ideas and perspectives from different cultures may be a messy process, but it allows people to build consensus on how we see and share the world around us.

Any regulation also needs to consider its impact on international human rights. They are universal, fundamental, and non negotiable. We should carefully examine all solutions to make sure that we are aware of how potential restrictions could be abused, applied unevenly to different populations, or enforced too broadly in a way that silences or excludes people online. When we are overzealous about limiting knowledge, we risk impacting inclusivity and diversity. Permanent removal of knowledge can have long-term invisible impacts.

So how can we keep knowledge free and the internet open? Our recommendation is that this happens by giving power not to the few but to the many. Wikipedia is often held up as an exception to more traditional models for the consumer web, but we believe it is the example that decentralized models of curation and regulation can work. Wikipedia has shown how effective it can be when we empower communities to uphold a clear mission, purpose, and set of standards. As we look to the future of content moderation, we must similarly devise means to involve broad groups of stakeholders in these discussions, in order to create truly democratic, diverse, and sustainable online spaces.

Wikimedia’s vision is a world where every single human can freely share in the sum of all knowledge. This week’s conference produced some powerful momentum and collaboration between a multitude of stakeholders towards this shared future. The hard work is just beginning, but by meaningfully engaging more people and organizations today and in the future, we can develop standards and principles that are more inclusive, more enforceable, and more effective. We are encouraged by the possibility in front of us.

Together, we can help protect a flourishing and open internet that allows for new forms of culture, science, participation and knowledge.

Batches of Rust

10:47, Wednesday, 05 2019 June UTC

QuickStatments is a workhorse for Wikidata, but it had a few problems of late.

One of those is bad performance with batches. Users can submit a batch of commands to the tool, and these commands are then run on the Labs server. This mechanism has been bogged down for several reasons:

  • Batch processing written in PHP
  • Each batch running in a separate process
  • Limitation of 10 database connection per tool (web interface, batch processes, testing etc. together) on Labs
  • Limitation of (16? observed but not validated) simultaneous processes per tool on Labs cloud
  • No good way to auto-start a batch process when it is submitted (currently, auto-starting a PHP process every 5 minutes, and exit if there is nothing to do)
  • Large backlog developing

Amongst continued bombardment on Wiki talk pages, Twitter, Telegram etc. that “my batch is not running (fast enough)”, I went to mitigate the issue. My approach is to do all the batches in a new processing engine, written in Rust. This has several advantages:

  • Faster and easier on the resources than PHP
  • A single process running on Labs cloud
  • Each batch is a thread within that process
  • Checking for a batch to start every second (if you submit a new batch, it should start almost immediately)
  • Use of a database connection pool (the individual thread might have to wait a few milliseconds to get a connection, but the system never runs out)
  • Limiting simultaneous batch processing for batches from the same user (currently: 2 batches max) to avoid the MediaWiki API “you-edit-too-fast” error
  • Automatic handling of maxlag, bot/OAuth login etc. by using my mediawiki crate

This is now running on Labs, processing all (~40 at the moment) open batches simultaneously. Grafana shows the spikes in edits, but no increased lag so far. The process is given 4GB of RAM, but could probably do with a lot less (for comparison, each individual PHP process used 2GB).

A few caveats:

  • This is a “first attempt”. It might break in new, fun, unpredicted ways
  • It will currently not process batches that deal with Lexemes. This is mostly a limitation of the wikibase crate I use, and will likely get solved soon. In the meantime, please run Lexeme batches only within the browser!
  • I am aware that I have now code duplication (the PHP and the Rust processing). For me, the solution will be to implement QuickStatements command parsing in Rust as well, and replace PHP completely. I am aware that this will impact third-party use of QuickStatements (e.g. the WikiBase docker container), but the PHP and Rust sources are independent, so there will be no breakage; of course, the Rust code will likely evolve away from PHP in the long run, possibly causing incompatabilities

So far, it seems to be running fine. Please let me know if you encounter any issues (unusual errors in your batch, weird edits etc.)!

Older blog entries