September 18, 2018

Andrew Newell, User:RockMagnetist

Andrew Newell, also known as RockMagnetist on Wikipedia, is the Visiting Scholar at the Deep Carbon Observatory for 2017-2018. That means he has institutional access to academic sources to improve Wikipedia’s coverage of topics relevant to deep carbon science. He is a long-time contributor and administrator; if you’ve read about geophysics-related subjects on Wikipedia, there’s a very good chance you could find his username somewhere in the articles’ edit histories. 

An honorarium

This year, I became the first Visiting Scholar to be paid an honorarium. It wasn’t all that much ($3000 over a year), but it did raise questions about conflict of interest. Wikipedia was built out of the voluntary contributions of thousands of editors, and paid editing is rightly viewed with deep suspicion. Wikipedia has a “policy with legal implications” on disclosure of paid contributions and a conflict of interest behavioral guideline. The latter says that conflict-of-interest editing is “strongly discouraged”.

The dreaded sock puppet.
Image: File:Gebreide aap techniek rondbreien.JPG, Ellywa, CC BY-SA 4.0, via Wikimedia Commons.

My honorarium was paid by the Deep Carbon Observatory (DCO). We had to decide how I should conduct myself so that there wouldn’t be any problems. Under the user name RockMagnetist, I have been an editor on Wikipedia since 2010 and an administrator since 2013. Naturally, using administrative privileges for paid work is a major no-no! So the first thing I did was to create a separate non-admin account, RockMagnetist (DCO visiting scholar). This also made it easy to track my contributions using Wiki Education’s Dashboard. But even this step must be done carefully because alternate accounts are often used to deceive other editors and make trouble. Such accounts are called sock puppets and are among the most reviled objects on Wikipedia. To avoid any suspicion of my being a sock puppet, I made the connection clear in my user name and with a tag on my user page.

Avoiding conflict of interest

To avoid conflict of interest, I needed to have a clear understanding with the DCO. Fortunately, they agreed that I shouldn’t actively promote their work. Instead, I would add content in subject areas related to deep carbon, and while I would include work by DCO scientists, I wouldn’t give it undue weight.

Readers may wonder if editing articles for the DCO while not giving undue weight to DCO scholars is like Douglas Hofstadter‘s cure for hiccups in Gödel, Escher, Bach: “Run around the house three times without thinking of the word ‘wolf’.”[1] But really, it’s not that hard. The key is to rely primarily on secondary sources, as Wikipedia policy requires. These are published sources that discuss original research, providing context and interpretation. All analysis, evaluation and interpretation should come from secondary sources. And these sources determine the proper weight to assign to contributions by DCO scholars.

Choosing pages to edit

My initial plan was to use a top-down approach, adding material to frequently visited articles such as Diamond and Carbon cycle, and working downwards from there. This benefits the most people; while Diamond is visited a million times a year, Deep Carbon Observatory only gets about two thousand views. An article on one of the broader topics can also interest readers in more specialized topics. Indeed, I was often inspired to write about those specialized topics.

Diamonds

Artist’s conception of a multitude of tiny diamonds next to a hot star.
Image: File:SpaceNanoDiamonds.jpg, public domain, via Wikimedia Commons.

My work on the article on diamonds was comparatively straightforward. It was already a featured article, the highest level of quality, but the section on the geology was out of date and missing some basic information on how diamonds form. I needed to tear it apart and reorganize it, but if I did that in situ it would leave a mess – something one wouldn’t want to see in an FA-class article, even temporarily. So I rewrote the section in my user space and replaced the existing one in a single edit.

The old version had a link to an article called “Diamonds on Jupiter and Saturn”. This was a rambling account of a theory that predicted that it rains diamonds in the atmospheres of those planets. This intriguing theory never gained any traction in the scientific world, but there is a much stronger case for diamond rain on Uranus and Neptune. This was the start of a fascinating journey into accounts of a vast number of nanodiamonds forming in space, as well as diamond planets, diamonds forming in stars and a theory that diamond might have been the first mineral in the Universe. This became the article Extraterrestrial diamonds, which appeared on Wikipedia’s Main Page in the Did you know column and attracted over 10,000 views in a few hours.

Carbon cycle and geochemistry

An ad for the 1911 Encyclopedia Britannica.
Image: File:EncycBrit1913.jpg, public domain, via Wikimedia Commons.

My goal for the article on the carbon cycle was to expand and update the skimpy section on the Earth’s interior. However, a discussion of the carbon cycle, or any biogeochemical cycle, requires a variety of geochemical concepts such as reservoirs and residence times. These are basic ideas in geochemistry, and I felt it was important that they be explained properly somewhere in Wikipedia. This led me to the article Geochemistry.

When I first encountered Geochemistry back in 2012, it was a bit out of date – most of it was taken from a 1911 edition of Encyclopedia Britannica! By 2017, sections on the history of geochemistry and trace metals in the oceans had been added, but it was still dominated by a 1911 account of the composition of Earth’s crust. That bothered me. I don’t really know much about geochemistry, but I have often encountered it in my work as a geophysicist, and I appreciate its importance. So I started to update the article. I added a lot of material on the abundance of the elements on scales from the Universe down to the Earth. I also added some basic concepts like differentiation and mixing.

After looking at Carbon cycle, I returned to Geochemistry and added information on reservoirs and residence times. I put in some basic information on so-called box models that represent a geophysical system by reservoirs with inputs and outputs. I also discovered that there was already an article on residence time – in fact there were several! The concept is important not only in earth science but also in several engineering fields (it was first developed to analyze chemical reactors). I merged five articles that covered the same ground and developed a coherent introduction to the subject with applications in fields like groundwater flow and pharmacology. I also worked on Fugacity, another important concept in geochemistry and one that I had never understood. After reading the existing article, I still didn’t understand it, so I wrote a version that I did understand.

Organic materials

Carpathite, an organic mineral that fluoresces under ultraviolet light.
Image: File:Pierre-img 0579.jpg, Rama, CC BY-SA 2.0 FR, via Wikimedia Commons.

I didn’t always use the top-down approach. I was aware of efforts by DCO scholars to add to Wikipedia and couldn’t resist looking over their shoulders. They added new articles on some recently discovered minerals such as middlebackite. They were mostly organic minerals, a group of minerals containing carbon. I noticed that this group had no article devoted to it. Organic minerals are generally rare; they are found in curious settings such as fossilized guano; their very definition involves some historical quirks related to early ideas about vitalism; and the DCO launched a Carbon Mineral Challenge to promote the discovery of more of them. So with a bottom-up approach, I created the article Organic mineral, and it was accepted for the Did you know column.

Biographies

Other DCO scholars added biographies of colleagues. Creating autobiographical articles is strongly discouraged because it is difficult to write neutrally and objectively about yourself. The same goes for people with whom you have a close connection. I had intended to avoid biographies of DCO scholars, but I did end up working on two – those of Mark S. Ghiorso and Robert Hazen. Aside from passing Ghiorso in the hallway when I was a graduate student, I have no connection with these people, and objectively they make excellent subjects for biographies, with plenty of independent sources. The joys and challenges of writing scientific biographies could easily fill another blog.

COI revisited

Conflict of interest is primarily a quality issue. As the guidelines state, people with conflicts of interest add material that is “typically unsourced or poorly sourced and often violates the neutral point of view policy by being promotional and omitting negative information.” The best way to avoid this trap is to only write about a subject when you can find good secondary sources. And have fun – write about stuff that piques your interest!

References

1. Hofstadter, Douglas R. (1979). Gödel, Escher, Bach : an eternal golden braid. Basic Books. p. 254. ISBN 0394745027.


To read more about RockMagnetist’s work, checkout our roundup here. To learn more about current Visiting Scholars, click here.

Another piece of Wiki research has been published and you would not know when you look for it in the Wikiverse. The people who are big in Wiki research run this project called Wikicite. It has multiple aspects; publications and there authors are entered in Wikidata. Blogs are written how wonderful the various aspects are of what is on offer in the growing quality and quantity of data and how well an author may be presented in Scholia.

This is all well and good and indeed there is plenty to cheer about like the documentation about the Zika fever that is included but when a subject that is key to the Wikimedia researchers is not as well represented it will never be good enough.

There is a reason why using your own is so relevant; it shows you where your model fails you. WAll kinds of everything like conference speakers are included, many more female scientists are represented as a result of the indomitable Jess Wade but when new research by Wikimedians, professional Wikimedians, are not included, the effort is not sincere; the people that do go to conferences do not learn from the daily practice and that makes Wikicite stale and mostly academic.
Thanks,
     GerardM

September 17, 2018

What happened?

On May 3rd 2018 a large spike in the number of login attempts was detected on English Wikipedia due to a dictionary attack sourcing primarily from a single internet service provider.

Several hours into the attack the security team and others at the Foundation launched countermeasures mitigating the attacker's efforts. While the countermeasures were successful, end users continued to receive "failed login" notifications emails as usual.

What information was involved?

Users whose accounts were compromised were contacted or blocked. Information disclosed consisted of usernames and passwords derived as part of the dictionary attack. No personal information was disclosed.

What are we doing about it?

Changes to password policies: The security team and others at the Foundation are evaluating our current password policy with the intention of strengthening it to better protect online identities, promote a culture of security, and to align with best practices. More on this in the coming weeks but it’s definitely a step in the right direction.

Routine security assessments: Starting at the end of September, the security team will begin a series of penetration tests to assess some of our current controls and capabilities.

As the Security team grows (we’re hiring) we will expand our capabilities to include additional assessments such as routine dictionary attacks to identify poorly credentialed accounts, penetration testing, policy updates, and additional security controls and countermeasures.

Other technical controls and countermeasures: While we can’t disclose our exact countermeasures, we have a series of additional technical controls and countermeasures that will be implemented in the near future.

Security Awareness: There are several changes coming and to support these changes the security team will be launching various security awareness campaigns in the coming months.

John Bennett
Director of Security, Wikimedia Foundation

TriangleArrow-Left.svgprevious 2018, week 38 (Monday 17 September 2018) nextTriangleArrow-Right.svg
Other languages:
العربية • ‎čeština • ‎dansk • ‎English • ‎فارسی • ‎suomi • ‎français • ‎עברית • ‎हिन्दी • ‎日本語 • ‎한국어 • ‎মেইতেই লোন্ • ‎polski • ‎русский • ‎svenska • ‎українська • ‎中文

September 15, 2018

In my last posts I explored some optimization strategies inside the Rust code for mtpng, a multithreaded PNG encoder I’m creating. Now that it’s fast, and the remaining features are small enough to pick up later, let’s start working on a C API so the library can be used by C/C++-based apps.

If you search the web you’ll find a number of tutorials on the actual FFI-level interactions, so I’ll just cover a few basics and things that stood out. The real trick is in making a good build system; currently Rust’s “Cargo” doesn’t interact well with Meson, for instance, which really wants to build everything out-of-tree. I haven’t even dared touch autoconf. ;)

For now, I’m using a bare Makefile for Unix (Linux/macOS) and a batch file for Windows (ugh!) to drive the building of a C-API-exercising test program. But let’s start with the headers and FFI code!

Contracts first: make a header file

The C header file (“mtpng.h“) defines the types, constants, and functions available to users of the library. It defines the API contract, both in code and in textual comments because you can’t express things like lifetimes in C code. :)

Enums

Simple enums (“fieldless enums” as they’re called) can be mapped to C enums fairly easily, but be warned the representation may not be compatible. C enums usually (is that the spec or just usually??) map to the C ‘int’ type, which on the Rust side is known as c_int. (Clever!)

//
// Color types for mtpng_encoder_set_color().
//
typedef enum mtpng_color_t {
    MTPNG_COLOR_GREYSCALE = 0,
    MTPNG_COLOR_TRUECOLOR = 2,
    MTPNG_COLOR_INDEXED_COLOR = 3,
    MTPNG_COLOR_GREYSCALE_ALPHA = 4,
    MTPNG_COLOR_TRUECOLOR_ALPHA = 6
} mtpng_color;

So the representation is not memory-compatible with the Rust version, which explicitly specifies to fit in a byte:

#[derive(Copy, Clone)]
#[repr(u8)]
pub enum ColorType {
    Greyscale = 0,
    Truecolor = 2,
    IndexedColor = 3,
    GreyscaleAlpha = 4,
    TruecolorAlpha = 6,
}

If you really need to ship the enums around bit-identical, use #[repr(c_int)]. Note also that there’s no enforcement that enum values transferred over the FFI boundary will have valid values! So always use a checked transfer function on input like this:

impl ColorType {
    pub fn from_u8(val: u8) -> io::Result<ColorType> {
        return match val {
            0 => Ok(ColorType::Greyscale),
            2 => Ok(ColorType::Truecolor),
            3 => Ok(ColorType::IndexedColor),
            4 => Ok(ColorType::GreyscaleAlpha),
            6 => Ok(ColorType::TruecolorAlpha),
            _ => Err(other("Invalid color type value")),
        }
}

More complex enums can contain fields, and things get complex. For my case I found it simplest to map some of my mode-selection enums into a shared namespace, where the “Adaptive” value maps to something that doesn’t fit in a byte and so could not be valid, and the “Fixed” values map to their contained byte values:

#[derive(Copy, Clone)]
pub enum Mode<T> {
    Adaptive,
    Fixed(T),
}

#[repr(u8)]
#[derive(Copy, Clone)]
pub enum Filter {
    None = 0,
    Sub = 1,
    Up = 2,
    Average = 3,
    Paeth = 4,
}

maps to C:

typedef enum mtpng_filter_t {
    MTPNG_FILTER_ADAPTIVE = -1,
    MTPNG_FILTER_NONE = 0,
    MTPNG_FILTER_SUB = 1,
    MTPNG_FILTER_UP = 2,
    MTPNG_FILTER_AVERAGE = 3,
    MTPNG_FILTER_PAETH = 4
} mtpng_filter;

And the FFI wrapper function that takes it maps them to appropriate Rust values.

Callbacks and function pointers

The API for mtpng uses a few callbacks, required for handling output data and optionally as a driver for input data. In Rust, these are handled using the Write and Read traits, and the encoder functions are even generic over them to avoid having to make virtual function calls.

In C, the traditional convention is followed of passing function pointers and a void* as “user data” (which may be NULL, or a pointer to a private state structure, or whatever floats your boat).

In the C header file, the callback types are defined for reference and so they can be validated as parameters:

typedef size_t (*mtpng_read_func)(void* user_data,
                                  uint8_t* p_bytes,
                                                                  size_t len);

typedef size_t (*mtpng_write_func)(void* user_data,
                                   const uint8_t* p_bytes,
                                   size_t len);

typedef bool (*mtpng_flush_func)(void* user_data);

On the Rust side we must define them as well:

pub type CReadFunc = unsafe extern "C"
    fn(*const c_void, *mut uint8_t, size_t) -> size_t;

pub type CWriteFunc = unsafe extern "C"
    fn(*const c_void, *const uint8_t, size_t) -> size_t;

pub type CFlushFunc = unsafe extern "C"
    fn(*const c_void) -> bool;

Note that the function types are defined as unsafe (so must be called from within an unsafe { … } block or another unsafe function), and extern “C” which defines them as using the platform C ABI. Otherwise the function defs are pretty standard, though they use C-specific types from the libc crate.

Note it’s really important to use the proper C types because different platforms may have different sizes of things. Not only do you have the 32/64-bit split, but 64-bit Windows has a different c_long type (32 bits) than 64-bit Linux or macOS (64 bits)! This way if there’s any surprises, the compiler will catch it when you build.

Let’s look at a function that takes one of those callbacks:

extern mtpng_result
mtpng_encoder_write_image(mtpng_encoder* p_encoder,
                          mtpng_read_func read_func,
                                                  void* user_data);
#[no_mangle]
pub unsafe extern "C"
fn mtpng_encoder_write_image(p_encoder: PEncoder,
                             read_func: Option<CReadFunc>,
                             user_data: *const c_void)
-> CResult
{
    if p_encoder.is_null() {
        CResult::Err
    } else {
        match read_func {
            Some(rf) => {
                let mut reader = CReader::new(rf, user_data);
                match (*p_encoder).write_image(&mut reader) {
                    Ok(()) => CResult::Ok,
                    Err(_) => CResult::Err,
                }
            },
            _ => {
                CResult::Err
            }
        }
    }
}

Note that in addition to the unsafe extern “C” we saw on the function pointer definitions, the exported function also needs to use #[no_mangle]. This marks it as using a C-compatible function name, otherwise the C linker won’t find it by name! (If it’s an internal function you want to pass by reference to C, but not expose as a symbol, then you don’t need that.)

Notice that we took an Option<CReadFunc> as a parameter value, not just a CReadFunc. This is needed so we can check for NULL input values, which map to None. while valid values map to Some(CReadFunc). (The actual pointer to the struct is more easily checked for NULL, since that’s inherent to pointers.)

The actual function is passed into a CReader, a struct that implements the Read trait by calling the function pointer:

pub struct CReader {
    read_func: CReadFunc,
    user_data: *const c_void,
}

impl CReader {
    fn new(read_func: CReadFunc,
           user_data: *const c_void)
    -> CReader
    {
        CReader {
            read_func: read_func,
            user_data: user_data,
        }
    }
}

impl Read for CReader {
    fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
        let ret = unsafe {
            (self.read_func)(self.user_data,
                             &mut buf[0],
                             buf.len())
        };
        if ret == buf.len() {
            Ok(ret)
        } else {
            Err(other("mtpng read callback returned failure"))
        }
    }
}

Opaque structs

Since I’m not exposing any structs with public fields, I’ve got a couple of “opaque struct” types on the C side which are used to handle pointing at the Rust structs from C. Not a lot of fancy-pants work is needed to marshal them; the pointers on the C side pass directly to pointers on the Rust side and vice versa.

typedef struct mtpng_threadpool_struct
    mtpng_threadpool;

typedef struct mtpng_encoder_struct
    mtpng_encoder;

One downside of opaque structs on the C side is that you cannot allocate them on the stack, because the compiler doesn’t know how big they are — so we must allocate them on the heap and explicitly release them.

In Rust, it’s conventional to give structs an associated “new” method, and/or wrap them in a builder pattern to set up options. Here I wrapped Rayon’s ThreadPool builder with a single function that takes a number of threads, boxes it up on the heap, and returns a pointer to the heap object:

extern mtpng_result
mtpng_threadpool_new(mtpng_threadpool** pp_pool,
                     size_t threads);
pub type PThreadPool = *mut ThreadPool;

#[no_mangle]
pub unsafe extern "C"
fn mtpng_threadpool_new(pp_pool: *mut PThreadPool,
                            threads: size_t)
-> CResult
{
    if pp_pool.is_null() {
        CResult::Err
    } else {
        match ThreadPoolBuilder::new().num_threads(threads).build() {
            Ok(pool) => {
                *pp_pool = Box::into_raw(Box::new(pool));
                CResult::Ok
            },
            Err(_err) => {
                CResult::Err
            }
        }
    }
}

The real magic here is Box::into_raw() which replaces the Box<ThreadPool> smart pointer with a raw pointer you can pass to C. This means there’s no longer any smart management or releasing, so it’ll outlive the function… and we need an explicit release function:

extern mtpng_result
mtpng_threadpool_release(mtpng_threadpool** pp_pool);
#[no_mangle]
pub unsafe extern "C"
fn mtpng_threadpool_release(pp_pool: *mut PThreadPool)
-> CResult
{
    if pp_pool.is_null() {
        CResult::Err
    } else {
        drop(Box::from_raw(*pp_pool));
        *pp_pool = ptr::null_mut();
        CResult::Ok
    }
}

Box::from_raw() turns the pointer back into a Box<ThreadPool>, which de-allocates the ThreadPool at the end of function scope.

Lifetimes

Annotating object lifetimes in this situation is ….. confusing? I’m not sure I did it right at all. The only lifetime marker I currently use is the for thread pool, which must live at least as long as the encoder struct.

As a horrible hack I’ve defined the CEncoder to use static lifetime for the threadpool, which seems …. horribly wrong. I probably don’t need to do it like this. (Guidance and hints welcome! I will update the post and the code! :D)

// Cheat on the lifetimes?
type CEncoder = Encoder<'static, CWriter>;

Then the encoder creation, which takes an optional ThreadPool pointer and required callback function pointers, looks like:

extern mtpng_result
mtpng_encoder_new(mtpng_encoder** pp_encoder,
                  mtpng_write_func write_func,
                  mtpng_flush_func flush_func,
                  void* const user_data,
                  mtpng_threadpool* p_pool);
pub type PEncoder = *mut CEncoder;

#[no_mangle]
pub unsafe extern "C"
fn mtpng_encoder_new(pp_encoder: *mut PEncoder,
                     write_func: Option<CWriteFunc>,
                     flush_func: Option<CFlushFunc>,
                     user_data: *const c_void,
                     p_pool: PThreadPool)
-> CResult
{
    if pp_encoder.is_null() {
        CResult::Err
    } else {
        match (write_func, flush_func) {
            (Some(wf), Some(ff)) => {
                let writer = CWriter::new(wf, ff, user_data);
                if p_pool.is_null() {
                    let encoder = Encoder::new(writer);
                    *pp_encoder = Box::into_raw(Box::new(encoder));
                    CResult::Ok
                } else {
                    let encoder = Encoder::with_thread_pool(writer, &*p_pool);
                    *pp_encoder = Box::into_raw(Box::new(encoder));
                    CResult::Ok
                }
            },
            _ => {
                CResult::Err
            }
        }
    }
}

Note how we take the p_pool pointer and turn it into a Rust reference by dereferencing the pointer (*) and then re-referencing it (&). :)

Because we’re passing the thread pool across a safe/unsafe boundary, it’s entirely the caller’s responsibility to uphold the compiler’s traditional guarantee that the pool instance outlives the encoder. There’s literally nothing to stop it from being released early by C code.

Calling from C

Pretty much all the external-facing functions return a a result status enum type, which I’ve mapped the Rust Result<_,Error(_)> system into. For now it’s just ok or error states, but will later add more detailed error codes:

Since C doesn’t have a convenient “?” syntax or try! macro for trapping those, I wrote a manual TRY macro for my samples’s main(). Ick!

#define TRY(ret) { \
    mtpng_result _ret = (ret); \
    if (_ret != MTPNG_RESULT_OK) { \
        fprintf(stderr, "Error: %d\n", (int)(_ret)); \
        retval = 1; \
        goto cleanup; \
    }\
}

The calls are then wrapped to check for errors:

int main() {

... some state setup ...

        mtpng_threadpool* pool;
    TRY(mtpng_threadpool_new(&pool, threads));

    mtpng_encoder* encoder;
    TRY(mtpng_encoder_new(&encoder,
                          write_func,
                          flush_func,
                          (void*)&state,
                          pool));

    TRY(mtpng_encoder_set_chunk_size(encoder, 200000));

    TRY(mtpng_encoder_set_size(encoder, 1024, 768));
    TRY(mtpng_encoder_set_color(encoder, color_type, depth));
    TRY(mtpng_encoder_set_filter(encoder, MTPNG_FILTER_ADAPTIVE));

    TRY(mtpng_encoder_write_header(encoder));
    TRY(mtpng_encoder_write_image(encoder, read_func, (void*)&state));
    TRY(mtpng_encoder_finish(&encoder));

cleanup:
    if (encoder) {
        TRY(mtpng_encoder_release(&encoder));
    }
    if (pool) {
        TRY(mtpng_threadpool_release(&pool));
    }

    printf("goodbye\n");
    return retval;
}

Ok, mostly straightforward right? And if you don’t like the TRY macro you can check error returns manually, or whatever. Just don’t forget to check them! :D

Error state recovery other than at API boundary checks may or may not be very good right now, I’ll clean that up later.

Now I’m pretty sure some things will still explode if I lean the wrong way on this system. For instance looking above, I’m not convinced that the pointers on the stack will be initialized to NULL except on debug builds. :D

Don’t forget the callbacks

Oh right, we needed read and write callbacks! Let’s put those back in. Start with a state structure we can use (it doesn’t have to be the same state struct for both read and write, but it is here cause why not?)

typedef struct main_state_t {
    FILE* out;
    size_t width;
    size_t bpp;
    size_t stride;
    size_t y;
} main_state;

static size_t read_func(void* user_data,
                            uint8_t* bytes,
                            size_t len)
{
    main_state* state = (main_state*)user_data;
    for (size_t x = 0; x < state->width; x++) {
        size_t i = x * state->bpp;
        bytes[i] = (x + state->y) % 256;
        bytes[i + 1] = (2 * x + state->y) % 256;
        bytes[i + 2] = (x + 2 * state->y) % 256;
    }
    state->y++;
    return len;
}

static size_t write_func(void* user_data,
                             const uint8_t* bytes,
                             size_t len)
{
    main_state* state = (main_state*)user_data;
    return fwrite(bytes, 1, len, state->out);
}

static bool flush_func(void* user_data)
{
    main_state* state = (main_state*)user_data;
    if (fflush(state->out) == 0) {
        return true;
    } else {
        return false;
    }
}

Couple of things stand out to me: first, this is a bit verbose for some common cases.

If you’re generating input row by row like in this example, or reading it from another source of data, the read callback works ok though you have to set up some state. If you already have it in a buffer, it’s a lot of extra hoops. I’ve added a convenience function for that, which I’ll describe in more detail in a later post due to some Rust-side oddities. :)

And writing to a stdio FILE* is probably really common too. So maybe I’ll set up a convenience function for that? Don’t know yet.

Building the library

Oh right! We have to build this code don’t we, or it won’t actually work.

Start with the library itself. Since we’re creating everything but the .h file in Rust-land, we can emit a shared library directly from the Cargo build system by adding a ‘cdylib’ target. In our Cargo.toml:

[lib]
crate-type = ["rlib", "cdylib"]

The “rlib” is a regular Rust library; the “cdylib” is a C-compatible shared library that exports only the C-compatible public symbols (mostly). The rest of the Rust standard library (the parts that get used) are compiled statically inside the cdylib, so they don’t interfere with other libraries that might have been built in a similar way.

Note this means that while a Rust app that uses mtpng and another rlib can share a Rayon ThreadPool instance, a C app that uses mtpng and another cdylib cannot share ThreadPools because they might be different versions etc.

Be warned that shared libraries are complex beasts, starting with the file naming! On Linux and most other Unix systems, the output file starts with “lib” and ends with “.so” (libmtpng.so). But on macOS, it ends in “.dylib” (libmtpng.dylib). And on Windows, you end up with both “mtpng.dll” which is linked at runtime and an “mtpng.dll.lib” which is linked at compile time (and really should be “mtpng.lib” to follow normal conventions on Windows, I think).

I probably should wrap the C API and the cdylib output in a feature flag, so it’s not added when building pure-Rust apps. Todo!

For now, the Makefile or batch file are invoking Cargo directly, and building in-tree. To build cleanly out of tree there are some options on Cargo to specify the target dir and (in nightly toolchain) the work output dir. This seems to be a work in progress, so I’m not worrying about the details too much yet, but if you need to get something working soon that integrates with autotools, check out GNOME’s librsvg library which has been migrating code from C to Rust over the last couple years and has some relevant build hacks. :)

Once you have the output library file(s), you put them in the appropriate place and use the appropriate system-specific magic to link to them in your C program like for any shared library.

The usual gotchas apply:

  • Unix (Linux/macOS) doesn’t like libraries that aren’t in the standard system locations. You have to add an -L param with the path to the library to your linker flags to build against the library in-tree or otherwise not standardly located.
  • Oh and Unix also hates to load libraries at runtime for the same reason! Use LD_LIBRARY_PATH or DYLD_LIBRARY_PATH when running your executable. Sigh.
  • But sometimes macOS doesn’t need that because it stores the relative path to your library into your executable at link time. I don’t even understand that Mach-O format, it’s crazy! ;)
  • Windows will just load any library out of the current directory, so putting mtpng.dll in with your executable should work. Allll right!

It should also be possible to build as a C-friendly static library, but I haven’t fiddled with that yet.

Windows fun

As an extra fun spot on Windows when using the MSVC compiler and libc… Cargo can find the host-native build tools for you, but gets really confused when you try to cross-build 32-bit on 64-bit sometimes.

And MSVC’s setup batch files are … hard to find reliably.

In the current batch file I’ve had some trouble getting the 32-bit build working, but 64-bit is ok. Yay!

Next steps

There’s a lot of little things left to do on mtpng: adding support for more chunks so it can handle indexed color images, color profiles, comments, etc… Ensuring the API is good and error handling is correct… Improving that build system. :)

And I think I can parallelize filtering and deflate better within a chunk, making compression faster for files too small for a single block. The same technique should work in reverse for decoding too.

Well, it’s been a fun project and I really think this is going to be useful as well as a great learning experience. :) Expect a couple more update posts in this series in the next couple weeks on those tuning issues, and fun little things I learn about Rust along the way!

04/09/2018-10/09/2018

Pic

Visualisation of changes in OSM over the time [1] | © HOT US Inc., © MapBox, Map data © OpenStreetMap contributors

EU upload filters and ancillary copyright

  • The proposal to reform the EU copyright was originally suggested by Günther Oettinger during his time was the EU Digital Commissioner (2014-2016). The EU’s controversial reform, may introduce upload filters and ancillary copyright. It also has only a very limited exception for text and data mining. A key hurdle was passed last Wednesday after the majority of members of the EU parliament voted in favour the proposal, with only slight modifications.Many online communities protested against the proposal in the hope that the EU Parliament would decline it again, as they did in July 2018. The concern is that the new law will require all parties in the EU to implement upload filters, i.e. censorship infrastructure and as such, it will become a big hurdle for projects with user-generated content like OSM or Wikipedia.Wikipedia visitors in many EU countries were met with a black banner, urging them to defend the internet, and some Wikipedia editions were shut down completely showing only a black protest message. OSM in Germany with its local chapter FOSSGIS participated in this protest by replacing every tenth tile with a protest message on a black background. The OSMF Board did not take a stance on the matter although there was and still is some noise on the OSMF mailing list.The new copyright law still needs parliamentary approval before it will become law. By passing the draft, the EU parliament opens negotiations with the European Council and the European Commission. If, as currently expected by the lawmakers, this version of the law will be approved by all three EU bodies, it could be ready for a final vote by the end of the year.

Mapping

  • Osmose now allows potential errors to be exported as KML files. You can import these files into apps, such as Maps.me, to assist in checking on the ground.
  • The Osmose QA tool now includes a layer with traffic signs from Mapillary where the equivalent information is missing from OSM. At the moment, the layer is available in France, Brussels and parts of Germany.
  • The question as to whether waterway=riverbank is an ‘old scheme’ and should be replaced by natural=water + water=river was discussed on the tagging mailing list.
  • On the tagging mailing list, terraced buildings – rows of houses with shared walls – were discussed.
  • Andrew Wiseman from Apple’s mapping team started a project to help improve the road networks of Senegal, Côte d’Ivoire, Tunesia and Uganda and he asks the local mappers on the respective countries’ mailing lists where to direct his team’s attention.
  • Telenav created MapRoulette tasks for highways in OSM lacking a speed limit, but where data are available from image recognition processing of OpenStreetCam photos. So far the new Maproulette challenge is limited to Detroit.
  • A reddit user asks whether tools exist that semi-automatically trace buildings from images. The answers included pointers to Microsoft’s buildings footprints, RoboSat from Mapbox and the JOSM plugins building tools, Lakewalker, Scanaerial, Tracer2 and mapathoner. However, you should make yourself familiar with the Automated Edits code of conduct.
  • Harry Wood re-iterates Christoph Hormann’s call to double check multipolygons. As reported osm2pgsql is repairing fewer errors than it did before it was updated. Now errors are just rejected: leading to gaps on the map.

Community

  • The OsmAnd team launched a short survey, asking if you know about their in-built Travel Feature, which uses data from Wikimedia’s Wikivoyage project.
  • OSM is becoming a multi-generational project. Florian Lohoff’s son, born when he joined OSM in 2008, has now also started contributing to OSM.
  • A bicycle enthusiast from France created tutorial videos (automatic translation) explaining the many features of OsmAnd in French. English documentation can be found on the official website.
  • We reported earlier about a quality analysis tool written by Pierre Béland and used to analyse 12 African cities. The tool has now been used to analyse 25 Tasking Manager tasks from August. Maps by each contributor allows rapid visualisation to detect those buildings that need to be revised as a matter of priority. Once again there is variability in the quality of building geometries and the number of topological errors. This analysis suggests we need better monitoring tools to follow up on the work done at mapathons by inexperienced mappers.
  • Tigerfell wants to rewrite the Relation-Template for the wiki. He explains his motivation in a blog post and is inviting you to participate.

OpenStreetMap Foundation

  • A new server, named pyrene, was added to our tile rendering cluster. It is located in the OpenSystems Lab, based at Oregon State University, Corvallis (USA), and should reduce tile latency for North American users.

Events

  • The FOSS4G Belgium Conference will be held on October 25, 2018. The organiser, OSGeo.be, just announced the continuation of the annual conferences that started in 2015 and is calling for papers, maps and sponsors.
  • The FOSS4G SotM Oceania, a joint conference of OSM and FOSS4G communities of Australia, New Zealand, and the Pacific Islands, will take place in Melbourne on November 20-23, 2018. The organisers are looking for open geospatial projects for the community day on the last day of the conference.

Humanitarian OSM

  • [1] HOT announced the launch of Visualize Change, a tool that can create an embedded and downloadable visualisation of the changes in OSM over time. The tool was shown to the public at the recent FOSS4G.
  • CartONG, a French NGO, is looking for (fr) (automatic translation) someone willing to apply for a civic service (fr) (automatic translation) position. Missions include planning and hosting mapathons, taking part in the operations at CartONG and raising the awareness of new audiences.
  • Janet Chapman is organising a Global Mapathon from 28th to 30th of September 2018 to help end female genital mutilation (FGM).
  • Luc Kpogbe from OSM Benin reports that they are mapping their northern border region and the city of Tanguieta. They are training young mappers and are able to provide them with smartphones thanks to a micro-grant from HOT.

Education

  • Marena Brinkhurst and Jinal Foflia, from Mapbox, together with Yves Barthelemy, of the Zanzibar Mapping Initiative, and Nuala Cowan, of Open Cities Africa, hosted a 2 day hackathon in Zanzibar following the FOSS4G event in Dar-Es-Salaam.This resulted in:
    • tools for comparing imagery (2 islands of Zanzibar were flown over with drones to create fresh high resolution imagery)
    • interacting with crowd-sourced data
    • visualisations of school enrolment, urban development, and flood prone areas
    • an interactive tour of cultural heritage sites in Zanzibar

Maps

  • It looks like many users originating from Western countries have a problem with the “main map” on openstreetmap.org as they are only familiar with the Latin script (we reported earlier). Sven Geggus explained in detail how he copes with localisation in the German map style on openstreetmap.de. Specifically, he makes notes about proper tagging, what one should care about and which problems might arise. In addition, he presents a feasibility study in case OSM.org should one day be converted to vector tiles.

switch2OSM

Open Data

  • WeeklyOSM often receives links to papers about OSM in science journals that are unfortunately closed source and – except for the abstract – behind a paywall. The EU and some national research funders have announced a plan to make research papers free to read. The new policy will be implemented from 1 January 2020.
  • The saga of non-open addresses in the UK continues. Owen Boswarva summarises the current position as the national Geospatial Commission gets up to speed.

Software

  • cartONG, a French NGO, prepared a comparison between the three commercial UAV post-treatment applications: Pix4D, ESRI’s Drone2Map and AgiSoft.

Programming

  • Andrew just announced the release of AgentMaps, a Javascript library for building and visualising dynamic social systems on maps.
  • Simon Poole has added a function to OSM that provides a file with a list of deleted user account UIDs. Most of those user accounts have been deleted as they were used exclusively for spamming. Pascal Neis’ tool How did you contribute is already making use of the list.
  • Sabrina Marx, from the Heidelberg Institute for Geoinformation Technology (HeiGIT), published a step by step tutorial in a Jupyter Notebook on analysing HOT Tasking Manager projects using the ohsome API.
  • OSM user Glassmann documented his approach to generating a vector tile overlay for use in iD in his OSM diary.

Releases

  • Wambacher’s OSM related software list has been updated. Most recent version changes were Komoot Android, Mapbox Nav SDK , Mapillary Android, Naviki Android, QMapShack and Vespucci.

Did you know …

  • JungleBus, an Android mobile app that makes it easy for beginners to collect bus stops all around the world.
  • OpenBeerMap, it allows the user to specify their favourite beer brand and then displays pubs offering it on a map. The question of whether the beer brands offered in pubs merit being added to OSM is more contentious though.
  • … the Christmas map, where you can already start adding events like Christmas markets?

Other “geo” things

  • Randy Meech (ex CEO of Mapzen) outlines, in a long twitter thread, the philosophy behind his new business, Street Cred labs, which is developing new approaches to capturing POI data. Of particular interest are his remarks about how the data may be licensed. TechCrunch has written a short article about the company.
  • From time to time people use OSM to add fantasy locations. The medieval fantasy city generator might help people who are looking for a nice looking fantasy map.
  • The Open Geospatial Consortium (OGC) is looking for comments on the draft Geospatial Artificial Intelligence (GeoAI) Domain Working Group (DWG) Charter, that is intended to provide a forum for discussion and aims to ensure interoperability of OGC standards in AI applications.
  • The Guardian reports that sheet 440 is the worst selling map in the OS Explorer series (1:25k), covering an area of over 800 square kilometres in the far north of Scotland with a population of less than 200, and ‘a few dozen’ buildings. The best selling map, of Snowdonia, is 180 times more popular. OS surveyor Dave Robertson also explains how new roads are mapped – not much different to how we do it in OSM.

Upcoming Events

Where What When Country
Posadas Mapatón de parajes y caminos 2018-09-15 argentina
Nantes Participation aux Journées européennes du patrimoine à l’École de Longchamp 2018-09-15-2018-09-16 france
Rennes Découverte d’OpenStreetMap aux Journées Européennes du Patrimoine 2018-09-15-2018-09-16 france
Berlin Berliner Hackweekend 2018-09-15-2018-09-16 germany
Grenoble Rencontre mensuelle 2018-09-17 france
Toronto Mappy Hour 2018-09-17 canada
Cologne Bonn Airport Bonner Stammtisch 2018-09-18 germany
Nottingham Pub Meetup 2018-09-18 united kingdom
Lonsee Stammtisch Ulmer Alb 2018-09-18 germany
Viersen OSM Stammtisch Viersen 2018-09-18 germany
Karlsruhe Stammtisch 2018-09-19 germany
Mumble Creek OpenStreetMap Foundation public board meeting 2018-09-20 everywhere
Leoben Stammtisch Obersteiermark 2018-09-20 austria
Kyoto 幕末京都オープンデータソン#06:壬生の浪士と新撰組 2018-09-22 japan
Tokyo 首都圏マッピングパーティー 戸山公園で山登り!? 2018-09-22 japan
La Mandragore Réunion franco-allemande à Strasbourg 2018-09-22 france
Miyazaki 宮崎マッピングパーティ 2018-09-23 japan
Graz Stammtisch Graz 2018-09-24 austria
Bremen Bremer Mappertreffen 2018-09-24 germany
Lüneburg Lüneburger Mappertreffen 2018-09-25 germany
San Juan Maptime! Manila Meet-up 2018-09-27 philippines
Biella Incontro mapper di Biellese 2018-09-29 italy
Buenos Aires State of the Map Latam 2018 2018-09-24-2018-09-25 argentina
Detroit State of the Map US 2018 2018-10-05-2018-10-07 united states
Bengaluru State of the Map Asia 2018 2018-11-17-2018-11-18 india
Melbourne FOSS4G SotM Oceania 2018 2018-11-20-2018-11-23 australia
Lübeck Lübecker Mappertreffen 2018-09-27 germany
Düsseldorf Stammtisch 2018-09-28 germany
Stuttgart Stuttgarter Stammtisch 2018-10-03 germany
Bochum Mappertreffen 2018-10-04 germany

Note: If you like to see your event here, please put it into the calendar. Only data which is there, will appear in weeklyOSM. Please check your event in our public calendar preview and correct it, where appropriate.

This weeklyOSM was produced by Nakaner, PierZen, Polyglot, Rogehm, SK53, Spanholz, Guillaume Rischard, SunCobalt, TheSwavu, derFred, geologist, jinalfoflia, sev_osm.

September 14, 2018

Dr. Denneal Jamison-McClung is Interim Director of the Biotech Program at UC Davis, as well as Director of BioTech SYSTEM and DEB Program Coordinator. She also serves on the Our Voices blog editorial board and as an ex officio member of the UC Davis ADVANCE management team. She has utilized Wiki Education’s tools and support in her Biotech courses. This is a republishing of her post about how women and allies in STEM can make Wikipedia more representative of all people.

In our cyber-connected modern world, many people search the internet via smartphones as a first step when learning about new topics.  Wikipedia, in particular, is a “go to” source of information and is the fifth most popular website in the world, racking up millions of views per day. Given Wikipedia’s ubiquity as an information source, the quality and scope of information provided on the platform has the power to shape our collective understanding of the world around us.

Wikipedia’s volunteer editors tackle the behemoth task of curating the information added to the site by the public, including decisions about which contributed pages or paragraphs are “notable” and should be maintained and which pages should be deleted.  In recent years, the Wikimedia Foundation (WMF) has been asking vital research questions about the influence of editor diversity (the vast majority [~85%] of Wikipedia editors are male) on the breadth of content available on the site.

Several studies have found that Wikipedia content skews towards topics of traditional/stereotypical interest to males (e.g. sports teams, video games, military history), likely reflecting the interests of the pool of editors. In response, WMF has launched campaigns to recruit and train a more diverse pool of “Wikipedians” and have made an effort to ensure editorial guidelines and policies are gender neutral. There has been steady improvement, but there is still a long way to go…

Klein and Konieczny (2018) recently published an analysis of the ratio of biographies of women and non-binary-gendered notable individuals to the total number of Wikipedia biographies (called the Wikidata Human Gender Indicator [WHGI]) across different cultures. Around the world, biographies of women are underrepresented on Wikipedia though the balance is shifting toward parity (expected in February 2034 given their analysis of ~2014 WHGI data…sixteen more years to go?!).

Figure 3. WHGI-country world map visualization. View interactive version at http://whgi.wmflabs.org/data.html
Published in: Piotr Konieczny; Maximilian Klein; New Media & Society  Ahead of Print
DOI: 10.1177/1461444818779080
Copyright © 2018 SAGE Publications

The lack of diversity in Wikipedia biographies of notable individuals extends to women in STEM.  Ongoing lack of representation in the largest, most frequently accessed body of knowledge in the world contributes to the silencing of our voices.

“According to Wikipedia, being a notable person is less about to life expectancy, somewhat more about education and economic status, and even more about positions of power.”  – Klein and Konieczny, 2018

To help change cultural perceptions of who can contribute to STEM and to inspire the next generation of young scientists and engineers, it is essential that open access platforms, especially Wikipedia, offer a realistic perspective on the diversity of people already working to tackle big global challenges and historical contributions by underrepresented groups. Let’s speed up the process… sixteen more years to achieve gender parity on notable biographies is too long!

What can women and allies in STEM do to accelerate the improvement of Wikipedia as an inclusive, fact-based resource for the global community?

  • Brainstorm a list of the notable academics in your own research community and check to see if a Wikipedia page exists for those individuals, as well as whether they meet the platform guidelines for notability.
  • Using your list of potentially notable biographies:
    • Incorporate a “wiki writing” assignment into existing courses or work with humanities colleagues to develop new, interdisciplinary courses that highlight the diversity of STEM scientists.
    • Organize an “editathon” event that brings together students and colleagues to add or improve Wikipedia pages on notable STEM professionals from underrepresented backgrounds.
    • Start a Wikipedia page for a person in your discipline who you know of, but have never met or have had only limited interactions (limits potential conflict of interest).
  • Contribute related open access content to Wikimedia Commons (e.g. articles, photos, illustrations, drawings, videos) in your topic area of expertise.

For inspiration, keep an eye on these collective and heroic individual efforts to improve Wikipedia’s content:

In addition to educating the public and inspiring young scholars, highlighting the work of diverse STEM professionals currently active in teaching and research on Wikipedia may help to improve the diversity of:

  • invited speakers at professional meetings (avoid the dreaded “manel”!… in addition to Wikipedia, check out the 500 Women Scientists database)
  • prestigious awardees (Nobel prizes, national academies)
  • participants in high level government advisory bodies and review panels
  • recruitment of STEM women to leadership roles in academia (chairs, directors, deans, chancellors, presidents) and industry (C-suite, VCs, board chairs)

References


This is a republishing with permission from the author. See the original post here.

Interested in teaching with Wikipedia? Visit teach.wikiedu.org or reach out to contact@wikiedu.org with questions.


Header imageFile:UC Davis campus buildings and scenes (16188061937).jpg, UC Davis Arboretum and Public Garden, CC BY 2.0, via Wikimedia Commons.

The guard rails I'll be following will be around the original blog post created by Darian Patrick in November 2016. I'll do my best to fill in what gaps I can.

What Happened?
The attackers targeted a small group of privileged and high profile users. It is most likely that the attackers were using passwords that had been published as part of dump of other compromised websites such as LinkedIn. This notion was also confirmed by compromised users that they were in fact recycling passwords across multiple sites with known password dumps. There was no evidence of system compromise.

What information was involved?
There is no evidence of any personal information being disclosed beyond usernames and passwords.

What was done about it?
Improved alerting and reporting to identify dictionary and brute force attacks
Extended password policy to mitigate attacks

John Bennett
Director of Security, Wikimedia Foundation

September 13, 2018

Each year, the Wikimedia Foundation surveys the volunteer communities who edit Wikimedia sites for their input on a variety of topics that, in turn, help Foundation staff make decisions about how to support these communities. In April 2018, over 4,000 Wikimedia community members, answered up to 50 questions about their experiences working on the Wikimedia projects. We heard from editors on Wikipedia and other websites, community organizers who coordinate programs or manage organizations, and volunteer software developers. The Community Engagement Insights 2018 report is now published, and here are a few highlights.

Diversity among contributors the Wikimedia projects remains the same as last year.

In examining demographic changes from last year, we found few differences in gender and age among contributors. Women continue to represent between 5.2 and 13.6% among the Wikimedia projects. The age of contributors seemed to increase slightly, but the average contributor continues to be in the 35–44 age range. Contributors with less activity are younger and range closer to the 25–34 year age range. We found a significant reduction in regional distribution of Wikimedia editors, which could have been a result of changes made to the sampling and needs to be investigated further.

Self-awareness about how one’s behaviors or actions affects others stands out as needing improvement among communities.

We measured various aspects of community health and diversity and inclusion among Wikimedia communities. Contributors were asked whether their peers are aware of how their behavior or actions affect others. Out of a scale of 5, the mean score was about 3.05. Other measures for community health were quite a bit higher (from 3.5 to 4.1). This suggests that this could be an area to focus on to improve the health of the community.

Editors and community organizers value diversity differently.

There is room for improvement in communities valuing diversity of content and people. Respondents were able to select up to 4 statements about whether their community valued diversity in different ways. The average among contributors selected just 1.5 out of 4. The average community organizers selected about 2.5. This shows that organizers perceive that their communities place more value in diversity.

Harassment is still an ongoing issue on the Wikimedia projects.

Harassment doesn’t seem to have gotten worse, based on a few questions we asked. For example, in one question we ask contributors how often they are bullied or harassed in various Wikimedia projects. Although 71% of 280 reported having been bullied or harassed on Wikipedia, we did not find statistically significant changes from 2017 data. There were small changes to smaller Wikimedia projects.

Survey results lead to better understanding of Wikimedia communities that improves our work.

This survey is the work of 11 teams across the Foundation who wanted to learn how their work affected the communities that we serve, and the data will now help more teams at the Foundation make future decisions about their work:

  • The Community Resources team learned that the most important thing that participants gain from Wikimania is discovering new ideas or projects. For Wikimedia Conference, participants gain time to resolve issues or conflicts. For regional or local events, participants reported gaining new skills.
  • The Legal Department plans to increase awareness of the Transparency Report, because the survey showed that contributors are often not informed about the report.
  • For the Community Programs team, which supports Structured Data on Commons, the survey revealed that Wikimedia Commons users would like more support for multilingual descriptions of media files and would like to be able to easily discover new or unexpected media files.
  • Our Learning & Evaluation team will be using key results in annual planning discussions that will form their work to build capacities among communities.
  • And the Trust and Safety team will be working to increase attention of the emergency@wikimeda.org email address so that users know where to go when they see threats of violence.

 
Other Wikimedia Foundation teams are continuing to learn from their report and deciding what to do next with the data and each team will have a list of actionable next steps in their report.

Read the reports and leave your comments!  You ideas and feedback are welcome.  We will be hosting a livestream presentation on Thursday, 20 September 2018, at 9:00 am Pacific / 16:00 UTC.

Edward Galvez, Evaluation Strategist (Surveys), Learning and Evaluation
Wikimedia Foundation

Amy Dye-Reeves
Image: File:Adreeves01.jpg, Adreeves01, CC BY-SA 4.0, via Wikimedia Commons.

Amy Dye-Reeves, is an Assistant Professor of Research and Instruction Librarian at Murray State University. Amy is the College of Arts & Humanities Liaison to the Departments of History, Psychology, Political Science and Sociology. During the Fall of 2018, Amy is teaching a Research in the Information Age course and providing a multitude of instructional literacy instructional sessions.  For future development, Amy will engage the Wiki Education platform to explore digital information and the dissemination of the accuracy of information gathering through the ACRL framework.

Before I joined Wiki Education’s Wikipedia Fellows program, my base knowledge of the wide variety of Wiki Tools was little to none. I would bring up Wikipedia in the classroom as a way of discussing the varying accuracy of its bibliographic references. But my own experience “behind the scenes” on the platform was limited.

In the winter/spring of 2018, I began my career as a new academic subject specialist serving the departments of history, psychology, sociology, and political science. I presently teach over 30 instructional sessions within a given semester. I teach a variety of information literacy skills including  search strategies, resource gathering, and evaluating resources. In student conversation, Wikipedia appeared quite frequently in terms of resource gathering. But it became clear that students did not have the skills to evaluate the articles for accuracy before trusting them. With a majority of my students utilizing Wikipedia, I wanted to find a program that would allow me to strengthen my base knowledge of the virtual tools and help my students utilize this resource correctly.  With this information in mind, I began researching professional development opportunities within Wikipedia. By chance, I found the Wikipedia Fellows Program link after doing investigative research on the main Wikipedia Education’s homepage. The program has proved to be a great fit within my goals for professional development.

My main objective in participating within this particular program was learning the overall landscape of Wikipedia itself with the hope of expanding and incorporating this resource within the university classroom. At first, the program seemed overwhelming with an abundance of new tools within a virtual landscape unbeknownst to myself. The first major hurdle was figuring out how to use Wikipedia’s editing features. The visual editor is straightforward, as it allows users to make edits to an article by changing the text as it appears. But it’s limiting. The source editor, on the other hand, has a bit of a learning curve, but allows a user to make more complex contributions by going into the HTML of the article. These initial hurdles would often lead to feelings of anxiety. But each fellowship session provided videos and tutorials allowing me to practice with the major Wiki Tools without the fear of breaking the product itself. Every week, I felt more adventurous in trying new aspects of Wikipedia. I am currently learning how to use and incorporate Wiki Datasets into a variety of library instructional sessions.

For future development, I plan on utilizing Wikipedia within my upper division Information minor courses. At present, I am teaching an Information in the Research Age survey course. The course covers the basic skills of information gathering and dissemination. It would be a great fit for engagement with Wikipedia editing, a practice that involves honing one’s digital literacy and communication skills. For an upcoming upper-division course, I plan on incorporating the ACRL Framework for Information Literacy for Higher Education with Wikipedia. Wiki Education’s Dashboard will be an important tool for this, as it provides students with the skills needed to produce information on a worldwide scale, and provides instructors the means to track those contributions. In this future Wikipedia assignment, students will instantly take notice that the information they have created can be read by a large scale audience, rather than just the professor reading a term paper at the end of the course. The readership more than doubles (it may even grow exponentially depending on the articles they choose to work on!) and the student will feel intellectually impacted by seeing their work on a large digital platform. I am excited to incorporate this new platform into my teaching and to see the student’s reaction to this particular product next year.


Learn more about Amy’s Wikipedia Fellows cohort here. And check out our latest professional development opportunity with the National Archives.


Header imageFile:Pogue library.JPG, public domain, via Wikimedia Commons.

A galloping overview

Let’s first get a bird’s-eye view of the parts of the search process: text comes in and gets processed and stored in a database (called an index); a user submits a query; documents that match the query are retrieved from the index, ranked based on how well they match the query, and are then presented to the user. That sounds easy enough, but each step hides a wealth of detail. Today we’ll focus on another part of the step where “text gets processed”—and look at normalization.[1]

Also keep in mind that humans and computers have very different strengths, and what is easy to one can be incredibly hard for the other.

A foolish consistency

The simplest kind of normalization that readers of Latin, Greek, Cyrillic, Armenian and many other scripts often don’t even notice is case—that is, uppercase vs. lowercase. For general text, we want Wikipedia, wikipedia, WIKIpedia, wikiPEDIA, and WiKiPeDiA to all be treated the same. The usual method is to convert everything to lowercase.

There are exceptions—called capitonyms—where the capitalized form means something different. In English, we have March/march, May/may, August/august—so many months!—Polish/polish, Hamlet/hamlet, and others. In German, where nouns are always capitalized, there are also words that differ only by capitalization, such as Laut (“sound”) and laut (“loud”). Their conflation through lowercasing is often something we just have to live with.

As with everything else when dealing with the diversity of the world’s languages, there isn’t just one “right” way to do things. A speaker of English, for example, will tell you that the lowercase version of I is i, while a Turkish speaker would have to disagree, because Turkish has a dotted uppercase İ and dotless lowercase ı, and the corresponding pairs are İ/i and I/ı. As a result, we lowercase I differently on Turkish wikis than we do on other wikis in other languages. Cyrillic Г (“ge”) has different lowercase forms in Russian, Bulgarian, and Serbian, and then different italic lowercase forms in those languages as well.

Other complications include German ß, which, depending on who and when you ask, might or might not have an uppercase form. It can be capitalized as SS,[2] or using an uppercase version (that is not well supported by typical fonts), and was only accepted by the Council for German Orthography in 2017.

There are also uppercase vs. lowercase complications with digraphs used in some languages, like Dutch ij or Serbian , lj, or nj—which are treated as single letters in the alphabet. The Serbian letters have three case variants: ALL CAPS DŽ / LJ / NJ, Title Case Dž / Lj / Nj, and lowercase dž / lj, nj. The Dutch letter, in contrast, only comes in two variants, UPPERCASE IJ and lowercase ij. Though they are usually typed as two letters, there are distinct Unicode characters for all three variants: i.e., DŽ, Dž, and dž, but only IJ and ij.

The calculus of variations

Another common form of normalization is to replace “variant” forms of a character with the more “typical” version. For example, we might replace the aforementioned single Serbian character dž with two separate characters: d and ž; or Dutch ij with i and j. Unicode also has precomposed single-character roman numerals—like ⅲ and Ⅷ—and breaking them up into iii and VIII makes them much easier to search for.

So-called “stylistic ligatures” are also relatively common. For example, in some typefaces, the letter f tends to not sit well with a following letter, particularly i and l; either there’s too much space between the letters, or the top of the f is too close to the following letter. To solve this problem, ligatures—a single character made by combining multiple characters—are used. The most common in English are ff, fl, fi, ffi, and ffl. Open almost any book in English (in a serif font and published by a large publisher) and you’ll find these ligatures on almost every page. The most obvious is often fi, which is usually missing the dot on the i (in a serif font—not necessary in a sans-serif font).[3] Some stylistic ligatures, like the st ligature (st) are more about looking fancy.

Other ligatures, like æ, œ, and ß (see footnote 1) can—depending on the language—be full-fledged independent letters, or just stylistic/posh/pretentious ways of writing the two letters. Separating them can make matching words like encyclopaedia and encyclopædia easier.

Non-Latin variants abound! Greek sigma (Σ/σ) has a word-final variant, ς, which is probably best indexed as “regular” σ. Many Arabic letters have multiple forms—as many as four: for initial, medial, final, and stand-alone variants. For example, bāʾ  has four forms: ب, ـب, ـبـ, بـ.

For other letters, their status as “variants” is language dependent. In English, we often don’t care much about diacritics. The names Zoë and Zoe are—with apologies to people with those names—more or less equivalent, and you have to remember who uses the diaeresis[4] and who doesn’t. Similarly, while résumé or resumé always refer to your CV, resume often does, too. In Russian, the Cyrillic letter Ё/ё is treated as essentially the same as Е/е, and many people don’t bother to type the diaeresis—except in dictionaries and encyclopedias. So, of course, we have to merge them on Russian-language wikis. In other languages, such as Belarusian and Rusyn, the letters are treated as distinct. And, whether you want to keep the diacritics or not, you may still need to normalize them. For example, you can use a single Unicode character,[5] é, or a regular e combined with a “combining diacritic”, é, which is two characters, not one.[6] Similarly, in Cyrillic, ё and й are one character each, while ё and й are two characters each.

Some characters are difficult to tell apart, while others are hard to identify—and even if you could identify them, could you type them? Normalizing them to their unaccented counterpart makes a lot of sense in many cases. My general rule of thumb is that if a letter is a separate letter in a language’s alphabet, then it needs to be a separate letter when you normalize that language for search. Russian is an exception, but it’s a good first place to start.

Below is a collection of letters related to A/a, presented in image form just in case you don’t have the fonts to support them all. All of them except one (in grey) are separate Unicode characters.[7] For some reason, there is a “latin small letter a with right half ring[8] but no version with a capital A. Good thing we don’t need to convert it to uppercase it for searching!

A collection of letters related to A/a.

Now for our last character-level variation to consider: in some applications, it makes sense to normalize characters across writing systems. For example, numbers probably do represent the same thing across writing systems, and where multiple version are common, it makes sense to normalize them to one common form. Thus, on Arabic-language wikis, Eastern Arabic numerals are normalized to Western Arabic numerals so that searching for ١٩٨٤ will find 1984, and vice versa. For multi-script languages, where the conversion is possible to do at search time, it makes sense to normalize to one writing system for searching. In Serbian-language wikis, Cyrillic is normalized to Latin; in Chinese-language wikis, Traditional characters are normalized to Simplified characters.[9]

Further reading / Homework

You can read more about the complexities of supporting Traditional and Simplified Chinese characters and Cyrillic and Latin versions of Serbian (and other multi-script languages) in one of my  earlier blog posts, “Confound it!” Wikipedia has lots more information on the surprisingly complex topic of letter case, and the examples of ligatures in a wide variety of languages.

If you can’t wait for next time, I put together a poorly edited and mediocrely presented video in January of 2018, available on Wikimedia Commons, that covers the Bare-Bones Basics of Full-Text Search. It starts with no prerequisites, and covers tokenization and stemming, inverted indexes, basic boolean and proximity retrieval operations, TF/IDF and the vector space model of similarity, field-level indexing, using multiple indexes, and then touches on some of the elements of scoring.

Up next

In my next blog post, we will almost certainly actually look at stemming—which involves reducing a word to its base form, or a reasonable facsimile thereof—as well as stop words, and thesauri.

Trey Jones, Senior Software Engineer, Search Platform
Wikimedia Foundation

———

Footnotes

1. Last time I said we’d talk about stemming and other normalization, but character-level normalization kind of took over this post, so we’ll put off stemming and related topics until next time.

2. Why does a symbol that stands for an “s” sound look like a B (or Greek β)? Well, you see, long ago there was a written letter form in common use called long s—which looks like this (in whatever font your computer is willing and able to show it to you): ſ. It was historically written or printed in a way that looks like an integral sign or an esh: ʃ, or in a chopped-off version that looks like an f without the crossbar (or, maddeningly, with only the left half of the crossbar). If you take the long s and the regular s—ſs—and squish them together, and make the top of the long s reach over to the top of the regular s, you get an ß—whose name in German, Eszett, reflects that even earlier it was a long s and a tailed z: ſʒ.

A fun side note: optical character recognition (OCR) software generally isn’t trained on either form of long s, and often interprets it as an “f”. As a result, The Google Books Ngram Viewer will (incorrectly) tell you that fleek was wildly popular in the very early 1800s. In reality, it’s usually “sleek” written with a long s (for example: full height or chopped), or some other unusual character, like a ligature of long s and t in Scots/Scottish English “steek”.

Typography is fun!

An ſʒ ligature and ß in six different fonts, which reveal or elide the “ſs” origin of ß to varying degrees.

3. A small miscellany: Turkish, which distinguishes dotted i and dotless ı doesn’t use the ligature since it often removes the dot. The spacing between letters is called kerning, and once you start paying attention to it, you can find poor kerning everywhere. Finally, another character that most people don’t use in their everyday writing—but which shows up a lot in printed books—is the em dash (i.e., ); I personally love it—obviously.

4. The technical name for the double-dot diacritic ( ¨ ) is “diaeresis” or “trema”, though many English speakers call it an “umlaut”, because one of its common uses is to mark umlaut in German—which is a kind of sound change. In English, the diaeresis is usually used to mark that two adjacent vowels are separate syllables—e.g., Chloë and Zoë rhyme with Joey, not Joe, and naïve is not pronounced like knave. For extra pretentiousness points, you can use it in words like coöperate and reënter, too—or you can just use a hyphen: co-operate, re-enter—though doing so may mess with your tokenization!

5. The term “Unicode character” can be more than a little ambiguous. It can refer to a code point, which is the numerical representation of the Unicode entity, which is what the computer deals with. It can refer to a grapheme which is an atomic unit of a writing system, which is usually what humans are thinking about. You can also talk about a glyph, which is the specific shape of a grapheme—for example in a particular font. There are invisible characters used for formatting that have a code point but no glyph, surrogate code points that can pair up in many ways to represent a Chinese character as a single glyph, special “non-characters,” and lots of other weird corner cases. It’s complicated, so people often just use “character” and sort out the details as needed.

6. This kind of normalization is relatively common, and there’s a reasonable chance that between me writing this and it getting published on the Wikimedia blog, some software along the way will convert my two-character to the one-character é. Not all letter+diacritic combinations have precomposed equivalents, though.

7. You may have noticed that in the last row of lowercase a’s, the third from the right has a different shape. Different fonts can and do use either of the letter shapes as their base form, but in more traditional typography, the a with the hook on top is the “regular” or “roman” form, and the rounder one is the “italic” form. Lowercase g can have a similar difference in form, and there is also a Unicode character for the specific “single-story” g—“latin small letter script g” (see footnote 6). Some Cyrillic letters also have very different italic forms—see the grey highlighted examples below.

For the typography nerds, traditional italic versions of a font have distinct forms. When they are just slanted versions of the roman forms, they are technically “oblique”, rather than italic. Below is the same pangram in the font Garamond, set in roman, italic, and oblique type.

8. Unicode descriptions of characters are always in ALL CAPITAL LETTERS / small caps. Why? Because we like it that way! Seriously, though, I don’t really know why. Hmmm.

9. Chinese is a case where real life gets a bit messier than our idealized abstraction. Sometimes you have to do some normalization before tokenization. Because Chinese doesn’t use spaces, tokenization is much more difficult than it is for English and other European languages. The software we use that does tokenization only works on Simplified characters, so we have to normalize Traditional characters to Simplified before tokenization.

Google Code-in will take place again soon (from October 23 to December 13). GCI is an annual contest for 13-17 year old students to start contributing to free and open projects. It is not only about coding: We also need tasks about design, documentation, outreach/research, and quality assurance. And you can mentor them!

Last year, 300 students worked on 760 Wikimedia tasks, supported by 51 mentors from our community.

  • Your gadget code uses some deprecated API calls?
  • You’d enjoy helping someone port your template to Lua?
  • You’d welcome some translation help (which cannot be performed by machines)?
  • Your documentation needs specific improvements?
  • Your user interface has some smaller design issues?
  • Your Outreachy/Summer of Code project welcomes small tweaks?
  • You have tasks in mind that welcome some research?

Note that “beginner tasks” (e.g. “Set up Vagrant”) and generic tasks are very welcome (like “Choose and fix 2 PHP7 issues from the list in this task” style).

If you have tasks in mind which would take an experienced contributor 2-3 hours, become a mentor and add your name to our list!

Thank you in advance, as we cannot run this without your help.

September 12, 2018

Scientists recognize the importance of communicating about science to the general public. When scientific information reaches outside of the academy, more people are equipped to make better informed political and behavioral choices. But how effective are the current channels scientists are using to reach people outside their specific scientific communities?

The public gets science information online

According to a 2017 study by Pew Research Center, “most Americans say they get science news no more than a couple of times per month, and when they do, most say it is by happenstance rather than intentionally.” People primarily learn about science from general news outlets. But what information are they receiving exactly? And where can they go if they want to learn more? The answer is, they go online.

“Individuals are increasingly turning to online environments to find information about science and to follow scientific developments,” says Dominique Brossard in a recent PNAS article. It’s therefore crucial for scientists and scientific institutions to engage online platforms to reach the public sphere with the latest research.

When citizens are informed, they can make policy decisions and behavioral choices that have a positive effect on our planet’s future. The urgency of science communication has never been stronger.

“Public debates over science-related policy issues – such as global climate change, vaccine requirements for children, genetically engineered foods, or developments in human gene editing – place continuous demands on the citizenry to stay abreast of scientific developments,” states the Pew Research Center study.

Science journalism is in decline. Without journalists to do the work of science promotion, it’s more important than ever for scientists to do it themselves.

When the public wants to learn more about a scientific topic, they turn to search engines, which inevitably point them to Wikipedia. But, depending on how well that topic is covered on the online encyclopedia, they may not find what they’re looking for. That’s why it’s important for scientific experts to contribute to Wikipedia and fill in content gaps.

But there’s another reason why scientists should engage in Wikipedia. And that’s because not only is the public looking to Wikipedia to understand science, so are scientists!

Wikipedia influences science itself

In a study published in November 2017, Doug Hanley, a macroeconomist at the University of Pittsburgh, and Neil Thompson, an innovation scholar at MIT, found that Wikipedia articles about science have an effect on the progress of future scientific research. In the study, Hanley and Thompson analyzed the language that appears in Wikipedia science articles and measured that against how language in scientific research papers changes over time. Read more about their methodology in Bethany Brookshire’s write up on ScienceNews.org here.

What Hanley and Thompson found was that Wikipedia articles have a real effect on the vocabulary of scientific journal articles all around the world, but especially in countries with weaker economies. Scientists in these countries don’t necessarily have access to the latest paywalled research and are more likely to rely on public resources like Wikipedia. Whether or not scientists are admitting they’re among the millions of people who turn to the online encyclopedia daily, the influence is real.

Essentially, Brookshire explains, the research shows that “Wikipedia is not just a passive resource, it also has an effect on the frontiers of knowledge.”

“[Wikipedia] is a big resource for science and I think we need to recognize that,” researcher Thompson says. “There’s value in making sure the science on Wikipedia is as good and complete as possible.”

Wikipedia articles are meant to incorporate all facets of a topic from a neutral point of view and from all angles. That consensus building is vital to the Wikipedia editing community, as well as the scientific community. So, not only is improving Wikipedia an act of public scholarship, but it’s also one that allows scientists to indirectly communicate with (and better inform) their peers.

Improving Wikipedia performs a public good and it advances scientific knowledge. It turns out that the experience is also personally rewarding for the scientists doing it.

Scientists want to engage in public scholarship

“Many academics enter science to change the world for the better. … [But] most academic work is shared only with a particular scientific community, rather than policymakers or businesses, which makes it entirely disconnected from practice.”

That’s what Julian Kirchherr of the Guardian stresses in his article A PhD should be about improving society, not chasing academic kudos. Some even cite science communication as a moral imperative. Environmental scientist Jonathan Foley speaks to the personally fulfilling aspects of sharing one’s knowledge:

“Communicating your science with the broader world is one of the most fulfilling things you will ever do,” he writes. “I guarantee you will find it fun, rewarding, and ultimately very educational.”

Engaging with the public through platforms that people use and trust is becoming increasingly important to new generations of academics. In Do Scientists Understand the Public?, Chris Mooney writes,

“In a recent survey of one thousand graduate-level science students at a top research institution (the University of California, San Francisco), less than half designated academic research as their top career choice. Instead, these young scientists are often interested in public engagement and communication, but face limited career opportunities to pursue these goals. In other words, if there is a crying need to forge better connections between scientists and the public, there is also an army of talent within universities looking for such outreach work. That base is young, optimistic, and stands ready to be mobilized.”

Wikipedia democratizes science communication, allowing for all to participate

Wikipedia provides these academics the opportunity to reach millions with their scholarship, using language that a non-expert can understand.

“Public sources of scientific information such as Wikipedia,” says Thompson, “are incredibly important for spreading knowledge to people who are not usually part of the conversation.”

Often, when members of the public don’t participate in channels of science communication, it isn’t for lack of interest. Instead, it may be the result of structural inequalities that limit their access to those channels. Everyone with internet access can get to Wikipedia. It’s the most effective way to put reliable, up-to-date scientific information into the hands of everyone, everywhere.

Wikipedia brings science to the public, but also connects the public to scientists and scientists to each other. It is a platform where reliable, neutral fact-reporting is valued and passionate rhetoric is not tolerated. It is a space to work together toward a clearer understanding of scientific research for the benefit of all.


Bibliography


ImageFile:Astronomy and curiosity take us to unknown places.jpgRalina Shaikhetdinova, CC BY-SA 4.0, via Wikimedia Commons.

#1Lib1Ref is an annual campaign where librarians and other contributors to Wikipedia add references to improve statements with the ultimate objective of improving the reliability of Wikipedia.

In 2018, the Iberocoop Network participated in the #1lib1ref campaign (1bib1ref in Spanish) in Latin America. The campaign ran for three weeks in May 2018, to commemorate the birthday of Spanish Wikipedia. During the campaign, several Latin American countries participated in online and offline activities involving librarians and the general public to drive contributions on Wikipedia. In addition to the participation of the Iberocoop Latin American countries (Argentina, Chile, Uruguay, Bolivia and Mexico), the campaign attracted participants from Brazil, Portugal, France, Italy, Catalonia, Australia, Cameroon, India and Ghana.

The results of #1bib1ref from the Latin American participants were encouraging and a great start given the limited resources and very limited prior planning: 70 editors made about 522 edits in 371 articles with the hashtag #1bib1ref, and possibly more without the hashtag. The campaign that emerged during the same period under the broader hashtag #1lib1ref saw about 940 edits in 358 articles by 132 participants in 12 languages. On social media, 32 countries made 1200 posts with the hashtag, reaching 1.8 million people, 3.6 million times.

Here are some highlights from around the world:

  • For the first time ever, #1Lib1Ref saw participation from the West African country of Ghana. While partnerships were sought with libraries to champion the campaign, the Wikimedia community also stepped in and actively contributed by improving sources on several articles on Wikipedia.
  • Wikimedia Uruguay organised a workshop on Referencing on Wikipedia for the Educational Documentation Department of the Ministry of Education and Culture.
  • In Argentina, two workshops were organised by Wikimedia Argentina for the network of librarians at Universidad de Buenos Aires and at the School of Librarians of the City of Buenos Aires.
  • In India, two workshops were organised by Krishna Chaitanya Velaga for librarians at the Kakatiya University and Annamayya Library in Warangal and Guntur respectively. At the close of these workshops, around 118 articles were improved and 24 new participants were introduced to the basics of editing Wikipedia and importantly, adding a citation! The momentum built from the campaign saw its way into the Advance Wiki Training (for users familiar with Wikipedia) organised by CIS-A2K and Krishna which focused on addressing reference issues as well as encouraging the use of reliable references and shedding more light on the Wikipedia Library Program and how the different language Indian communities could benefit in terms of acquiring sources in their own languages.

 
All the enthusiasm and sharing around the campaign made its way even into Wikimania Cape Town in July 2018. The Wikipedia Library team organised a #1Lib1Ref workshop called Wikipedia 101 for Librarians, which I led, during the pre-conference at Wikimania. This 1-day pre-conference event trained librarians and library staff on how to contribute to the Wikimedia projects as well as enlighten them about their role in bridging some of the knowledge gaps that exist about South Africa on Wikipedia (as the theme for Wikimania 2018 implied). The session included lightning talks from active Wikimedians (Kerry Raymond and Andy Mabbett) working in the Wikipedia + Library space, a #1Lib1Ref editathon, and a Wikidata session to showcase the opportunities around linked open data. Jake Orlowitz carried the energy into the main conference with a three-year retrospective talk at Wikimania looking at the campaign’s evolution and growth.

This indeed is a new turn for 1Lib1Ref. We were able to witness great growth potential in emerging communities firsthand and the void that needs to be filled. On the other hand, the campaign is gradually gaining audience and transforming into a widespread activity in multiple Wikimedia communities and regions. We believe, in making the campaign more suitable to emerging communities we have generated the enthusiasm for participation from librarians and our friends in the southern hemisphere. Still to come, watch out for more activities from India and Iran in the coming months.

And of course, we invite everyone to join us for #1lib1ref 2019 kicking off in January for yet another opportunity to make Wikipedia even more reliable globally!

Felix Nartey, Global Coordinator, The Wikipedia Library
Wikimedia Foundation

In my last posts I covered profiling and some tips for optimizing inner loops in Rust code while working on a multithreaded PNG encoder. Rust’s macro system is another powerful tool for simplifying your code, and sometimes awesomeizing your performance…

Rust has a nice system of generics, where types and functions can be specialized based on type and lifetime parameters. For instance, a Vec<u8> and a Vec<u32> both use the same source code describing the Vec structure and all its functions, but at compile time any actual Vec<T> variants get compiled separately, with as efficient a codebase as possible. So this gives a lot of specialization-based performance already, and should be your first course of action for most things.

Unfortunately you can’t vary generics over constant values, like integers, which turns out to sometimes be something you really wish you could have!

In mtpng, the PNG image filtering code needs to iterate through rows of bytes and refer back to bytes from a previous pixel. This requires an offset that’s based on the color type and bit depth. From our last refactoring of the main inner loop it looked like this, using Rust’s iterator system:

let len = out.len();
for (dest, cur, left, up, above_left) in
    izip!(&mut out[bpp ..],
          &src[bpp ..],
          &src[0 .. len - bpp],
          &prev[bpp ..],
          &prev[0 .. len - bpp]) {
    *dest = func(*cur, *left, *up, *above_left);
}

When bpp is a variable argument to the function containing this loop, everything works fine — but total runtime is a smidge lower if I replace it with a constant value.

Nota bene: In fact, I found that the improvements from this macro hack got smaller and smaller as I made other optimizations, to the point that it’s now saving only a single instruction per loop iteration. :) But for times when it makes a bigger difference, keep reading! Always profile your code to find out what’s actually slower or faster!

Specializing macros

The filter functions look something like this, without the specialization goodies I added:

fn filter_paeth(bpp: usize, prev: &[u8], src: &[u8], dest: &mut [u8]) {
    dest[0] = Filter::Paeth as u8;

    filter_iter(bpp, &prev, &src, &mut dest[1 ..],
                    |val, left, above, upper_left| -> u8
        {
        val.wrapping_sub(paeth_predictor(left, above, upper_left))
    })
}

The filter_iter function is the wrapper func from previous post; it runs the inner loop and calls the closure with the actual filter (inlining the function call out to make it zippy in release builds). Rust + LLVM do a great job of optimizing things up already, and this is quite fast — especially since moving to iterators.

But if we need to specialize on something we can’t express as a type constraint on the function definition… macros are your friend!

The macro-using version of the function looks very similar, with one addition:

fn filter_paeth(bpp: usize, prev: &[u8], src: &[u8], dest: &mut [u8]) {
    filter_specialize!(bpp, |bpp| {
        dest[0] = Filter::Paeth as u8;

        filter_iter(bpp, &prev, &src, &mut dest[1 ..],
                    |val, left, above, upper_left| -> u8
                {
            val.wrapping_sub(paeth_predictor(left, above, upper_left))
        })
    })
}

The “!” on “filter_specialize!” indicates it’s a macro, not a regular function, and makes for a lot of fun! commentary! about! how! excited! Rust! is! like! Captain! Kirk! ;)

Rust macros can be very powerful, from simple token replacement up to and including code plugins to implement domain-specific languages… we’re doing something pretty simple, accepting a couple expressions and wrapping them up differently:

macro_rules! filter_specialize {
    ( $bpp:expr, $filter_closure:expr ) => {
        {
            match $bpp {
                // indexed, greyscale@8
                1 => $filter_closure(1),
                // greyscale@16, greyscale+alpha@8 
                2 => $filter_closure(2),
                // truecolor@8
                3 => $filter_closure(3),
                // truecolor@8, greyscale+alpha@16
                4 => $filter_closure(4),
                // truecolor@16
                6 => $filter_closure(6),
                // truecolor+alpha@16
                8 => $filter_closure(8),
                _ => panic!("Invalid bpp, should never happen."),
            }
        }
    }
}

The “macro_rules!” bit defines the macro using the standard magic, which lets you specify some token types to match and then a replacement token stream.

Here both $bpp and $filter_closure params are expressions — you can also take identifiers or various other token types, but here it’s easy enough.

Note that unlike C macros, you don’t have to put a “\” at the end of every line, or carefully put parentheses around your parameters so they don’t explode. Neat!

However you should be careful about repeating things. Here we could save $filter_closure in a variable and use it multiple times, but since we’re specializing inline versions of it that’s probably ok.

Note that things like match, if, and function calls can all inline constants at compile time! This means each invocation uses the exact-bpp inlined variant of the function.

Down on the assembly line

Looking at the “Paeth” image filter, which takes three byte inputs from different pixels… here’s a fragment from the top of the inner loop where it reads those pixel byte values:

movdqu (%rcx,%r9,1),%xmm1     ; read *left
movdqu (%rsi,%rbx,1),$xmm3    ; read *up
movdqu (%rsi,%r9,1),%xmm5     ; read *above_left

(Note we got our loop unrolled and vectorized by Rust’s underlying LLVM optimizer “for free” so it’s loading and operating on 16 bytes at a time here.)

Here, %rcx points to the current row and %rsi to the previous row. %rbx contains the loop iterator index, and %r9 has a copy of the index offset by -bpp (in this case -4, but the compiler doesn’t know that) to point to the previous pixel.

The version with a constant bytes-per-channel is able to use the fixed offset directly in x86’s addressing scheme:

movdqu (%rcx,%rbx,1),%xmm1    ; read *left
movdqu (%rsi,%rbx,1),%xmm5    ; read *above_left
movdqu 0x4(%rsi,%rbx,1),%xmm3 ; read *up

Here, %rbx has the previous-pixel index, and there’s no need to maintain the second indexer.

That doesn’t seem like an improvement there — it’s the same number of instructions and as far as I know it’s “free” to use a constant offset in terms of instructions per cycle. But it is faster. Why?

Well, let’s go to the end of the loop! The variable version has to increment both indexes:

add $0x10,%r9   ; %r9 += 16
add $0x10,%rbx  ; %rbx += 16
cmp %r9,%r14    ; if %r9 > len - (len % 16)
jne f30         ; then continue loop

but our constant version only has to update one.

add $0x10,%rbx ; %rbx += 16
cmp %rbx,%r8   ; if %rbx > len - (len % 16)
jne 2092       ; then continue loop

Saved one instruction in an inner loop. It’s one of the cheapest instructions you can do (adding to a register is a single cycle, IIRC), so it doesn’t save much. But it adds up on very large files to …. a few ms here and there.

The improvement was bigger earlier in the code evolution, when I was using manual indexing into the slices. :)

Macro conclusions

  • Always profile to see what’s slow, and always profile to see if your changes make a difference.
  • Use generics to vary functions and types when possible.
  • Consider macros to specialize code in ways you can’t express in the generics system, but check that the compiled output does and performs how you want!

I may end up removing the filter specialization macro since it’s such a small improvement now and it costs in code size. :) But it’s a good trick to know for when it’s helpful!

Next: interfacing Rust code with C!


September 11, 2018

I already covered some inner-loop optimization tricks for low-level Rust code in mtpng, but how do you check how fast bits of your code are anyway?

It’s fairly easy to wrap your whole code in a timer, which I’ve done for convenience of testing total runtime:

extern crate time;
use time::precise_time_s;

...

let start_time = precise_time_s();
write_png(&outfile, header, options, &pool, &data)?;
let delta = precise_time_s() - start_time;

println!("Done in {} ms", (delta * 1000.0).round());

(High-level benchmarks like this are supported in nightly/unstable Rust via “cargo bench”, or in stable with some additional tools.)

It’s a pain in the butt to do that kind of testing on all your inner functions though, and taking the timings affects your performance so you really shouldn’t try!

The way to go is to use a sampling-based profiler native to your operating system. I’ve done most of my detailed profiling on Linux, using the “perf” tool.

Build preparation

Currently the “cargo” package manager doesn’t support a profiling-specific … profile … for building. You need debug symbols or you won’t understand much of your profile details, but you need a fully optimized build or why bother measuring its performance?

It’s fairly easy to add debug info to your release builds in your Cargo.toml file:

[profile.release]
# Unoptimized debug builds are too slow to profile
# having debug info doesn't hurt perf for now
debug = true

Though you might want to remove it before making binary releases. :)

Alternatively you could remove all the things that slow down performance from the debug profile and add optimization to it.

Perf is good

Perf seems to have come from Linux kernel-land, and has all kinds of magical abilities of system-wide profiling I haven’t even fathomed yet. But it’s fairly easy to use for quick and dirty profiling!

Run your program with “perf record” plus your command line; I usually go ahead and run via “cargo run” and then skip through the cargo invocation so I don’t have to go finding the binary under targets/release/bin or whereever.

$ perf record cargo run --release -- in.png out.png --threads=1

(Using a single worker thread makes it easier to read the profile, though sometimes you want to go crazy and test with all threads. Perf will happily record all the threads and child processes!)

If you have a long-running process you want to attach to at runtime, you can do that with the -p option and the pid:

$ perf record -p 12345

This causes a very slight slowdown to your program as it records, but it’s consistent over time and can measure hotspots down to the instructions!

Reporting for duty

Once you’ve run the program, pull up an interactive report in your terminal with “perf report”:

$ perf report

You can also save multiple recordings and pick up a past one, if you want to compare multiple runs of versions of your program.

The initial view will be of the entire run: all processes, all threads, every symbol in it. (You won’t see symbols for system libraries unless you’ve installed debug info packages.)

That “Cannot load tips.txt file” seems to be a packaging problem with perf in Fedora. :) It’s harmless.

Use the arrow keys to pick an item of interest — in this case mtpng::filter::Filterator::filter — and hit enter to get a menu:

“bin” is the name of my binary (clever), so let’s first dive into the relevant worker thread from my process, and ignore the rest:

Now we can see that the biggest individual time cost in the worker thread pool is libz’s “deflate” compression! This is only because we’re seeing the profile after a few days of optimization. Before, the biggest was the filter call. :D

Let’s scroll down and hit enter to look at the filter function in detail:

Hit “Annotate” and the real fun starts:

It’ll pop you into what it thinks is the biggest hotspot, but in a big function you might have to scroll around to find the relevant stuff.

You’ll notice both the instruction-by-instruction disassembly gibberish (useful but tricky to understand!) and little bits of your original source and (mangled) symbol declarations.

In heavily optimized Rust code it can be hard to follow exactly what’s going on because things get re-ordered and there can be magic under the hood you didn’t realize were happening… but it can be enough to piece together which individual parts of your function are slow, and to see which bits are different when you change something.

In this case we can see that the filter complexity heuristic loop has been auto-vectorized to use SSE instructions, even though my source code is a regular call in an iterator loop. Pretty rad!

Next post: using Rust macros to specialize functions for fun and performance!

python

“ Programs are meant to be read by humans and only incidentally for computers to execute.”
― Donald Knuth

Python, the programming language which supports multiple programming paradigms and which was first released in 1991 (yes, earlier than Java! ) has become more accepted with its contributions in enabling easier programming in scientific computing applications. This wonderful language has been my companion during my life as a research student with its wonderful libraries and ease of use, especially REPL.

I was excited to attend the Euroscipy 2018 which took place at Trento, a beautiful valley in Italy. I attended the main conference which took place on the 30th and 31st of August 2018. It showcased the use of Python in different scientific applications and there were talks given by people from Academia as well as Industry who used Python to get their job done. It was interesting to observe that most of the talks mentioned the use of Python in Machine Learning and Data Science. Also, various python libraries which people have initially developed for use within their research groups/ company are made open source and people could contribute to their further development.

 

IMG_5473

There were some talks which I found especially interesting which I would like to mention a few points about.

One of these was a Python library called Imbalanced-learn by Guillaume Lemaitre. This library is used to make more accurate predictions when the data set used for training is skewed with the samples in some classes being comparatively much fewer in number. For example, in the problems such as cancer cell detection, solar wind records, and car insurance claims, the ratio of data samples across classes can be as high as 26:1. The approaches used to solve this involve a combination of unsupervised learning (outlier detection), semi-supervised learning (novelty detection) and supervised learning (resampling).

Various researchers from the bio-medical community were also present at the conference, explaining the use of Python and it’s libraries, to solve interesting problems in the bio-medical field like  named entity recognition (using a library called OGER), dimensionality reduction in Neuroscience (using techniques like Tensor Component Analysis and demixed PCA, in addition to normal PCA) and Chaosolver which helps to determine phase space dynamics in bio-medical applications.

Another interesting talk which I found to be particularly pragmatic was titled ‘How not to screw up with Machine Learning in Production’. This talk focused on explaining the components in a Machine Learning system which are essential for production in addition to the core Machine Learning models, such as training/ serving skew and data validation. The talk suggested using existing solutions or a hybrid approach instead of building this entire ML eco-system from scratch (using tools such as TensorFlow Serving, Clipper, Apache PredicitonIO, SeldonCore/KubeFlow).

 

IMG_5465

This was my first time, attending an international tech conference and it has given me many valuable experiences and insights. I am sure that I can find good use of the open-source Python libraries introduced to me in the conference. Also getting to know the speakers and discussing my interests and challenges with them has widened my horizons. Now that I am back from the conference, all that remains are the memories of kind people whom I met at Trento and some wise words and thoughts from the speakers.

You can read this post in the original Spanish on Wikimedia Mexico’s blog.

Editatona, Wikimedia Mexico’s initiative to reduce the gender gap on Wikipedia and Wikimedia projects, won the 2018 Premio del Fondo Regional para la Innovación Digital en América Latina y el Caribe (abbreviated FRIDA, and translated as “Award for the Regional Fund for Digital Innovation in Latin America and the Caribbean”), under the Technology and Gender category.

The announcement, made on 13 August 2018, recognizes efforts made in Mexico to achieve a larger presence of women on Wikipedia, the free online encyclopedia, throughout in-person events called “Editatonas”. These events include the participation of female volunteers as well as institutions, collectives, and allies that have collections and information that can provide verifiability and accuracy according to the existing rules of Wikipedia.

Editatona aims to both reduce the historical injustice against women and create a balance,” said Carmen Alcázar, coordinator and founder of the project. “It adds to a series of efforts on different fronts that us, women, are opening to clearly say: a history without us—never again.”

The initiative started in Mexico in 2015 with the support of SocialTIC, Luchadoras, Ímpetu A.C., and La Sandía Digital. Since then, dozens of events have multiplied in countries like Spain, Brazil, Guatemala, Ecuador, and Uruguay. Editatona was recognized among over 400 proposals in 24 countries of Latin America and the Caribbean.

The FRIDA award, which awards a monetary stipend, will be given to Editatona at UNESCO’s headquarters in Paris during the 2018 Internet Governance Forum.

Wikimedia Mexico

You can read more about the roles and history of independent Wikimedia affiliates, like Wikimedia Mexico.

This term, Dr. David Webster is trying something new. “This year’s textbook [is] written by previous years’ students,” he announced over Twitter, much to the excitement of his followers.

Dr. Webster has taught his students at Bishop University how to contribute to Wikipedia as a classroom assignment for a few terms now. In his Spring 2016 course, Memory, truth and reconciliation in the developing world, students created new Wikipedia articles on a variety of course-related topics, including Truth and Reconciliation in Cambodia and the International Commission of Investigation on Human Rights Violations in Rwanda since October 1, 1990. 16 of these articles, along with 4 from a 2014 course, are included in a new textbook resource for Dr. Webster’s students to utilize this term and beyond.

The textbook that students will be referencing this term (along with other course materials and readings), written by former students.

Dr. Webster wrote about the experience of having students write for their future classmates back in 2017. The experience provided students with research, writing, and digital literacy skills – as well as a new found confidence. “Now that they are content providers,” he writes, “they won’t look at Wikipedia the same way.”

Wikipedia is a unique platform through which to engage students. They learn to collaborate with each other and with other Wikipedia users as they distill course topics into concise, well-researched, heavily cited articles for the general public. The exercise strengthens research, writing, collaboration, and digital literacy skills – all while providing a public service. Students make academic information (often restricted behind paywalls) available to anyone with internet connection worldwide. Wikipedia is the ultimate open educational resource.

Another instructor in our program, Dr. Clare Talwalker of UC Berkeley, finds a Wikipedia assignment inspiring because it transcends traditional academic timelines. “Students may build on each other’s work in the coming semesters, returning to some of the same articles and slowly improving many important parts of the Wikipedia universe.”

And it turns out that this collaborative nature of a Wikipedia assignment is attractive to students, too. One student in Dr. Webster’s Fall 2016 course reflected,

“One of the main points I have taken away from this course is that public history, and by extension public memory, cannot solely be shaped by individual scholars. They must be created diversely and as collaborative works by all those whom it may affect. Wikipedia is optimal for this presentation.”

Students tend to be more motivated to produce quality work when they know their work can make an impact beyond their course. We’ve received such feedback from instructors and students alike that seeing the measurable impact of their work on Wikipedia makes a difference.

And the passion that the assignment can inspire also has the power to live on beyond the term. Haleigh Marcello at the University of California in San Diego, for example, shared why she found contributing to Wikipedia to be such a “rewarding and fun experience,” and one that she’ll continue to pursue. Similarly, Jane Lee came back to update her Wikipedia article months after her course at Washington University in St. Louis ended and shared how proud she was to work toward a better final product.

“There are few assignments that better illustrate the nature of sources, the research process, and the relevance of student writing,” writes Dr. Webster, whose course (which will utilize the new textbook) started last week. “I’m looking forward to seeing how this class responds to using a textbook written by former students.”

As are we!


Interested in teaching with Wikipedia? Visit teach.wikiedu.org to get started, or reach out to contact@wikiedu.org with questions.


Image: File:A course textbook for History & Global Studies 228, Bishop’s University.jpg, Dwebsterbu, CC BY-SA 4.0, via Wikimedia Commons.

Looking at interesting patterns and bottlenecks I discover working on a multithreaded PNG encoder in Rust

A common pattern in low-level code is passing around references to data buffers that are owned somewhere higher up the call chain. In C you either send a pointer and a length as a pair of parameters, or you send a pointer and have a convention like NULL-termination, and then you carefully read and write only within that region…

Take a slice

In Rust, you use a data type called a “slice”, which is as pointer+length pair but with safer semantics and some really nice syntactic sugar. :) A slice can be … sliced … out of any contiguous structure like a fixed-size array, a resizable vector, or another slice.

There’s two big safety improvements over C pointers:

  • Rust’s compile-time borrow checker system ensures that only one piece of code holds a mutable reference to the underlying data, so you can’t have an invalid reference. (Imagine having a slice into a vector, and then the vector resizes due to an append! Can’t happen.)
  • Access via index (my_slice[i]) is bounds-checked at runtime. Like dereferencing a null pointer in C, it will probably* kill your process to access a slice out of bounds.

[Update: *Dereferencing null in C is “undefined behavior” and sometimes doesn’t crash, depending on the system and whether you’ve installed signal handlers. Rust’s “panics” are more defined in how they behave, and can in some cases be caught and recovered from. But by default, either is bad for you if you don’t handle it! ;)]

“But wait!” I hear you say. “Bounds checks at runtime are sslloowww!” Well there’s ways around that. :)

Bounds checkin’

So what is indexing into an array anyway? The index value is a “usize” (unsigned integer, pointer-sized) which is behind the scenes added to the underlying pointer to produce a final pointer. So we can think of my_slice[i] = x as doing something like this behind the scenes:

*(my_slice.as_ptr() + i) = x;

With the bounds check, it looks something like:

if i < my_slice.len() {
    *(my_slice.as_ptr() + i) = x;
} else {
    panic!("Out of bounds");
}

Note that you don’t need to check for i >= 0 because it’s an unsigned type!

But what about in a loop? Won’t that check slow tight loops down?

for i in 0 .. my_slice.len() {
    if i < my_slice.len() {
        *(my_slice.as_ptr() + i) = x;
    } else {
        panic!("Out of bounds");
    }
}

That looked kind of redundant right? Isn’t the loop already checking that i < my_slice.len() on every iteration? In fact it is… And in an optimized build, the bounds check can actually be removed by the optimizer!

Don’t be afraid to let the optimizer do your work for you — the default-immutability and ownership semantics of Rust mean there’s a lot of things like that that improve dramatically in an optimized build while still retaining code that’s both straightforward to read and refactor, and performs well.

Iterators

Using a for loop with an index range isn’t always considered good style in Rust, both because of those bounds checks and because iterators are far, far more flexible since they can work with other data structures than slices.

An iterator version of that little loop would start out as:

for iter in my_slice.iter_mut() {
  *iter = x;
}

You call iter_mut() to get a mutable reference, or iter() for immutable. Each pass through the loop gives you a reference to the element which you can read or write appropriately.

For a slice that essentially compiles down to the same as the for loop with an index range, but without needing the intermediate check even in unoptimized builds.

Cheating

You can also use the “unsafe” get_unchecked and get_unchecked_mut functions to get a reference to an indexed value without the bounds check! But you have to wrap in an “unsafe” block, because Rust makes you label stuff like that. :D

for i in 0 .. my_slice.len() {
    unsafe {
        *(my_slice.get_unchecked_mut(i)) = x;
    }
}

Multiple slices and the optimizer

In mtpng I found a case where I had to use indexing instead of iterators because I was working with multiple slices in sync, which introduced several bounds checks.

I found that adding validation checks that the lengths were all the same actually made all the bounds checks disappear, doubling the speed of the tight loop and improving overall encode speed by over 25%.

Without the validation, the function looked something like this:

fn filter_iter<F>(bpp: usize, prev: &[u8], src: &[u8], out: &mut [u8], func: F)
    where F : Fn(u8, u8, u8, u8) -> u8
{
    for i in 0 .. bpp {
        let zero = 0u8;
        out[i] = func(src[i], zero, prev[i], zero);
    }
    for i in bpp .. out.len() {
        out[i] = func(src[i], src[i - bpp], prev[i], prev[i - bpp]);
    }
}

With the checks added at the top, before the inner loop:

fn filter_iter<F>(bpp: usize, prev: &[u8], src: &[u8], out: &mut [u8], func: F)
    where F : Fn(u8, u8, u8, u8) -> u8
{
    assert!(out.len() >= bpp);
    assert!(prev.len() == out.len());
    assert!(src.len() == out.len());

    for i in 0 .. bpp {
        let zero = 0u8;
        out[i] = func(src[i], zero, prev[i], zero);
    }
    for i in bpp .. out.len() {
        out[i] = func(src[i], src[i - bpp], prev[i], prev[i - bpp]);
    }
}

At runtime those extra checks at the top should never trigger, because all three slices are the same length and bpp is never larger than the length. But the optimizer didn’t know that! Making the invariant explicit in the code, instead of just hoping it was right, lets the optimizer turn all of these:

[Updated: using the assert! macro is better style than manually calling panic! in your high-level code. Note that assert! code is always present in both debug and release builds; use the debug_assert! macro for checks that aren’t necessary for safety or performance.]

for in in bpp .. out.len() {
    if i < src.len() {
        if i < prev.len() {
            if i < out.len() {
                out[i] = func(src[i], src[i - bpp], prev[i], prev[i - bpp]);
            } else {
                panic!("Out of bounds");
            }
        } else {
            panic!("Out of bounds");
        }
    } else {
        panic!("Out of bounds");
    }
}

Into this with no bounds checks:

for in in bpp .. out.len() {
    out[i] = func(src[i], src[i - bpp], prev[i], prev[i - bpp]);
}

Pretty neat right!

zip and izip!

Update: The above case can also be rewritten with iterators by “zipping” multiple iterators together.

If you only have two iterators, you can use the “zip” function in the standard library; if you have more you can use the “izip!” macro in the itertools crate.

This ends up with code that can be a bit verbose but should also run cleanly:

let len = out.len();
for (dest, cur, left, up, above_left) in
    izip!(&mut out[bpp ..],
          &src[bpp ..],
          &src[0 .. len - bpp],
          &prev[bpp ..],
          &prev[0 .. len - bpp]) {
    *dest = func(*cur, *left, *up, *above_left);
}

[Update: I was able to confirm that careful use of izip! slightly outperforms indexing plus voodoo assertions, removing another instruction or two per inner loop iteration. If you can write sanely that way, it works nicely! Won’t work if you need random access to the various slices, but for this kind of lock-step iteration it’s perfect.]

Debug vs release builds

The rust compiler and the cargo package manager default to unoptimized debug builds if you don’t tell it to make a release build.

This sounds good, except the entire Rust standard library is built on the same patterns of using safe clean code that optimizes well… For mtpng I’m seeing a 50x slowdown in runtime in unoptimized debug builds versus optimized release builds. Yeeeooooowwwwch!

Note that you can change the optimization level for your own debug builds in the Cargo.toml file in your project, which can help; you can crank it all the way up to a release build’s optimization status or leave it somewhere in the middle.

The title is a little wordy, but I hope you get the gist. I just spent 10 minutes staring at some data on a Grafana dashboard, comparing it with some other data, and finding the numbers didn’t add up. Here is the story in case it catches you out.

The dashboard

The dashboard in question is the Wikidata Edits dashboard hosted on the Wikimedia Grafana instance that is public for all to see. The top of the dashboard features a panel that shows the total number of edits on Wikidata in the past 7 days. The rest of the dashboard breaks these edits down further, including another general edits panel on the left of the second row. 

The problem

The screenshot above shows that the top edit panel is fixed to show the last 7 days (this can be seen by looking at the blue text in the top right of the panel). The second edits panel on the left of the second row is also currently displaying data for the last 7 days (this can be seen by looking at the range selector on the top right of the dashboard.

The outlines of the 2 graphs in the panels appear to follow the same general shape. However both panels show different totals for the total edits made in the window. The first panel reports 576k edits in 1 week, but the second panel reports 307k. What on earth is going on?

Double checking the data against another source I found that both numbers  here are totally off. For a single day the total edits is closer to 700k, which scales up to 4-5 million edits per week.

hive (event)> select count(*)
            > from mediawiki_revision_create
            > where `database` = "wikidatawiki"
            > and meta.dt between "2018-09-09T02:00Z" and "2018-09-10T02:00Z"
            > and year=2018 and month=9 and (day=9 or day=10)
            > ;
.....
_c0
702453
Time taken: 24.991 seconds, Fetched: 1 row(s)

maxDataPoints

The Graphite render API used by Grafana has a parameter called maxDataPoints which decides the total number of data points to return. The docs are slightly more detailed saying:

Set the maximum numbers of datapoints for each series returned when using json content.
If for any output series the number of datapoints in a selected range exceeds the maxDataPoints value then the datapoints over the whole period are consolidated.
The function used to consolidate points can be set using the consolidateBy function.

Graphite 1.14 docs

Reading the documentation of the consolidateBy functions we find the problem:

The consolidateBy() function changes the consolidation function from the default of ‘average’ to one of ‘sum’, ‘max’, ‘min’, ‘first’, or ‘last’.

Graphite 1.14 docs

As the default consolidateBy function of ‘average’ is used, the total value on the dashboard will never be correct. Instead we will get the total of the averages.

Fixes for the dashboard

I could set the maxDataPoints parameter to 9999999 for all panels, that would mean that the previous assumptions would now hold true. Grafana would be getting ALL of the data points in Graphite and correctly totaling them. I gave it a quick shot but it probably isn’t what we want. We don’t need that level of granularity.

Adding consolidateBy(sum) should do the trick. And in the screenshot below we can now see that the totals make sense and roughly line up with our estimations.

For now I have actually set the second panel to have a maxDataPoints value for 9999999. As the data is stored at a minutely granularity this means roughly 19 years of minutely data can be accessed. When looking at the default of 7 days that equates to 143KB of data.

Continued confusion and misdirection

I have no doubt that Grafana will continue to trip me and others up with little quirks like this. At least the tooltip for the maxDataPoints options explains exactly what the option does, although this is hidden by default on the current Wikimedia version.

Data data everywhere. If only it were all correct.

The post Grafana, Graphite and maxDataPoints confusion for totals appeared first on Addshore.

September 10, 2018

On 2 September, disaster struck the National Museum of Brazil: a massive fire devastated the building and its extensive holdings. Centuries of cultural heritage, including recordings of dead languages and ancient artifacts from pre-Columbian times, were lost.

But amid the carnage and destruction, a movement has risen, one with the aim of adding as much knowledge about the museum’s collections to Wikimedia projects (including Wikipedia) before anything more is lost forever.

This mobilization includes the creation and development of articles about the disaster that destroyed the museum, which held the oldest scientific collection in Brazil and was one of the largest museums in Latin America, and the launching of a campaign to gather images on the building and collection. Long-time Wikimedia editors and first-timers got together to make sure we would learn from this incident, one that was a forcefully reminds us that the goal of recording the sum of all knowledge has a deadline.

The first Portuguese Wikipedia edit about the National Museum fire was made by an anonymous user at 8:40 PM (UTC time), minutes after reports were being made on TV. One hour and a half later, long-time Wikimedian DarwIn created an entry about the fire itself. It initially read that “the fire at the National Museum was a fire of large proportion at the National Museum, in Rio de Janeiro, on September 2, 2018.”

According to the museum’s entry on the English Wikipedia, “The National Museum held a vast collection with more than 20 million objects, encompassing some of the most important material records regarding natural science and anthropology in Brazil, as well as numerous items that came from other regions of the world and were produced by several cultures and ancient civilizations […] The museum also held one of the largest scientific libraries of Brazil, with over 470,000 volumes and 2,400 rare works.”

Improvements on the article were coordinated among Wikimedians through social networks, as tasks were being assigned or taken up by many. On the Portuguese Wikipedia, around 250 people contributed to the entry on the museum and on the fire from 2 to 6 September. As of 10 September, entries on the fire existed in 21 languages.

———

The first picture of the fire was uploaded at 11:23 PM, only hours after the fire had begun. The photographer, Felipe Milanez, is a university professor and journalist, who had worked on research projects with the team of the National Museum. As he posted on Facebook on what he was seeing, I asked him to send me pictures of his to upload to Wikimedia Commons (a formal authorization to use the image followed). He sent five images; in just three days, this set was seen over 2 million times, particularly an image of the statue of the Brazilian Emperor Pedro II with the museum on fire behind. This was Felipe Milanez’s first contribution to Wikimedia projects.

“People need to know about this, [and] people need to see this”, said Felipe Milanez said on why he was contributing to Wikimedia Commons. He also wrote reports for the local and international press. According to him, the disaster that stroke the National Museum was associated with a lack of investment from the Brazilian government, which had cut funding for the institution.

Wikimedians then began publishing a call on social networks for people to upload images to Wikimedia Commons, a freely licensed media repository that holds many of the images used on Wikipedia, on the museum building and collection. In just three days, around 2,000 images of the museum taken before the fire have been uploaded. Images received were normally uploaded to a generic category on Commons, and more experienced users then worked to curate the content, often communicating through private messaging to discuss categorization strategies.

Student Juliana Gouy was one of those responding to the call. The National Museum is to her a special place, somewhere she went to with her family as a child; it was a calm refuge from Rio de Janeiro, one of the largest and hectic cities in Brazil. “As soon as I heard the fire was going on,” she said, “I felt the need to look at the pictures I had taken there and I thought many of my friends would like to see them. I shared these pictures publicly, and then many people started liking them.” That’s when she was approached to contribute the images to Wikimedia Commons.

Juliana Gouy had never contributed to Wikimedia Commons, and had actually never heard of the project. She uploaded a small set of images and, as she was having trouble with the UploadWizard, was given help to upload around 200 pictures of the museum building and its collections.

As content was being produced collaboratively, editors of the Portuguese Wikipedia agreed to a proposal by long-time editor Dornicke that a site banner should appear above all pages to mourn the loss of so much cultural heritage. This led to a formal call, translated into eleven languages, for people to contribute images on the building and collection to Wikimedia Commons.

———

This campaign, asking people to contribute their images of the National Museum, is still on. There is no fully digitized collection of the museum’s holidings, much less the ones that were destroyed in the fire. We need your help in preserving as much of the museum’s knowledge as we possibly can.

You may also be interested in helping with other Wikimedia museum partnerships, such as the Museu do Ipiranga, the Brazilian National Archives, and an ongoing facial reconstruction of Luzia.

João Alexandre Peschanski, Wikimedian

Last May, Prince Harry, fifth in line to the British throne, married Meghan Markle, an American actress and activist.

The event captivated millions upon millions of people for several weeks, and many of them journeyed to Wikipedia to read the encyclopedia’s curated content about the British monarchy, the wedding plans, and the people involved. Unfortunately, one person they weren’t able to read much about on Wikipedia until three days before the wedding was Doria Ragland, Markle’s mother, who had recently become the subject of news articles in the weeks leading up to the wedding.

You might be forgiven for thinking that the English Wikipedia would have had an article on Ragland. After all, it’s the largest encyclopedia in the history of the world. But the site is built on the contributions from hundreds of thousands of volunteer editors, each of whom donates their time to write about general interest educational material for fun, and it has strict “notability” standards which lay out specific requirements that have to be met before someone can have a Wikipedia article about them.*

This is where Wikipedia editor and AfroCROWD member Linda Fletcher comes in. Fletcher, who like so many others had come to Wikipedia to learn more about the royal wedding, believed that Ragland met those notability standards and resolved to fix the content gap. Motivated by what she calls the “enduring relationship” between Markle and Ragland, Fletcher was excited that the research required for writing a Wikipedia article would lead to her learning a great deal more about both of them.

On 16 May, Fletcher created a short entry about Ragland. Fifteen minutes later, another editor nominated her work for deletion, as they disagreed that Ragland was notable under Wikipedia’s definition.

Without intended irony, this is a feature of Wikipedia, not a bug. Articles on non-notable people are created every day, some with good intentions and others as high school pranks, and they are inevitably nominated for deletion. Volunteers then debate whether the nomination was correct, with the aim of coming to a consensus agreement.** Fletcher did not participate in the debate, telling me that “I knew that Doria Ragland was significant, and that in time she will become even more significant.”

In this case, the system worked. The article about Ragland was not deleted, but her high profile meant that it was a stressful and heated affair, with 84 editors weighing in over five days. Some jumped in to help bolster the article Fletcher created, adding hundreds of words, sourced with citations to sources with a reputation for fact checking and accuracy. Today, the article is nearly 450 words long, and was viewed over two million times in the month of May, making it one of the most popular articles on the site.

For her part, Fletcher is still keeping an eye out for new news articles about Ragland and her daughter, but has moved back into her usual pattern of identifying content gaps and creating new articles, whether they’re jazz musicians, the founder of permaculture in Ghana, historical sites in Harlem, and more.

Want to join? Here’s how.

Ed Erhart, Senior Editorial Associate, Communications
Wikimedia Foundation

*These standards can put women and non-Western people at a disadvantage, and indeed the English Wikipedia has a persistent gender gap and systemic bias in its content and contributors. A bit under 18% of the encyclopedia’s biographies are about women, whose place in the historical record has often been overshadowed, neglected, or simply omitted, and the majority of its geographically tied topics come from the United States and Europe. Some of this stems from the bias in the sources used to write Wikipedia articles, which are themselves broadly western.

**Here’s more information about the English Wikipedia’s deletion process.

TriangleArrow-Left.svgprevious 2018, week 37 (Monday 10 September 2018) nextTriangleArrow-Right.svg
Other languages:
العربية • ‎čeština • ‎English • ‎British English • ‎español • ‎suomi • ‎français • ‎עברית • ‎हिन्दी • ‎日本語 • ‎한국어 • ‎মেইতেই লোন্ • ‎polski • ‎русский • ‎svenska • ‎українська • ‎中文

September 09, 2018

In my last post I wrapped up the patches to improve perceived performance of screenshots on the Linux GNOME desktop. With that done, why not implement my crazy plan for parallel PNG encoding to speed the actual save time?

Before starting any coding project, you need to choose a language to implement it in. This is driven by a number of considerations: familiarity, performance, breadth and depth of the standard library, and whether it’ll fit in with whatever eventual deployment target you have in mind.

Now C here

I’ve done a fair amount of coding in C, it’s well known for performing well, and it’s the default for most of the components in GNOME, so that’s a plus… but its standard library is small, its type and macro systems are limited so repetitious code is common, and memory management is notoriously unreliable — especially in multithreaded situations. I would either have to implement some threading details myself, or find a library to use… which might introduce a dependency to the GNOME platform.

There’s also C++, C’s weird cousin. Templates are way more powerful than C macros, but I have trouble wrapping my head around the details, and error messages expose a lot of details. Memory management and, in latest versions threading, are a big improvement! But it adds the C++ standard library as a dependency, which might or might not fly for GNOME.

My usual go-to languages are PHP and JavaScript, neither of which are suitable for a high-performance binary-parsing system component. :)

Getting Rust-y

But what about Rust? It’s being used by some other GNOME components such as librsvg which renders SVG icons (and is also used by us at Wikimedia for rendering thumbnails of SVG icons, diagrams, and maps!), and can be integrated into a C-compatible library without introducing runtime or install-time dependencies.

Rust’s type system is powerful, with a generics system that’s in some ways more limited than C++ templates but much easier to grok. And it also has a macro system that can rewrite bits of code for you in a way that, again, I find a lot easier to comprehend than C++ templates and is much more powerful than C macros.

And the memory management is much cleaner, with a lot of compile-time checking through a “borrow checker” system that can identify a HUGE number of memory-misuse problems long before you get into the debugger.

Plus, Rust’s community seems friendly, and they’ve got a good package manager system that makes it easy to pull in compile-time dependencies to fill in gaps in the standard library.

I’ve been keeping an eye on Rust for a couple years, reading a lot of docs and some books, and half-starting a few tiny test projects, but never had a project in front of me that seemed as good a time to start as this!

Well that’s just crate

Packages in Rust are called “crates”, and the package manager is named “cargo”. A crate can contain a library, a binary, or both, and can have tests, docs, examples, and even benchmarks that can be run by the package manager.

Creating a stub library crate via ‘cargo init –lib’ is fairly straightforward, and I was able to immediately have something I could compile and have tests run on with ‘cargo test’!

As an editor I’m using Visual Studio Code, which has an “rls” package which provides Rust support including inline display of compiler warnings and errors. This is a *huge* improvement versus manually switching to a terminal to run compiles to check for errors, especially as cascading errors can scroll the initial cause of your error off the screen. ;) But the error messages themselves are usually clear and helpful, often suggesting a fix for common problems.

I’m also able to run my tests on Linux, macOS, and Windows machines without any additional work, as Rust and cargo are pretty portable.

From there, I was able to start adding tiny parts of my plan, bit by bit, with confidence.

Atypical types

Rust’s type system has a number of quirks and interesting points, but the most important kinds of things are structs and enums.

Structs are like structs or classes in C++, kinda sorta. Their fields can be exposed publicly or private to the module, and they can have methods (more or less). Here’s one from the project so far:

#[derive(Copy, Clone)]
pub struct Options {
    pub chunk_size: usize,
    pub compression_level: CompressionLevel,
    pub strategy: Strategy,
    pub streaming: bool,
    pub filter_mode: FilterMode,
}

The “derive(Copy, Clone)” bit tells the compiler to treat the struct as something that can be trivially copied, like a primitive value. Not always what you want for a struct; I may remove this in my final API. The fields are also exposed directly with “pub”, which I may change to a builder pattern and accessors. But it’s an easy way to get started!

An associated “impl” block adds your associated functions and methods. The compiler checks that you initialized all fields; it’s impossible to end up with uninitialized memory by accident.

impl Options {
    // Use default options
    pub fn new() -> Options {
        Options {
            chunk_size: 128 * 1024,
            compression_level: CompressionLevel::Default,
            strategy: Strategy::Default,
            filter_mode: FilterMode::Adaptive,
            streaming: true,
        }
    }
}

Meanwhile, Rust enums can be made to work like a C enum, which is mostly a list of constant values, but they can also be a very powerful construct!

Here’s a boring enum example which maps names for the PNG byte filters to their codes:

#[derive(Copy, Clone)]
pub enum FilterType {
    None = 0,
    Sub = 1,
    Up = 2,
    Average = 3,
    Paeth = 4,
}

And here’s one that’s more interesting:

#[derive(Copy, Clone)]
pub enum FilterMode {
    Adaptive,
    Fixed(FilterType),
}

This means a FilterMode can be either the value “Adaptive” or the value “Fixed”… with an associated sub-value of type FilterType. These are kind of like C unions, but they’re not awful. ;)

That fancy enum pattern is used everywhere in Rust, with the Option<T> enum (either None or Some(T)) and Result<T, E> (either Ok(T) or Err(E)). In combination with the “match” construct this makes for great parsing of command line options into optional settings…

options.filter_mode = match matches.value_of("filter") {
    None             => FilterMode::Adaptive,
    Some("adaptive") => FilterMode::Adaptive,
    Some("none")     => FilterMode::Fixed(FilterType::None),
    Some("up")       => FilterMode::Fixed(FilterType::Up),
    Some("sub")      => FilterMode::Fixed(FilterType::Sub),
    Some("average")  => FilterMode::Fixed(FilterType::Average),
    Some("paeth")    => FilterMode::Fixed(FilterType::Paeth),
    _                => return Err(err("Unsupported filter type")),
};

There’s a little redundancy of namespaces that I can clean up with better naming there, but you get the idea. Note you can even match on an enum containing a string!

Threading with Rayon

Another big concern of mine was being able to do threading management easily. Rust’s standard library includes tools to launch threads and safely send ownership of data between them, but it’s a little bare bones for running a threadpool with a queue of work items.

The Rayon library was highly recommended; it includes both a really solid ThreadPool class that you can drop function closures into, and higher-level constructions building on top of iterators to, for instance, split a large arra-based process into small chunks.

I ended up not using the fancy iterator systems (yet?) but the ThreadPool is perfect. Combined that with message passing from the standard library to send completed blocks back to the main thread for processing, and I was able to get a working multithreaded app going in about two days, with my bugs almost entirely about my filtering, compression, and output file structure, and not the threading. :)

Deflation for once

The last main library component I needed was the actual data compression — I can implement image filtering and chunk output myself, and it was easy to grab a crc32 library via a crate. But the ‘deflate’ compression algo looks hard, and why reimplement that if I don’t have to?

There are several ‘deflate’ and ‘gzip’ implementations written in Rust available as crates, but none of them included all the abilities I needed for parallel chunked compression:

  • overriding the sliding-window dictionary with the previous chunk’s last 32 KiB of uncompressed data to improve compression across chunks
  • flushing output at the end of each chunk without marking the last block as end-of-stream

Possibly I’ll find some time later to figure out how to add those features. But for now, I used the C zlib library (the same one used by the C libpng library!) through an API adapter crate that exposes the absolute minimum C API of zlib as Rust functions and types.

It shouldn’t introduce dependencies since zlib is already used by …. lots of stuff. :D

So how’s it working out?

Pretty well. Check out my mtpng project over on github if you like!

I got it mostly working for truecolor images in about 3 days, and did some performance tuning later in the week that sped it up enough it’s often beating libpng on a single thread, and always beats it using multiple threads on large files.

The API needs work before I publish it as a crate, so I’ll keep fiddling with it, and I haven’t yet figured out how to implement the C API but I know it can be done. I’ve also got a long list of todos… :)

I’ll add more blog posts soon on some interesting details…

I’m happy to release a two Clean Architecture + Bounded Contexts diagrams into the public domain (CC0 1.0).

I created these diagrams for Wikimedia Deutchland with the help of Jan Dittrich, Charlie Kritschmar and Hanna Petruschat. They represent the architecture of our fundraising codebase. I explain the rules of this architecture in my post Clean Architecture + Bounded Contexts. The new diagrams are based on the ones I published two years ago in my Clean Architecture Diagram post.

Diagram 1: Clean Architecture + DDD, generic version. Click to enlarge. Link: SVG version

Diagram 1: Clean Architecture + DDD, fundraising version. Click to enlarge. Link: SVG version

 

September 08, 2018

While tuning BIOS settings on my old workstation PC with two CPU sockets, I noticed a setting I hadn’t touched in a long time if ever — the “memory interleaving mode” could be tuned for either SMP or NUMA, and was set on SMP.

What… what does that mean? SMP is Symmetric Multi-Processing and I’ve heard that term since ages past in Linux-land for handling multiple CPUs and cores. NUMA is just some server thing right?

So… NUMA is Non-Uniform Memory Access, and here specifically refers to the fact that each CPU socket has its own connection to its share of system memory that’s slightly faster than accessing the same memory through the other CPU socket.

With the BIOS’s memory interleave mode set to NUMA, that speed differential is exposed to the operating system, and memory from each processor is assigned to a separate region of physical memory addressing. This means the OS can assign memory and processor time to any given process optimized for speed as much as possible, only slowing down if a given process runs out of stuff fitting on one socket. Cool right?

Meanwhile with it set to SMP, the memory is laid out interleaved, so any given piece of memory might be fast, or it might be slow. Lame right?

So.

I tried it for fun, on Linux and Windows both and at first didn’t see much difference. Using valgrind’s “cachegrind” tool confirmed that the things I was testing (PHP in a tight interpreter loop, or PNG compression) were mostly working in-cache and so memory latency wasn’t a big deal, and memory bandwidth is nooooowhere near being saturated.

Then I found a case where NUMA mode fell down badly: multithreaded app with more threads than physical CPU cores on a single socket.

Running my PNG compression tests at 1, 2, 4, or 16 threads ran about as fast with SMP or NUMA mode. But at 8, there was a big dip in speed.

Since I have 2x quad-core processors with hyper-threading, the layout is:

  • 2 sockets
    • 4 cores per socket
      • 2 threads per core

SMP mode assigns threads to logical processors like this:

  • 1 thread – runs on either socket
  • 2 threads – one on each socket
  • 4 threads – two on each socket, on separate cores
  • 8 threads – four on each socket, on separate cores
  • 16 threads – eight on each socket, packing all cores’ threads

NUMA mode prefers to group them together, because they’re all in a single process:

  • 1 thread – runs on either socket
  • 2 threads – two on one socket, on separate cores
  • 4 threads – four on one socket, on separate cores
  • 8 threads – eight on one socket, packing all cores’ threads
  • 16 threads – eight on each socket, packing all cores’ threads

Now we see the problem! Because Hyper-Threading shares execution resources between the two threads of each core, you don’t get as much work done when both threads are packed.

If I hadn’t had Hyper-Threading on, 8 threads would’ve scaled better but it wouldn’t be able to run 16 (at a sllliiiggghhhtt speed boost) anymore.

Putting it back on SMP mode for now. For a system that’s always under full load NUMA is theoretically superior, but it interacts with Hyper-Threading weirdly for my up-and-down desktop/workstation workload.

 

28/08/2018-03/09/2018

Pic

Public footpath on Selsley Common (part of the Cotswold Way national trail), near Stroud, Gloucestershire [1] | Photo © 2018 Nick Johnston

Mapping

  • The city of Stuttgart allows OSM to use its aerial imagery, which has a ground resolution of 20 cm. This step was discussed with the local OSM community during the last local OSM meetup where the city representatives also asked what data could help OSM further.
  • On the tagging mailing list Joseph Guillaume introduces his proposal for canal=qanat; these are underground channels for conveying groundwater.
  • On talk-GB Microsoft asked for help with some unnamed roads that they have identified.
  • [1] Nick Johnston wrote a blog post about mapping Britain’s paths in OpenStreetMap. He includes much background information on public rights of way paths, He got the data for the paths in his home region Gloucestershire from Rowmaps and surveyed using OsmAnd with the tracks in the background. Even if you have no intention to map paths in the UK, just the pictures of his beautiful home region make the article worth reading. He ends his article with the warning that you should not import the data straight away for a number of reasons.
  • User MKnight has refined (automatic translation) the reservoir Lac de Petit-Saut in French Guiana in weeks of hard work. He also summarises the negative environmental consequences of the dam. (de)
  • OpenSnowMap.org’s maintainer yvecai has compiled up-to-date statistics about ski pistes in OSM. The combined length of all ski pistes in OSM exceeds 100 000 km!
  • Land parcel data in OSM is mentioned in a study about the impact of spatial data on the development of slums in cities with rapid urbanisation. According to the report, OSM has not reached a consensus as to whether parcel-level data should be mapped due to concerns about data quality, validity, and maintainability. The same applies to metadata on parcel data. It also mentions that mapping parcel data is becoming more common.

Community

  • We recently wrote about Matthias Plennert’s article “The social construction of technological stasis: The stagnating data structure in OpenStreetMap” in the journal Big Data & Society. The article in general and one of the key points in the article, i.e. that hidden gatekeepers would be responsible for technological stagnation in OSM, caused some discussions. The author responded to the feedback in a special article.
  • Noémie Lehuby, Florian Lainez, Flora Hayat and Jocelyn Jaubert have revamped the OSM France website. This was also discussed on the Talk-Fr mailing list. (fr) (automatic translation)
  • MySociety, a UK-based non-profit organisation, providing technology, research and data to help people become active citizens, is looking for worldwide electoral boundaries. David Earl forwarded the query to the OSMF mailing list.
  • A lot of text about Facebook and its contribution to OpenStreetMap has been written recently. LukeWalsh’s article provides a new perspective on it. He wrote about his experience during the college rotational program, when his placement was at the Facebook department, which makes OSM contributions.
  • The Kosovo community now has a Kosovo-wide Telegram group. Recent topics include open data and the naming of things in an ethnically diverse country. All mappers from anywhere in any language are welcome.
  • United States ambassador to Turkmenistan, Allan Mustard (OSM apm-wa) wrote two articles about his work on OSM, currently the best map of Turkmenistan available to the general public. As you can read in one article, being an ambassador has the advantage that you can use the head of the Motor Roads State Concern as a QA tool. In his second article he gave some insight into his tour to a southern Balkan Province in Turkmenistan where he updated road information and collected POIs, GPS traces, and Mapillary pictures.

Imports

  • Andrew Harvey plans to import buildings in the Australian Capital Territory into OSM and started the required discussion on the import mailing list. The import would add over 60,000 buildings with over 500,000 nodes to OSM but does not include address data.
  • The import of road network data is planned in Kerala, India, a region affected by severe flooding recently. The data was generated by Facebook making use of machine learning. The plan is to allow Facebook’s mapping team to do the initial import, after which the OSM-India community will validate the data. The local community requested Facebook’s assistance with this import. However, during the discussion it was highlighted that the import guideline was not followed exactly and some remarks were made.

Events

  • FOSS4G TOKAI 2018 (automatic translation) was held on Aug 24-25 in Tokai. The organising committee published the collected tweets (ja) of the event and published a video of the “core day” as well.
  • Mapping events will be hosted during OSM Geography Awareness Week on Nov 11-17, 2018 in several places around the world.
  • Michael Schultz from the GIScience Group at Heidelberg University invites participation in a validation mapathon on September 13 during EuroGEOSS in Geneva. The goal is to validate a forthcoming Landuse-Dataset that is being generated through fusion of OSMlanduse Data and Sentinel-2 Data using machine learning.

Humanitarian OSM

  • The World Bank is funding the improvement of OpenStreetMap-Analytics (OSMA) with new functionality through the project Open Cities Africa and the Global Facility for Disaster Reduction and Recovery (GFDRR). HOT reported at FOSS4G. The development, as well as the hosting of the project, is being carried out by the Heidelberg Institute for Geoinformation Technology (HeiGIT).

Maps

  • Laurence Tratt complains on Twitter that he can’t really read anything on the “official” osm.org map in those countries where no Latin characters are used. Laurence knows about openstreetmap.de, but looks for a map listing names in English (Munich, not München – or Milano = Milan and not Mailand. 😉 )Sven Geggus maintainer of the German style and the localisation code used there (available from here), thinks that a change from osm.org is barely possible with the currently used technology. However, in his opinion, this issue should also be addressed as part of the switch to vector tiles. In principle, however, the operation of an English version of the localised map would not be a big problem. Sven already runs a server for his employer’s projects.
  • OpenAndroMaps, a website for free Android vector maps, now provides additional information like peaks, capitals and ocean depths. The maker of OAM explains his motivation and gives some background information in a blog post.
  • Daniel Koć, maintainer of OSM’s main map style on osm.org, has written a second article about his personal design principles for the OSM Carto map style. It discusses why the size of a feature is the primary property considered for showing some objects earlier than others. The article also has hints on how to choose a proper zoom level for starting to show them.
  • Christoph Hormann addresses the possibilities of using patterns to mark different area types in his blog post “More on Pattern use in Maps”.

switch2OSM

  • The website wedemain.fr published (automatic translation) an article about how local authorities and administrations are using open-source alternatives to Google services. The recent change by the department Maine-et-Loire from Google Maps to OSM and other institutions mainly to Framasoft is discussed. The association Framasoft hosts many open source services, free of charge.
  • Gadgets360 published an article about some companies that have based their business model on free access to Google Maps. The article examines the trouble they are now in and provides some information on alternatives, including OSM.
  • etourisme.info recommends that Tourist Offices switch to OpenStreetMap and explains what other benefits this brings on top of the saved G-Maps fees. (fr) (automatic translation)

Software

  • An overview of important OSM apps for Mobile, classified according to purpose, was made for a mapathon event in Kinshasa.

Programming

  • Fabian Kowatsch from the Heidelberg Institute for Geoinformation Technology (HeiGIT) published a first draft of the documentation of the ohsome API, which is a REST API for analysing the history of OSM data through the OSM History Analytics platform Ohsome.org.
  • manoharuss gives a short introduction in his user diary on how to use OSMCha to find suspicious edits in OSM.

Releases

  • Simon Poole announced the release of Vespucci 11.1 BETA. The newly implemented features include:
    • improved preset handling
    • lookup in taginfo now happens manually for the recently introduced “auto-presets”
    • improved copy and paste of tags
    • more minor changes.

Did you know …

  • … the website autobahnkilometer.ch created by dktue? He has, according his announcement (automatic translation) in the forum, created this page to measure the current length of the OSM motorway network in Switzerland and to count the tagged motorway kilometres.
  • … the site share.mapbbcode.org that allows you to easily create and share maps with custom markers and geometries? You can invite others to edit or view your map. The site also provides [map] bbcode that is supported by some forums.
  • … the site mundraub.org? Originally used mainly in Germany, the user base of the site is expanding. The name “mundraub” means petty theft of food. The sites shares the location of plants, mainly fruit trees, that are freely available for harvesting by everyone.

OSM in the media

  • Our map was vandalised recently at a prominent location. As usual the offensive data was fixed after a very short time by volunteer OSM contributors. This time the data was corrected within two hours. Unfortunately Mapbox, whose maps are used by many companies including SnapChat, pulled the data for updating their maps during this two hour window. Hence the vandalism was visible to a very large audience and found a loud echo in the media with Techchrunch, BBC, Gizmodo, Buzzfeednews and many more reporting about it. In an official article OSM condemns this kind of vandalism.

Other “geo” things

  • Melanie Froude has set up an online tool that allows mapping of landslides. Dave Petley wrote a blog post about the new tool, where he mentions it’s still experimental. It provides a global and regional view of deaths from landslides. Both Melanie and Dave are from the Department of Geography, University of Sheffield. They also authored Global fatal landslide occurrence from 2004 to 2016.
  • As Eric Wilson reported on Twitter, the Russell Senate Office Building in the US was renamed to John McCain Senate Office Building on Google Maps after McCain’s death. According to the Washington Post it is not clear yet by whom and why the entry was changed. The name was later reverted to its previous version.
  • Agence France-Presse has published a map illustrating emergency medical care in France. All white areas are more than 30 minutes away from an emergency department.

Upcoming Events

Where What When Country
Bochum Mappertreffen 2018-09-06 germany
Nantes Discussions et préparations 2018-09-06 france
Rennes Réunion mensuelle 2018-09-10 france
Lyon Rencontre mensuelle pour tous 2018-09-11 france
Munich Münchner Stammtisch 2018-09-13 germany
Berlin 123. Berlin-Brandenburg Stammtisch 2018-09-14 germany
Posadas Mapatón de parajes y caminos 2018-09-15 argentina
Berlin Berliner Hackweekend 2018-09-15-2018-09-16 germany
Grenoble Rencontre mensuelle 2018-09-17 france
Toronto Mappy Hour 2018-09-17 canada
Cologne Bonn Airport Bonner Stammtisch 2018-09-18 germany
Lüneburg Lüneburger Mappertreffen 2018-09-18 germany
Nottingham Pub Meetup 2018-09-18 united kingdom
Lonsee Stammtisch Ulmer Alb 2018-09-18 germany
Karlsruhe Stammtisch 2018-09-19 germany
Mumble Creek OpenStreetMap Foundation public board meeting 2018-09-20 everywhere
Leoben Stammtisch Obersteiermark 2018-09-20 austria
Kyoto 幕末京都オープンデータソン#06:壬生の浪士と新撰組 2018-09-22 japan
Buenos Aires State of the Map Latam 2018 2018-09-24-2018-09-25 argentina
Detroit State of the Map US 2018 2018-10-05-2018-10-07 united states
Bengaluru State of the Map Asia 2018 2018-11-17-2018-11-18 india
Melbourne FOSS4G SotM Oceania 2018 2018-11-20-2018-11-23 australia
Lübeck Lübecker Mappertreffen 2018-09-27 germany
Manila Maptime! Manila 2018-09-27 philippines

Note: If you like to see your event here, please put it into the calendar. Only data which is there, will appear in weeklyOSM. Please check your event in our public calendar preview and correct it, where appropriate.

This weeklyOSM was produced by Nakaner, PierZen, Polyglot, Rogehm, SK53, SunCobalt, TheSwavu, YoViajo, derFred, geologist, jcoupey, jinalfoflia, keithonearth.

I've been using the JetBrains IDE PHPStorm ever since I really got started in MediaWiki development in 2013. Its symbol analysis and autocomplete is fantastic, and the built-in inspections generally caught most coding issues while you were still writing the code.

But, it's also non-free software, which has always made me feel uncomfortable using it. I used to hope that they would one day make a free/libre community version, like they did with their Python IDE, PyCharm. But after five years of waiting, I think it's time to give up on that hope.

So, about a year ago I started playing with replacements. I evaluated NetBeans, Eclipse, and Atom. I quickly gave up on NetBeans and Eclipse because it took too long for me to figure out how to create a project to import my code into. Atom looked promising, but if I remember correctly, it didn't have the symbol analysis part working yet.

I gave Atom a try again two weeks ago, since it looked like the PHP 7 language server was ready (spoiler: it isn't really). I like it. Here's my intial feelings:

  • The quick search bar (ctrl+t) has to re-index every time I open up Atom, which means I can't use it right away. It only searches filenames, but that's not a huge issue since now most of MediaWiki class names match the filenames.
  • Everything that is .gitignore'd is excluded from the editor. This is smart but also gets in the way, when I have all MediaWiki extensions cloned to extensions/, which is gitignored'd in core.
  • Theme needs more contrast, I need to create my own or look through other community ones.
  • Language server regularly takes up an entire CPU when I'm not even using the editor. I don't know what it's doing - definitely not providing good symbol analysis. It really can't do anything more advanced than things that are in the same file. I'm much less concerned about this since phan tends to catch most of these errors anyways.
  • The PHPCS linter plugin doesn't work. I need to spend some time understanding how it's supposed to work still, because I think I'm using it wrong.

Overall I'm pretty happy with Atom. I think there are still some glaring places where it falls short, but now I have the power to actually fix those things. I'd estimate that my productivity loss in the past two weeks has been 20%, but now it's probably closer to 10-15%. And as time goes on, I expect I'll start making productivity gains since I can customize my editor significantly more. Hooray for software freedom!

Older blog entries